Post

Conversation

Interesting. I had assumed that AI’s annoying writing tics emerged from post-training. That the models had, essentially, over-learned certain effective rhetorical techniques. But if I’m reading

@TuhinChakr

correctly, the problem is the training data itself.

Quote

Alex Imas

@alexolegimas

10h

This from @TuhinChakr is brilliant. That prize winning story from Granta? Turns out it's just a bunch of random whole phrases taken directly from existing text on the internet. Tool allows you to trace those n-grams directly to their source, which is mostly random fanfiction.

9:16 PM · May 22, 2026

56.1K

Views

View quotes

Post your reply

Joe Weisenthal

@TheStalwart

Emdashes, XY contrasts, triplicates. All fine writing techniques when used in moderation. So I figured it was like wine, where fruitier wine does better in blind taste tests, but then people find it repellent in volume.

pretraining matters a lot.

@KelseyTuoc

is not correct here. There is published work that shows AI draws more from pretraining than human arxiv.org/pdf/2410.04265 I also have an

@iclr_conf

paper that shows awkward ngrams are more in LLM text than humans

698

matt duffy

@iammattduff

I've long thought that pretraining has more impact on writing than conventional wisdom says. there's a reason that, despite the huge incentive to do so, better pos-training hasn't fixed it. some early issues definitely came from RLHF, but the persistence indicates it's in the

9.8K

Joe Weisenthal

@TheStalwart

Sincere question. Are you sure there’s a huge invective fix it?

509

Sichu Lu

@lu_sichu

Have you tried Kimi? I think they had something in pre-training that made their model sound a lot nicer or something(I may be misremembering it could be post training) they had some text diversity thing i think?

265

Joe Weisenthal

@TheStalwart

I have not. I’ll try it out. I don’t pay close attention to differences in prose style among the lands, cause I don’t really use the models to chat very much.

227

Qivshi

@Qivshi1

All the phrases are not that uncommon, eg "sour tang of fermenting" shows up in a few places before this in Google

"sour tang of fermenting" - Google Search

Are you skeptical of the claim being made?

3.3K

Tresy

@TresyHQ

So the issue starts in pretraining. Which parts of the data are actually driving those tics?

173

Sapphosphorence

@KrautFishing

I don't think you should believe this blog post. It's fundamentally confused about how n-grams work and you'll leave yourself with a permanently wrong impression if you take it seriously

Brandon Pizzacalla

@bpizzacalla

we built a whole content system and kept hitting the same writing patterns regardless of how we changed the prompts. eventually realized some of those tics are baked in too deep to prompt around. the only thing that actually worked was tight structural constraints on the brief -

188

Cody

@breenemachine

I feel like all questions around “why did the model do this” are sort of like “why did the stock go up/down”. There are a bunch of good answers but it’s really difficult to answer precisely given we’re talking in the trillions of parameters

Y Y

@ylyzybythyn

I am emotionally devestated to learn that an LLM chatbot copied text written by people (the thing they spent years programming LLM chatbots to do in their Evil Laboratories.)

151

Noah Brier

@heyitsnoah

I think it’s prob a mix. This from Simon Rich a few years ago is interesting time.com/6301288/the-ai and is this piece interconnects.ai/p/why-ai-writi

I'm a Screenwriter. These AI Jokes Give Me Nightmares

Im half way through Daniel Yergin’s prize and its crazy how many of the current ai tics are in his work. If i didnt know when it was written i would have been like hold up…

41K

ML Kejera

@KejeraL

I was previously shortlisted for this prize. I'm not keen on generative AI. But it's hard to ignore what seems to be open AI plagiarism that we in the literary community didn't catch. That the data is repeated verbatim makes the case that AI creative writing is plagiarism imo