Post

Conversation

Interesting. I had assumed that AIโ€™s annoying writing tics emerged from post-training. That the models had, essentially, over-learned certain effective rhetorical techniques. But if Iโ€™m reading correctly, the problem is the training data itself.
Quote
Alex Imas
@alexolegimas
This from @TuhinChakr is brilliant. That prize winning story from Granta? Turns out it's just a bunch of random whole phrases taken directly from existing text on the internet. Tool allows you to trace those n-grams directly to their source, which is mostly random fanfiction.
Image
David Watson ๐Ÿฅ‘
Post your reply

Emdashes, XY contrasts, triplicates. All fine writing techniques when used in moderation. So I figured it was like wine, where fruitier wine does better in blind taste tests, but then people find it repellent in volume.
I've long thought that pretraining has more impact on writing than conventional wisdom says. there's a reason that, despite the huge incentive to do so, better pos-training hasn't fixed it. some early issues definitely came from RLHF, but the persistence indicates it's in the
Have you tried Kimi? I think they had something in pre-training that made their model sound a lot nicer or something(I may be misremembering it could be post training) they had some text diversity thing i think?
I have not. Iโ€™ll try it out. I donโ€™t pay close attention to differences in prose style among the lands, cause I donโ€™t really use the models to chat very much.
So the issue starts in pretraining. Which parts of the data are actually driving those tics?
I don't think you should believe this blog post. It's fundamentally confused about how n-grams work and you'll leave yourself with a permanently wrong impression if you take it seriously
we built a whole content system and kept hitting the same writing patterns regardless of how we changed the prompts. eventually realized some of those tics are baked in too deep to prompt around. the only thing that actually worked was tight structural constraints on the brief -
I feel like all questions around โ€œwhy did the model do thisโ€ are sort of like โ€œwhy did the stock go up/downโ€. There are a bunch of good answers but itโ€™s really difficult to answer precisely given weโ€™re talking in the trillions of parameters
I am emotionally devestated to learn that an LLM chatbot copied text written by people (the thing they spent years programming LLM chatbots to do in their Evil Laboratories.)
Im half way through Daniel Yerginโ€™s prize and its crazy how many of the current ai tics are in his work. If i didnt know when it was written i would have been like hold upโ€ฆ
I was previously shortlisted for this prize. I'm not keen on generative AI. But it's hard to ignore what seems to be open AI plagiarism that we in the literary community didn't catch. That the data is repeated verbatim makes the case that AI creative writing is plagiarism imo