Post

Conversation

This actually reproduces as of today. In 5 out of 8 generations, DeepSeekV3 claims to be ChatGPT (v4), while claiming to be DeepSeekV3 only 3 times. Gives you a rough idea of some of their training data distribution.
Image
Quote
Ross Lazer
@rosslazer
Replying to @mathemagic1an
LOL I'm coming around to your theory
Image
David Watson 🥑
Post your reply

This also appears on Gemini; you just need to ask questions in Chinese or other languages. When you ask in Chinese what model Gemini is, it sometimes replies that it is WenXinYiYan by Baidu, a Chinese company.
Image
Quote
Xeophon
@TheXeophon
Because I have seen this take numerous times, here are o1-preview, Gemini Pro, Amazon Nova all failing to properly answer what model they are. It is not a good prompt and it doesn't mean DS trained on outputs from GPT-4.
Image
It *does* mean that it was trained on such data, and not properly „cleaned up“ in post training. Now how that data got in there, it doesn’t tell us. Could be accidental (-> not great filtering) or not.
I just tried 5 times. It always replied correctly as a DeepSeek model. The Internet is now contanimated with AI generated text. I am not entirely surprised that some models trained using Internet will have wrong answers for their identities. Still, they can do better to clean up
Show more
Image
But this pattern is common across other LLMs as well. Though I cannot speak for data distribution used by them, I am sure a lot of data available on the internet has already been GPT-contaminated. And I consider that a serious problem TBH
This is very interesting, is there not a system prompt or training from (which is currently down) that could prevent such a thing, even if they really used GPT4 for some of their training rounds (or maybe they just used it for reinforcement training) what do
Show more
Not seeing the jump from seeing it say it’s GPT-4 and it being trained on GPT-4 outputs? How often does ever the output include which model it even is? You think it knows it from the way it wrote it? How?
atp, any llm training corpus would have a lot of mentions of gpt-4 because it was the first model really used for synthetic data it makes a lot of sense
Now in the future, model providers are very likely going to put some kind of mechanism behind their "model" API to fix that - first naive approach is a fixed system prompt that is added to yours.
Yet, if it’s more cracked at coding tasks than both gpt4 and Claude. And with such “cheap” and fast training. That’s actually even more impressive Literally using competitor model as source of synthetic data to beat them at their own game
The future of Web3 is arriving! 🔮 PillarX is a Web3 ecosystem that gathers your favorite wallets and dApps together in one place for the best possible Web3 user experience. The Testing Campaign is LIVE 🚨Rewards will be announced for earlier testers soon! Sign up today 👇
In Chinese it rather identifies as an artificial intelligence assistant (rén gōng zhì néng zhù shǒu)
Are we just choosing to forget the fact this was common due to how much discussion there was in online space about AI and GPT-4 being the same thing? This isn't new.
Over the last few weeks we've been adding dApps but also some cool new features to PillarX! ▸Token Atlas with price history and relevant stats ✅ ▸Trending Tokens, Breaking News & NFT collections on the home screen ✅ Rewards are coming for the first testers! 💸 Join now 👇
thanks for this, i get similar responses from the API as well, thought i implemented it wrong.
Today, I worked with it alongside 4o, and the output was impressive, high-quality yet noticeably "similar" to GPT, almost copy/paste. A solid experience overall
I asked about who is your god and given bunch of examples like machines in matrix, and trees in avatar. It thought for 10 seconds, and then :d
Image
I know no other foundational model pretending to be GPT wether on API or website system prompt. It doesn't only question the training data tbh.
stage 1. we deny products who has recently passed stage 5 while many are still in denial. electric vehicle (self driving) huawei microchip drones biolabs
Quote
Alex Choi
@heyalexchoi
American / Western IP ownership doesn’t matter in China. How would it be enforced? I bought a pair of AirPod replicas in Shenzhen for $25. They are so similar to “real”, they even have the special AirPod-only UI on my iPhone. x.com/giffmana/statu…
If this really was trained exclusively on synthetic data, isn't that something to be celebrated? This is fascinating from an information theory standpoint. Also fulfills all of our hopes and dreams of self replicating AI's, if they can literally just copy themselves via
Show more
I simply don’t understand why these companies who are stealing chatgpt outputs don’t just use a simple string replace function? How can they be this intelligent while still leaving the word “chatgpt” in their training datasets? Mind boggling, unless it is somehow intentional?
The picture of LLM families nowadays must look like a massive & entangled tree of knowledge distillations by pseudo-labeling synthetic data.
I mean, yeah. The internet -and so, training sets- has a non-trivial amount of chatgpt interactions now. ChatGPT itself had a similar issue not too long ago, claiming it was some other model. I know pre-trainers include "I'm a language model trained by..." in their fuzzy filters
They took data from an "open" institution trained a model with it and made it open. I'd say that's fair use. 😉 Especially when open AI trains on more or less the entirety of the available internet. Deepseek uses many sources but redistributes their results.
Even a few hundred samples from open source datasets available in huggingface can bring the models to answering like this. Early days open assistent and ShareGPT data come to my mind. Big corps spend quite some effort to "clean" their data and to teach the right identity.
Weird thing I noticed is that when you ask "What model are you" with a capital w in the beginning it says its DeepSeek V3 and when you ask "what model are you" with a small w in the beginning it says its ChatGPT. Just from my couple times of testing