This actually reproduces as of today. In 5 out of 8 generations, DeepSeekV3 claims to be ChatGPT (v4), while claiming to be DeepSeekV3 only 3 times.
Gives you a rough idea of some of their training data distribution.
Post
Conversation
Experience raw, unfiltered conversations with CAVEDUCK AI!
Where genuine convos happen with over 5,000 unique characters.
Who's up for some real talk?
#AILiveTalk #AIchatbot #AIgirlfriend #AICharacterChats
Slide 1 of 6 - Carousel
So they spent $6M on training GPUs, but then another $6B on OpenAI API calls to get the training data?
It *does* mean that it was trained on such data, and not properly „cleaned up“ in post training.
Now how that data got in there, it doesn’t tell us. Could be accidental (-> not great filtering) or not.
I just tried 5 times. It always replied correctly as a DeepSeek model. The Internet is now contanimated with AI generated text. I am not entirely surprised that some models trained using Internet will have wrong answers for their identities. Still, they can do better to clean up
Show more
Or it's the same nonsense that drove us to browser strings...
For ultimately the same reasons.
this works specifically for lowercase
playing with other variants is pretty interesting
Legends are made, not born.
Watch competitors battle it out on Surviving Mann and see who earns the title of MVP! #SurvivingMann #LegendsInTheMaking #NavySEALs #SpecOps
0:23
But this pattern is common across other LLMs as well. Though I cannot speak for data distribution used by them, I am sure a lot of data available on the internet has already been GPT-contaminated. And I consider that a serious problem TBH
This is very interesting, is there not a system prompt or training from (which is currently down) that could prevent such a thing, even if they really used GPT4 for some of their training rounds (or maybe they just used it for reinforcement training) what do
Show more
Not seeing the jump from seeing it say it’s GPT-4 and it being trained on GPT-4 outputs? How often does ever the output include which model it even is? You think it knows it from the way it wrote it? How?
atp, any llm training corpus would have a lot of mentions of gpt-4 because it was the first model really used for synthetic data
it makes a lot of sense
This only matters for post training right? There is a whole base model as well
In comparison, when asked the same question (what model are you?), model Heidi Klum responded 100% with "Heidi Klum".
Now in the future, model providers are very likely going to put some kind of mechanism behind their "model" API to fix that - first naive approach is a fixed system prompt that is added to yours.
The future of Web3 is arriving!
PillarX is a Web3 ecosystem that gathers your favorite wallets and dApps together in one place for the best possible Web3 user experience.
The Testing Campaign is LIVE
Rewards will be announced for earlier testers soon! Sign up today 
Im sure there are lots of innovations in deepseek v3, this kind of mistakes nuke any level of confidence in it.
This is likely in the training data. Unless specifically told an LLM will not know what it is.
Are we just choosing to forget the fact this was common due to how much discussion there was in online space about AI and GPT-4 being the same thing?
This isn't new.
It's what's been known now: America Innovates, Europe Regulates, China Replicates.
Over the last few weeks we've been adding dApps but also some cool new features to PillarX!
▸Token Atlas with price history and relevant stats
▸Trending Tokens, Breaking News & NFT collections on the home screen
Rewards are coming for the first testers!
Join now 
you'd might get that just scraping the internet for the last 24 months.
This is not an isolated incident though. Been happening several times here too.
Today, I worked with it alongside 4o, and the output was impressive, high-quality yet noticeably "similar" to GPT, almost copy/paste. A solid experience overall
If each one uses each other as training data they're all evolving into a single LLM
I asked about who is your god and given bunch of examples like machines in matrix, and trees in avatar. It thought for 10 seconds, and then :d
I know no other foundational model pretending to be GPT wether on API or website system prompt. It doesn't only question the training data tbh.
Experience raw, unfiltered conversations with CAVEDUCK AI!
Where genuine convos happen with over 5,000 unique characters.
Who's up for some real talk?
#AILiveTalk #AIchatbot #AIgirlfriend #AICharacterChats
Slide 1 of 6 - Carousel
No wonder all model feels same to some extent. I wish for unique models
stage 1.
we deny
products who has recently passed stage 5 while many are still in denial.
electric vehicle (self driving)
huawei
microchip
drones
biolabs
Quote
Alex Choi
@heyalexchoi
American / Western IP ownership doesn’t matter in China. How would it be enforced?
I bought a pair of AirPod replicas in Shenzhen for $25. They are so similar to “real”, they even have the special AirPod-only UI on my iPhone. x.com/giffmana/statu…
If this really was trained exclusively on synthetic data, isn't that something to be celebrated?
This is fascinating from an information theory standpoint.
Also fulfills all of our hopes and dreams of self replicating AI's, if they can literally just copy themselves via
Show more
Private islands, castles, yachts, jets, and experiences unheard of to the public. You can have it all.
Concierge.io: A private, exclusive, premium, and personalized travel booking service for high-net-worth individuals.
Apply today and #TravelUnlikeAnyOther
I simply don’t understand why these companies who are stealing chatgpt outputs don’t just use a simple string replace function? How can they be this intelligent while still leaving the word “chatgpt” in their training datasets? Mind boggling, unless it is somehow intentional?
make sense, they basically used chartgpt to generate synthetic data to train this small model
I mean, yeah. The internet -and so, training sets- has a non-trivial amount of chatgpt interactions now. ChatGPT itself had a similar issue not too long ago, claiming it was some other model.
I know pre-trainers include "I'm a language model trained by..." in their fuzzy filters
They took data from an "open" institution trained a model with it and made it open. I'd say that's fair use.
Especially when open AI trains on more or less the entirety of the available internet. Deepseek uses many sources but redistributes their results.
IMHO, this could be a downstream consequence of OpenAI outputs leaking everywhere on the internet
Even a few hundred samples from open source datasets available in huggingface can bring the models to answering like this.
Early days open assistent and ShareGPT data come to my mind.
Big corps spend quite some effort to "clean" their data and to teach the right identity.
Weird thing I noticed is that when you ask "What model are you" with a capital w in the beginning it says its DeepSeek V3 and when you ask "what model are you" with a small w in the beginning it says its ChatGPT. Just from my couple times of testing
Wow I can't believe it's reproducing the distribution of chat transcripts on the internet how can this be