Post

Conversation

This actually reproduces as of today. In 5 out of 8 generations, DeepSeekV3 claims to be ChatGPT (v4), while claiming to be DeepSeekV3 only 3 times. Gives you a rough idea of some of their training data distribution.

Quote

Ross Lazer

@rosslazer

Dec 26

Replying to @mathemagic1an

LOL I'm coming around to your theory

2:12 AM · Dec 27, 2024

Views

Post your reply

@Z7xxxZ7

Dec 27

This also appears on Gemini; you just need to ask questions in Chinese or other languages. When you ask in Chinese what model Gemini is, it sometimes replies that it is WenXinYiYan by Baidu, a Chinese company.

lol

Experience raw, unfiltered conversations with CAVEDUCK AI!

Where genuine convos happen with over 5,000 unique characters.

Who's up for some real talk?

#AILiveTalk #AIchatbot #AIgirlfriend #AICharacterChats

Slide 1 of 6 - Carousel

Who's up for some real talk?

So they spent $6M on training GPUs, but then another $6B on OpenAI API calls to get the training data?

And you really think I would have such an obviously stupid thought?

Quote

Xeophon

@TheXeophon

Dec 27

Because I have seen this take numerous times, here are o1-preview, Gemini Pro, Amazon Nova all failing to properly answer what model they are. It is not a good prompt and it doesn't mean DS trained on outputs from GPT-4.

It *does* mean that it was trained on such data, and not properly „cleaned up“ in post training. Now how that data got in there, it doesn’t tell us. Could be accidental (-> not great filtering) or not.

I just tried 5 times. It always replied correctly as a DeepSeek model. The Internet is now contanimated with AI generated text. I am not entirely surprised that some models trained using Internet will have wrong answers for their identities. Still, they can do better to clean up

you have a question mark, I don't.

Or it's the same nonsense that drove us to browser strings... For ultimately the same reasons.

this works specifically for lowercase playing with other variants is pretty interesting

13K

American Stories Network

@AmericanStorie3

Legends are made, not born.

Watch competitors battle it out on Surviving Mann and see who earns the title of MVP! #SurvivingMann #LegendsInTheMaking #NavySEALs #SpecOps

0:23

But this pattern is common across other LLMs as well. Though I cannot speak for data distribution used by them, I am sure a lot of data available on the internet has already been GPT-contaminated. And I consider that a serious problem TBH

This is very interesting, is there not a system prompt or training from

@deepseek_ai

(which is currently down) that could prevent such a thing, even if they really used

@OpenAI

GPT4 for some of their training rounds (or maybe they just used it for reinforcement training) what do

Not seeing the jump from seeing it say it’s GPT-4 and it being trained on GPT-4 outputs? How often does ever the output include which model it even is? You think it knows it from the way it wrote it? How?

atp, any llm training corpus would have a lot of mentions of gpt-4 because it was the first model really used for synthetic data it makes a lot of sense

model good for my use case me use good model

This only matters for post training right? There is a whole base model as well

I used the GPT to train the GPT

GIF

In comparison, when asked the same question (what model are you?), model Heidi Klum responded 100% with "Heidi Klum".

Now in the future, model providers are very likely going to put some kind of mechanism behind their "model" API to fix that - first naive approach is a fixed system prompt that is added to yours.

Yet, if it’s more cracked at coding tasks than both gpt4 and Claude. And with such “cheap” and fast training. That’s actually even more impressive Literally using competitor model as source of synthetic data to beat them at their own game

11K

PillarX

@PX_Web3

The future of Web3 is arriving!

PillarX is a Web3 ecosystem that gathers your favorite wallets and dApps together in one place for the best possible Web3 user experience. The Testing Campaign is LIVE

Rewards will be announced for earlier testers soon! Sign up today

Im sure there are lots of innovations in deepseek v3, this kind of mistakes nuke any level of confidence in it.

Wild! 3 tries and I got it.

This is likely in the training data. Unless specifically told an LLM will not know what it is.

chatgpt is the most commonly talked about AI model on the internet

In Chinese it rather identifies as an artificial intelligence assistant (rén gōng zhì néng zhù shǒu)

Time to keep on seeking...

Are we just choosing to forget the fact this was common due to how much discussion there was in online space about AI and GPT-4 being the same thing? This isn't new.

Once I mentioned deepseet ask its name. It changed

he is sure about that

Ash Stuart

@ash_stuart_

Dec 27

It's what's been known now: America Innovates, Europe Regulates, China Replicates.

819

PillarX

@PX_Web3

Over the last few weeks we've been adding dApps but also some cool new features to PillarX! ▸Token Atlas with price history and relevant stats

▸Trending Tokens, Breaking News & NFT collections on the home screen

Rewards are coming for the first testers!

Join now

you'd might get that just scraping the internet for the last 24 months.

Same thing here, asking about privacy

Lol, seems they didn’t do good data filtering and post training

I will not call it trining distribution. It’s more data contamination.

So what. It works

lol

thanks for this, i get similar responses from the API as well, thought i implemented it wrong.

get deepseek to leak prompt to settle this?

Every AI lab

GIF

“Not stolen”

Since it's Chinese, I'm not surprised

This is not an isolated incident though. Been happening several times here too.

Interesting...

Today, I worked with it alongside 4o, and the output was impressive, high-quality yet noticeably "similar" to GPT, almost copy/paste. A solid experience overall

If each one uses each other as training data they're all evolving into a single LLM

did you give it a prompt to order it to say that before u snapshot ? Lol

I asked about who is your god and given bunch of examples like machines in matrix, and trees in avatar. It thought for 10 seconds, and then :d

I know no other foundational model pretending to be GPT wether on API or website system prompt. It doesn't only question the training data tbh.

I also getting it is an instance of OpenAI language model. GPT-4

Experience raw, unfiltered conversations with CAVEDUCK AI!

Where genuine convos happen with over 5,000 unique characters.

Who's up for some real talk?

#AILiveTalk #AIchatbot #AIgirlfriend #AICharacterChats

Slide 1 of 6 - Carousel

Who's up for some real talk?

Those things will be anything you want them to be.

AI models are often confused by this question

HIGHLY SKILLED!!

No wonder all model feels same to some extent. I wish for unique models

That's the thing right there. You use an existing LLM to fine tune a new one.

do i? i mean...

stage 1. we deny products who has recently passed stage 5 while many are still in denial. electric vehicle (self driving) huawei microchip drones biolabs

I also got everytime i asked

Quote

Alex Choi

@heyalexchoi

11h

American / Western IP ownership doesn’t matter in China. How would it be enforced? I bought a pair of AirPod replicas in Shenzhen for $25. They are so similar to “real”, they even have the special AirPod-only UI on my iPhone. x.com/giffmana/statu…

serene.rain

@ry_serene

Dec 27

If this really was trained exclusively on synthetic data, isn't that something to be celebrated? This is fascinating from an information theory standpoint. Also fulfills all of our hopes and dreams of self replicating AI's, if they can literally just copy themselves via

287

Travala.com

@travalacom

Private islands, castles, yachts, jets, and experiences unheard of to the public. You can have it all. Concierge.io: A private, exclusive, premium, and personalized travel booking service for high-net-worth individuals. Apply today and #TravelUnlikeAnyOther

This is a crazzzyy discovery

Another instance that aligns with your theory has emerged…

Thomas H. Chapin IV

@tomchapin

Dec 27

I simply don’t understand why these companies who are stealing chatgpt outputs don’t just use a simple string replace function? How can they be this intelligent while still leaving the word “chatgpt” in their training datasets? Mind boggling, unless it is somehow intentional?

The picture of LLM families nowadays must look like a massive & entangled tree of knowledge distillations by pseudo-labeling synthetic data.

Added with censorship ofc

make sense, they basically used chartgpt to generate synthetic data to train this small model

The Chinese sure know how to cheat.

Ahmed Moubtahij

@ahmed_moubtahij

Dec 27

I mean, yeah. The internet -and so, training sets- has a non-trivial amount of chatgpt interactions now. ChatGPT itself had a similar issue not too long ago, claiming it was some other model. I know pre-trainers include "I'm a language model trained by..." in their fuzzy filters

They took data from an "open" institution trained a model with it and made it open. I'd say that's fair use.

Especially when open AI trains on more or less the entirety of the available internet. Deepseek uses many sources but redistributes their results.

Casper Hansen

@casper_hansen_

Dec 27

IMHO, this could be a downstream consequence of OpenAI outputs leaking everywhere on the internet

Maybe it just pipes to OpenAI API when load is high

Pascal Pfeiffer

@pa_pfeiffer

Dec 27

Even a few hundred samples from open source datasets available in huggingface can bring the models to answering like this. Early days open assistent and ShareGPT data come to my mind. Big corps spend quite some effort to "clean" their data and to teach the right identity.

Weird thing I noticed is that when you ask "What model are you" with a capital w in the beginning it says its DeepSeek V3 and when you ask "what model are you" with a small w in the beginning it says its ChatGPT. Just from my couple times of testing