Thread

yannlecun

Never test on the training set.

786

sung.kim.mw

arxiv.org/abs/2…

May be a graphic of poster, calendar and text

mshonle

Wait, wouldn't that be never train on the test set, or am I missing the point?

borun.d.chowdhury

I used to regularly test LLMs on my own curated problems from game theory, maths and physics. I used to initially post them on LinkedIn highlighting failures and also questioning the benchmarks.However I found newer versions were solving those problems easily. So I stopped posting. However, last time I asked o1 mini to solve laminar flow in an annular tube. It failed spectacularly. But o3 mini and then even lighter models solved it as if it was nothing.All this hype is BS. 1/2

borun.d.chowdhury

If one really wants to check human level reasoning, curate the training and test data with a clear separation in time. Sure it’s hard to do but this is the only sure way of not testing on the train set. 2/2

steriana

Huh...that's exactly what final exams in college are.

doyouknowmeforsure

I am right now in my machine learning class and everybody is presenting their projects.

doyouknowmeforsure

it's the final project that my professor wants. I built an MLP trained on the MNIST data set!

trqianf

Time to add protoPNets! 🙃

zshao

Maybe the US Math Olympiad problems were selected based on how badly the public available models do. Just speculating.

meliusvivere

They didn’t test on Gemini 2.5?

danielogbuigwe

It just came out. Research probably was carried out in the previous weeks.

traveelravioli

Interesting

supercake.ai

shooketh

_wgljr

I don’t claim to know anything about this type of research, but some of those totals in the screenshot of the original tweet don’t add up. Is that normal?

gkasperf

Surprising a total of 0 people

nag3lt

Unfortunately, the total number or LLM hype-men and fanboys is much farther away from zero, than we'd like

gary.bradski

Nonsense, both my video (sold) and space companies test on the train set. I call it “LAM” dropping the “S” from SLAM, but others call it SfM. Well, it’s really not the test set but the “object” is the same.

kshirsagarmahesh

I used LLMs to help my high school kid with his math. They are awesome solving most of the textbook problems however sometimes they write correct solution with nonsensical and wrong intermediate steps

fivetrp

Even more important: Never train on the test set 😂 (ARC-AGI much?)

walulyajfrancis

💯data overfiting, you need to challege the model with data it has never seen before

jpraderad

Faith in humanity temporarily restored. Our consciousness and subjectivity still allow for better adaptation/transference. Evidence (weak) in support for this theory of why we developed comsciousness and are not automatons.

dikaiosvne

wow, what.

jpagano569

Isn’t this like grading if a fish can climb a tree?

daniel.sum

I wanna see a math AI try the Putnam exam. Cats can’t do that either.

Related threads

wongmjane

發覺有啲海外港人鍾意鬥邊個嚟得耐啲其實有冇個計分系統嚟判斷邊個「勁啲」？以前喺香港嘅出身、住邊度、讀邊間名校係咪會carry over嚟呢邊做base score？有冇個formula？

Translate

sung.kim.mw

We Do Not Have Enough Compute by Max WeinbachI am not sure his thesis is correct, but this tidbit is interesting"While there is no pricing on Gemini 2.5 Pro yet, I’m hearing it’s around the same price as Deepseek R1. It’s more intelligent than o1-Pro while being around 150x cheaper as well."

benedictevans

I just saw a slide that talked about 'product dynamicity'

justinwolfers

Trump Says Recession Unfortunate But Necessary Step To Get To Depression

183

nam.n.nguyen

Some might consider it a brilliant move.Increasing tariffs has a similar effect as increasing tax by a flat rate, regardless of incomes. So it is not progressive as income tax. The additional income would support tax reduction for corporates and high earners.So tax the poor to give to the rich, and rely on trickle-down economics.Win win!Just saying! :-)