META JUST KILLED TOKENIZATION !!!
A few hours ago they released "Byte Latent Transformer". A tokenizer free architecture that dynamically encodes Bytes into Patches and achieves better inference efficiency and robustness!
(I was just talking about how we need dynamic tokenization that is learned during training
It's like fucking christmas!)
I don't want to talk too much about the architecture.
But here's a nice visualization from their paper.
Let's look at benchmarks instead :)
"BLT models can match the performance of tokenization-based models like Llama 3 at scales up to 8B and 4T bytes, and can trade minor losses in evaluation metrics for up to 50% reductions in inference flops!"
This is basically a perplexity vs training flops chart - scaling laws with compute. BPB is a tokenizer independent version of perplexity.
BLT is on par or better than LLama 3 BPE!
Most importantly they scale this approach to train Llama-3 8B model on 1T tokens which beats the standard Llama-3 architecture with BPE tokenizer
!
Conversation
Make sure to check out my latest visualization. I spent way too long on it, so I have to shill for it now 
Quote
Lisan al Gaib
@scaling01
LMARENA EVOLUTION OF TOP 10 MODELS BY ELO OVER TIME
Since November 20th, only Google and OpenAI models have been in the Top 10 !!!
Anthropic hasn't made the Top 10 since September 27th.
Show more
0:28
Reminds me of this paper:
Quote
Lisan al Gaib
@scaling01
Just finished: "Human-like Episodic Memory for Infinite Context LLMs" and I have to admit it's pretty cool
The TLDR is:
A human-inspired, event-based, dynamic memory system that efficiently processes up to 10 million tokens contexts by organizing input into coherent episodic x.com/scaling01/statโฆ
Show moreI have to think about this table the whole time. It's probably the most important. The reason why we all love byte-level encoding or dynmaic encoding schemes.
These are tiny 8b models. I wonder how much Llama-3.1-405B would improve simply by changing the tokenization.
On what
Show more
I knew this was coming tbh...
Like literally patches rather than tokens....
not really - you could try but this paper doesn't improve different modalities
at least there is no study in the paper showing this
Imagine dynamically updating your loss function based on the byte patch entropy of the input data. Low entropy, easy to learn, high entropy, hard to learn.
Kind of looks like a more complex version of character2vec.
But we are still limited by the window size for bundling these patches and then the transformer can't exactly consider every patch that has ever existed right?
So wouldn't there be a locality problem?
Is this still a preprocessing step done before training begins? Or are the patches somehow dynamically changed during training? How is this not just another tokenization scheme?
who cares, humans aren't interpretable either
let's just run full steam ahead in a future without interpretable superintelligence CHOO CHOO
Nothing can go wrong because nothing ever happens, right?
Iโm about to bust!!!! Iโve been harassing people on X about this for so long 


Learned bytewise cross-entropy over the inputs. Very nice. Cool paper. Thank you!
BIg if true and if there's no trade-offs or overall _something_ that is a deal-breaker in practical functionality and applicability.
Breakdown of the paper:
This paper introduces a new language model architecture called the Byte Latent Transformer (BLT). It aims to address the limitations of the traditional tokenization-based language models.
The results show that BLT can match the performance of
Show more
Sounds promising!
Current architecture requiring Tokenization and embedding is rather complex and not efficient. Vector store search is still the limiting factor in RAG accuracy
idk if that will be cool for GPT-2 speedrun but i think you will be happy to see this :D
What if we no longer needed tokenization at all, and AI could just think in raw bytesโcould this be the breakthrough that makes machines smarter, faster, and more human-like than ever before?
meta just killed tokenizerization!!! they just released "byte latent transformer" a tokenizer free architecture. whatever, i still have a ton of meta pegging to do tonight.
AFAIK, it doesn't dispense wirh tokenization, but organizes tokens in bigger batches. Am I wrong?
It's a completely new approach that's trying out a lot of ideas that already exist. Things like MoE, CoT, MTCS... it's a lot to take in.
It's like Lucy lu asking "master, with the panda gone, who will be the next dragon warrior ?"
does this mean you can train fully multimodal with formatting? Just shovel in all YouTube videos, podcast audio, and text, and it just figures out the weights eventually?
Actually this is not new as per se, there where previous papers (one from Google too) which processes inputs or characters in patches and then sends it to the transformer
If you precompute the permutations of patches you effectively back to tokenization
TLDR: Byte Latent Transformer (BLT)
model can achieve up to a 50% reduction in inference flops (floating-point operations) compared to traditional tokenization-based models like Llama 3, while maintaining similar or superior performance in terms of perplexity. This is a
Show more
Open source models have such a strong advantage, itโs too risky for enterprises to rely on openAI/Anthropic APIs for critical infrastructure
what's with the dramatic headlines, lately everyone who wants to share anything of any substance has to overbullshit it like this? why the effff
looks ultra-stupid
also btw, tokenization still stays it just costs more as an ongoing process in learning now, oh so how
so its... variable length tokens and the tokens are learned as part of training the model?
Skimming over it, it's basically learned tokenization. I don't see how this solves any of the issues with tokenization.
This was bound to happen... Language is not optioned for llms and tokenization is just an intuitive way to encode things...
Hopefully we can decode a big patch of simple text in a single forward pass... Basically de-linking compute from the number of "tokens"
How does this compare with MrT5?
tokenization has minor minor gains it basically does not matter given enough data
It looks promising. BLT will
replace tokenizer soon. Innovative Transformer! Hope I can fine tune it for AGI LLM model when it is full release on Huggingface. I hope.
Does this mean now models can figure out how many r's are there in "strawberry"?
Hold my beer...
What about raw *bits*? ;)
Please communicate this to the academics who are still "word"-hacking and miseducating at the same time.