Post

Conversation

META JUST KILLED TOKENIZATION !!! A few hours ago they released "Byte Latent Transformer". A tokenizer free architecture that dynamically encodes Bytes into Patches and achieves better inference efficiency and robustness! (I was just talking about how we need dynamic tokenization that is learned during training ๐Ÿฅฒ It's like fucking christmas!)
Image
I don't want to talk too much about the architecture. But here's a nice visualization from their paper.
Image
Let's look at benchmarks instead :) "BLT models can match the performance of tokenization-based models like Llama 3 at scales up to 8B and 4T bytes, and can trade minor losses in evaluation metrics for up to 50% reductions in inference flops!" This is basically a perplexity vs training flops chart - scaling laws with compute. BPB is a tokenizer independent version of perplexity. BLT is on par or better than LLama 3 BPE!
Image
Most importantly they scale this approach to train Llama-3 8B model on 1T tokens which beats the standard Llama-3 architecture with BPE tokenizer
Image
!
443.1K
Views
David Watson ๐Ÿฅ‘
Post your reply

Make sure to check out my latest visualization. I spent way too long on it, so I have to shill for it now ๐Ÿ˜‚
Quote
Lisan al Gaib
@scaling01
LMARENA EVOLUTION OF TOP 10 MODELS BY ELO OVER TIME Since November 20th, only Google and OpenAI models have been in the Top 10 !!! Anthropic hasn't made the Top 10 since September 27th.
Show more
0:28
Reminds me of this paper:
Quote
Lisan al Gaib
@scaling01
Just finished: "Human-like Episodic Memory for Infinite Context LLMs" and I have to admit it's pretty cool The TLDR is: A human-inspired, event-based, dynamic memory system that efficiently processes up to 10 million tokens contexts by organizing input into coherent episodic x.com/scaling01/statโ€ฆ
Show more
Image
Image
Image
I have to think about this table the whole time. It's probably the most important. The reason why we all love byte-level encoding or dynmaic encoding schemes. These are tiny 8b models. I wonder how much Llama-3.1-405B would improve simply by changing the tokenization. On what
Show more
Image
Remember folks, it theyre willing to publish it then it ain't a real breakthrough!
Does this mean we can train models on any digital object (all file extensions) natively?
not really - you could try but this paper doesn't improve different modalities at least there is no study in the paper showing this
Imagine dynamically updating your loss function based on the byte patch entropy of the input data. Low entropy, easy to learn, high entropy, hard to learn.
Kind of looks like a more complex version of character2vec. But we are still limited by the window size for bundling these patches and then the transformer can't exactly consider every patch that has ever existed right? So wouldn't there be a locality problem?
Is this still a preprocessing step done before training begins? Or are the patches somehow dynamically changed during training? How is this not just another tokenization scheme?
who cares, humans aren't interpretable either let's just run full steam ahead in a future without interpretable superintelligence CHOO CHOO Nothing can go wrong because nothing ever happens, right?
BIg if true and if there's no trade-offs or overall _something_ that is a deal-breaker in practical functionality and applicability.
Breakdown of the paper: This paper introduces a new language model architecture called the Byte Latent Transformer (BLT). It aims to address the limitations of the traditional tokenization-based language models. The results show that BLT can match the performance of
Show more
Image
They didn't kill tokenization. They are doing steamed batching tokenization with a new name.
Sounds promising! Current architecture requiring Tokenization and embedding is rather complex and not efficient. Vector store search is still the limiting factor in RAG accuracy
What if we no longer needed tokenization at all, and AI could just think in raw bytesโ€”could this be the breakthrough that makes machines smarter, faster, and more human-like than ever before?
Square profile picture
meta just killed tokenizerization!!! they just released "byte latent transformer" a tokenizer free architecture. whatever, i still have a ton of meta pegging to do tonight.
It's a completely new approach that's trying out a lot of ideas that already exist. Things like MoE, CoT, MTCS... it's a lot to take in.
Interesting to see how this will unfold tokenization and prompt security altogether /pawnd
Actually this is not new as per se, there where previous papers (one from Google too) which processes inputs or characters in patches and then sends it to the transformer If you precompute the permutations of patches you effectively back to tokenization
TLDR: Byte Latent Transformer (BLT)๐Ÿฅ“ model can achieve up to a 50% reduction in inference flops (floating-point operations) compared to traditional tokenization-based models like Llama 3, while maintaining similar or superior performance in terms of perplexity. This is a
Show more
Open source models have such a strong advantage, itโ€™s too risky for enterprises to rely on openAI/Anthropic APIs for critical infrastructure
what's with the dramatic headlines, lately everyone who wants to share anything of any substance has to overbullshit it like this? why the effff looks ultra-stupid also btw, tokenization still stays it just costs more as an ongoing process in learning now, oh so how
This was bound to happen... Language is not optioned for llms and tokenization is just an intuitive way to encode things... Hopefully we can decode a big patch of simple text in a single forward pass... Basically de-linking compute from the number of "tokens"
I am so not understanding the final table. Do 1T Tokens equate to 6T Bytes or what?
It looks promising. BLT will replace tokenizer soon. Innovative Transformer! Hope I can fine tune it for AGI LLM model when it is full release on Huggingface. I hope.
Please communicate this to the academics who are still "word"-hacking and miseducating at the same time.