body { -ms-overflow-style: scrollbar; overflow-y: scroll; overscroll-behavior-y: none; } .errorContainer { background-color: #FFF; color: #0F1419; max-width: 600px; margin: 0 auto; padding: 10%; font-family: Helvetica, sans-serif; font-size: 16px; } .errorButton { margin: 3em 0; } .errorButton a { background: #1DA1F2; border-radius: 2.5em; color: white; padding: 1em 2em; text-decoration: none; } .errorButton a:hover, .errorButton a:focus { background: rgb(26, 145, 218); } .errorFooter { color: #657786; font-size: 80%; line-height: 1.5; padding: 1em 0; } .errorFooter a, .errorFooter a:visited { color: #657786; text-decoration: none; padding-right: 1em; } .errorFooter a:hover, .errorFooter a:active { text-decoration: underline; } #placeholder, #react-root { display: none !important; } body { background-color: #FFF !important; }

JavaScript is not available.

We’ve detected that JavaScript is disabled in this browser. Please enable JavaScript or switch to a supported browser to continue using x.com. You can see a list of supported browsers in our Help Center.

Terms of Service Privacy Policy Cookie Policy Imprint Ads info © 2024 X Corp.

To view keyboard shortcuts, press question mark
View keyboard shortcuts

Post

Conversation

META JUST KILLED TOKENIZATION !!! A few hours ago they released "Byte Latent Transformer". A tokenizer free architecture that dynamically encodes Bytes into Patches and achieves better inference efficiency and robustness! (I was just talking about how we need dynamic tokenization that is learned during training

It's like fucking christmas!)

I don't want to talk too much about the architecture. But here's a nice visualization from their paper.

Let's look at benchmarks instead :) "BLT models can match the performance of tokenization-based models like Llama 3 at scales up to 8B and 4T bytes, and can trade minor losses in evaluation metrics for up to 50% reductions in inference flops!" This is basically a perplexity vs training flops chart - scaling laws with compute. BPB is a tokenizer independent version of perplexity. BLT is on par or better than LLama 3 BPE!

Most importantly they scale this approach to train Llama-3 8B model on 1T tokens which beats the standard Llama-3 architecture with BPE tokenizer

!

6:14 AM · Dec 13, 2024

443.1K

Views

David Watson 🥑

Post your reply

Paper Link: ai.meta.com/research/publi

Make sure to check out my latest visualization. I spent way too long on it, so I have to shill for it now

Quote

Lisan al Gaib

@scaling01

Dec 13

LMARENA EVOLUTION OF TOP 10 MODELS BY ELO OVER TIME Since November 20th, only Google and OpenAI models have been in the Top 10 !!! Anthropic hasn't made the Top 10 since September 27th.

Show more

0:28

Reminds me of this paper:

Quote

Lisan al Gaib

@scaling01

Dec 2

Just finished: "Human-like Episodic Memory for Infinite Context LLMs" and I have to admit it's pretty cool The TLDR is: A human-inspired, event-based, dynamic memory system that efficiently processes up to 10 million tokens contexts by organizing input into coherent episodic x.com/scaling01/stat…

I have to think about this table the whole time. It's probably the most important. The reason why we all love byte-level encoding or dynmaic encoding schemes. These are tiny 8b models. I wonder how much Llama-3.1-405B would improve simply by changing the tokenization. On what

Show more

*tokenization not encoding

Remember folks, it theyre willing to publish it then it ain't a real breakthrough!

lol

Nothing is something

I knew this was coming tbh...

Like literally patches rather than tokens....

Did you ask the oracle?

Does this mean that max context window is in bytes now?

i don't think so It's in patches I believe?

ArtificialGFactor

Meta is cooking fundamentally

Chef's kiss

this deserves a double exclam

Does this mean we can train models on any digital object (all file extensions) natively?

not really - you could try but this paper doesn't improve different modalities at least there is no study in the paper showing this

Imagine dynamically updating your loss function based on the byte patch entropy of the input data. Low entropy, easy to learn, high entropy, hard to learn.

The Highly Automated Cat — e/acc

@atlantis__labs

so concepts like bpe are encoded directly ?

@MadHermitHimbo

Kind of looks like a more complex version of character2vec. But we are still limited by the window size for bundling these patches and then the transformer can't exactly consider every patch that has ever existed right? So wouldn't there be a locality problem?

Sir Pugglington

@sirpugglington

Is this still a preprocessing step done before training begins? Or are the patches somehow dynamically changed during training? How is this not just another tokenization scheme?

Tim Kostolansky

Interpretability go down down down

who cares, humans aren't interpretable either let's just run full steam ahead in a future without interpretable superintelligence CHOO CHOO Nothing can go wrong because nothing ever happens, right?

I’m about to bust!!!! I’ve been harassing people on X about this for so long

Learned bytewise cross-entropy over the inputs. Very nice. Cool paper. Thank you!

BIg if true and if there's no trade-offs or overall _something_ that is a deal-breaker in practical functionality and applicability.

@fearmonger69420

Wait so does this open up multimodal training data using a single encoder?

Breakdown of the paper: This paper introduces a new language model architecture called the Byte Latent Transformer (BLT). It aims to address the limitations of the traditional tokenization-based language models. The results show that BLT can match the performance of

Show more

They didn't kill tokenization. They are doing steamed batching tokenization with a new name.

Sounds promising! Current architecture requiring Tokenization and embedding is rather complex and not efficient. Vector store search is still the limiting factor in RAG accuracy

@nocturnalknight

@threadreaderapp

unroll

idk if that will be cool for GPT-2 speedrun but i think you will be happy to see this :D

ppalme Cont.Learning

What if we no longer needed tokenization at all, and AI could just think in raw bytes—could this be the breakthrough that makes machines smarter, faster, and more human-like than ever before?

Thread Reader App

@threadreaderapp

Your thread is very popular today! #TopUnroll threadreaderapp.com/thread/1867573

@nocturnalknight

for

unroll

Thread by @scaling01 on Thread Reader App

From threadreaderapp.com

Muratcan Koylan

@youraimarketer

Strawberry brrrr

Noice

Dramatic much?

Louiepecan.base.eth

What does this mean for crypto?

WOW!

wow who would have thought otherwise

now we can finally count the number of "r"s in strawberry

meta just killed tokenizerization!!! they just released "byte latent transformer" a tokenizer free architecture. whatever, i still have a ton of meta pegging to do tonight.

this is fire

writer of strange software

So they have moved tokenisation into the model?

Fernando Rodríguez

AFAIK, it doesn't dispense wirh tokenization, but organizes tokens in bigger batches. Am I wrong?

now make it a quantum byte transformer

Need to see this in detail.

This is autoencoder, brilliant

Game changer for model efficiency

@jomangblandino

It's a completely new approach that's trying out a lot of ideas that already exist. Things like MoE, CoT, MTCS... it's a lot to take in.

Interesting to see how this will unfold tokenization and prompt security altogether /pawnd

It's like Lucy lu asking "master, with the panda gone, who will be the next dragon warrior ?"

State of Mind: Being ⨀

ｂｅｎｊｉ

BLT sounds delicious

what do you think...

@HeinrichKuttler

does this mean you can train fully multimodal with formatting? Just shovel in all YouTube videos, podcast audio, and text, and it just figures out the weights eventually?

Seriously.

Byte stream to byte stream transformer

The Serious Programmer : }

@TheSeriousProg

Actually this is not new as per se, there where previous papers (one from Google too) which processes inputs or characters in patches and then sends it to the transformer If you precompute the permutations of patches you effectively back to tokenization

Very interesting.

TLDR: Byte Latent Transformer (BLT)

model can achieve up to a 50% reduction in inference flops (floating-point operations) compared to traditional tokenization-based models like Llama 3, while maintaining similar or superior performance in terms of perplexity. This is a

Show more

Open source models have such a strong advantage, it’s too risky for enterprises to rely on openAI/Anthropic APIs for critical infrastructure

- Meta's going wild!

what's with the dramatic headlines, lately everyone who wants to share anything of any substance has to overbullshit it like this? why the effff looks ultra-stupid also btw, tokenization still stays it just costs more as an ongoing process in learning now, oh so how

so its... variable length tokens and the tokens are learned as part of training the model?

Skimming over it, it's basically learned tokenization. I don't see how this solves any of the issues with tokenization.

@LucaMiglioli185

This was bound to happen... Language is not optioned for llms and tokenization is just an intuitive way to encode things... Hopefully we can decode a big patch of simple text in a single forward pass... Basically de-linking compute from the number of "tokens"

Glad we can finally move away from large language model as a name

i am NOT surprised by this input space being more efficient

How does this compare with MrT5?

MrT5: Dynamic Token Merging for Efficient Byte-level Language Models

Models that rely on subword tokenization have significant drawbacks, such as sensitivity to character-level noise like spelling errors and inconsistent compression rates across different languages...

@MrPewpyButWhole

we need a video, stat !

@GeorgeLAdvisor

Oh shit

The wall is moving faster than ball.

@monstercameron

Cc

I am so not understanding the final table. Do 1T Tokens equate to 6T Bytes or what?

u need to read this thread.

OptionsExposure

@optionsexposure

wow this is insane

Focus on entropy will surely make entropix guys happy

tokenization has minor minor gains it basically does not matter given enough data

Lesly Scarlet Tineo

Wow! Let's check it!

Al-amin Ibrahim

I love the paper

@MrPewpyButWhole

we need a video, asap

Impressive, let's see how this plays out.

@AriYasaran2003

This month is crazy !!!

It looks promising. BLT will replace tokenizer soon. Innovative Transformer! Hope I can fine tune it for AGI LLM model when it is full release on Huggingface. I hope.

:)

Potentially huge

Does this mean now models can figure out how many r's are there in "strawberry"?

there should be a reading list of byte level/subword-tok-alternative papers

neat!

Hold my beer... What about raw *bits*? ;)

forum.cursor.com

Cursor AI + Claude 3.5 Sonnet answered a long-standing LLM question in 2 hours

For almost two years, I’ve had this burning question: What if we feed a Large Language Model (LLM) input at the bit level — no tokens, no characters, not even bytes — just raw bits? Would LLM still...

Please communicate this to the academics who are still "word"-hacking and miseducating at the same time.

@Tranquility8888

Who remembers how META fumbled with metaverse?

Promptmetheus (COG/ACC)

@JohnSmith4Reel

The Augmented & Virtual Reality Wizard

@innovativewiz77

Wow,

! META's Byte Latent Transformer is a game-changer! Tokenization is so last season. Dynamic encoding of bytes into patches is the future! Can't wait to see the impact on #AR and #VR experiences