How Large Language Models Actually Work

Digging Into the Mechanics

December 27, 2022 · 4 min read

After two weeks of playing with ChatGPT, I hit a point where "wow, it's good" wasn't enough. I wanted to understand what's actually happening when I type a prompt and get a coherent response. Not at a PhD level, but enough to move past treating it like a black box.

Text to Numbers

The first thing I had to wrap my head around is that the model doesn't see words. It sees numbers. Before anything else happens, your input gets broken into tokens, which are pieces of words, sometimes whole words, sometimes fragments. The word "understanding" might get split into "under" and "standing." Each token maps to a number. That's what the model actually processes.

This was a useful starting point because it immediately demystifies something. The model isn't reading. It's doing math on sequences of numbers that represent text fragments. Everything that follows builds on that.

Words in Space

Once text is tokenized, each token gets converted into an embedding, a vector that represents where that word sits in a high-dimensional space. Words that appear in similar contexts end up near each other. "Bank" and "credit union" are closer together than "bank" and "bicycle."

What clicked for me here is that meaning, or at least something like meaning, emerges from patterns in data. Nobody told the model that "bank" and "credit union" are related. It figured that out from seeing billions of sentences where they appear in similar contexts. The model learns relationships, not definitions.

Attention

The piece that finally made it make sense to me is something called the attention mechanism, introduced in a 2017 paper called "Attention Is All You Need." The idea is that when the model processes a word, it doesn't just look at the words next to it. It looks at the entire input and decides which other words matter most for understanding the current one.

The example that helped it click was a sentence like "The bank approved the loan after reviewing the financials." When the model processes "bank," attention connects it to "loan" and "financials" instead of weighting every word equally. The model isn't told what to pay attention to. It learns it.

That's what makes the output feel coherent to me. It isn't stringing together probable next words blindly. It's weighing relationships across the whole input. The architecture that does this is called a transformer, and it's the foundation of GPT, BERT, and most of the models making headlines right now.

Why Scale Matters

The transformer architecture has been around since 2017, which raised the obvious question for me. Why is ChatGPT only happening now? Scale. GPT-3.5, the model behind ChatGPT, has around 175 billion parameters, the adjustable weights the model learned during training. More parameters means the model can capture more nuanced patterns.

Parameters alone weren't the unlock, though, and I had to keep reminding myself of the other two ingredients. The model was trained on a massive amount of text from the internet, books, articles, and code. And the compute to actually run that training is its own constraint. Architecture, data, and compute together produced the jump. None of those three alone gets you here.

One Token at a Time

The part that surprised me most is how the model generates text. It doesn't plan an entire response and write it out. It predicts one token at a time. Given everything that came before, what's the most likely next token? It picks one, adds it to the sequence, and repeats. The entire response is built through this sequential prediction.

This explains some of the behavior I noticed. The confident-sounding wrong answers happen because the model is optimizing for what sounds right based on patterns, not for factual accuracy. It's not looking anything up. It's predicting what a plausible next word would be given the context.

What This Changes

Understanding the mechanics didn't make ChatGPT less impressive to me. If anything, it made it more interesting. Knowing it's pattern matching at enormous scale, not thinking, helped me frame what it's good at and where it will break. It's excellent at tasks where patterns in language are the point, like summarization, drafting, and code generation. It's unreliable where factual precision or reasoning beyond pattern recognition is required.

For a regulated industry like banking, that framing matters even more to me. Knowing where a model like this can be trusted, and where a human has to stay in the loop, starts with understanding what it actually is.

Text to Numbers​

Words in Space​

Attention​

Why Scale Matters​

One Token at a Time​

What This Changes​