Writing an LLM from scratch, part 22 — finally training our LLM! :: Giles’ blog

🚀 Explore this insightful post from Hacker News 📖

📂 Category:

💡 Main takeaway:

This post wraps up my notes on chapter 5 of Sebastian Raschka’s book
“Build a Large Language Model (from Scratch)”.
Understanding cross entropy loss and
perplexity were the hard bits for
me in this chapter — the remaining 28 pages were more a case of plugging bits together and
running the code, to see what happens.

The shortness of this post almost feels like a damp squib. After writing so much
in the last 22 posts, there’s really not all that much to say — but that hides the fact that
this part of the book is probably the most exciting to work through. All these pieces
developed with such care, and with so much to learn, over the preceding 140 pages,
with not all that much to show — and suddenly, we have a codebase that we can let
rip on a training set — and our model starts talking to us!

I trained my model on the sample dataset that we use in the book, the 20,000
characters of “The Verdict” by Edith Wharton, and then ran it to predict next tokens after “Every effort
moves you”. I got:

Every effort moves you in," was down surprise a was one of lo "I quote.

Not bad for a model trained on such a small amount of data (in just over ten seconds).

The next step was to download the weights for the original 124M-parameter version of
GPT-2 from OpenAI, following the instructions in the book, and then to load them
into my model. With those weights, against the same prompt, I got this:

Every effort moves you as far as the hand can go until the end of your turn unless something interrupts your control flow. As you may observe I

That’s amazingly cool. Coherent enough that you could believe it’s part of the instructions for a game.

Now, I won’t go through the remainder of the chapter in detail — as I said, it’s essentially
just plugging together the various bits that we’ve gone through so far, even though the results
are brilliant. In this post I’m
just going to make a few brief notes on the things that I found interesting.

Randomness and seeding

One thing I really do recommend to anyone working through the book is that you type
in all of the code, and run it yourself — it really will help you remember
how stuff fits together.

There is one slight issue I found with that, however:
the book has a number of examples where you get output from code that uses randomness — for
example, where you take a look at the loss it has on some sample text before you
start training, or make it generate samples during the train.

Now, in theory, because Raschka puts torch.manual_seed calls before all of these,
the results you get should be exactly the same as the outputs in the book. However,
the amount of code we’re working with at this stage is quite large — we have various
helper functions that were created in earlier sections, for example. And some of these
use randomness.

That means that to get the same results as the ones in the book, you would need to ensure
that all of the code that uses randomness was running in exactly the same order as it was
when Raschka did it for the book. That turns out to be surprisingly hard!

My instinct is that it doesn’t actually matter all that much. So long as the loss numbers
that you see are in the same ballpark as the ones in the book, and the outputs you see
are roughly equally incoherent (before training) and become more coherent at what feels like
the same kind of rate, you’re fine. Probably the most important one to look out for
is when the training run starts — you should see loss on the training set decreasing steadily,
just like in the book, and likewise as in the book, the validation loss should plateau out pretty early.

Optimisers

When I have built simple backpropagation through neural networks in the past, I’ve
generally updated parameters by multiplying the gradients by a small number, the
learning rate, and then subtracting them from their respective parameters to get
updated ones — classic stochastic gradient descent.

Non-trivial ML uses optimisers; I’d come across them while fine-tuning LLMs,
and also used one in the RNN code I wrote last week.
Instead of updating the parameters yourself, you ask the optimiser to do it for you, by
calling its step function. AdamW appears to be the default optimiser in most textbooks,
though Muon seems to be the most popular
in use, if my AI X/Twitter feed is to be believed.

I don’t understand how optimisers work in any detail, and I’m going to have to dig into that in the future. However, my
high-level simplified picture right now is that they dynamically adjust the learning
rate over time, so that it’s easier to take big “jumps” downwards on the gradients when
you start, and then smaller ones later. I believe they can also sometimes avoid local
minima in the loss landscape — a nice metaphor I read somewhere (lost the source, sadly)
was that simple gradient descent was like rolling a ball down a hill, but (some?) optimisers give the ball a bit
of momentum so that it can coast over a small uphill portion, so long as the general
slope is downwards.

Anyway, more investigation needed later.

In practice, with AdamW, you initialise it at the start of your training loop,
with a learning rate (which I imagine is similar to the one my older code used, a
scaling factor for gradients) and a weight decay (:shrug:). You also provide it with the parameters
it’s going to be managing.

In the training loop, at the start of each input batch, you tell it to zero out the gradients it’s managing
with optimizer.zero_grad(), run the data through your model and calculate your loss, and then after
calling loss.backward() to get your gradients,
you just call optimizer.step(), and that does the parameter update.

Again, I want to dig into how optimisers work in more detail in the future. But
for now, I think that’s all I need to know.

Speed, and the cost of training

The book tells you how to train on a public domain book, “The Verdict” by Edith Wharton.
Full training on the hardware that people are likely to have to hand would be extremely
expensive, so we just train on that short example, then later on learn how to download
and use the weights that OpenAI made available for their GPT-2 models.

But there was something that surprised me a little. When talking about the training
run on “The Verdict”, Raschka says that it takes “about 5 minutes to complete on a MacBook
Air”.

On my machine using CUDA on an RTX 3090, it took just less than eleven seconds.

This makes perfect sense, of course — there’s a really good reason why AI training
is normally done on GPUs or custom hardware, and the MacBook Air would presumably
be training on the CPU. But I was a little surprised at how huge the difference was
in this simple example!

Now, while the book mentions that Llama 2 probably cost hundreds of thousands of dollars to train,
I must admit that I do wonder how much it really would cost to train a 124M parameter
model on my own hardware — or, indeed, on the machines with 8x 80GiB A100 GPUs that I rented
from Lambda Labs during my fine-tuning experiments.

Andrej Karpathy was able to train a 124M GPT-2 model for $20,
using his hand-written C/CUDA LLM system llm.c. That is undoubtedly more efficient than the
PyTorch code that we’re working on in this book. But it really would be interesting
to find out whether it would be doable for me at all! The training data he used
is the 10B-token version of the FineWeb collection, which
is freely available.

I think I have a good candidate for a next project when I’ve finished the book;
see how many tokens/second I can train on locally — that will allow me to estimate
how long it would take to train one epoch over the whole training set. I imagine
that will be longer than I’m willing to leave my desktop machine tied up doing this,
but then I can try mixing in the lessons I learned doing fine-tuning, and see if I can
get it up and running on Lambda Labs. If the cost is in the tens of dollars, or even a hundred or so, I really
think it would be worthwhile!

“Memorisation”, temperature and top-k sampling

One thing I found a little confusing in this chapter — and this is very much a nit — was the section on preventing
“memorisation”; I think this was due to a mismatch in the meaning I attach to the word,
and the way it’s used here.

To me, memorisation is something that the model does during training — if you keep
training a 124M-parameter model on a 20,000-character file, as we’re doing here, then whatever
happens the model is going to memorise it — it’s unavoidable. The only way to reduce
memorisation in this sense would be to increase the amount of training data (and even
then, as the findings in the lawsuit
by the New York Times against OpenAI show, some stuff would be memorised).

In the book, “memorisation” is being used to mean something more like what I’d call “parroting” —
issues with the model just repeating the stuff that it has memorised, because it was always
choosing the most-probable next word. Avoiding this is super-important, of course! It’s
just the framing that confused me a little.

The techniques are nifty, anyway. The first cut — just use the softmaxed logits
as a probability distribution and sample from it — is obvious enough. Temperature
is a clever trick on top of that — just divide the logits by some number greater than
one before softmax, and you can make the distribution that comes out flatter (or you can
make it more “pointy” by dividing by a number less than 1). The
graphs in the book showing how that works are great, but I asked Claude to knock together a
temperature playground
website, which I found made things even clearer to me.

And finally, the top-k technique — only consider the k most probable tokens, and
then do the temperature/softmax calculations — was a sensible addition to add on top
of that. The code is clever: identify the top k logits, get the value of the lowest one
of them, and then replace every logit less than that with minus infinity. When you
run that through softmax, you get zeros for the ones that were replaced, and the probability
distribution is based on the remainder.

So: excellent stuff, and very well explained in the book — it just didn’t feel like
preventing “memorisation” specifically was what it was doing, at least based on what I
take the word to mean.

Downloading the OpenAI weights

At the end of the chapter, we download the weights for the original GPT-2 model
that OpenAI produced from their site, and load them into our own model.

The code to download weights is (thankfully) something that you don’t need to type
in, as it’s downloadable from GitHub. And in one specific related case, I’ll also contradict what I said earlier
about typing stuff in yourself — I definitely recommend that you copy the
load_weights_into_gpt that copies the downloaded weights into our own model
from GitHub too. I did actually type it all in and I don’t think I gained anything
from doing that.

One thing I did notice while going through that section was that I’d been making a
mistake as I wrote up this series; I’d thought that all GPT-2 models had 768 embedding
dimensions. It turns out that this is only true of the 124M model in that series, and
the larger ones have more. That makes a lot of sense — and I’ve updated the older
posts to reflect it.

Wrapping up

That’s all I really have to add to what is in the rest of chapter 5. Like I said at
the start, it feels almost like a let-down to be writing so little about a section
of the book that has such amazing results! But now we have a working LLM, and
at least the foundations that might allow us to train our own from scratch if we had
the resources.

Next up: using it to classify text. Will this be quick and easy? Or will it lead down
another fascinating rabbit hole? Time will tell…

Share your opinion below!)()
}”
>

🔥 Tell us your thoughts in comments!

#️⃣ #Writing #LLM #scratch #part #finally #training #LLM #Giles #blog

🕒 Posted on 1760574371

By

Leave a Reply

Your email address will not be published. Required fields are marked *