Training Timmy - a thought by Beer, a techie in Berlin

Timmy is a 6k language model and a product of my Hellm World project. Training him was a surprisingly fun and insightful process. This post is my attempt at sharing those vibes with you. He learned like a human; the easy stuff first and hard stuff later.

A few weeks ago I set out to learn large language models from first principles. I challenged myself to write their logic from absolute scratch including every matrix multiplication. This would force me to deeply understand the inner working of this amazing technology. I'll write about that experience later.

Today's post is just about Timmy and how much he grew 🥹. Let's dig in.

What do we want Timmy to do?

I like colors, so I want Timmy to like colors. The goal was to teach it to repeat the same color it was given 5 times. Its world consists only of the following colors: red, orange, yellow, green, blue, violet and he knows the special characters " " (space) and <EOS> (end of sequence) which allows Timmy to stop talking.

My goal here was not only to find out whether I could teach it a pattern, but also if it could derive behavior on a case it has never seen before. This is called generalization and sort of a first sign of intelligence.

Check out the training data for this model down below. Specifically a violet-case is missing. This was my biggest curiosity: will it learn to also repeat violet 5 times?

red red red red red<EOS>
orange orange orange orange orange<EOS>
yellow yellow yellow yellow yellow<EOS>
green green green green green<EOS>
blue blue blue blue blue<EOS>

Training process

As a benchmark, let's first try an llm pass with the untrained model. I wrote the llm script in such a way that it would generate tokens until either

the <EOS> token was generated
a cap of 100 tokens was hit to prevent infinite loops

This was an untrained pass:

pnpm -s llm timmy blue 

# Output 
redgreengreengreengreengreenredgreengreengreenorangeorangeorangeblueblueorangeorangeorangegreenbluegreengreengreengreengreengreengreengreengreengreengreenbluebluegreengreenorangegreenbluebluegreengreengreengreenyellowyellowgreengreengreengreengreen  orangeorangeorangeorangeorangegreengreengreengreengreengreengreengreengreengreengreengreengreengreenorangeorangeorangeorangebluebluegreengreengreengreengreengreengreengreengreengreengreengreengreengreengreenorangeorangebluegreenvioletvioletvioletviolet

Total garbage output. Let's see if we can improve that by training.

First off I ran a 200 training loops.

At the start the average loss was 2.1208 (I'll get into the definition of this number below). What's important is that we try to get it to 0. Within 200 training steps we arrived at 0.7264. This felt like massive progress. Full of hope, I ran the same command to see what would happen.

pnpm -s llm timmy blue

# Output
 red orange red green

Ok; interesting pattern. Let's observe what's going on:

It correctly outputs 4 tokens to complete the set of 5 tokens
It correctly outputs the pattern word -> space -> word

But, it outputs random words. This confused me. Why would it perfectly learn one pattern and not another?

Maybe reaching 1000 steps would get us there?

Sadly; no. In the first 200 steps we improved roughly 66%. The next 800 steps only improved us by 0.43% to 0.7173. Giving Timmy blue again gave similar output: perfect prediction of 5 words, but the colors seemed random.

What is happening?

After another 2000 steps we still were at 0.7159 which was only another 0.066% improvement compared to step 1000.

Why? Why were we flattening? Time to dig deeper.

Let's break down our training data and see if we find something.

Each training sequence is cut into segments to train against. For example the line red red red red red<EOS> is split into the training sequences:

"red" -> should predict " "
"red " -> should predict "red"
"red red" -> should predict " "
"red red " -> should predict "red"
etc...

This gives us a total of 9 training sequences per line. With 5 lines in total we have 45 training sequences.

The separate loss scores of those sequences is the following:

"<color>" - 20/45 training sequences with average loss of 1.609
" " - 20/45 training sequences with average loss of 0.000431
"<EOS>" - 5/45 training sequences with average loss of 0.001801

This seems roughly correct with a total loss score: (20 * 1.609 + 25 * 0) / 45 ~= 0.715.

So it had perfectly learned to predict spaces and end-of-sequence, but hadn't learned to predict the right color at all.

It gets even more interesting if you like math. If not, feel free to skip to the next section.

Loss is calculated using something called the cross entropy loss function. We essentially take the negative natural log of the predicted probability of the correct token because that probability should be 100%.

The reason why we do this is because it gives us a nice mapping from Timmy's guessed probability to its incorrectness. If Timmy guesses the correct token is 100% the correct one we get -ln(1) = 0. If it gives 0% chance to the correct token we get -ln(0) = Infinity (not infinity really but kind of).

Since we have 5 colors to choose from let's see what the cost would be if the model perfectly gives each color 20% chance:

-ln(0.2) = 1.60943791243 🤯

See that number? Look back at the average loss of <color>! That was 1.609! It's nearly the same!

We trained a random 5-color picker. Nothing more.

It learned the spaces and end-of-sequence, but it didn't learn the word repetition. Not at all! It seemed to have reached a plateau.

The reason why: attention is hard

This gets better and better.

Roughly said LLMs do two steps over and over again.

Run an attention mechanism.
- This is the key mechanism that was published in 2017 and was one of the breakthroughs necessary to create LLMs.
Run a 'standard' neural net
- This is the stuff you've seen diagrams of. Neurons connected to next neurons which combine numbers in unique ways.

Attention is hard. Attention is a mechanism to allow a language model to look back into the text and draw meaning from it. It's more complex than neural nets and also harder to train.

What happened is that Timmy had just learned the easy stuff.

This is his thinking at this moment:

Question: did we have a non-space last word? -> predict " "
Question: did we predict 5 words? -> predict "<EOS>"
Question: what word did we have two tokens ago? -> HARD! I need to pay attention. Whatever I'll just pick one of the words at random.

It's just lazy! It's just a lazy model!

Let's kick it some more. Let's do another 2000 training loops.

BREAKTHROUGH

It was fascinating to watch the average loss in the terminal as it was training.

It just got slower and slower in creeping to ~0.71

We had gone from

Step 0000: 2.1208
Step 0200: 0.7264
Step 1000: 0.7173
Step 3000: 0.7159

Diminishing returns...

But then, suddenly!

Step 4000: 0.7098
Step 4160: 0.6904

I couldn't believe it

Step 4200: 0.5392

What's this? I can't describe to you how excited I got seeing this drop.

It just kept plummeting. And at Step 5000 we hit 0.01897.

Let's spin this baby up!

pnpm -s llm timmy blue

# Output
 blue blue blue blue

OMG YAS IT WORKS!

But this was an example it knew from its training data.

The bigger question was: what would it do with the violet term? It had NEVER seen that word in the training data at all. Does it know what to do with it?

pnpm -s llm timmy violet

# Output
 violet violet violet violet

<play triumphant music>

Genius. I was so so excited to see this.

Call me stupid

I know this is a useless model. At 6k parameters this is a nothing-burger in useful value.

But that wasn't the point of this. This was my personal frankenstein. I wanted to make it work from scratch.

Not only did it end up working; it was able to generate results from input data that did NOT occur in its training data!

On top of that; it was fascinating to see Timmy learning in a two-stage method and figure out why exactly the training was levelling off.

chart of loss progress

Do you want to run Timmy yourself? Go for it! The repo can be checked out here and is super trivial to get up and running. At 6k params this will feel instant on even old devices.