ML Foundations: From Neural Networks to Transformer Models
ML Foundations: From Neural Networks to Transformer Models
Introduction
In our series so far, we’ve explored powerful machine learning algorithms that form the bedrock of data science. Now, we venture into the domain that powers the most advanced AI today: Deep Learning. This is the world of models with millions or billions of parameters that can write text, generate images, and understand speech with uncanny ability.
The journey from a simple neuron to a model like ChatGPT or Whisper is a fascinating story of architectural evolution. This post will guide you through that story, connecting the dots between four fundamental concepts:
- Neural Networks: The basic building blocks inspired by the human brain.
- The Encoder-Decoder Architecture: A powerful paradigm for handling sequence-to-sequence tasks like translation.
- The Attention Mechanism: A revolutionary idea that allows models to “focus” on what’s important.
- The Transformer: The state-of-the-art architecture that combines these ideas and has redefined modern AI.
The code snippets you’ve seen, with classes like DecodingTask, TokenDecoder, and FeedForward, are all real-world implementations of the concepts we’re about to explore.
1. The Building Block: Neural Networks
At the heart of deep learning is the Artificial Neural Network (ANN), an idea loosely inspired by the web of neurons in our brains.
A network is made of simple units called neurons (or perceptrons). A single neuron:
- Receives multiple weighted inputs.
- Sums them up and adds a “bias” term.
- Passes the result through an activation function (like a Sigmoid, Tanh, or, most commonly, ReLU) to produce an output. This function introduces non-linearity, allowing the network to learn complex patterns.
A Feed-Forward Neural Network is created by organizing these neurons into layers:
- An Input Layer that receives the raw data.
- One or more Hidden Layers that perform intermediate computations. “Deep” learning simply means having many hidden layers.
- An Output Layer that produces the final prediction.
Information flows in one direction—from input to output. The network “learns” by adjusting the weights and biases throughout the network using an algorithm like Gradient Descent to minimize a cost function. The FeedForward and Res2Net classes in the provided code are examples of how these layers are implemented in practice.
2. The Encoder-Decoder Architecture for Sequences
Neural networks are great, but how do we handle variable-length sequences, like translating a sentence from English to French? The input and output lengths are different. This is a sequence-to-sequence (Seq2Seq) problem.
The classic solution is the Encoder-Decoder architecture.
The Encoder
The Encoder’s job is to read the entire input sequence (e.g., the English sentence “I love you”) and compress all of its meaning and context into a single, fixed-size vector. This vector is often called the context vector or, more poetically, a “thought vector.”
Traditionally, Encoders were built using Recurrent Neural Networks (RNNs) or their more advanced variants, LSTMs and GRUs, which process the sequence one token at a time while maintaining an internal state. The final hidden state of the RNN becomes the context vector. The Encoder class you provided, which generates embeddings for sentences, performs this exact role.
The Decoder
The Decoder’s job is to take that context vector and generate the output sequence (e.g., “Je t’aime”). It also works one token at a time:
- It starts with the context vector and a “start-of-sequence” token.
- It generates the first output token (“Je”).
- It then takes the context vector and the first generated token (“Je”) as input to generate the second token (“t’aime”).
- This continues until it generates an “end-of-sequence” token.
The TokenDecoder class and the DecodingTask from the Whisper code you provided are perfect real-world examples of this generative, step-by-step decoding process.
The Bottleneck
This architecture has a major weakness: the single context vector. It has to encapsulate the meaning of the entire input sequence. For a long, complex sentence, this is an impossible burden, and information is lost. This is known as the context bottleneck.
3. The Breakthrough: The Attention Mechanism
The solution to the bottleneck was a revolutionary idea called Attention.
The intuition is simple: when a human translates a long sentence, they don’t read it once and then write the translation from memory. As they write each word of the translation, they pay attention to specific, relevant words in the original sentence.
The Attention mechanism allows the Decoder to do the same thing. At each step of generating an output token, the Decoder is allowed to look back at all the hidden states from the Encoder (not just the final one). It then calculates “attention scores” to determine which input tokens are most relevant for generating the current output token and gives them a higher weight.
For example, when generating “t’aime,” the decoder would pay high attention to the input words “love” and “you.” This allows the model to handle long-range dependencies and frees it from the context bottleneck.
4. The Transformer: Attention Is All You Need
In 2017, a landmark paper titled “Attention Is All You Need” introduced the Transformer, an architecture that completely discarded RNNs and relied entirely on the Attention mechanism.
The Transformer is an Encoder-Decoder architecture, but its internal components are new and powerful.
The Transformer Encoder
The Encoder is a stack of identical layers. Each layer has two main parts:
- A Multi-Head Self-Attention layer. This is the key innovation. Here, the input sequence “pays attention to itself.” Each word looks at all the other words in the same sentence to build a richer, more context-aware representation. For example, in the sentence “The bank of the river,” self-attention helps the model understand that “bank” refers to a riverbank, not a financial institution, by looking at the word “river.”
- A Position-wise Feed-Forward Network. This is a standard neural network layer applied independently to each token’s representation. The
FeedForwardclass in your snippets is exactly this component.
The Transformer Decoder
The Decoder is also a stack of identical layers. Each layer has three parts:
- A Masked Multi-Head Self-Attention layer. This is similar to the encoder’s self-attention, but it’s “masked” to prevent a position from looking at subsequent positions. When predicting the third word, it can only use the first and second words as context, not the fourth.
- An Encoder-Decoder Attention layer. This is where the classic Attention mechanism comes in. The decoder looks back at the final output of the Encoder stack and pays attention to the most relevant input tokens.
- A Position-wise Feed-Forward Network.
Because the Transformer contains no recurrence, it also needs a way to know the order of the sequence. This is done by adding Positional Encodings to the input embeddings.
The DecodingTask class you provided is a beautiful, complete implementation of a Transformer-based decoding process, including handling prompts, beam search, and logit filters—all core components of modern generative models.
Conclusion
The evolution from simple neurons to the Transformer architecture is a story of brilliant solutions to fundamental problems.
- Neural Networks provide the basic computational units.
- The Encoder-Decoder framework allows us to tackle sequence-to-sequence problems.
- The Attention Mechanism solves the context bottleneck, allowing models to focus on relevant information.
- The Transformer leverages Self-Attention to build deep contextual understanding, becoming the de facto standard for nearly all state-of-the-art language, audio, and, increasingly, vision models.
Understanding this progression is key to understanding how modern AI works.
Suggested Reading
- “Attention Is All You Need” (Vaswani et al., 2017): The original paper that introduced the Transformer. It’s dense but foundational.
- “The Illustrated Transformer” by Jay Alammar: An incredibly intuitive and visual blog post that breaks down the Transformer architecture. A must-read.
- “Hands-On Machine Learning with Scikit-Learn, Keras & TensorFlow” by Aurélien Géron: Chapters on RNNs, Attention, and Transformers provide excellent practical context.