Simple Transformer Model Tutorial
A Comprehensive Tutorial on Building a Simple Transformer Model with Hugging Face and PyTorch (GPU Support)
Introduction
Welcome to this in-depth tutorial on building a simple Transformer model using Python, Hugging Face, and PyTorch with GPU support. Transformers, introduced in the seminal paper “Attention is All You Need” by Vaswani et al. (2017), have revolutionized natural language processing (NLP) and are the backbone of modern large language models (LLMs) like BERT, GPT, and Llama. This tutorial aims to demystify the Transformer architecture for beginners, walking you through its components, implementation, and training with a simple example. We will use the Hugging Face transformers
library and PyTorch, leveraging GPU acceleration for efficient computation.
This tutorial is designed to be verbose, providing detailed explanations of each component, step-by-step code, and visualizations (described for download). It prepares you for a follow-up episode where we will extend this knowledge to build and fine-tune a full-fledged LLM. By the end, you’ll have a working Transformer model, understand its architecture, and be ready to explore advanced topics in the next episode.
Prerequisites
Before diving in, ensure you have the following:
- Python Knowledge: Familiarity with Python programming, including functions, classes, and basic data structures.
- PyTorch Basics: Understanding of PyTorch tensors, neural network modules, and training loops.
- Hugging Face: Basic familiarity with the Hugging Face
transformers
library (we’ll guide you through its usage). - Hardware: A system with a CUDA-compatible GPU for accelerated training. If you don’t have a GPU, the code will fall back to CPU, but GPU is recommended for performance.
- Environment Setup:
- Python 3.8+
- Install required libraries:
pip install torch transformers datasets numpy matplotlib
- A CUDA-enabled GPU with the appropriate PyTorch version installed (e.g.,
torch>=2.0.0+cu121
for CUDA 12.1). Check your setup with:import torch print(torch.cuda.is_available()) # Should print True if GPU is available
What is a Transformer?
The Transformer is a deep learning model architecture that relies entirely on the attention mechanism, eliminating the need for recurrent neural networks (RNNs) like LSTMs. It excels in sequence-to-sequence tasks, such as machine translation, text generation, and more. The key innovation is the self-attention mechanism, which allows the model to weigh the importance of different words in a sentence when processing each word, capturing long-range dependencies efficiently.
Key Components of a Transformer
A Transformer consists of an Encoder and a Decoder, each composed of multiple layers (or blocks). Here’s a breakdown of the main components:
- Input Embedding: Converts input tokens (e.g., words or subwords) into dense vectors.
- Positional Encoding: Adds information about the position of each token in the sequence, as Transformers lack inherent sequential order.
- Multi-Head Self-Attention: Computes attention scores to focus on relevant parts of the input sequence.
- Feed-Forward Neural Network: Applies a position-wise fully connected layer to each token.
- Layer Normalization and Residual Connections: Stabilizes training and improves gradient flow.
- Decoder: Generates output sequences, using masked self-attention to prevent attending to future tokens.
- Output Layer: Maps the decoder’s output to a probability distribution over the vocabulary.
Architecture Diagram
To visualize the Transformer, refer to the architecture diagram (described below for download). The diagram shows the encoder and decoder stacks, with arrows indicating data flow through self-attention, feed-forward layers, and residual connections.
Figure 1: Transformer Architecture
Description: A diagram illustrating the Transformer model with an encoder (left) and decoder (right). The encoder processes the input sequence, while the decoder generates the output. Key components include input embeddings, positional encodings, multi-head attention, feed-forward networks, and layer normalization. Arrows show residual connections and the flow from encoder to decoder via cross-attention.
File: transformer_architecture.png
(Download separately)
Step-by-Step Implementation
We’ll implement a simple Transformer model for a sequence-to-sequence task: translating English numbers (e.g., “one”) to their numerical form (e.g., “1”). This toy example simplifies tokenization and focuses on the model architecture. We’ll use Hugging Face for tokenization and PyTorch for the model and training.
Step 1: Environment Setup
First, let’s set up the environment and check for GPU availability.
import torch
import torch.nn as nn
import torch.optim as optim
from transformers import AutoTokenizer
import numpy as np
import matplotlib.pyplot as plt
from datasets import Dataset
import uuid
# Check for GPU
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
print(f"Using device: {device}")
This code imports necessary libraries and ensures the model runs on GPU if available, falling back to CPU otherwise.
Step 2: Prepare a Toy Dataset
For simplicity, we’ll create a dataset mapping English number words to their numerical equivalents (e.g., “one” → “1”). We’ll use Hugging Face’s Dataset
class for data handling.
# Create a toy dataset
data = {
"input": ["one", "two", "three", "four", "five", "six", "seven", "eight", "nine", "ten"],
"output": ["1", "2", "3", "4", "5", "6", "7", "8", "9", "10"]
}
dataset = Dataset.from_dict(data)
# Initialize tokenizer (using a pre-trained tokenizer for simplicity)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
# Tokenize the dataset
def tokenize_function(examples):
input_enc = tokenizer(examples["input"], padding="max_length", truncation=True, max_length=10)
output_enc = tokenizer(examples["output"], padding="max_length", truncation=True, max_length=10)
return {
"input_ids": input_enc["input_ids"],
"attention_mask": input_enc["attention_mask"],
"labels": output_enc["input_ids"]
}
tokenized_dataset = dataset.map(tokenize_function, batched=True)
Here, we:
- Create a small dataset with 10 examples.
- Use the BERT tokenizer for simplicity (though a custom tokenizer could be used for this toy task).
- Tokenize inputs and outputs, padding sequences to a fixed length (10 tokens) and generating attention masks.
Step 3: Define the Transformer Model
We’ll implement a simple Transformer model using PyTorch. This model includes an encoder and decoder, with multi-head self-attention and feed-forward layers.
class SimpleTransformer(nn.Module):
def __init__(self, vocab_size, d_model=64, nhead=4, num_layers=2, dim_feedforward=128, max_seq_length=10):
super(SimpleTransformer, self).__init__()
self.d_model = d_model
self.embedding = nn.Embedding(vocab_size, d_model)
self.pos_encoder = PositionalEncoding(d_model, max_seq_length)
# Encoder
encoder_layer = nn.TransformerEncoderLayer(
d_model=d_model,
nhead=nhead,
dim_feedforward=dim_feedforward,
batch_first=True
)
self.encoder = nn.TransformerEncoder(encoder_layer, num_layers=num_layers)
# Decoder
decoder_layer = nn.TransformerDecoderLayer(
d_model=d_model,
nhead=nhead,
dim_feedforward=dim_feedforward,
batch_first=True
)
self.decoder = nn.TransformerDecoder(decoder_layer, num_layers=num_layers)
# Output layer
self.fc_out = nn.Linear(d_model, vocab_size)
def forward(self, src, tgt, src_mask=None, tgt_mask=None):
# Embed and add positional encoding
src = self.embedding(src) * torch.sqrt(torch.tensor(self.d_model, dtype=torch.float32))
src = self.pos_encoder(src)
tgt = self.embedding(tgt) * torch.sqrt(torch.tensor(self.d_model, dtype=torch.float32))
tgt = self.pos_encoder(tgt)
# Encoder forward
memory = self.encoder(src, mask=src_mask)
# Decoder forward
output = self.decoder(tgt, memory, tgt_mask=tgt_mask)
# Output projection
output = self.fc_out(output)
return output
class PositionalEncoding(nn.Module):
def __init__(self, d_model, max_len=5000):
super(PositionalEncoding, self).__init__()
pe = torch.zeros(max_len, d_model)
position = torch.arange(0, max_len, dtype=torch.float).unsqueeze(1)
div_term = torch.exp(torch.arange(0, d_model, 2).float() * (-np.log(10000.0) / d_model))
pe[:, 0::2] = torch.sin(position * div_term)
pe[:, 1::2] = torch.cos(position * div_term)
pe = pe.unsqueeze(0)
self.register_buffer('pe', pe)
def forward(self, x):
return x + self.pe[:, :x.size(1), :]
# Initialize model
vocab_size = tokenizer.vocab_size
model = SimpleTransformer(vocab_size=vocab_size).to(device)
Explanation:
- Embedding Layer: Maps token IDs to dense vectors of size
d_model
(64 in this case). - Positional Encoding: Adds sine and cosine functions to encode token positions, as described in Vaswani et al. (2017).
- TransformerEncoder/Decoder: Uses PyTorch’s built-in
TransformerEncoderLayer
andTransformerDecoderLayer
for multi-head attention and feed-forward layers. - Output Layer: Projects the decoder output to the vocabulary size for token prediction.
- The model is moved to the GPU (or CPU) using
.to(device)
.
Step 4: Training the Model
We’ll train the model using a cross-entropy loss and the Adam optimizer. We’ll also generate a training loss plot (described for download).
# Training setup
criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id)
optimizer = optim.Adam(model.parameters(), lr=0.001)
num_epochs = 50
batch_size = 2
# Convert dataset to PyTorch tensors
def format_batch(batch):
return {
"input_ids": torch.tensor(batch["input_ids"], dtype=torch.long).to(device),
"attention_mask": torch.tensor(batch["attention_mask"], dtype=torch.long).to(device),
"labels": torch.tensor(batch["labels"], dtype=torch.long).to(device)
}
# Training loop
losses = []
model.train()
for epoch in range(num_epochs):
epoch_loss = 0
for i in range(0, len(tokenized_dataset), batch_size):
batch = tokenized_dataset[i:i+batch_size]
batch = format_batch(batch)
# Forward pass
outputs = model(batch["input_ids"], batch["labels"])
loss = criterion(outputs.view(-1, vocab_size), batch["labels"].view(-1))
# Backward pass
optimizer.zero_grad()
loss.backward()
optimizer.step()
epoch_loss += loss.item()
avg_loss = epoch_loss / (len(tokenized_dataset) // batch_size)
losses.append(avg_loss)
print(f"Epoch {epoch+1}/{num_epochs}, Loss: {avg_loss:.4f}")
# Plot training loss
plt.figure(figsize=(10, 6))
plt.plot(range(1, num_epochs+1), losses, marker='o')
plt.title("Training Loss Over Epochs")
plt.xlabel("Epoch")
plt.ylabel("Loss")
plt.grid(True)
plt.savefig("training_loss.png")
Figure 2: Training Loss Plot
Description: A line plot showing the training loss over 50 epochs. The x-axis represents the epoch number, and the y-axis shows the average loss per epoch. The plot includes markers at each epoch and a grid for readability.
File: training_loss.png
(Download separately)
Step 5: Inference
Let’s test the model by translating a number word to its numerical form.
def translate_number(model, tokenizer, text, max_length=10):
model.eval()
with torch.no_grad():
# Tokenize input
inputs = tokenizer(text, return_tensors="pt", padding=True, truncation=True, max_length=max_length)
input_ids = inputs["input_ids"].to(device)
# Generate output
output_ids = torch.zeros((1, max_length), dtype=torch.long).to(device)
for i in range(max_length):
outputs = model(input_ids, output_ids)
next_token = torch.argmax(outputs[:, i, :], dim=-1)
output_ids[:, i] = next_token
if next_token.item() == tokenizer.eos_token_id:
break
# Decode output
output_text = tokenizer.decode(output_ids[0], skip_special_tokens=True)
return output_text
# Test the model
test_input = "five"
result = translate_number(model, tokenizer, test_input)
print(f"Input: {test_input}, Output: {result}")
This code defines a function to generate translations and tests it with the input “five”. The model should output “5” if trained successfully.
Understanding the Transformer Architecture
Let’s dive deeper into the Transformer’s components, as they form the foundation for LLMs.
Multi-Head Self-Attention
The self-attention mechanism computes a weighted sum of input embeddings, where weights are determined by the similarity between tokens. Multi-head attention splits the input into multiple subspaces, allowing the model to focus on different aspects of the sequence simultaneously.
Formula: For a single attention head: [ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V ] Where:
- ( Q ): Query matrix
- ( K ): Key matrix
- ( V ): Value matrix
- ( d_k ): Dimension of keys/queries
Multi-head attention concatenates the outputs of multiple attention heads: [ \text{MultiHead}(Q, K, V) = \text{Concat}(\text{head}_1, \text{head}_2, …, \text{head}_h)W^O ]
Positional Encoding
Since Transformers process tokens in parallel, they need positional encodings to capture word order. We used sine and cosine functions, as proposed by Vaswani et al. (2017): [ PE_{(pos, 2i)} = \sin(pos / 10000^{2i/d_{\text{model}}}) ] [ PE_{(pos, 2i+1)} = \cos(pos / 10000^{2i/d_{\text{model}}}) ]
Feed-Forward Networks and Residual Connections
Each Transformer layer includes a position-wise feed-forward network (FFN): [ \text{FFN}(x) = \max(0, xW_1 + b_1)W_2 + b_2 ] Residual connections add the input to the output of each sub-layer, followed by layer normalization: [ \text{LayerNorm}(x + \text{Sublayer}(x)) ]
Preparing for the Next Episode: LLMs
This tutorial provides a foundation for understanding Transformers, which are the building blocks of LLMs. In the next episode, we’ll:
- Scale up the model with more layers and parameters.
- Use a larger dataset (e.g., from Hugging Face’s
datasets
library). - Fine-tune a pre-trained LLM like GPT-2 or Llama for specific tasks.
- Explore advanced techniques like LoRA (Low-Rank Adaptation) for efficient fine-tuning.
Tips for Scaling to LLMs
- Use Pre-trained Models: Hugging Face provides pre-trained models like
distilbert
orgpt2
that can be fine-tuned for specific tasks. - Efficient Training: Use mixed precision training (
torch.cuda.amp
) to reduce memory usage on GPUs. - Data Quality: LLMs require large, diverse datasets. Consider datasets like
wikitext
orbookcorpus
. - Hyperparameter Tuning: Experiment with learning rates, batch sizes, and model sizes.
Visualizations
To aid understanding, download the following figures:
- Transformer Architecture (
transformer_architecture.png
): Shows the encoder-decoder structure with labeled components. - Training Loss Plot (
training_loss.png
): Displays the loss curve over epochs, indicating training progress.
Conclusion
In this tutorial, we built a simple Transformer model using PyTorch and Hugging Face, trained it on a toy dataset, and visualized the training process. We explored the Transformer’s architecture, including self-attention, positional encoding, and feed-forward layers, with GPU support for efficient computation. This foundation prepares you for the next episode, where we’ll dive into LLMs, scaling up the model and tackling real-world NLP tasks.
Feel free to experiment with the code, adjust hyperparameters, or try different datasets. Upload this tutorial to GitHub to share with others, and stay tuned for the LLM episode!
Citations and References
- Vaswani, A., Shazeer, N., Parmar, N., Uszoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is All You Need. Advances in Neural Information Processing Systems, 30. Link
- Hugging Face Documentation: https://huggingface.co/docs/transformers
- PyTorch Documentation: https://pytorch.org/docs/stable/index.html