Install the packages for this chapter once if you haven’t already:
install.packages("keras3")
keras3::install_keras()
The second line sets up a Python TensorFlow backend that
keras3 uses under the hood. It takes a few minutes the
first time and then never again.
library(tidyverse)
library(keras3)In Chapter 19 we built up the machinery of a neural network on paper. We computed forward passes by hand, watched gradient descent move one weight, and saw why hidden layers and nonlinear activations matter. Today we put that machinery to work on real text data.
Neural networks can’t read words directly — they only operate on
numbers. Two ideas bridge the gap. First, tokenization:
text is split into individual words, and each unique word is assigned an
integer index from a fixed vocabulary, so a review becomes a sequence of
integers like [14, 22, 16, 43, ...]. For our dataset, the
keras team has already done this for us — they took 50,000 reviews,
split them on whitespace and punctuation, ranked words by how often they
appear in the full corpus, and assigned each word its rank as its index
(“the” gets 1, “a” gets 2, and so on). Second,
embedding: the network turns each integer into a short
vector of learned numbers (say, length 16). Unlike tokenization,
embedding vectors are not prepared in advance — they are
learned from scratch during training, alongside all the other weights.
Words used in similar contexts end up at similar locations in the
embedding space; the network discovers this on its own from the labeled
data.
What does “similar locations in the embedding space” mean? Imagine each word being placed in a 2D space (real embeddings have more dimensions, but two are easier to draw). Words used in similar contexts cluster together. Apple and orange both appear in food contexts, so they end up near each other. Cat belongs to a different category, so it lives far from apple. But cat and orange share a color, so they’re closer along the color dimension than cat and apple. The network discovers these patterns by adjusting each word’s vector until words with similar predictive value land near each other.
The full pipeline from raw text to a sentiment prediction looks like this:
A note on where this fits in the bigger picture. Today’s lecture learns embeddings from scratch on 5,000 labeled reviews. Modern “large language models” — the systems behind tools like ChatGPT, Claude, and BERT — do something more powerful: they pre-train on billions of words of general web text using a self-supervised task (typically predicting the next word, or filling in missing words), and learn rich representations of language without needing any human labels. When researchers then want to do sentiment classification, they take a pre-trained model and fine-tune it on a small labeled dataset. The result is much better accuracy than from-scratch training.
Our data is 50,000 movie reviews from IMDB, each labeled positive (1) or negative (0). Half (25,000) are training, half are test. The task is sentiment classification — text in, predicted label out.
We’ll build and compare three architectures, increasing in sophistication:
Why we need a validation set. A model that learns the training data perfectly is not necessarily a good model. Given enough capacity, a network will eventually start fitting the noise in the training set — overfitting — and its predictions on data it hasn’t seen will get worse. To catch this we need data the model has not trained on. We split the original training set into two pieces: the part the model actually trains on (the training set, 20,000 reviews) and a held-out piece used only to check how well the model generalizes (the validation set, 5,000 reviews).
Training happens in passes over the data. Inside one pass, the
optimizer goes through the training set in small
batches (groups of 64 reviews in our setup) and updates
the weights once per batch. One full sweep through all 4,000 training
reviews — about 63 weight updates in our case — is called one
epoch. When we call fit(epochs = 5), the
training loop runs five such sweeps back to back and then stops. The
network has no built-in “enough” signal; we choose the number. After
each epoch we compute accuracy on both the training and validation sets,
and the validation curve tells us whether we picked a sensible number.
If training accuracy keeps climbing while validation accuracy plateaus
or falls, we know we’ve trained too long and the model is overfitting —
that’s our cue to lower epochs next time.
The 25,000 test set reviews we set aside earlier never get touched during model development. We only look at them once at the very end, after we’ve made all our architecture and tuning choices, to report a clean final accuracy number.
To keep each training run fast on a laptop, we’ll use a 5,000-review subset of the training data (4,000 for training + 1,000 for validation). The complete pipeline runs in a few minutes per architecture.
We’ll keep only the top 10,000 most common words from the vocabulary. Less common words are replaced with a special “unknown” token. This is just a convenience: it caps the size of the embedding table we’ll need later.
# dataset_imdb() loads the reviews; num_words caps the vocabulary
imdb <- dataset_imdb(num_words = 10000)
# imdb$train and imdb$test each hold $x (the reviews) and $y (the labels)
x_train_raw <- imdb$train$x
y_train <- imdb$train$y
x_test_raw <- imdb$test$x
y_test <- imdb$test$y
length(x_train_raw)
length(x_test_raw)Each review is a vector of word indices of variable length. Let’s look at one:
# The first 20 indices of review 1, and the label (1 = positive, 0 = negative)
x_train_raw[[1]][1:20]
y_train[1]That’s the integer-encoded form. To read it as English we need to invert the vocabulary — build a lookup from integer back to word. Indices 1, 2, and 3 are reserved for special tokens (padding, start-of-sequence, unknown), so the actual word indices are offset by 3.
# Build a reverse lookup: integer -> word
word_index <- dataset_imdb_word_index()
index_word <- setNames(names(word_index), unlist(word_index))
# A helper to decode one review back to English
decode <- function(sequence) {
paste(index_word[as.character(sequence - 3)], collapse = " ")
}Now let’s read a positive review (label = 1) in full:
# First positive review in the training set
pos_idx <- which(y_train == 1)[1]
decode(x_train_raw[[pos_idx]])And a negative one (label = 0):
# First negative review in the training set
neg_idx <- which(y_train == 0)[1]
decode(x_train_raw[[neg_idx]])The pattern you’ll see: positive reviews lean on words like “great”, “wonderful”, “loved”; negative ones lean on “boring”, “bad”, “waste”. That’s the signal the network will learn to pick up.
Neural networks need fixed-length input, but reviews vary in length. Let’s see how much:
review_lengths <- sapply(x_train_raw, length)
summary(review_lengths)We’ll set the maximum length to 200 (longer reviews get truncated; shorter ones get padded with zeros) and take a 5,000-review subset to keep training fast.
maxlen <- 200
# pad_sequences() turns the variable-length integer sequences into a uniform
# matrix: short reviews are padded with zeros, long ones are truncated
x_train <- pad_sequences(x_train_raw, maxlen = maxlen)
x_test <- pad_sequences(x_test_raw, maxlen = maxlen)
# Use only 5,000 reviews for training so each model fits in seconds
set.seed(42)
train_idx <- sample(seq_len(nrow(x_train)), 5000)
x_train_small <- x_train[train_idx, ]
y_train_small <- y_train[train_idx]For the rest of the lecture we’ll build five different models and
compare them. Rather than copy-pasting nearly-identical code blocks,
let’s write one function that builds, trains, and plots any of our
architectures. Each Part below will then be a single line: call
train_model() with the parameter you want to vary.
Read through the function carefully — it shows you what every model in Parts 2 through 7 actually does.
train_model <- function(arch = "average",
embedding_dim = 16,
hidden_units = 0, # 0 = no hidden layer
recurrent_units = 32, # used by "rnn" and "lstm"
dropout_rate = 0, # 0 = no dropout
epochs = 5) {
# Start a stack of layers; every model begins with an embedding
model <- keras_model_sequential() %>%
layer_embedding(input_dim = 10000, output_dim = embedding_dim)
# Choose what happens to the sequence of word embeddings
if (arch == "average") {
# Bag-of-embeddings: average all 200 word vectors into one summary vector
model <- model %>% layer_global_average_pooling_1d()
if (hidden_units > 0) {
model <- model %>% layer_dense(units = hidden_units, activation = "relu")
}
if (dropout_rate > 0) {
model <- model %>% layer_dropout(rate = dropout_rate)
}
} else if (arch == "rnn") {
# Simple RNN: read words in order, carry a memory vector forward
model <- model %>% layer_simple_rnn(units = recurrent_units)
} else if (arch == "lstm") {
# LSTM: like RNN but uses gates to prevent gradients from vanishing
model <- model %>% layer_lstm(units = recurrent_units)
}
# Final sigmoid output for binary classification
model <- model %>% layer_dense(units = 1, activation = "sigmoid")
# Standard training setup: Adam optimizer, binary cross-entropy loss
model %>% compile(
optimizer = "adam",
loss = "binary_crossentropy",
metrics = "accuracy"
)
# Train. validation_split = 0.2 reserves the last 20% as the validation set.
history <- model %>% fit(
x_train_small, y_train_small,
epochs = epochs,
batch_size = 64,
validation_split = 0.2,
verbose = 2
)
plot(history)
invisible(list(model = model, history = history))
}Now every experiment below is a one-line call. To rerun with a different setting, just change the relevant argument.
Our first model is the simplest possible neural network for text: an embedding layer, followed by averaging across all word positions, followed by a single sigmoid output unit. With a single output unit and a sigmoid, this is essentially logistic regression, with the twist that the “features” are learned word embeddings instead of hand-crafted variables.
In train_model() terms: arch = "average",
hidden_units = 0.
Pick an embedding dimension below, then train the model. Try a few values: 4, 16, 64.
train_model(arch = "average", hidden_units = 0, embedding_dim = ____, epochs = 5)The baseline averages embeddings and applies a linear classifier. Let’s add a hidden layer with a ReLU activation between the averaging step and the output, and see if the extra capacity helps.
Pick a number of hidden units below, then train. Try 4, 16, 64.
train_model(arch = "average", hidden_units = ____, epochs = 5)So far we’ve trained for only 5 epochs. What happens if we train much longer? In Chapter 19 we said that gradient descent keeps moving weights to reduce the training loss. If we let it run for too long, the model may start to memorize the training set instead of learning general patterns — overfitting, the same disease we met in Chapter 13.
Train the model from Part 3 (use hidden_units = 64) for
many more epochs.
train_model(arch = "average", hidden_units = 64, epochs = ____)There are several ways to fight overfitting. In Chapter 13 we saw regularization, which penalizes large coefficients. Neural networks have a different favorite tool: dropout.
Dropout works by randomly “turning off” some fraction of the hidden units during each training step. With dropout rate 0.5, half of the units are silenced on each pass. This forces the network to spread its learning across many units rather than relying heavily on a few, and it acts like training many smaller networks at once.
Pick a dropout rate below, then train. Try 0, 0.2, and 0.5.
train_model(arch = "average", hidden_units = 64, epochs = 20, dropout_rate = ____)Every model we’ve built so far throws away word order. The averaging step in Parts 2–5 turns the 200 word vectors into one summary vector, which means “the movie was great, not bad” and “the movie was bad, not great” produce identical inputs to the classifier.
In Chapter 19 we sketched a different approach: read the review one word at a time, carrying a memory vector that accumulates context as we go. That’s called a recurrent neural network, or RNN. The same small network is applied at each position, and its output (the updated memory) feeds into the next position.
Pick a number of RNN units below, then train. Try 32. Compare the validation accuracy to your best averaging model from Parts 2–5.
train_model(arch = "rnn", recurrent_units = ____, epochs = 10)The simple RNN had a hard time. The problem is vanishing gradients: when the network tries to learn from a 200-word review, the gradient signal has to travel back through 200 timesteps during backpropagation. At each step the gradient gets multiplied by something less than 1, and after that many multiplications the signal is effectively zero. The early words in a review can’t influence the model’s weights, so the network can’t learn long-range patterns no matter how long we train it.
The LSTM (Long Short-Term Memory) was invented in 1997 specifically to fix this. It has the same conceptual setup as a simple RNN — read one word at a time, carry memory forward — but it uses internal “gates” that decide what to keep, what to forget, and what to output at each step. The gates let useful information flow through many timesteps without being crushed. You don’t need the gate equations for this lecture; just know that LSTM is what people actually use in practice when they want a sequence model.
Pick a number of LSTM units below, then train. Try 32.
train_model(arch = "lstm", recurrent_units = ____, epochs = 5)You’ve now seen three architectures and their validation accuracies. Look back at your results, pick the architecture with the best validation accuracy, and retrain that model. We’ll then evaluate it once on the 25,000-review test set we set aside at the very beginning. This is our clean, unbiased estimate of how well the chosen model will generalize to new reviews — the number we report to the world.
# Retrain whichever architecture gave you the best validation accuracy.
# The example below is for the averaging + dropout model from Part 5;
# replace the arguments if a different architecture won.
result <- train_model(arch = "average", hidden_units = 64,
epochs = 20, dropout_rate = 0.5)
# evaluate() runs the trained model on the test set and returns loss and accuracy
result$model %>% evaluate(x_test, y_test, verbose = 0)In your own words, answer the following: