---
title: "20 Deep Learning Part 2"
format:
  html:
    self-contained: TRUE
---

::: {.callout-important}
## Before you start

Install the packages for this chapter once if you haven't already:

```{r, eval = F}
install.packages("keras3")
keras3::install_keras()
```

The second line sets up a Python TensorFlow backend that `keras3` uses under the hood. It takes a few minutes the first time and then never again.
:::

```{r}
#| message: false
#| warning: false
library(tidyverse)
library(keras3)
```

In Chapter 19 we built up the machinery of a neural network on paper. We computed forward passes by hand, watched gradient descent move one weight, and saw why hidden layers and nonlinear activations matter. Today we put that machinery to work on real text data.

Neural networks can't read words directly --- they only operate on numbers. Two ideas bridge the gap. First, **tokenization**: text is split into individual words, and each unique word is assigned an integer index from a fixed vocabulary, so a review becomes a sequence of integers like `[14, 22, 16, 43, ...]`. For our dataset, the keras team has already done this for us --- they took 50,000 reviews, split them on whitespace and punctuation, ranked words by how often they appear in the full corpus, and assigned each word its rank as its index ("the" gets 1, "a" gets 2, and so on). Second, **embedding**: the network turns each integer into a short vector of learned numbers (say, length 16). Unlike tokenization, embedding vectors are *not* prepared in advance --- they are learned from scratch during training, alongside all the other weights. Words used in similar contexts end up at similar locations in the embedding space; the network discovers this on its own from the labeled data.

What does "similar locations in the embedding space" mean? Imagine each word being placed in a 2D space (real embeddings have more dimensions, but two are easier to draw). Words used in similar contexts cluster together. *Apple* and *orange* both appear in food contexts, so they end up near each other. *Cat* belongs to a different category, so it lives far from *apple*. But *cat* and *orange* share a color, so they're closer along the color dimension than *cat* and *apple*. The network discovers these patterns by adjusting each word's vector until words with similar predictive value land near each other.


The full pipeline from raw text to a sentiment prediction looks like this:


A note on where this fits in the bigger picture. Today's lecture learns embeddings from scratch on 5,000 labeled reviews. Modern "large language models" --- the systems behind tools like ChatGPT, Claude, and BERT --- do something more powerful: they *pre-train* on billions of words of general web text using a self-supervised task (typically predicting the next word, or filling in missing words), and learn rich representations of language without needing any human labels. When researchers then want to do sentiment classification, they take a pre-trained model and *fine-tune* it on a small labeled dataset. The result is much better accuracy than from-scratch training.

## What we're doing today

Our data is 50,000 movie reviews from IMDB, each labeled positive (1) or negative (0). Half (25,000) are training, half are test. The task is sentiment classification --- text in, predicted label out.

We'll build and compare three architectures, increasing in sophistication:

1. **Bag-of-embeddings** (Parts 2--5): the simplest model. Average all the word vectors in a review into one summary vector, then classify. This throws away word order entirely. We'll first build it without a hidden layer (essentially logistic regression on learned features), then with a hidden layer, then watch it overfit, then fix the overfitting with dropout.
2. **Simple RNN** (Part 6): a recurrent network that reads the review word by word, carrying a memory vector forward. This is the architecture sketched in Chapter 19 --- it preserves word order in principle but is hard to train on long sequences.
3. **LSTM** (Part 7): a more sophisticated recurrent network with internal "gates" that fix the simple RNN's training problems. This is what people actually use in practice when they want a sequence model.

**Why we need a validation set.** A model that learns the training data perfectly is not necessarily a good model. Given enough capacity, a network will eventually start fitting the noise in the training set --- *overfitting* --- and its predictions on data it hasn't seen will get worse. To catch this we need data the model has *not* trained on. We split the original training set into two pieces: the part the model actually trains on (the **training set**, 20,000 reviews) and a held-out piece used only to check how well the model generalizes (the **validation set**, 5,000 reviews).

Training happens in passes over the data. Inside one pass, the optimizer goes through the training set in small **batches** (groups of 64 reviews in our setup) and updates the weights once per batch. One full sweep through all 4,000 training reviews --- about 63 weight updates in our case --- is called one **epoch**. When we call `fit(epochs = 5)`, the training loop runs five such sweeps back to back and then stops. The network has no built-in "enough" signal; we choose the number. After each epoch we compute accuracy on both the training and validation sets, and the validation curve tells us whether we picked a sensible number. If training accuracy keeps climbing while validation accuracy plateaus or falls, we know we've trained too long and the model is overfitting --- that's our cue to lower `epochs` next time.

The 25,000 **test set** reviews we set aside earlier never get touched during model development. We only look at them once at the very end, after we've made all our architecture and tuning choices, to report a clean final accuracy number.

To keep each training run fast on a laptop, we'll use a 5,000-review subset of the training data (4,000 for training + 1,000 for validation). The complete pipeline runs in a few minutes per architecture.

## Part 1: Loading and preparing the data

We'll keep only the top 10,000 most common words from the vocabulary. Less common words are replaced with a special "unknown" token. This is just a convenience: it caps the size of the embedding table we'll need later.

```{r}
# dataset_imdb() loads the reviews; num_words caps the vocabulary
imdb <- dataset_imdb(num_words = 10000)

# imdb$train and imdb$test each hold $x (the reviews) and $y (the labels)
x_train_raw <- imdb$train$x
y_train     <- imdb$train$y
x_test_raw  <- imdb$test$x
y_test      <- imdb$test$y

length(x_train_raw)
length(x_test_raw)
```

Each review is a vector of word indices of variable length. Let's look at one:

```{r}
# The first 20 indices of review 1, and the label (1 = positive, 0 = negative)
x_train_raw[[1]][1:20]
y_train[1]
```

That's the integer-encoded form. To read it as English we need to invert the vocabulary --- build a lookup from integer back to word. Indices 1, 2, and 3 are reserved for special tokens (padding, start-of-sequence, unknown), so the actual word indices are offset by 3.

```{r}
# Build a reverse lookup: integer -> word
word_index <- dataset_imdb_word_index()
index_word <- setNames(names(word_index), unlist(word_index))

# A helper to decode one review back to English
decode <- function(sequence) {
  paste(index_word[as.character(sequence - 3)], collapse = " ")
}
```

Now let's read a positive review (label = 1) in full:

```{r}
# First positive review in the training set
pos_idx <- which(y_train == 1)[1]
decode(x_train_raw[[pos_idx]])
```

And a negative one (label = 0):

```{r}
# First negative review in the training set
neg_idx <- which(y_train == 0)[1]
decode(x_train_raw[[neg_idx]])
```

The pattern you'll see: positive reviews lean on words like "great", "wonderful", "loved"; negative ones lean on "boring", "bad", "waste". That's the signal the network will learn to pick up.

Neural networks need fixed-length input, but reviews vary in length. Let's see how much:

```{r}
review_lengths <- sapply(x_train_raw, length)
summary(review_lengths)
```

We'll set the maximum length to 200 (longer reviews get truncated; shorter ones get padded with zeros) and take a 5,000-review subset to keep training fast.

```{r}
maxlen <- 200

# pad_sequences() turns the variable-length integer sequences into a uniform
# matrix: short reviews are padded with zeros, long ones are truncated
x_train <- pad_sequences(x_train_raw, maxlen = maxlen)
x_test  <- pad_sequences(x_test_raw,  maxlen = maxlen)

# Use only 5,000 reviews for training so each model fits in seconds
set.seed(42)
train_idx <- sample(seq_len(nrow(x_train)), 5000)
x_train_small <- x_train[train_idx, ]
y_train_small <- y_train[train_idx]
```

## Setup: a single function to train and evaluate any model

For the rest of the lecture we'll build five different models and compare them. Rather than copy-pasting nearly-identical code blocks, let's write one function that builds, trains, and plots any of our architectures. Each Part below will then be a single line: call `train_model()` with the parameter you want to vary.

Read through the function carefully --- it shows you what every model in Parts 2 through 7 actually does.

```{r}
train_model <- function(arch = "average",
                        embedding_dim   = 16,
                        hidden_units    = 0,    # 0 = no hidden layer
                        recurrent_units = 32,   # used by "rnn" and "lstm"
                        dropout_rate    = 0,    # 0 = no dropout
                        epochs          = 5) {

  # Start a stack of layers; every model begins with an embedding
  model <- keras_model_sequential() %>%
    layer_embedding(input_dim = 10000, output_dim = embedding_dim)

  # Choose what happens to the sequence of word embeddings
  if (arch == "average") {
    # Bag-of-embeddings: average all 200 word vectors into one summary vector
    model <- model %>% layer_global_average_pooling_1d()
    if (hidden_units > 0) {
      model <- model %>% layer_dense(units = hidden_units, activation = "relu")
    }
    if (dropout_rate > 0) {
      model <- model %>% layer_dropout(rate = dropout_rate)
    }
  } else if (arch == "rnn") {
    # Simple RNN: read words in order, carry a memory vector forward
    model <- model %>% layer_simple_rnn(units = recurrent_units)
  } else if (arch == "lstm") {
    # LSTM: like RNN but uses gates to prevent gradients from vanishing
    model <- model %>% layer_lstm(units = recurrent_units)
  }

  # Final sigmoid output for binary classification
  model <- model %>% layer_dense(units = 1, activation = "sigmoid")

  # Standard training setup: Adam optimizer, binary cross-entropy loss
  model %>% compile(
    optimizer = "adam",
    loss      = "binary_crossentropy",
    metrics   = "accuracy"
  )

  # Train. validation_split = 0.2 reserves the last 20% as the validation set.
  history <- model %>% fit(
    x_train_small, y_train_small,
    epochs           = epochs,
    batch_size       = 64,
    validation_split = 0.2,
    verbose          = 2
  )

  plot(history)
  invisible(list(model = model, history = history))
}
```

Now every experiment below is a one-line call. To rerun with a different setting, just change the relevant argument.

## Part 2: A baseline --- logistic regression with word embeddings

Our first model is the simplest possible neural network for text: an **embedding layer**, followed by **averaging across all word positions**, followed by a single sigmoid output unit. With a single output unit and a sigmoid, **this is essentially logistic regression**, with the twist that the "features" are learned word embeddings instead of hand-crafted variables.

In `train_model()` terms: `arch = "average"`, `hidden_units = 0`.

::: {.callout-note}
## Question 2

Pick an embedding dimension below, then train the model. Try a few values: 4, 16, 64.

1.  How does the final validation accuracy change with the embedding dimension?
2.  Does a larger embedding dimension always help? Why might it not?
:::

```{r}
train_model(arch = "average", hidden_units = 0, embedding_dim = ____, epochs = 5)
```

## Part 3: Add a hidden layer

The baseline averages embeddings and applies a linear classifier. Let's add a hidden layer with a ReLU activation between the averaging step and the output, and see if the extra capacity helps.

::: {.callout-note}
## Question 3

Pick a number of hidden units below, then train. Try 4, 16, 64.

1.  How does validation accuracy compare to the no-hidden-layer baseline in Part 2, and how does it change with the number of hidden units?
2.  Look at training accuracy vs. validation accuracy. Are they close, or is one much higher? What does that suggest?
:::

```{r}
train_model(arch = "average", hidden_units = ____, epochs = 5)
```

## Part 4: Watch overfitting happen

So far we've trained for only 5 epochs. What happens if we train much longer? In Chapter 19 we said that gradient descent keeps moving weights to reduce the training loss. If we let it run for too long, the model may start to memorize the training set instead of learning general patterns --- *overfitting*, the same disease we met in Chapter 13.

::: {.callout-note}
## Question 4

Train the model from Part 3 (use `hidden_units = 64`) for many more epochs.

1.  Look at the plot. At roughly what epoch does the validation accuracy stop improving, and what is the training accuracy doing meanwhile?
2.  Why is the gap between training and validation accuracy a problem?
:::

```{r}
train_model(arch = "average", hidden_units = 64, epochs = ____)
```

## Part 5: Fight overfitting with dropout

There are several ways to fight overfitting. In Chapter 13 we saw regularization, which penalizes large coefficients. Neural networks have a different favorite tool: **dropout**.

Dropout works by randomly "turning off" some fraction of the hidden units during each training step. With dropout rate 0.5, half of the units are silenced on each pass. This forces the network to spread its learning across many units rather than relying heavily on a few, and it acts like training many smaller networks at once.

::: {.callout-note}
## Question 5

Pick a dropout rate below, then train. Try 0, 0.2, and 0.5.

1.  Compare the gap between training and validation accuracy with and without dropout. Did dropout close the gap?
2.  If a gap remained even with a generous dropout rate (say 0.5), what reasons can you think of for why dropout might not fully eliminate overfitting? Think about *which parts* of the network the dropout layer is actually regularizing.
3.  Does a higher dropout rate always help? Why might it not?
:::

```{r}
train_model(arch = "average", hidden_units = 64, epochs = 20, dropout_rate = ____)
```

## Part 6: A recurrent network --- reading reviews word by word

Every model we've built so far throws away word order. The averaging step in Parts 2--5 turns the 200 word vectors into one summary vector, which means "the movie was great, not bad" and "the movie was bad, not great" produce identical inputs to the classifier.

In Chapter 19 we sketched a different approach: read the review one word at a time, carrying a memory vector that accumulates context as we go. That's called a **recurrent neural network**, or RNN. The same small network is applied at each position, and its output (the updated memory) feeds into the next position.

::: {.callout-note}
## Question 6

Pick a number of RNN units below, then train. Try 32. Compare the validation accuracy to your best averaging model from Parts 2--5.

1.  Did the RNN improve validation accuracy? Compare both the level and the train/val gap.
2.  RNNs are *supposed* to be better at text because they preserve word order. Why might they not dramatically outperform averaging on IMDB?
:::

```{r}
train_model(arch = "rnn", recurrent_units = ____, epochs = 10)
```

## Part 7: A better sequence model --- the LSTM

The simple RNN had a hard time. The problem is **vanishing gradients**: when the network tries to learn from a 200-word review, the gradient signal has to travel back through 200 timesteps during backpropagation. At each step the gradient gets multiplied by something less than 1, and after that many multiplications the signal is effectively zero. The early words in a review can't influence the model's weights, so the network can't learn long-range patterns no matter how long we train it.

The **LSTM** (Long Short-Term Memory) was invented in 1997 specifically to fix this. It has the same conceptual setup as a simple RNN --- read one word at a time, carry memory forward --- but it uses internal "gates" that decide what to keep, what to forget, and what to output at each step. The gates let useful information flow through many timesteps without being crushed. You don't need the gate equations for this lecture; just know that LSTM is what people actually use in practice when they want a sequence model.

::: {.callout-note}
## Question 7

Pick a number of LSTM units below, then train. Try 32.

1.  How does the LSTM compare to the other two architectures (simple RNN from Part 6, and your best averaging model)? If it didn't dramatically beat them, what could explain that? Recall what kind of words tend to determine a movie review's sentiment.
:::

```{r}
train_model(arch = "lstm", recurrent_units = ____, epochs = 5)
```

## Final evaluation on the test set

You've now seen three architectures and their validation accuracies. Look back at your results, pick the architecture with the best validation accuracy, and retrain that model. We'll then evaluate it once on the 25,000-review test set we set aside at the very beginning. This is our clean, unbiased estimate of how well the chosen model will generalize to new reviews --- the number we report to the world.

```{r}
# Retrain whichever architecture gave you the best validation accuracy.
# The example below is for the averaging + dropout model from Part 5;
# replace the arguments if a different architecture won.
result <- train_model(arch = "average", hidden_units = 64,
                      epochs = 20, dropout_rate = 0.5)

# evaluate() runs the trained model on the test set and returns loss and accuracy
result$model %>% evaluate(x_test, y_test, verbose = 0)
```

## Part 8: Reflection

::: {.callout-note}
## Question 8

In your own words, answer the following:

1.  Compare the three architectures you built today: averaging (Parts 2--5), simple RNN (Part 6), and LSTM (Part 7). What did each do well, and what was each one's main weakness?
2.  Suppose you wanted to apply this same pipeline to **Federal Reserve statements**, trying to predict whether each statement is hawkish or dovish. Which architecture would you reach for, and why? Recall the "but" example from Chapter 19.
:::

