19 Deep Learning Part 1
How one neuron works
Before architecture or math, the picture. A neural network is built from tiny units called neurons. Each one is a simple on/off switch.
A neuron takes a few numbers as input. It multiplies each input by a weight, adds them up, and adds a bias. That sum is the neuron's signal. Then the neuron applies a rule: if the signal is positive, pass it forward; if not, stay quiet.
That rule is called an activation function. The simplest and most common one is called ReLU: output the signal if it's positive, output zero otherwise. When a neuron's output is nonzero, we say it fires. When it outputs zero, it's silent.
What the neuron is "listening for" depends entirely on its weights. A neuron with large weights on income and credit score fires for low-risk applicants. A neuron with a large weight on square footage and another on bedrooms fires for family-sized houses. Each neuron picks out a specific pattern in the inputs.
That's the whole engine. A network is just many neurons stacked in layers, each one firing or staying silent based on what its inputs look like. The next sections explain why this simple machinery can do things linear regression cannot.
Why we need neural networks: linear models can't discover combinations
Every model we've used so far — linear regression, logistic regression, KNN, decision trees, random forests — shares one hidden assumption: each input column has its own independent effect on the prediction. Higher income shifts approval probability by a fixed amount. Higher debt-to-income shifts it by another fixed amount. Effects add up.
Many real predictions don't work that way. Take house prices. Price doesn't depend on square footage, bedrooms, zip code, and neighborhood wealth independently. What really drives price is combinations:
- Square footage and bedrooms together determine how many people can live there. A 3,000-square-foot house with 1 bedroom is worth less than one with 4 bedrooms.
- Zip code and neighborhood wealth together determine school quality. A wealthy neighborhood in a good zip code is worth far more than the sum of its parts.
A linear regression cannot discover these combinations. To use them, a human has to hand-engineer new columns — family_size_capacity, school_quality — and add them to the data. That assumes you already know which combinations matter.
A neural network learns the combinations on its own. Each neuron in a hidden layer settles on a particular combination of inputs through training. One unit might end up with weights like:
$$ \text{signal} = 2.5 \cdot \text{wealth} + 1.8 \cdot \text{zipcode} - 4.0 $$
The weights 2.5 and 1.8 mean this unit pays attention to wealth and zip code. The bias of \(-4\) is the threshold: the signal must clear 4 before the unit fires. So this unit stays silent for poor neighborhoods or low-scoring zip codes, but fires strongly when both are high. The output layer reads this signal and adds, say, $50K to the predicted price whenever the unit fires.
Nobody told the network to combine wealth and zip code. Training discovered that this weighted sum, with that threshold, reduced prediction error on the data. A human looking at the trained weights might label this unit "school quality" — but the unit just learned three numbers that worked.
Other units learn other combinations: family-size capacity from square footage and bedrooms, or patterns no human would have thought to engineer. The final prediction is built from these learned combinations, not from the raw inputs.
Neural networks are useful on tabular data for this reason. They really shine on images, audio, and text — where the number of useful combinations is enormous and no human could engineer them all by hand. Later in this chapter we'll see how a network reads a Federal Reserve statement.
The rest of the chapter builds the apparatus: what a network looks like, how it computes predictions, why the threshold rule is essential, and how the network finds its own weights from training.
The housing-price example above is adapted from Stanford CS229 lecture notes (Chen, Katanforoosh, & Ng).
Part 1: The Architecture of a Neural Network
A feedforward neural network is a stack of layers. Each layer takes a vector of inputs, multiplies it by a matrix of weights, adds a bias vector, and passes the result through a nonlinear activation function. The output of one layer becomes the input to the next.
Here is a small network with 2 inputs, one hidden layer of 3 units, and 1 output:
Question 1: List all the parameters of the network shown above. You can group them as vectors or matrices to keep the answer compact. Once you have them all listed, count up the total number of individual numbers — you should get 13.
For each unit in the hidden layer, we compute a pre-activation value:
$$ z_j = w_{j1} x_1 + w_{j2} x_2 + b_j $$
and then pass it through a nonlinear activation function. A very common choice is the ReLU (rectified linear unit):
$$ \mathrm{ReLU}(z) = \max(0, z) $$
The output layer takes the hidden activations as its inputs, computes another weighted sum, and passes it through a sigmoid function to squash the result into the range \([0,1]\) so we can interpret it as a predicted probability — exactly like in logistic regression from Chapter 6:
$$ \sigma(z) = \frac{1}{1 + e^{-z}} $$
Part 2: The Forward Pass by Hand
Before we go further, let's prove to ourselves that a neural network isn't magic. We're going to compute one full forward pass by hand on a tiny network — and every single operation in it is something you already know from earlier chapters. The only new piece is the ReLU in the middle.
Notation we'll use throughout Part 2:
| Symbol | What it is |
|---|---|
| \(x\) | the input vector (with components \(x_1, x_2\)) |
| \(W_1, b_1\) | hidden-layer weights and biases |
| \(z_1\) | hidden-layer pre-activation — weighted sum before ReLU |
| \(a_1\) | hidden-layer activation — output after ReLU |
| \(W_2, b_2\) | output-layer weights and bias |
| \(z_2\) | output-layer pre-activation — weighted sum before sigmoid |
| \(\hat{y}\) | the final prediction — output after sigmoid |
Let's plug in some numbers. Suppose our input is \( x = (1, 2) \). We will set the weights of the hidden layer to:
$$ W_1 = \begin{bmatrix} 0.5 & -0.3 \\ 0.2 & 0.8 \\ -0.1 & 0.4 \end{bmatrix}, \quad b_1 = \begin{bmatrix} 0.1 \\ 0.0 \\ -0.2 \end{bmatrix} $$
Each row of \(W_1\) holds the two input-weights for one hidden unit.
Question 2: Compute the pre-activation vector \( z_1 = W_1 x + b_1 \) for the hidden layer. You should arrive at \(z_1 = (0,\; 1.8,\; 0.5)\). Fill in the blanks:
$$ z_{1,1} = (0.5)(\_\_\_) + (-0.3)(\_\_\_) + 0.1 \;=\; \_\_\_ $$
$$ z_{1,2} = (0.2)(\_\_\_) + (0.8)(\_\_\_) + 0.0 \;=\; \_\_\_ $$
$$ z_{1,3} = (-0.1)(\_\_\_) + (0.4)(\_\_\_) + (-0.2) \;=\; \_\_\_ $$
Now apply ReLU to get the hidden-layer activations \(a_1 = \mathrm{ReLU}(z_1)\). Recall the ReLU rule: keep the value if positive, replace with zero otherwise. Since none of your three values are negative, you should get \(a_1 = (0,\; 1.8,\; 0.5)\) — same as \(z_1\) in this case.
Now finish the forward pass through the output layer. Let \( W_2 = (0.6,\; -0.4,\; 0.5) \) and \( b_2 = 0.1 \).
Question 3: Compute \( z_2 = W_2 \cdot a_1 + b_2 \), then apply the sigmoid to get the final prediction \(\hat{y}\). You should arrive at \(\hat{y} \approx 0.41\). Fill in the blanks:
$$ z_2 \;=\; (0.6)(\_\_\_) \;+\; (-0.4)(\_\_\_) \;+\; (0.5)(\_\_\_) \;+\; 0.1 \;=\; \_\_\_ $$
$$ \hat{y} \;=\; \sigma(z_2) \;=\; \frac{1}{1 + e^{-(\_\_\_)}} \;=\; \_\_\_ $$
Interpret \(\hat{y}\) as a probability: in this case the model predicts about a 41% chance the label is 1. Notice what you just did — a weighted sum, a max-with-zero, another weighted sum, a sigmoid. That's the entire forward pass. Whatever a neural network does, it does with these four operations stacked together.
From linear regression to a neural network
A neural network isn't a different species of model — it sits at the top of a ladder of models you already know. Each rung adds one ingredient:
- Linear regression (Chapter 5): \(\hat{y} = w_1 x_1 + w_2 x_2 + b\). Weighted sum of inputs, output a number.
- Logistic regression (Chapter 6): wrap the output in a sigmoid: \(\hat{y} = \sigma(w_1 x_1 + w_2 x_2 + b)\). Same weighted sum, now squashed to a probability in \([0, 1]\).
- One-hidden-layer network (this chapter): instead of feeding the inputs directly to the sigmoid, first run them through a hidden layer of ReLU units. Each hidden unit computes its own weighted sum and applies ReLU; the output layer then takes a weighted sum of those and applies sigmoid.
- Deep network: stack more hidden layers between the inputs and the output.
Every step adds exactly one ingredient — sigmoid, hidden layer, depth — and at every step the underlying operation is still just weighted sums and nonlinearities. We can see all four rungs in one line:
$$ \underbrace{\hat{y} = w_1 x_1 + w_2 x_2 + b}_{\text{linear regression}} \;\xrightarrow{\text{add sigmoid}}\; \underbrace{\hat{y} = \sigma(w_1 x_1 + w_2 x_2 + b)}_{\text{logistic regression}} \;\xrightarrow{\text{add hidden ReLU layer}}\; \underbrace{\hat{y} = \sigma\bigl(W_2 \cdot \mathrm{ReLU}(W_1 x + b_1) + b_2\bigr)}_{\text{neural network}} $$
The collapse works in reverse too. Strip the ReLU out of the neural network and you get logistic regression. Strip the sigmoid out of that and you get linear regression. A linear regression is just a neural network with no hidden layers and no activation functions. Everything you learned in Chapters 5 and 6 is a special case of what we are building now.
Here is the same idea drawn out for a minimal network — 2 inputs, 2 hidden units, 1 output, no activation function anywhere:
On the left, the network has six weights (\(w_{11}, w_{12}, w_{21}, w_{22}, v_1, v_2\)) plus three biases — nine parameters in total. With no activation function, all that machinery does the same job as the two-input linear regression on the right. The six weights collapse into just two effective coefficients: \(\beta_1 = v_1 w_{11} + v_2 w_{21}\) and \(\beta_2 = v_1 w_{12} + v_2 w_{22}\). The extra parameters add nothing — the network is just a longer way of writing the same linear formula.
This is why the nonlinearity matters. Without the ReLU, the hidden layer's output is just another linear combination of the inputs, which is just another linear regression. Extra weights, no extra expressive power. The ReLU is what lets each hidden unit construct a genuinely new feature — like the "school quality" unit from the opening — and that's what makes the network capable of patterns a linear model cannot represent.
Part 3: Why Hidden Layers? Why "Deep"?
To make the case for hidden layers concrete, here is a famous toy problem that logistic regression cannot solve: the XOR function. There are four inputs and four labels:
| \(x_1\) | \(x_2\) | label \(y\) |
|---|---|---|
| 0 | 0 | 0 |
| 0 | 1 | 1 |
| 1 | 0 | 1 |
| 1 | 1 | 0 |
The label is 1 if the two inputs differ, and 0 if they match. If you plot these four points on a grid — \((0,0)\) and \((1,1)\) as one class, \((0,1)\) and \((1,0)\) as the other — you can see that no straight line separates them. Logistic regression, which can only draw straight decision boundaries, cannot get better than 50% accuracy on this dataset.
A network with a single hidden layer of just two ReLU units can solve XOR. Here is one set of weights that works:
$$ W_1 = \begin{bmatrix} 1 & -1 \\ -1 & 1 \end{bmatrix}, \quad b_1 = \begin{bmatrix} 0 \\ 0 \end{bmatrix}, \quad W_2 = \begin{bmatrix} 10 & 10 \end{bmatrix}, \quad b_2 = -5 $$
Question 4: Verify that these weights solve XOR by computing the forward pass for each of the four points. You should arrive at predictions \(\hat{y} \approx (0.007,\; 0.993,\; 0.993,\; 0.007)\) — i.e., the network classifies the four points as \(0, 1, 1, 0\), matching the XOR labels.
For each input \(x = (x_1, x_2)\), apply the forward pass step by step:
$$ z_1 = W_1 x + b_1 = (x_1 - x_2,\; -x_1 + x_2) $$
$$ a_1 = (h_1, h_2) = \mathrm{ReLU}(z_1) $$
$$ z_2 = 10\, h_1 + 10\, h_2 - 5 $$
$$ \hat{y} = \sigma(z_2) = \frac{1}{1 + e^{-z_2}} $$
Fill in the table:
| \(x_1\) | \(x_2\) | \(a_1 = (h_1, h_2)\) | \(z_2\) | \(\hat{y}\) | true \(y\) |
|---|---|---|---|---|---|
| 0 | 0 | (___, ___) | ___ | ___ | 0 |
| 0 | 1 | (___, ___) | ___ | ___ | 1 |
| 1 | 0 | (___, ___) | ___ | ___ | 1 |
| 1 | 1 | (___, ___) | ___ | ___ | 0 |
After filling in the table, look at when each hidden unit "fires" (output > 0). Each one detects a specific pattern — describe in one sentence what pattern each is looking for.
This is the central idea of neural networks: hidden units learn features, and the output layer combines them into a final decision.
One hidden layer dramatically expands what the model can represent. Going "deep" — stacking more layers — lets the model build features from raw data. In an image classifier, early layers detect edges, middle layers detect shapes, later layers detect whole digits. No human had to tell the network what an edge is.
But when is deep learning actually worth it? Decision trees solve XOR easily, and a random forest is hard to beat on tabular data. To see where neural networks pull ahead, take an economic application that involves text: predicting how the 10-year Treasury yield moves after a Federal Reserve FOMC statement. The interesting question is how each kind of model processes a sentence — neither can read text directly, but they get the numbers they need in very different ways.
How a tree (or random forest) handles text. A human picks a hawkish word list (elevated, persistent, tightening) and a dovish word list (moderated, accommodative, patient), and counts how many words from each list appear in the statement. The text becomes a small table — one row per statement, a column for each count. The tree splits on these counts just like it splits on income in the mortgage data. The order of the words is thrown away.
How a neural network handles text. Recall the architecture from Part 1: inputs feed through weighted connections into a hidden layer with ReLU, producing an output. For text, we use the same architecture, but with two sources of input — the current word and a memory vector carried from the previous step — and the output becomes the updated memory:
That is one timestep — one application of the network. To read a sentence, we apply that same processor once per word, threading the memory forward. Crucially, the weights inside the processor do not change as we move through the sentence; the same network is reused at every position:
Look at the green arrows. Each step's memory output feeds in as input to the next step. By the time the processor reaches "elevated," the memory already encodes "we saw a positive statement, then a contrastive 'but'" — so "elevated" gets processed in a context that flags it as the meaningful clause. Run the same processor on the reversed sentence and you get a different memory state, because the order of words was different.
Question 5: When are neural networks really worth it? Consider these two sentences, both plausibly from an FOMC statement:
"Inflation has moderated but remains elevated."
"Inflation remains elevated but has moderated."
Both sentences contain the exact same words: one hawkish word (elevated), one dovish word (moderated). A bag-of-words model — the kind a random forest sees — produces the same prediction for both. But markets read them very differently. The first emphasizes that inflation is still a problem (hawkish, yields up). The second emphasizes that inflation is coming down (dovish, yields down). What comes after the "but" is what matters. A neural network reading word-by-word can learn a feature like "what is the sentiment of the clause after the conjunction?" In one or two sentences, explain why a random forest on word-counts cannot learn this feature, no matter how many trees you grow.
Choosing the architecture
We've seen why hidden layers and nonlinear activations matter. Next question: how many hidden layers, and how many units in each? Three knobs, each with a real tradeoff.
Width — how many units per hidden layer. Each unit learns one pattern (recall the two XOR units). A wider layer can detect more patterns at once. Too few units, the layer cannot represent what the problem needs (underfitting). Too many, the layer memorizes noise (overfitting). Typical: 32–256 units for tabular data; much larger for image classification.
Depth — how many hidden layers. Stacking layers builds hierarchical features. In an image classifier, layer 1 detects edges, layer 2 combines edges into shapes, layer 3 combines shapes into parts. More depth = more abstraction, but harder training. Typical: 2–3 layers for tabular problems; dozens or hundreds for large image and language models.
Activation function. ReLU is the default for hidden layers — cheap, well-behaved, works in practice. For the output: sigmoid for binary classification, softmax for multi-class, no activation for predicting a continuous number.
In practice, architecture choice is part principle, part empirical. The principles above tell you the right shape for your problem. The exact numbers usually come from starting with a known-good baseline and adjusting based on how training goes.
Part 4: How a Network Sets Up to Learn
So far we've chosen weights by hand. A real network has thousands or millions of weights, and nobody picks those by hand. The network has to find them on its own, by looking at training data and adjusting. This section shows how.
The loss function: measuring how wrong the model is
First we need a score for the predictions. The standard choice for binary classification is binary cross-entropy:
$$ L(y, \hat{y}) = -\bigl[\, y \log(\hat{y}) + (1-y)\log(1-\hat{y}) \,\bigr] $$
The loss is small when the model is confidently right, large when it is confidently wrong. If \(y = 1\), it reduces to \(-\log(\hat{y})\): small when \(\hat{y}\) is near 1, explodes near 0. If \(y = 0\), the same logic reversed.
Question 6: Suppose a network predicts \(\hat{y} = 0.7\). Compute the loss for two cases: (a) the true label is \(y = 1\); (b) the true label is \(y = 0\). You should arrive at (a) \(L \approx 0.36\) and (b) \(L \approx 1.20\). Fill in the blanks:
$$ \text{(a)}\quad L = -[\,1 \cdot \log(\_\_\_) + 0 \cdot \log(1 - \_\_\_)\,] = -\log(\_\_\_) = \_\_\_ $$
$$ \text{(b)}\quad L = -[\,0 \cdot \log(\_\_\_) + 1 \cdot \log(1 - \_\_\_)\,] = -\log(\_\_\_) = \_\_\_ $$
Case (b) gives the larger loss. The model "got more wrong": it predicted 0.7 when the true label was 0.
Gradient descent: the rule for updating one weight
How do we change the weights to make the loss smaller? Imagine standing on a foggy hillside, trying to reach the valley. You can't see the landscape, but you can feel which way the ground slopes. The strategy: step downhill, feel again, step again, repeat.
That's what a network does. For each weight \(w\), it computes the slope of the loss with respect to that weight — the gradient \(\partial L / \partial w\) — and applies:
$$ w_{\text{new}} = w_{\text{old}} - \eta \cdot \frac{\partial L}{\partial w} $$
If increasing \(w\) makes the loss go up, the rule decreases \(w\). If increasing \(w\) makes the loss go down, the rule increases \(w\). The learning rate \(\eta\) sets the step size — too large and we overshoot; too small and learning crawls.
Computing one gradient by hand
To make this concrete, take the simplest possible network — a single weight \(w\), a single bias \(b\), a single input \(x\), and a sigmoid output. This is just logistic regression from Chapter 6, which is the point: gradient descent for a one-layer network is how you would fit a logistic regression.
The forward pass:
$$ z = wx + b, \qquad \hat{y} = \sigma(z) = \frac{1}{1 + e^{-z}} $$
To get \(\partial L / \partial w\), use the chain rule. The weight \(w\) affects \(z\), which affects \(\hat{y}\), which affects \(L\):
$$ \frac{\partial L}{\partial w} = \underbrace{\frac{\partial L}{\partial \hat{y}}}_{\text{from the loss}} \cdot \underbrace{\frac{\partial \hat{y}}{\partial z}}_{\text{from the sigmoid}} \cdot \underbrace{\frac{\partial z}{\partial w}}_{\text{from } z = wx + b} $$
Each piece takes a few lines of calculus. When you multiply all three together, the messy \(\hat{y}(1-\hat{y})\) terms cancel and the result simplifies dramatically:
$$ \frac{\partial L}{\partial w} = (\hat{y} - y) \cdot x $$
That cancellation is not an accident. Binary cross-entropy was chosen for sigmoid outputs precisely because it produces this clean form. Squared error would leave the \(\hat{y}(1-\hat{y})\) factor in place and slow training whenever \(\hat{y}\) was near 0 or 1.
The gradient is just the prediction error scaled by the input. Overpredicted? Gradient is positive, weight goes down. Underpredicted? Gradient is negative, weight goes up.
Question 7: Suppose \(x = 2\), the true label is \(y = 1\), and the model currently predicts \(\hat{y} = 0.7\). With a learning rate \(\eta = 0.1\) and current weight \(w_{\text{old}} = 0.5\), compute the new value of \(w\) after one gradient descent step. You should arrive at \(w_{\text{new}} = 0.56\). Fill in the blanks:
$$ \frac{\partial L}{\partial w} \;=\; (\hat{y} - y) \cdot x \;=\; (\_\_\_ - \_\_\_)(\_\_\_) \;=\; \_\_\_ $$
$$ w_{\text{new}} \;=\; w_{\text{old}} - \eta \cdot \frac{\partial L}{\partial w} \;=\; \_\_\_ - (0.1)(\_\_\_) \;=\; \_\_\_ $$
In one or two sentences, interpret how this update nudged the weight toward a better prediction: did \(w\) go up or down, and why does that direction shrink the loss the next time the model sees this example?
Putting it all together: backpropagation
A deep network has one partial derivative per weight — thousands or millions of them. The full list is the gradient vector \(\nabla L\). Backpropagation is the algorithm that computes the whole vector in one backward sweep, applying the chain rule layer by layer. The training loop is then:
- Forward pass — compute \(\hat{y}\).
- Compute the loss \(L(y, \hat{y})\).
- Backpropagate — get the full gradient vector \(\nabla L\) in one backward sweep.
- Update — apply \(w_i^{\text{new}} = w_i^{\text{old}} - \eta \cdot \partial L / \partial w_i\) to every weight at the same time. Then repeat on the next example.
Every weight updates after every example. After enough iterations on enough data, the weights stop changing meaningfully — the network has learned.
Question 8: In your own words, answer two questions: (a) what does the forward pass compute, and why does it need a nonlinear activation function like ReLU? (b) what does training do — describe the role of the loss function and the role of gradient descent.