Backpropagation with asymmetric weights
A number of recent papers have explored learning in deep neural networks without backpropagation, often motivated by the apparent biological implausibility of backpropagation.
Supposedly, the brain's neurons can only transmit electrical impulses in a single direction; neurons have no way of "communicating" error backwards.
So while backpropagation is an effective method for training neural networks, it is not a reasonable analogue for biological learning.
Now I haven't the faintest clue about neuroscience, so I can't comment on whether machine learning does or should replicate biological learning.
But I do find "learning without backpropagation" very intriguing for two practical reasons.
First, since the gradients at $layer_{t-1}$ depend on the gradients at $layer_{t}$, backpropagation isn't an inherently parallelizable algorithm. This is a practical bottleneck for efficiently training larger networks.
Second, backpropagation is still computationally expensive. Even where the backwards pass requires only as many operations as the forward pass, this still doubles the effective training cost for a given network.
So is it possible to train a network with conventional gradient-based methods while minimizing the downsides to backpropagation?
According to the theory of feedback alignment, the answer is yes!
At a high level, feedback alignment means using random weights to communicate an error signal back through a network, rather than reusing the same weights during the forward and backward passes.
I found this pretty counterintuitive when I first came across the idea. How can you train a network if the error signal is essentially random? Not that I didn't believe the paper, but I really needed to write it out from scratch to really understand what's going on.
Backpropagation refresher
Let's revisit the theory behind backpropagation. I'm going to assume some basic familiarity - if you need a refresher, there are dozens of existing tutorials. I recommend Chapter 2 of Michael Nielsen's book.
Let's take a basic neural network with $t$ layers and a single real-valued output.
The output layer and loss function can be depicted as follows:
At the end of the forward pass, the network outputs a scalar prediction $\hat{y}$. A loss function $L$ is applied to $\hat{y}$ and $y$, returning a scalar representing the error of the network.
During backpropagation, we first calculate the error of the weights at layer $t-1$ (i.e. $W_{t-1}$) with respect to this loss.
We'll see shortly that this is used to calculate the error at $layer_{t-2}$.
In other words, we propagate the total network loss back through every intermediate layer in the network - hence the term backpropagation.
But what does this mean mathematically?
The network's output is calculated by multiplying $W_{t-1}$ by the activation outputs from the previous layer (zt-2).
$$ \hat{y} = W_{t-1}z_{t-2} $$
For stochastic gradient descent, the update rule for $W_{t-1}$ will be as follows:
$$ W_{t-1} := W_{t-1} - \frac{\partial L}{\partial W_{t-1}} W_{t-1} $$
Per the chain rule:
$$ \frac{\partial L}{\partial W_{t-1}} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial W_{t-1}} = \frac{\partial L}{\partial \hat{y}} z_{t-2} $$
Since the derivative of $(1)$ with respect to $W_{t-1}$ is $z_{t-2}$.
Now let's look at $z_{t-2}$, the activation outputs from the preceding layer.
$$ z_{t-2} = W_{t-2}z_{t-3} $$
The update rule for $W_{t-2}$ is as follows:
$$ W_{t-2} := W_{t-2} - \frac{\partial L}{\partial W_{t-2}}W_{t-2} $$
And again, by the chain rule:
$$ \frac{\partial L}{\partial W_{t-2}}= \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z_{t-2}}\frac{\partial z_{t-2}}{\partial W_{t-2}}$$
Since
$$ z_{t-2} = \sigma(W_{t-2}z_{t-3}) $$
then
$$ \frac{\partial z_{t-2}}{\partial W_{t-2}} = \sigma\prime(W_{t-2}z_{t-3})z_{t-3} $$
Where $\sigma$ is some non-linear activation function, and $\sigma\prime$ is its corresponding derivative.
This gives us an expression for the last component of $(6)$. But what about the second component - i.e. $\frac{\partial \hat{y}}{\partial z_{t-2}}$?
During the back pass for the last layer, we calculated $\frac{\partial \hat{y}}{\partial W_{t-1}}$ (i.e. the change in network output with respect to the last layer's weights) but not $\frac{\partial \hat{y}}{\partial z_{t-2}}$ (the change with respect to the penultimate activation output).
$$ \frac{\partial \hat{y}}{\partial z_{t-2}} = \frac{\partial L}{\partial \hat{y}} \frac{\partial \hat{y}}{\partial z_{t-2}} $$
$$ = \frac{\partial L}{\partial \hat{y}} W_{t-1}^\intercal $$
Substituting $(8)$ and $(9)$ into $(6)$, we now have:
$$ \frac{\partial L}{\partial W_{t-2}} = \frac{\partial L}{\partial \hat{y}}W_{t-1}^\intercal\sigma'(W_{t-2}z_{t-3})z_{t-3} $$
This line is the key to understanding feedback alignment.
Conventional backpropagation pushes the error "back" through the network via multiplication with that layer's weight matrix.
In other words, the weights used are symmetric between forward and backward passes.
I've only depicted backpropagation from the last (linear) to the penultimate (non-linear) layer, but the same applies at every layer in a deep network - the transpose of the weight matrix is used at each step to propagate the error backwards.
"Weight symmetry" is therefore just a way of saying "the chain rule requires multiplication by the weight matrix".
Now what's interesting is considering whether a network can learn without this weight symmetry. What happens if we replace the transposed weight matrix with purely random values?
Intuitively, you might expect that the network would simply fail to learn. That was certainly my initial reaction.
But you only need to read the title of the feedback alignment paper - "Random feedback weights support learning in deep neural network". Neural networks can still learn, even when completely random weights are used during the backwards pass!
Implementing asymmetric weight transfer
It's difficult to believe, hence why I had a stab at coding an asymmetric network from scratch. This shows that random weights can indeed learn XOR.
Here's a link to the full code, but I'll go through it here line-by-line.
I've used my preferred language (F#), so I'll try and explain some of the F#-specific syntax where I've used.
To make sure I really understood what was going on, I tried to keep the code as close as possible to the maths. This meant avoiding automagic DL frameworks like Keras/PyTorch, but I did use the MathNet library for the underlying matrix operations.
As this was only an exercise, it's all hand-coded, so there's no support for custom loss functions, optimizers, operations or even layers.
The objective
Ultimately, we're experimenting to see if a basic neural network can learn an arbitary function with random weights used during the backpass.
XOR seemed a convenient test function to choose - it's not linearly separable, therefore can only be learned via a non-linear model.
Let's start with the main invocation code.
{{< highlight fsharp >}} let nn = NN(10) let rnd = System.Random() let inputs = [ 0.0,1.0,1.0; 1.0,0.0,1.0; 1.0,1.0,0.0; 0.0,0.0,0.0 ]; let next () = List.item (rnd.Next inputs.Length) inputs {{< / highlight >}}
This is just basic setup code. We first initialize a neural network object, the XOR inputs/output tuples and a function that returns a random XOR input sample.
Next, the training/evaluation loop:
{{< highlight fsharp >}}
let iterations = 5000
for i in seq { 0..iterations } do
let (x1,x2,y1) = next()
let x = array2D [ [x1; ]; [x2] ] |> DenseMatrix.ofArray2
{{< / highlight >}}
We run 5000 training iterations, with a single input/output tuple at each iteration.
The last line is just to ensure the input can be matrix-multiplied with the weights in the first layer. Repeating this on every iteration is very wasteful, but for this experiment/problem, I prefer readability over efficiency.
{{< highlight fsharp >}} nn.Forward x y1 |> ignore nn.Backward x false nn.Update 0.01 x {{< / highlight >}}
Next is the crux of the training loop - one forward pass, one backwards pass, and a parameter update. We'll dive into these shortly. The forward pass will return the predicted value, which we will need to explicitly pipe this to the ignore function (F# complains if a function returns a value that isn't used anywhere).
{{< highlight fsharp >}} if i % 100 = 0 then let preds = seq { for j in seq { 0..20 } do let (x1,x2,y1) = next() let pred = nn.Forward (array2D [ [x1;];[x2 ] ] |> DenseMatrix.ofArray2) y1 |> (fun x -> match x > 0.5 with | true -> 1 | _ -> 0) if pred = int(y1) then yield true else yield false } let accuracy = (float(preds |> Seq.where id |> Seq.length) / float(preds |> Seq.length)) printfn "Accuracy : %f" accuracy {{< / highlight >}}
To evaluate the network's performance, every 100 iterations we will feed a random sample into the network and compare the generated prediction with the actual label.
If the network learns the XOR problem perfectly, the accuracy will converge to 1.0 (i.e. 100% of predictions were correct).
Now let's move to the internals of the neural network implementation.
First, define a type that takes a constructor argument for the dimension of each layer.
{{< highlight fsharp >}}
type NN (dim:int) =
let mutable w1 = Matrix
Like I said above, I'm rolling everything by hand here with no concern for extensibility.
This means hardcoding three layers - two non-linear layers plus one linear output layer.
To do this, I initialize matrices for each layer's weights, activation outputs and gradients, and the scalar error.
Note that the MathNet Random() function draws from a Gaussian distribution; this is known to be a suboptimal initialization strategy, but that's not important for this particular exercise.
In F#, all variables are immutable by default. These matrices will be updated during every forward/backward pass, so we explicitly denote these matrix variables as mutable.
{{< highlight fsharp >}} let relu x = max x 0.0 let relu' x = match x > 0.0 with | true -> 1.0 | false -> 0.0 {{< / highlight >}}
The basic network is only going to use hardcoded ReLU activations, so I define two functions - ReLU and its derivative.
{{< highlight fsharp >}}
member x.Forward (input:Matrix
This is the forward pass for our 3-layer network to learn XOR, accepting a vector of size (1x2) as input and outputting a scalar.
We multiply the first layer's weights by the input vector, followed by the relu activation. Note for simplicity, I haven't included any bias parameter.
This gives us $z_{1}$ (the first layer's activation outputs). Next, a matmul between $z_{1}$ and the second layer's weights, followed by the nonlinearity, giving us $z_{2}$.
Finally, a linear multiplication between $z_{2}$ and the 3rd layer's weights, giving us the network's output.
For the loss function, I am using the mean-squared-error, the derivative of which evaluates to $\hat{y} - y$.
{{< highlight fsharp >}}
member x.Backward (input:Matrix
w2' <- (err * r2.Transpose()).PointwiseMultiply(Matrix.map relu' z2)
w1' <- (r1 * w2').PointwiseMultiply((Matrix.map relu' z1))
{{< / highlight >}}
For a conventional (i.e. symmetric) backward pass, the error gradient at the output layer is multiplied by $W_{3}^\intercal$, the transpose of the last layer's weight matrix.
This is then pointwise-multiplied by the derivative of $z_{2}$ (the penultimate layer's activation function), giving the error gradient at the penultimate $layer_{2}$.
Likewise, to propagate the error from $layer_{2}$ to $layer_{1}$, we multiply by $W_{2}^\intercal$.
For now, let's skip the case where asymmetric weights are used.
Once we've finished our backwards pass, we perform the weight update:
{{< highlight fsharp >}}
member x.Update (lr:double) (inputs:Matrix
That completes a single training iteration.
If we run this with symmetric weights, the network will quickly converge to 100% accuracy.
{{< highlight fsharp >}} Accuracy : 0.476190 Accuracy : 0.619048 Accuracy : 0.761905 Accuracy : 1.000000 {{< / highlight >}}
Exactly how many iterations this takes will depend on the weight initialization, which is a topic best reserved for another day. But at the least we know our network's implementation is correct!
Let's return to the backwards pass and consider the case of asymmetric weights.
{{< highlight fsharp >}}
let r2 = Matrix
w2' <- (err * r2.Transpose()).PointwiseMultiply(Matrix.map relu' z2)
w1' <- (r1 * w2').PointwiseMultiply((Matrix.map relu' z1))
{{< / highlight >}}
Rather than multiplying by the transpose of each layer's weight matrix, what if multiply by completely random weights. How does that look?
{{< highlight fsharp >}} Accuracy : 0.857143 Accuracy : 0.666667 Accuracy : 0.761905 Accuracy : 0.857143 Accuracy : 1.000000 Accuracy : 1.000000 Accuracy : 1.000000 {{< / highlight >}}
The network still reaches 100% accuracy, showing that weight symmetry isn't necessary for the network to learn a non-linear function.
It's important to note that the random backpass weights are fixed - they are not adjusted or learned during the parameter update stage.
As a side-note, another curious result is that re-randomizing the weights during each backwards pass also doesn't prevent the network from learning.
Now why is this effective?
My own hand-wavy explanation is that training with random backpass weights equates to training a pseudo-objective $W_{r}$ rather than $W_{z}$. If this pseudo-objective is optimized with respect to the "true" loss, the direction of the gradients will be pushed in the direction of the "true" gradient. This means the pseudo-objective will converge towards the true objective.
I also speculate that there's a close relationship between feedback alignment and random projections.
Either way, the paper sets out a much more formal/rigorous explanation and proof.
I found it very instructive to go through this line-by-line, so I hope that others find this useful too. If you have any comments, criticisms or errata, please reach out in the comments or via Twitter!