A Briefly Introduction to RNN
First, let’s break down a RNN module. Here’s the structure of a RNN node, which has two inputs, where $X_t$ represent current time’s input, $H_{t-1}$ represent previous hidden state. After go through $\phi$ which is a FC layer with active function(by default it’s a $tanh$ function), it has two outputs $H_t$ and $O_t$, $H_t$ is current hiddent state which equals to current output $O_t$.
$$H_t=\phi{(X_tW^T_{xh}+B_{xh}+H_{t-1}W^T_{hh}+B_{hh})}\tag{1.1}$$
As equation shows above, a RNN cell has two weight parameters and two bisa parameters. Where $W_{xh}/B_{xh}$ represent weight/bias between input layer and hidden layer, and $W_{hh}/B_{hh}$ represent weight/bias between current hidden state and previous hidden state.
Assume we have a input that has sequence length of $L$ and input size of $H_{in}$, so $X_t\in\mathbb{R}^{L \times H_{in}}$. Given a hidden layer which has $H_{out}$ hidden size, we knows that $W_{xh}\in\mathbb{R}^{H_{in} \times H_{out}}$ and $B_h\in\mathbb{R}^{1 \times H_{out}}$. Because we only talk about RNN cell with only one layer, so we should have $H_{t-1}\in\mathbb{R}^{1 \times H_{out}}$, $W_{hh}\in\mathbb{R}^{H_{out} \times H_{out}}$ and $B_{hh}\in\mathbb{R}^{1 \times H_{out}}$. After go through $\phi$, we should get output $O_{t}=H_{t}\in\mathbb{R}^{1 \times H_{out}}$.
Build Your Own Network
Now, let’s define a RNN network that can do curve fitting. Here’s the RNN cell’s structure, which has four time steps.
We only need the last step’s output with ReLU activate function and pass it into a linear layer as shown in Figure 3.
$$\hat{Y}={H_{relu}W^T_{linear}+B_{linear}}\tag{2.1}$$
Prepare the Data
Given a real world example, let’s math the forward propagation together! Here we have a input size of $4 \times 5$, hidden size of $6$.
First, we should initialize a hidden state
$$H_0=\begin{bmatrix}-0.1596&-0.4090&-0.8323&-0.3635&-0.3322&0.0927\end{bmatrix}$$
Then, random initialize the network’s parameter
$$W_{xh}=\begin{bmatrix}0.2527&-0.2321& 0.3096&-0.0289&-0.2006\\-0.3046&-0.2622& 0.3102&-0.1528&-0.0195\\-0.0168&-0.2426& 0.0955&-0.2690&-0.0383\\0.3750& 0.1401& 0.1637&-0.2625& 0.3084\\-0.1525&-0.3558&-0.1511& 0.1539&-0.3454\\0.0534&-0.1614&-0.2251&0.2181&-0.1506\end{bmatrix}$$
$$B_{xh}=\begin{bmatrix}-0.3810&0.1493&-0.3121&-0.2417&0.2487&-0.1374\end{bmatrix}$$
$$W_{hh}=\begin{bmatrix}0.3927& 0.0157&-0.3487& 0.3704& 0.2096& 0.3068\\-0.2433& 0.1280& 0.3538&-0.1511& 0.2574& 0.3486\\0.1637& 0.3119& 0.2691& 0.2975&-0.3956& 0.2391\\-0.1146& 0.3130& 0.2728&-0.0528& 0.0038& 0.1231\\0.1855& 0.2253&-0.2392&-0.1092& 0.2223& 0.3423\\0.2057&-0.3282&-0.2153&-0.1266&-0.3013&-0.4103\end{bmatrix}$$
$$B_{hh}=\begin{bmatrix}0.0633&-0.0063&-0.2946&-0.2806&-0.0986&0.0013\end{bmatrix}$$
$$W_{linear}=\begin{bmatrix}-0.0681&0.2095&0.0617&0.1805&-0.3646&-0.1383\end{bmatrix}$$
$$B_{linear}=\begin{bmatrix}0.0904\end{bmatrix}$$
Prepare the input and the ground truth
$$X=\begin{bmatrix}-0.5921&-0.7245&-0.6794&-0.6360&-0.6841\\-0.7308&-0.7742&-0.7502&-0.7946&-0.8099\\-0.8188&-0.8364&-0.8453&-0.7627&-0.9163\\-0.8330&-0.9045&-0.8427&-0.8257&-0.9480\end{bmatrix}$$
$$Y=\begin{bmatrix}-0.9136\end{bmatrix}$$
Great! Now we have everything we need for RNN’s propagation
Forward
At each time step, we only need one sequence. So, at first time step what we need is the first row of $X$:
$$X_1=\begin{bmatrix}-0.5921&-0.7245&-0.6794&-0.6360&-0.6841\end{bmatrix}$$
We can calculate the hidden state step by step using $(1.1)$:
$$H_1=\begin{bmatrix}-0.2992&0.1064&-0.5517&-0.8639&0.6719&0.4261\end{bmatrix}\tag{T=1}$$
$$H_2=\begin{bmatrix}-0.3287&0.6731&-0.6743&-0.7822&0.8612&-0.1764\end{bmatrix}\tag{T=2}$$
$$H_3=\begin{bmatrix}-0.4079&0.5979&-0.7021&-0.7932&0.8782&-0.1138\end{bmatrix}\tag{T=3}$$
$$H_4=\begin{bmatrix}-0.3934&0.6356&-0.7040&-0.8010&0.8846&-0.1280\end{bmatrix}\tag{T=4}$$
Feeding the last hidden state into the active function:
$$H_{relu}=\begin{bmatrix}0&0.6356&0&0&0.8846&0\end{bmatrix}$$
And after the linear layer, we get the final output:
$$\hat{Y}=\begin{bmatrix}-0.0990\end{bmatrix}$$
Using MSELoss function, we can get the loss value:
$$Loss=\begin{bmatrix}0.6636\end{bmatrix}$$
You can run the following python code using Numpy:
import numpy as np
Y = np.array([-0.9136])
H0 = np.array([-0.1596, -0.4090, -0.8323, -0.3635, -0.3322, 0.0927])
Wxh = np.array([[ 0.2527, -0.2321, 0.3096, -0.0289, -0.2006],
[-0.3046, -0.2622, 0.3102, -0.1528, -0.0195],
[-0.0168, -0.2426, 0.0955, -0.2690, -0.0383],
[ 0.3750, 0.1401, 0.1637, -0.2625, 0.3084],
[-0.1525, -0.3558, -0.1511, 0.1539, -0.3454],
[ 0.0534, -0.1614, -0.2251, 0.2181, -0.1506]])
Bxh = np.array([-0.3810, 0.1493, -0.3121, -0.2417, 0.2487, -0.1374])
Whh = np.array([[ 0.3927, 0.0157, -0.3487, 0.3704, 0.2096, 0.3068],
[-0.2433, 0.1280, 0.3538, -0.1511, 0.2574, 0.3486],
[ 0.1637, 0.3119, 0.2691, 0.2975, -0.3956, 0.2391],
[-0.1146, 0.3130, 0.2728, -0.0528, 0.0038, 0.1231],
[ 0.1855, 0.2253, -0.2392, -0.1092, 0.2223, 0.3423],
[ 0.2057, -0.3282, -0.2153, -0.1266, -0.3013, -0.4103]])
Bhh = np.array([0.0633, -0.0063, -0.2946, -0.2806, -0.0986, 0.0013])
Wlinear = np.array([[-0.0681, 0.2095, 0.0617, 0.1805, -0.3646, -0.1383]])
Blinear = np.array([0.0904])
X1 = np.array([-0.5921, -0.7245, -0.6794, -0.6360, -0.6841])
H1 = np.tanh(np.dot(X1, Wxh.T) + Bxh + np.dot(H0, Whh.T) + Bhh)
X2 = np.array([-0.7308, -0.7742, -0.7502, -0.7946, -0.8099])
H2 = np.tanh(np.dot(X2, Wxh.T) + Bxh + np.dot(H1, Whh.T) + Bhh)
X3 = np.array([-0.8188, -0.8364, -0.8453, -0.7627, -0.9163])
H3 = np.tanh(np.dot(X3, Wxh.T) + Bxh + np.dot(H2, Whh.T) + Bhh)
X4 = np.array([-0.8330, -0.9045, -0.8427, -0.8257, -0.9480])
H4 = np.tanh(np.dot(X4, Wxh.T) + Bxh + np.dot(H3, Whh.T) + Bhh)
Hrelu = np.array([h if h > 0 else 0 for h in H4])
Yhat = np.dot(Hrelu, Wlinear.T) + Blinear
Loss = (Yhat - Y) ** 2
You shall have the following result:
H1 = [-0.2992, 0.1064, -0.5517, -0.8639, 0.6719, 0.4261]
H2 = [-0.3287, 0.6731, -0.6743, -0.7822, 0.8612, -0.1764]
H3 = [-0.4079, 0.5979, -0.7021, -0.7932, 0.8782, -0.1138]
H4 = [-0.3934, 0.6356, -0.7040, -0.8010, 0.8846, -0.1280]
Hrelu = [0.0000, 0.6356, 0.0000, 0.0000, 0.8846, 0.0000]
Yhat = [-0.0990]
Loss = [0.6636]
Backward
Now it comes to the important part, the backward propagation. As shown in Figure 1, the calculation is feeding into the same RNN node, we just unfold it temporally in Figure 2. So, to make the gradient go back through the network in time, the back propagation should be like Figure 4.
To calculate the partial derivative of the network’s parameter, we can use the chain rule.
$$\begin{split}\begin{aligned}\frac{\partial L}{\partial W_{linear}}&=\frac{\partial L}{\partial \hat{Y}}\frac{\partial \hat{Y}}{\partial W_{linear}}\\&=2\times{(\hat{Y} – Y)}H_{relu}\end{aligned}\end{split}$$
$$\begin{split}\begin{aligned}\frac{\partial L}{\partial B_{linear}}&=\frac{\partial L}{\partial \hat{Y}}\frac{\partial \hat{Y}}{\partial B_{linear}}\\&=2\times{(\hat{Y} – Y)}\end{aligned}\end{split}$$
We can easily calculate the derivative of linear layer’s parameter. When it come to RNN layer’s parameter, it becomes a bit complicate.
Remember how we calculate the derivative of a metrix? Mathematically, if you have a function $\vec{y}=f(\vec{x})$, then the gradient of $\vec{y}$ with respect to $\vec{x}$ is a Jacobian matrix $J$:
$$\begin{split}\begin{aligned}J&=\begin{pmatrix}\frac{\partial y}{\partial x_1}&\dots&\frac{\partial y}{\partial x_n}\end{pmatrix}\\&=\begin{pmatrix}\frac{\partial y}{\partial x_1}&\dots&\frac{\partial y}{\partial x_n}\\\vdots&\ddots&\vdots\\\frac{\partial y}{\partial x_1}&\dots&\frac{\partial y}{\partial x_n}\end{pmatrix}\end{aligned}\end{split}$$
Generally speaking, the forward propagation is data flow through time forwardly. When it comes to back propagation, the gradient flow through time in opposite direction for certain. So, we have to calculate the gradient of last time step first:
$\frac{\partial L}{\partial W_{xh}}$ / $\frac{\partial L}{\partial B_{xh}}$ / $\frac{\partial L}{\partial W_{hh}}$ / $\frac{\partial L}{\partial B_{hh}}$ at time step 4 should be:$\definecolor{a}{RGB}{114,0,172} \definecolor{b}{RGB}{145,208,80} \definecolor{c}{RGB}{255,0,255} \definecolor{d}{RGB}{18,110,213} \definecolor{e}{RGB}{160,65,12} \definecolor{f}{RGB}{243,150,13} \definecolor{g}{RGB}{0,204,153}$
$$\begin{split}\begin{aligned}\frac{\partial L}{\partial W_{xh}}&=\color{a}\frac{\partial L}{\partial \hat{Y}}\color{b}\frac{\partial \hat{Y}}{\partial H_{relu}}\color{c}\frac{\partial H_{relu}}{\partial H_4}\color{d}\frac{\partial H_4}{\partial W_{xh}}\end{aligned}\end{split}$$
$$\begin{split}\begin{aligned}\frac{\partial L}{\partial B_{xh}}&=\color{a}{\frac{\partial L}{\partial \hat{Y}}}\color{b}{\frac{\partial \hat{Y}}{\partial H_{relu}}}\color{c}{\frac{\partial H_{relu}}{\partial H_4}}\color{e}{\frac{\partial H_4}{\partial B_{xh}}}\end{aligned}\end{split}$$
$$\begin{split}\begin{aligned}\frac{\partial L}{\partial W_{hh}}&=\color{a}{\frac{\partial L}{\partial \hat{Y}}}\color{b}{\frac{\partial \hat{Y}}{\partial H_{relu}}}\color{c}{\frac{\partial H_{relu}}{\partial H_4}}\color{f}{\frac{\partial H_4}{\partial W_{hh}}}\end{aligned}\end{split}$$
$$\begin{split}\begin{aligned}\frac{\partial L}{\partial B_{hh}}&=\color{a}{\frac{\partial L}{\partial \hat{Y}}}\color{b}{\frac{\partial \hat{Y}}{\partial H_{relu}}}\color{c}{\frac{\partial H_{relu}}{\partial H_4}}\color{g}{\frac{\partial H_4}{\partial B_{hh}}}\end{aligned}\end{split}$$
Where:
$$\color{a} \frac{\partial L}{\partial \hat{Y}}=2\times{(\hat{Y}-Y)}$$
$$\color{b} \frac{\partial \hat{Y}}{\partial H_{relu}}=W_{linear}$$
$$\color{c} \frac{\partial H_{relu}}{\partial H_4}={\begin{bmatrix}H_4>0,1?0\end{bmatrix}}$$
$$\color{d} \begin{split}\begin{aligned}\frac{\partial H_4}{\partial W_{xh}}&={\begin{bmatrix}1-tanh^2(X_4W^T_{xh}+B_{xh}+H_3W^T_{hh}+B_{hh})\end{bmatrix}}\begin{pmatrix}\frac{\partial X_{4\left\{1\right\}}W_{xh\left\{11\right\}}}{\partial W_{xh\left\{11\right\}}}&\dots&\frac{\partial X_{4\left\{1\right\}}W_{xh\left\{16\right\}}}{\partial W_{xh\left\{16\right\}}}\\\vdots&\ddots&\vdots\\\frac{\partial X_{4\left\{5\right\}}W_{xh\left\{51\right\}}}{\partial W_{xh\left\{51\right\}}}&\dots&\frac{\partial X_{4\left\{5\right\}}W_{xh\left\{56\right\}}}{\partial W_{xh\left\{56\right\}}}\end{pmatrix}^T\end{aligned}\end{split}$$
$$\color{e} \begin{split}\begin{aligned}\frac{\partial H_4}{\partial B_{xh}}&={\begin{bmatrix}1-tanh^2(X_4W^T_{xh}+B_{xh}+H_3W^T_{hh}+B_{hh})\end{bmatrix}}\end{aligned}\end{split}$$
$$\color{f} \begin{split}\begin{aligned}\frac{\partial H_4}{\partial W_{hh}}&={\begin{bmatrix}1-tanh^2(X_4W^T_{hh}+B_{xh}+H_3W^T_{hh}+B_{hh})\end{bmatrix}}\begin{pmatrix}\frac{\partial H_{3\left\{1\right\}}W_{xh\left\{11\right\}}}{\partial W_{hh\left\{11\right\}}}&\dots&\frac{\partial H_{3\left\{1\right\}}W_{hh\left\{16\right\}}}{\partial W_{hh\left\{16\right\}}}\\\vdots&\ddots&\vdots\\\frac{\partial H_{3\left\{6\right\}}W_{hh\left\{51\right\}}}{\partial W_{hh\left\{51\right\}}}&\dots&\frac{\partial H_{3\left\{6\right\}}W_{hh\left\{56\right\}}}{\partial W_{hh\left\{56\right\}}}\end{pmatrix}^T\end{aligned}\end{split}$$
$$\color{g} \begin{split}\begin{aligned}\frac{\partial H_4}{\partial B_{hh}}&={\begin{bmatrix}1-tanh^2(X_4W^T_{xh}+B_{xh}+H_3W^T_{hh}+B_{hh})\end{bmatrix}}\end{aligned}\end{split}$$
Also, we can run this code to check if the calculation is correct:
Yhat_grad = 2 * (Yhat - Y)
Hrelu_grad = 2 * (Yhat - Y) * Wlinear
Whh_grad = Hrelu_grad * np.where(H4 > 0, 1, 0) * (1 - H4 ** 2) * np.expand_dims(H3, 0).T).T
Bhh_grad = Hrelu_grad * np.where(H4 > 0, 1, 0) * (1 - H4 ** 2)
Wxh_grad = Hrelu_grad * np.where(H4 > 0, 1, 0) * (1 - H4 ** 2) * np.expand_dims(X4, 0).T).T
Bxh_grad = Hrelu_grad * np.where(H4 > 0, 1, 0) * (1 - H4 ** 2)
Result shows as follow:
Yhat_grad = [1.6293]
Hrelu_grad = [[-0.111 0.3413 0.1005 0.2941 -0.594 -0.2253]]
Whh_grad = [[ 0. -0. 0. 0. -0. 0. ]
[-0.0830 0.1216 -0.1428 -0.1614 0.1787 -0.0232]
[-0. 0. -0. -0. 0. -0. ]
[-0. 0. -0. -0. 0. -0. ]
[ 0.0527 -0.0772 0.0907 0.1025 -0.1135 0.0147]
[ 0. -0. 0. 0. -0. 0. ]]
Bhh_grad = [[-0. 0.2034 0. 0. -0.1292 -0. ]]
Wxh_grad = [[ 0. 0. 0. 0. 0. ]
[-0.1695 -0.1840 -0.1714 -0.1680 -0.1929]
[-0. -0. -0. -0. -0. ]
[-0. -0. -0. -0. -0. ]
[ 0.1076 0.1169 0.1089 0.1067 0.1225]
[ 0. 0. 0. 0. 0. ]]
Bxh_grad = [[-0. 0.2034 0. 0. -0.1292 -0. ]]
Remember this only calculate gradient at time step 4! We also need to calculate the gradient of previous time steps, and the gradient should accumulated once it reach the begin of propagation and we shall have the final gradient for further optimize steps.
So, the final gradient should be: (${\partial Param}$ represent all learnable parameters in the RNN network) $\definecolor{red}{RGB}{255,0,0} \definecolor{h}{RGB}{93,137,13} \definecolor{i}{RGB}{0,192,119} \definecolor{j}{RGB}{86,12,234} \definecolor{k}{RGB}{238,126,166} \definecolor{l}{RGB}{1,190,195} \definecolor{m}{RGB}{190,149,215}$
$$\begin{split}\begin{aligned}\color{red}\frac{\partial L}{\partial Param}&\color{red}=\sum_{t=4}^{1}\frac{\partial L}{\partial \hat{Y}}\frac{\partial \hat{Y}}{\partial H_{relu}}\frac{\partial H_{relu}}{\partial H_t}\frac{\partial H_t}{\partial Param}\\&\color{red}=\color{a}\frac{\partial L}{\partial \hat{Y}}\color{b}\frac{\partial \hat{Y}}{\partial H_{relu}}\color{c}\frac{\partial H_{relu}}{\partial H_4}\color{d}\frac{\partial H_4}{\partial Param}\\&\color{red}+\color{a}\frac{\partial L}{\partial \hat{Y}}\color{b}\frac{\partial \hat{Y}}{\partial H_{relu}}\color{c}\frac{\partial H_{relu}}{\partial H_4}\color{h}\frac{\partial H_4}{\partial H_3}\color{i}\frac{\partial H_3}{\partial Param}\\&\color{red}+\color{a}\frac{\partial L}{\partial \hat{Y}}\color{b}\frac{\partial \hat{Y}}{\partial H_{relu}}\color{c}\frac{\partial H_{relu}}{\partial H_4}\color{h}\frac{\partial H_4}{\partial H_3}\color{j}\frac{\partial H_3}{\partial H_2}\color{k}\frac{\partial H_2}{\partial Param}\\&\color{red}+\color{a}\frac{\partial L}{\partial \hat{Y}}\color{b}\frac{\partial \hat{Y}}{\partial H_{relu}}\color{c}\frac{\partial H_{relu}}{\partial H_4}\color{h}\frac{\partial H_4}{\partial H_3}\color{j}\frac{\partial H_3}{\partial H_2}\color{l}\frac{\partial H_2}{\partial H_1}\color{m}\frac{\partial H_1}{\partial Param}\\\end{aligned}\end{split}$$
Leave a Reply