2. BP神经网络

发表信息: by Creative Commons Licence

目录:

一、单一神经元网络

单一神经元网络是最简单最基础的神经网络,这里第一层叫做输入层,第二层是隐藏层,第三层是输出层

上面这个图用公式表示出来就是:

\[Y = a(Z) = a(Xw) = a(w_0 + w_1 x_1 + \cdots + w_d x_d)\]

在神经网络中,\(a(·)\)一般称为激活函数。

在单一神经元网络里,选择线性的激活函数就相当于我们前文中的线性回归,选择sigmoid激活函数,就相当于前文中的逻辑回归。

我们以神经网络的形式重新编写线性回归和逻辑回归的代码,以获取对神经网络的初步认识:

(一)线性回归-BP神经网络

(1)激活函数

很显然,线性回归里\(a(X)=X\),所以:

def ols_activation(Z):
	return Z

(2)向前传播

向前传播就是从输入X到输出Y的计算过程

def ols_forward(X, w):
	Z = X*w
	a = ols_activation(Z)
	return a

其实这里输出就是X*w,之所以多加一个activation也是为了统一整个计算过程。

得到输出值以后,要计算输出值和真实值之间的差距,从而得到使差距最小的参数,在线性模型当中,我们使用梯度下降法,这里也是梯度下降法,只是后续会涉及到多层网络,计算过程更为复杂,统一称为反向传播,也就是BP传播

(3)反向传播(BP)

因为这里只有一层网络,所以和线性模型的梯度下降基本上一致,我们采取神经网络的写法:

同样是先计算梯度值:

def OLS_dev(X, Y, w):
	return 1.0/n*X.T*(X*w-Y)

计算反向传播(梯度下降)过程:

def OLS_BP(X, Y, w, lr, epoches):
	for i in range(epoches):
		w -= lr * OLS_dev(X, Y, w)
	return w

为了便于理解,这里梯度下降写的比前面简易一些

可以看到这里名称也发生了一些变换,之前的num_iters换成了epoches,eta换成了lr(learning rate)

(4)整合

整合起来就是:

class OLS_single_network():
    def __init__(self, epoches, lr):
        self.epoches = epoches
        self.lr = lr
        
    def ols_activation(self, X):
        return X
    
    
    def ols_forward(self, X, w):
        Z = X*w
        a = self.ols_activation(Z)
        return a
    
    def ols_dev(self, X, Y, w, input_num):
        return 1.0/input_num*X.T*(X*w-Y)
    
    def ols_BP(self, X, Y):
        input_num = X.shape[0]
        hidden_num = X.shape[1]
        output_num = 1
        w = np.mat(np.random.random((hidden_num, output_num)))
        for i in range(self.epoches):
            w -= self.lr * self.ols_dev(X, Y, w, input_num)
        
        return w
            
    def fit(self, X, Y):
        w = self.ols_BP(X, Y)
        y_predict = self.ols_forward(X, w)
        input_num = X.shape[0]
        print("train accuracy:{}%".format(100-1./input_num*np.sqrt(np.sum(np.square(y_predict-Y)))*100))
        return y_predict

运行

from sklearn.datasets import make_regression
import numpy as np
import matplotlib.pyplot as plt

X, Y = make_regression(n_samples=1000, n_features=1, noise=5) 
print(X.shape, Y.shape)
# ((1000, 1), (1000,))

# 这里没有专门区分w_i和w_0,因为没有正则化
X = np.insert(X, 0, values=1, axis=1)
X = np.mat(X)
Y = np.mat(Y).T
print(X.shape, Y.shape)
# ((1000, 2), (1000, 1))

lr = OLS_single_network(epoches=1000, lr=0.1)
y_predict = lr.fit(X, Y)
# train accuracy:83.6044609236%

xx = X[:,1].tolist()
yy = Y.tolist()
plt.scatter(xx, yy)
plt.plot(xx, y_predict.tolist(), c='r')
plt.show()

(一)逻辑回归-BP神经网络

(1)激活函数

这里的激活函数就是sigmoid函数,即

def LR_activation(Z):
	return 1./(1.+np.exp(-Z))

(2)向前传播

def LR_forward(X, w):
	Z = X*w
	a = LR_activation(Z)
	return a

(3)反向传播(BP)

def LR_dev(X, Y, w):
	y_predict = LR_forward(X, w)
	return -(1./n)*X.T*(Y - Y_predict)
def LR_BP(X, Y, w, lr, epoches):
	for i in range(epoches):
		w -= lr * LR_dev(X, Y, w)
	return w

(4)整合

 class LR_single_network():
    def __init__(self, epoches, lr):
        self.epoches = epoches
        self.lr = lr
        
    def lr_activation(self, Z):
        return 1./(1.+np.exp(-Z))
    
    
    def lr_forward(self, X, w):
        Z = X*w
        a = self.lr_activation(Z)
        return a
    
    def lr_dev(self, X, Y, w, input_num):
        y_predict = self.lr_forward(X, w)
        return -(1./input_num)*X.T*(Y - y_predict)
    
    def lr_BP(self, X, Y):
        input_num = X.shape[0]
        hidden_num = X.shape[1]
        output_num = 1
        w = np.mat(np.random.random((hidden_num, output_num)))
        for i in range(self.epoches):
            w -= self.lr * self.lr_dev(X, Y, w, input_num)
        return w
            
    def fit(self, X, Y):
        w = self.lr_BP(X, Y)
        y_predict = self.lr_forward(X, w)
        y_predict[y_predict > 0.5] = 1
        y_predict[y_predict <= 0.5] = 0
        input_num = X.shape[0]
        print("train accuracy:{}%".format(np.sum(y_predict==Y)/float(input_num) * 100))
        return y_predict

实现:

from sklearn.datasets import make_classification
import numpy as np
import matplotlib.pyplot as plt

X, Y = make_classification(n_samples=1000, n_features=20,  n_classes=2) 
print(X.shape, Y.shape)

X = np.insert(X, 0, values=1, axis=1)
X = np.mat(X)
Y = np.mat(Y).T
print(X.shape, Y.shape)

lr = LR_single_network(epoches=1000, lr=0.1)
y_predict = lr.fit(X, Y)

# ((1000, 20), (1000,))
# ((1000, 21), (1000, 1))
# train accuracy:92.7%

二、多元神经元网络

如果我们上面的神经网络再多一层呢?

(一)两层神经网络

这是一个两层的神经网络,第一层是输入层,第二层和第三层是隐藏层,第四层是输出层

如果第二层的激活函数不变,对应到前面的线性回归和逻辑回归就是在原先的基础上多增加了一次计算。

(1)线性回归-BP神经网络

第一层到第二层的计算:

\[\begin{eqnarray} Z^1_i & = & w^1_{i0} + w^1_{i1} x_1 + \cdots + w^1_{id} x_{d} \\ & = & \begin{pmatrix} w^1_{i0} & w^1_{i1} & \cdots & w^1_{id} \end{pmatrix} \begin{pmatrix} 1 \\ x_1 \\ \vdots \\ x_{d} \end{pmatrix} \end{eqnarray}\]

维度是(1×n),写成矩阵形式是:

\[\begin{eqnarray} Z^1 & = & w^1_{0} + w^1_{1} x_1 + \cdots + w^1_{d} x_{d} \\ & = & \begin{pmatrix} w^1_{0} & w^1_{1} & \cdots & w^1_{d} \end{pmatrix} \begin{pmatrix} 1 \\ x_1 \\ \vdots \\ x_{d} \end{pmatrix} \end{eqnarray}\]

维度是\((d_1×n)\)

\[a^1_i = Z^1_i\] \[a^1 = Z^1\]

第二层到第三层的计算:

\[\begin{eqnarray} Z^2_1 & = & w^2_0 + w^2_1 a^1_1 + \cdots + w^2_{d_1} a^1_{d_1} \\ & = & \begin{pmatrix} w^2_{0} & w^2_{1} & \cdots & w^2_{d_1} \end{pmatrix} \begin{pmatrix} 1 \\ a^1_1 \\ \vdots \\ a^1_{d_1} \end{pmatrix} \end{eqnarray}\]

维度是(1×n)

\[a^2_1 = Z^2_1\]

第三层到输出层

\[Y_{predict} = a^2_1\]

(1.1)激活函数

与前面相同

(1.2)向前传播

def ols_forward(X, w1, w2):
	Z1 = X * w1
	a1 = ols_activation(Z1)
	Z2 = a1 * w2
	a2 = ols_activation(Z2)
	result = {'a1':a1,
			  'a2':a2}
	return a2

(1.3)反向传播(BP)

层数增加最大的区别就在BP部分,求梯度需要用到链式法则,但是原理一样,都是损失函数对需要更新的参数求导:

\[\begin{eqnarray} L(w) & = & \frac{1}{2n} \sum_{i=1}^n[a^2_1-y_i]^2 \\ & = & \frac{1}{2n} \sum_{i=1}^n[w^2_0 + w^2_1 a^1_1 + \cdots + w^2_{d_1} a^1_{d_1}-y_i]^2\\ \end{eqnarray}\]

对\(w^2_k\)求梯度:

\[\begin{eqnarray} \frac{\partial L(w)}{\partial w^2_k} & = & \frac{\partial L(w)}{\partial a^2_1}\frac{\partial a^2_1}{\partial w^2_k} \\ & = & \frac{1}{n} \sum_{i=1}^{n}(a^2_1-y_i)a^1_k \end{eqnarray}\]

对\(w^{ij}\)求梯度

\[\begin{eqnarray} \frac{\partial L(w)}{\partial w^1_{ij}} & = & \frac{\partial L(w)}{\partial a^2_1}\frac{\partial a^2_1}{\partial a^1_i}\frac{\partial a^1_i}{\partial w^1_{ij}} \\ & = & \frac{1}{n} \sum_{i=1}^{n}(a^2_1-y_i)w^2_i x_j \end{eqnarray}\]
def ols_dev(X, Y, w1, w2):
	result = ols_forward(X, w1, w2)
	a1 = result['a1']
	a2 = result['a2']
	w2_dev = 1./n * (a2 - Y) * a1 
	w1_dev = 1./n * (a2 - Y) * w2 * X 
	w_dev = {'w1_dev': w1_dev,
			 'w2_dev': w2_dev}
	return w_dev
def ols_BP(X, Y, w1, w2, epoches, lr):
	for i in range(epoches):
		w_dev = lr_dev(X, Y, w1, w2)
		w1_dev = w_dev['w1_dev']
		w2_dev = w_dev['w2_dev']
		w1 -= lr*w1_dev
		w2 -= lr*w2_dev
	
	w = {'w1':w1,
		 'w2':w2}
	return w

但是看两层参数的梯度值可以发现,w2的梯度值和a1的数值相关,a1又和w1相关,而w1的梯度值和w2相关。

因此更理想的做法是,在一次迭代的时候的更新w2的数值,然后用更新完的w2数值,去计算w1的梯度,然后再用w1计算新的a1,再用来计算w2的梯度。因此更新上面的代码如下:

def ols_dev_w2(X, Y, a1, w1, a2, w2):
	w2_dev = 1./n * (a2 - Y) * a1 
	return w2_dev 
	
def ols_dev_w1(X, Y, a2, w2):
	w1_dev = 1./n * (a2 - Y) * w2 * X
	return w1_dev 

def ols_BP(X, Y, w1, w2, epoches, lr):
	for i in range(epoches):
	
		result = ols_forward(X, w1, w2)
		a1 = result['a1']
		a2 = result['a2']		
		w2_dev = lr_dev_w2(X, Y, a1, w1, a2, w2)
		w2 -= lr*w2_dev
		
		result = ols_forward(X, w1, w2)
		a2 = result['a2']				
		w1_dev = lr_dev_w1(X, Y, a2, w2)
		w1 -= lr*w1_dev
		
	w = {'w1':w1,
		 'w2':w2}
	return w	

(1.4)整合

class OLS_two_network():
    def __init__(self, epoches, lr, hidden2_num):
        self.lr = lr
        self.epoches = epoches
        self.hidden2_num = hidden2_num
        
    def ols_activation(self, Z):
        return Z
    
    def ols_forward(self, X, w1, w2):
        Z1 = X * w1
        a1 = self.ols_activation(Z1)
        Z2 = a1 * w2
        a2 = self.ols_activation(Z2)
        result = {'a1':a1,
                  'a2':a2}
        return result
    
    def ols_dev_w2(self, X, Y, a1, w1, a2, w2, input_num):
        w2_dev = 1./input_num * a1.T * (a2 - Y)
        return w2_dev
    
    def ols_dev_w1(self, X, Y, a2, w2, input_num):
        w1_dev = 1./input_num * X.T * (a2 - Y) * w2.T
        return w1_dev
    
    def ols_BP(self, X, Y):
        input_num = X.shape[0]
        hidden_num = X.shape[1]
        output_num = 1
        w1 = np.mat(np.random.random((hidden_num, self.hidden2_num)))
        w2 = np.mat(np.random.random((self.hidden2_num, output_num)))
        for i in range(self.epoches):
            result = self.ols_forward(X, w1, w2)
            a1 = result['a1']
            a2 = result['a2']
            w2_dev = self.ols_dev_w2(X, Y, a1, w1, a2, w2, input_num)
            w2 -= self.lr*w2_dev
            
            result = self.ols_forward(X, w1, w2)
            a2 = result['a2']
            w1_dev = self.ols_dev_w1(X, Y, a2, w2, input_num)
            w1 -= self.lr*w1_dev
        
        w = {'w1':w1,
             'w2':w2}
        return w
    
    def fit(self, X, Y):
        w = self.ols_BP(X, Y)
        w1 = w['w1']
        w2 = w['w2']
        result = self.ols_forward(X, w1, w2)
        y_predict = result['a2']
        input_num = X.shape[0]
        print("train accuracy:{}%".format(100-1./input_num*np.sqrt(np.sum(np.square(y_predict-Y)))*100))
        return y_predict

实现:

from sklearn.datasets import make_regression
import numpy as np
import matplotlib.pyplot as plt

X, Y = make_regression(n_samples=1000, n_features=10, noise=5) 
print(X.shape, Y.shape)

X = np.insert(X, 0, values=1, axis=1)
X = np.mat(X)
Y = np.mat(Y).T
print(X.shape, Y.shape)

lr = OLS_two_network(epoches=1000, lr=0.001, hidden2_num=5)
y_predict = lr.fit(X, Y)
# ((1000, 10), (1000,))
# ((1000, 11), (1000, 1))
# train accuracy:84.4941378548%

(2)逻辑回归-BP神经网络

第一层到第二层的计算:

\[\begin{eqnarray} Z^1_i & = & w^1_{i0} + w^1_{i1} x_1 + \cdots + w^1_{id} x_{d} \\ & = & \begin{pmatrix} w^1_{i0} & w^1_{i1} & \cdots & w^1_{id} \end{pmatrix} \begin{pmatrix} 1 \\ x_1 \\ \vdots \\ x_{d} \end{pmatrix} \end{eqnarray}\]

维度是(1×n),写成矩阵形式是:

\[\begin{eqnarray} Z^1 & = & w^1_{0} + w^1_{1} x_1 + \cdots + w^1_{d} x_{d} \\ & = & \begin{pmatrix} w^1_{0} & w^1_{1} & \cdots & w^1_{d} \end{pmatrix} \begin{pmatrix} 1 \\ x_1 \\ \vdots \\ x_{d} \end{pmatrix} \end{eqnarray}\]

维度是\((d_1×n)\)

\[a^1_i = \frac{1}{1+e^{- Z^1_i}}\] \[a^1 = \frac{1}{1+e^{- Z^1}}\]

第二层到第三层的计算:

\[\begin{eqnarray} Z^2_1 & = & w^2_0 + w^2_1 a^1_1 + \cdots + w^2_{d_1} a^1_{d_1} \\ & = & \begin{pmatrix} w^2_{0} & w^2_{1} & \cdots & w^2_{d_1} \end{pmatrix} \begin{pmatrix} 1 \\ a^1_1 \\ \vdots \\ a^1_{d_1} \end{pmatrix} \end{eqnarray}\]

维度是(1×n)

\[a^2_1 = \frac{1}{1+e^{- Z^2_1}}\]

第三层到输出层

\[Y_{predict} = a^2_1\]

(1.1)激活函数

与前面相同

(1.2)向前传播

def lr_forward():
	Z1 = X * w1
	a1 = lr_activation(Z1)
	Z2 = a1 * w2 
	a2 = lr_activation(Z2)
	
	result = {'a1':a1,
			'a2':a2}
	return result

(1.3)反向传播(BP)

\[L(w) = -\frac{1}{n}\sum_{i=1}^n [y_i log a^2_{1i} + (1-y_i)log(1-a^2_{1i})]\] \[\begin{eqnarray} \frac{\partial L(w)}{\partial w^2_j} & = & \frac{\partial L(w)}{\partial a^2_{1i}} \frac{\partial a^2_{1i}}{\partial Z^2_{1i}} \frac{\partial Z^2_{1i}}{\partial w^2_j} \\ & = & -\frac{1}{n}\sum_{i=1}^n [\frac{y_i}{a^2_{1i}} - \frac{1-y_i}{1-a^2_{1i}}]a^2_{1i}(1-a^2_{1i})a^1_{ji} \\ & = & -\frac{1}{n}\sum_{i=1}^n [y_i - a^2_{1i}]a^1_{ji} \\ & = & -\frac{1}{n} \begin{pmatrix} y_1 - a^2_{11} & y_2 - a^2_{12} & \cdots & y_n - a^2_{1n} \end{pmatrix} \begin{pmatrix} a^1_{j1} \\ a^1_{j2} \\ \vdots \\ a^1_{jn} \end{pmatrix} \\ & = & -\frac{1}{n} (Y - a^2_1)^T a^1_j \end{eqnarray}\]

维度是1×1

则:

\[\begin{eqnarray} \frac{\partial L(w)}{\partial w^2} & = & -\frac{1}{n} (Y - a^2_1)^T a^1 \end{eqnarray}\]

维度是\((1×d_1)\)

\[\begin{eqnarray} \frac{\partial L(w)}{\partial w^1_{mj}} & = & \frac{\partial L(w)}{\partial a^2_{1i}} \frac{\partial a^2_{1i}}{\partial Z^2_{1i}} \frac{\partial Z^2_{1i}}{\partial a^1_{mi}} \frac{\partial a^1_{mi}}{\partial w^1_{mj}} \\ & = & -\frac{1}{n}\sum_{i=1}^n [y_i - a^2_{1i}]w^2_{m} x_j \\ & = & -\frac{1}{n} \begin{pmatrix} y_1 - a^2_{11} & y_2 - a^2_{12} & \cdots & y_n - a^2_{1n} \end{pmatrix} w^2_m \begin{pmatrix} x^1_{1j} \\ x^1_{2j} \\ \vdots \\ x^1_{nj} \end{pmatrix} \\ & = & -\frac{1}{n} (Y - a^2_1)^T w^2_m X_{·,j} \end{eqnarray}\]

维度是1×1

\[\begin{eqnarray} \frac{\partial L(w)}{\partial w^1_{m}} & = & -\frac{1}{n} (Y - a^2_1)^T w^2_m X \end{eqnarray}\]

维度是\((1×d)\)

\[\begin{eqnarray} \frac{\partial L(w)}{\partial w^1} & = & -\frac{1}{n} w^2 (Y - a^2_1)^T w^2_m X \end{eqnarray}\]

维度是\((d×d_1)\)

def lr_dev_w1(X, Y, a2, w2):
	w1_dev = - 1./n * w2 * (Y - a2).T * X 
	return w1_dev
def lr_dev_w2(Y, a1, a2):
	w2_dev = - 1./n * (Y - a2).T * a1
	return w2_dev
def lr_BP(epoches):
	for i in range(epoches):
		w1 -= lr*w1_dev
		w2 -= lr*w2_dev
	w = {'w1':w1,
		 'w2':w2}
	return w

(1.4)整合

class LR_two_network():
    def __init__(self, lr, epoches):
        self.lr = lr
        self.epoches = epoches

    def lr_activation(self, Z):
        return 1./ (1 + np.exp(-Z))

    def lr_dev_w1(self, X, Y, a2, w2, input_num):
        return - 1./ input_num * X.T * (Y - a2) * w2.T

    def lr_dev_w2(self, Y, a1, a2, input_num):
        return - 1./ input_num * a1.T * (Y - a2)

    def lr_forward(self, X, w1, w2):
        Z1 = X * w1  
        a1 = self.lr_activation(Z1) 
        Z2 = a1 * w2 
        a2 = self.lr_activation(Z2)
        return a1, a2
    

    def lr_BP(self, X, Y):
        input_num = X.shape[0]
        hidden_num = X.shape[1]
        hidden2_num = int(hidden_num / 2)
        output_num = Y.shape[1]
        w1 = np.random.random((hidden_num, hidden2_num))
        w2 = np.random.random((hidden2_num, output_num))

        for i in range(self.epoches):
            a1, a2 = self.lr_forward(X, w1, w2)
            w2_dev = self.lr_dev_w2(Y, a1, a2, input_num)
            w2 -= self.lr * w2_dev
            a1, a2 = self.lr_forward(X, w1, w2)
            w1_dev = self.lr_dev_w1(X, Y, a2, w2, input_num)
            w1 -= self.lr * w1_dev

        return w1, w2

    def fit(self, X, Y):
        w1, w2 = self.lr_BP(X, Y)
        a1, a2 = self.lr_forward(X, w1, w2)
        y_predict = a2
        y_predict[y_predict < 0.5] = 0
        y_predict[y_predict >= 0.5] = 1
        input_num = X.shape[0]

        print("train accuracy:{}%".format(np.sum(y_predict == Y) / float(input_num) * 100))
        return y_predict

实现:

from sklearn.datasets import make_classification
import numpy as np
import matplotlib.pyplot as plt

X, Y = make_classification(n_samples=1000, n_features=20,  n_classes=2) 
print(X.shape, Y.shape)

X = np.insert(X, 0, values=1, axis=1)
X = np.mat(X)
Y = np.mat(Y).T
print(X.shape, Y.shape)

lr = LR_two_network(epoches=1000, lr=0.01)
y_predict = lr.fit(X, Y)
# ((1000, 20), (1000,))
# ((1000, 21), (1000, 1))
# train accuracy:89.1%

(二)多层神经网络

这里不限制层数,我们对前面的线性回归和逻辑回归写一个统一的形式。

首先从公式角度对神经网络做一个梳理:

激活函数记作:\(\sigma(·)\)

因此图中的\(a^{[l]}_i = \sigma(z^{[l]}_i)\)

其中,\(z^{[l]}_j = b^{[l]}_j + w^{[l]}_{1j}a^{[l-1]}_1 + w^{[l]}_{2j}a^{[l-1]}_2 + \cdots + w^{[l]}_{1d_{l-1}}a^{[l-1]}_{d_{l-1}}\)

很显然,我们这里假设每层神经元的个数是\((d_1, d_2, \cdots, d_L)\),其中如果Y只有一个输出层,也就是Y的维度是(n×1)那么\(d_L=1\)。

与此对应,损失函数我们记作:\(L(a^{[L]}_j, y_j) = L(\sigma(z^{[l]}_j), y_j)\)

与前文相同,参数更新的方向依旧是最小化损失函数,因此梯度方向都是\(\frac{\partial L(a^{[L]}_j, y_j}{\partial w^{[l]}_{ij}}\)

但是层数一多,计算起来就复杂多了,因此我们分步计算梯度:

(1)首先最简单的,对\(w^{[L]}\)系列进行求导:

分步骤,先对\(z^{[L]}\)求导:

\[\delta^{[L]}_j = \frac{\partial L(a^{[L]}, y)}{\partial z^{[L]}_j} = \frac{\partial L(a^{[L]}, y)}{\partial a^{[L]}_j} \frac{\partial a^{[L]}_j}{\partial z^{[L]}_j} = \frac{\partial L}{\partial a^{[L]}_j} \sigma^{\prime}(z^{[L]}_j)\]

前面的线性模型,\(\delta^{[L]}_j = \frac{1}{n} \sum_j (a^{[L]}_j - y_j)\)

然后再计算:

\[\frac{\partial z^{[L]}_j}{\partial w^{[L]}_{ij}} = a^{[L-1]}_i\]

则:

\[\frac{\partial L(a^{[L]}_j, y_j}{\partial w^{[L]}_{ij}} = \delta^{[L]}_j a^{[L-1]}_i\]

(2)其次,对\(w^{[L-1]}\)系列进行求导:

分步骤,先对\(z^{[L-1]}\)求导:

\[\begin{eqnarray} \delta^{[L-1]}_j & = & \frac{\partial L(a^{[L]}, y)}{\partial z^{[L-1]}_j} \\ & = & \frac{\partial L(a^{[L]}, y)}{\partial a^{[L]}_1}\frac{\partial a^{[L]}_1}{\partial z^{[L]}_1}\frac{\partial z^{[L]}_1}{\partial z^{[L-1]}_j} + \frac{\partial L(a^{[L]}, y)}{\partial a^{[L]}_2}\frac{\partial a^{[L]}_2}{\partial z^{[L]}_2}\frac{\partial z^{[L]}_2}{\partial z^{[L-1]}_j} + \cdots + \frac{\partial L(a^{[L]}, y)}{\partial a^{[L]}_{d_L}}\frac{\partial a^{[L]}_{d_L}}{\partial z^{[L]}_{d+L}}\frac{\partial z^{[L]}_{d_L}}{\partial z^{[L-1]}_j} \\ & = & \delta^{[L]}_1 w^{[L]}_{j1} \sigma^{\prime}(z^{[L-1]}_j) + \delta^{[L]}_2 w^{[L]}_{j2} \sigma^{\prime}(z^{[L-1]}_j) + \cdots + \delta^{[L]}_{d_L} w^{[L]}_{j{d_L}} \sigma^{\prime}(z^{[L-1]}_j) \\ & = & \sum_{k=1}^{d_L} \delta^{[L]}_k w^{[L]}_{jk} \sigma^{\prime}(z^{[L-1]}_j) \end{eqnarray}\]

其中: \(\begin{eqnarray} \frac{\partial z^{[L]}_1}{\partial z^{[L-1]}_j} & = &\frac{\partial z^{[L]}_1}{\partial a^{[L-1]}_j} \frac{\partial a^{[L-1]}_1}{\partial z^{[L-1]}_j} \\ & = & w^{[L]}_{j1} \sigma^{\prime}(z^{[L-1]}_j) \end{eqnarray}\)

首先了解一下四个重要的公式:

(1)输出层误差

\[\delta^{[L]}_j = \frac{\partial L}{\partial a^{[L]}_j} \prime{\sigma}(Z^{[L]}_j)\]

矩阵形式是:

\(\delta^{[L]} =\)\nabla (L) \bigodot \prime{\sigma}(Z^{[L]})$$

其中,

对线性回归来说:\(\frac{\partial L}{\partial a^{[l]}_j} = \frac{1}{n} \sum_j (a^{[L]}_j - y_j)\), \(\sigma^{\prime}(Z^{[L]}_j) = 1\), 相乘得:$$\delta^{[L]}_j = \frac{1}{n} \sum_j (a^{[L]}_j - y_j});

对逻辑回归来说:\(\frac{\partial L}{\partial a^{[l]}_j} = \frac{1}{n} \sum_j (\frac{y_j}{a^{[L]}_j} - \frac{1 - y_j}{1 - a^{[L]}_j})\), \(\sigma^{\prime}(Z^{[L]}_j) = a^{[L]}_j(1 - a^{[L]}_j)\),相乘得:$$\delta^{[L]}_j = \frac{1}{n} \sum_j (a^{[L]}_j - y_j});

(2)隐含层误差

\[\delta^{[l]}_j = \sum_k w^{[l+1]}_{kj} \delta^{[l+1]}_k \sigma^{\prime}(Z^{[l]}_j)\]

矩阵形式是:

\(\delta^{[l]} =\)w^{[l+1]} \delta^{[l+1]} \bigodot \sigma^{\prime}(Z^{[l]})$$

(3)参数变化率

\[\frac{\partial L}{\partial b^{[l]}_j} = \sigma^{[l]}_j\]

矩阵形式是:

\[\frac{\partial L}{\partial b^{[l]}} = \delta^{[l]}\] \[\frac{\partial L}{\partial w^{[l]}_{jk}} = a^{[l-1]}_k \delta^{[l]}_j\]

矩阵形式是:

\[\frac{L}{w^{[l]}} = \delta^{[l]} a^{[l-1] T}\]

(4)参数更新规则

\[b^{[l]}_j = b^{[l]}_j - \alpha \frac{\partial L}{\partial b^{[l]}_j}\]

矩阵形式是:

\[b^{[l]} = b^{[l]} - \alpha \frac{\partial L}{\partial b^{[l]}}\] \[w^{[l]}_{jk} = w^{[l]}_{jk} - \alpha \frac{\partial L}{\partial w^{[l]}_{jk}}\]

矩阵形式是:

\[w^{[l]} = w^{[l]} - \alpha \frac{\partial L}{\partial w^{[l]}}\]

我们用代码实现上面四个公式:

def output_dev(Y, a_L):
	return -1./n * np.sum(Y - a_L)
	
def hidden_dev():
	if type == 'ols':
		sigma_dev = 1
	elif type == 'lr':
		sigma_dev = np.multiply(a_l, (1-a_l))
	delta_l = np.multiply(w_l1 * delta_l1, sigma_dev)
	return delta_l
	
def error():
	delta = -1./n * np.sum(Y - a_L)
	for i in range(len(weight)-2, 0, -1):
		sigma_dev = np.multiply(a[i], (1-a[i]))
		delta.append(np.multiply(w[i]*delta[-1], sigma_dev)
		delta.reverse()
	return delta
	
def BP():
	delta = error()
	for i in range(len(weight)):
		b_dev[i] = delta[i]
		w_dev[i] = delta[i]*a[i-1]
		b[i] -= lr*b_dev[i]
		w[i] -= lr*w_dev[i]
		

参考文献:

  1. deep learning

  2. https://blog.csdn.net/u014303046/article/details/78200010