Improving NN - 가중치 초기화
Improving NN - 가중치 초기화
목적
NN의 가중치 초기화를 통한 성능향상을 시도한다.
데이터셋
도넛형으로 구성된 이진 분류 데이터셋이다.
Model
모델은 전 과정에서 구축한 3-layer NN을 사용한다.
def model(X, Y, learning_rate = 0.01, num_iterations = 15000, print_cost = True, initialization = "he"):
"""
Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.
Arguments:
X -- input data, of shape (2, number of examples)
Y -- true "label" vector (containing 0 for red dots; 1 for blue dots), of shape (1, number of examples)
learning_rate -- learning rate for gradient descent
num_iterations -- number of iterations to run gradient descent
print_cost -- if True, print the cost every 1000 iterations
initialization -- flag to choose which initialization to use ("zeros","random" or "he")
Returns:
parameters -- parameters learnt by the model
"""
grads = {}
costs = [] # to keep track of the loss
m = X.shape[1] # number of examples
layers_dims = [X.shape[0], 10, 5, 1]
# Initialize parameters dictionary.
if initialization == "zeros":
parameters = initialize_parameters_zeros(layers_dims)
elif initialization == "random":
parameters = initialize_parameters_random(layers_dims)
elif initialization == "he":
parameters = initialize_parameters_he(layers_dims)
# Loop (gradient descent)
for i in range(num_iterations):
# Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
a3, cache = forward_propagation(X, parameters)
# Loss
cost = compute_loss(a3, Y)
# Backward propagation.
grads = backward_propagation(X, Y, cache)
# Update parameters.
parameters = update_parameters(parameters, grads, learning_rate)
# Print the loss every 1000 iterations
if print_cost and i % 1000 == 0:
print("Cost after iteration {}: {}".format(i, cost))
costs.append(cost)
# plot the loss
plt.plot(costs)
plt.ylabel('cost')
plt.xlabel('iterations (per hundreds)')
plt.title("Learning rate =" + str(learning_rate))
plt.show()
return parameters
weight 초기화 방식
W[1] … W[L-1], W[L] / b[1] … b[L-1],b[L]의 가중치 초기화에 대한 방법이다.
-
Zero 초기화
np.zeros()를 활용하여 각 NN에 shape에 맞게 0으로 초기화한다.
def initialize_parameters_zeros(layers_dims): parameters = {} L = len(layers_dims) for i in range(1,L): parameters['W' + str(i)] = np.zeros((layers_dims[i],layers_dims[i-1])) parameters['b' + str(i)] = np.zeros((layers_dims[i],1)) return parametersparameters = model(train_X, train_Y, initialization = "zeros") print ("On the train set:") predictions_train = predict(train_X, train_Y, parameters) print ("On the test set:") predictions_test = predict(test_X, test_Y, parameters)Cost after iteration 0: 0.6931471805599453 Cost after iteration 1000: 0.6931471805599453 Cost after iteration 2000: 0.6931471805599453 Cost after iteration 3000: 0.6931471805599453 Cost after iteration 4000: 0.6931471805599453 Cost after iteration 5000: 0.6931471805599453 Cost after iteration 6000: 0.6931471805599453 Cost after iteration 7000: 0.6931471805599453 Cost after iteration 8000: 0.6931471805599453 Cost after iteration 9000: 0.6931471805599453 Cost after iteration 10000: 0.6931471805599455 Cost after iteration 11000: 0.6931471805599453 Cost after iteration 12000: 0.6931471805599453 Cost after iteration 13000: 0.6931471805599453 Cost after iteration 14000: 0.6931471805599453 On the train set: Accuracy: 0.5 On the test set: Accuracy: 0.5𝑎=𝑅𝑒𝐿𝑈(𝑧)=𝑚𝑎𝑥(0,𝑧)=0
초기 가중치가 0으로 초기화 되었으므로, a=0이며, 말단에 sigmoid를 적용하면
1/ 1+e^0 = 1/2 = ypred가 된다.
이를 손실함수에 적용하면
L(𝑎,𝑦)=−𝑦 ln (𝑦𝑝𝑟𝑒𝑑)−(1−𝑦) ln (1−𝑦𝑝𝑟𝑒𝑑)
if y가 1이라면
= - 1 ln ( 1/2) - (0) ln ( 1 - 0.5)
= -ln(1/2) = 0.693
if y가 0이라면
= -ln(1/2) = 0.693
예측이 0.5일 수 밖에 없기때문에, 동일한 손실을 출력할 수 밖에 없고, 이는 가중치를 변경하지 않게한다.
→ 절대 0으로 초기화해서는안된다.
-
랜덤 초기화
np.random.randn(shape)을 사용하여 랜덤으로 w를 초기화하고, b는 zeros를 통해 초기화한다.
# np.random.randn은 정규분포 # np.random.rand는 균일분포에서 랜덤으로 한다. def initialize_parameters_random(layers_dims): np.random.seed(3) parameters = {} L = len(layers_dims) for l in range(1,L): parameters['W'+str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1]) * 10 parameters['b'+str(l)] = np.zeros((layers_dims[l],1)) return parametersparameters = model(train_X, train_Y, initialization = "random") print ("On the train set:") predictions_train = predict(train_X, train_Y, parameters) print ("On the test set:") predictions_test = predict(test_X, test_Y, parameters)Cost after iteration 0: inf Cost after iteration 1000: 0.6247924745506072 Cost after iteration 2000: 0.5980258056061102 Cost after iteration 3000: 0.5637539062842213 Cost after iteration 4000: 0.5501256393526495 Cost after iteration 5000: 0.5443826306793814 Cost after iteration 6000: 0.5373895855049121 Cost after iteration 7000: 0.47157999220550006 Cost after iteration 8000: 0.39770475516243037 Cost after iteration 9000: 0.3934560146692851 Cost after iteration 10000: 0.3920227137490125 Cost after iteration 11000: 0.38913700035966736 Cost after iteration 12000: 0.3861358766546214 Cost after iteration 13000: 0.38497629552893475 Cost after iteration 14000: 0.38276694641706693 On the train set: Accuracy: 0.83 On the test set: Accuracy: 0.86초기 비용함수의 손실값이 inf로 나타나는 것은 가중치 초기화를 매우큰값으로 시작했기에, 손실함수가 무한대에 비슷한 손실을 가져 표현을 못하게 된것이다. 즉, 높은 랜덤 가중치 값은 최적화의 속도를 느리게 하기에 작은 random값을 곱해주는것이 중요하다.
-
He 초기화
Xavier 초기화라고도 불리는 초기화방식으로 정규분포로 부터 뽑은 랜덤value에
sqrt(2./layers_dims[l-1])를 곱해서 사용한다.
def initialize_parameters_he(layers_dims): np.random.seed(3) parameters = {} L = len(layers_dims) - 1 for l in range(1, L + 1): parameters['W' + str(l)] = np.random.randn(layers_dims[l],layers_dims[l-1]) *np.sqrt(2./layers_dims[l-1]) parameters['b' + str(l)] = np.zeros((layers_dims[l],1)) return parametersparameters = model(train_X, train_Y, initialization = "he") print ("On the train set:") predictions_train = predict(train_X, train_Y, parameters) print ("On the test set:") predictions_test = predict(test_X, test_Y, parameters)Cost after iteration 0: 0.8830537463419761 Cost after iteration 1000: 0.6879825919728063 Cost after iteration 2000: 0.6751286264523371 Cost after iteration 3000: 0.6526117768893805 Cost after iteration 4000: 0.6082958970572938 Cost after iteration 5000: 0.5304944491717495 Cost after iteration 6000: 0.4138645817071794 Cost after iteration 7000: 0.3117803464844441 Cost after iteration 8000: 0.23696215330322562 Cost after iteration 9000: 0.1859728720920684 Cost after iteration 10000: 0.15015556280371808 Cost after iteration 11000: 0.12325079292273551 Cost after iteration 12000: 0.09917746546525937 Cost after iteration 13000: 0.08457055954024283 Cost after iteration 14000: 0.07357895962677366 On the train set: Accuracy: 0.9933333333333333 On the test set: Accuracy: 0.96
결론
모델은 공통적으로 3 layer- NN
활성화 Relu, 출력층 sigmoid , 손실함수 binary-crossentropy를 사용.
| 초기화방식 | 훈련 정확도 | 평가 |
|---|---|---|
| zero init | 50% | 동일한 손실함수 값으로 가중치 수정 불가능 |
| large random init | 83% | 너무 큰 weight로 최적화속도문제 |
| He init | 99% | 가장 효율적이다. |
댓글남기기