Improving NN - Regularization

5 분 소요

Improving NN - Regularization

목적

정규화를 NN에 적용하기

데이터셋

Untitled

Untitled 1

모델

def model(X, Y, learning_rate = 0.3, num_iterations = 30000, print_cost = True, lambd = 0, keep_prob = 1):
    """
    Implements a three-layer neural network: LINEAR->RELU->LINEAR->RELU->LINEAR->SIGMOID.
    
    Arguments:
    X -- input data, of shape (input size, number of examples)
    Y -- true "label" vector (1 for blue dot / 0 for red dot), of shape (output size, number of examples)
    learning_rate -- learning rate of the optimization
    num_iterations -- number of iterations of the optimization loop
    print_cost -- If True, print the cost every 10000 iterations
    lambd -- regularization hyperparameter, scalar
    keep_prob - probability of keeping a neuron active during drop-out, scalar.
    
    Returns:
    parameters -- parameters learned by the model. They can then be used to predict.
    """
        
    grads = {}
    costs = []                            # to keep track of the cost
    m = X.shape[1]                        # number of examples
    layers_dims = [X.shape[0], 20, 3, 1]
    
    # Initialize parameters dictionary.
    parameters = initialize_parameters(layers_dims)

    # Loop (gradient descent)

    for i in range(0, num_iterations):

        # Forward propagation: LINEAR -> RELU -> LINEAR -> RELU -> LINEAR -> SIGMOID.
        if keep_prob == 1:
            a3, cache = forward_propagation(X, parameters)
        elif keep_prob < 1:
            a3, cache = forward_propagation_with_dropout(X, parameters, keep_prob)
        
        # Cost function
        if lambd == 0:
            cost = compute_cost(a3, Y)
        else:
            cost = compute_cost_with_regularization(a3, Y, parameters, lambd)
            
        # Backward propagation.
        assert (lambd == 0 or keep_prob == 1)   # it is possible to use both L2 regularization and dropout, 
                                                # but this assignment will only explore one at a time
        if lambd == 0 and keep_prob == 1:
            grads = backward_propagation(X, Y, cache)
        elif lambd != 0:
            grads = backward_propagation_with_regularization(X, Y, cache, lambd)
        elif keep_prob < 1:
            grads = backward_propagation_with_dropout(X, Y, cache, keep_prob)
        
        # Update parameters.
        parameters = update_parameters(parameters, grads, learning_rate)
        
        # Print the loss every 10000 iterations
        if print_cost and i % 10000 == 0:
            print("Cost after iteration {}: {}".format(i, cost))
        if print_cost and i % 1000 == 0:
            costs.append(cost)
    
    # plot the cost
    plt.plot(costs)
    plt.ylabel('cost')
    plt.xlabel('iterations (x1,000)')
    plt.title("Learning rate =" + str(learning_rate))
    plt.show()
    
    return parameters

정규화

비정규화 시 성능확인

  parameters = model(train_X, train_Y)
  print ("On the training set:")
  predictions_train = predict(train_X, train_Y, parameters)
  print ("On the test set:")
  predictions_test = predict(test_X, test_Y, parameters)

  Cost after iteration 0: 0.6557412523481002
  Cost after iteration 10000: 0.16329987525724204
  Cost after iteration 20000: 0.13851642423234922
    
  On the training set:
  Accuracy: 0.9478672985781991
  On the test set:
  Accuracy: 0.915

Untitled 2

Untitled 3

비정규화시에 overfitting되는것을 확인할 수 있다.

L2 정규화

Untitled 4

  def compute_cost_with_regularization(A3, Y, parameters, lambd):
       
      m = Y.shape[1]
      W1 = parameters["W1"]
      W2 = parameters["W2"]
      W3 = parameters["W3"]
        
      cross_entropy_cost = compute_cost(A3, Y) 
        
      L2_regularization_cost=(1/m) * (lambd/2) * (np.sum(np.square(W1))+np.sum(np.square(W2))+np.sum(np.square(W3)))
        
        
      cost = cross_entropy_cost + L2_regularization_cost
        
      return cost

L2 정규화 역전파 구현

위에 정규화된 함수로 계산된 손실함수를 미분하면 덧셈으로 연결된 정규화 텀은 각 레이어의 가중치에 대한 편미분시 원래 가중치에 lambd/m을 곱해서 더해주면 되고, 이는 가중치값에 패널티읕 부여함으로써, 극단적으로 큰 lambd값은 모델이 선형성만을 학습할 수 있게 할수도 있다

  def backward_propagation_with_regularization(X, Y, cache, lambd):
       
      m = X.shape[1]
      (Z1, A1, W1, b1, Z2, A2, W2, b2, Z3, A3, W3, b3) = cache
        
      dZ3 = A3 - Y
        
      dW3 = 1./m * np.dot(dZ3, A2.T) + ((lambd/m) * W3)
        
      db3 = 1. / m * np.sum(dZ3, axis=1, keepdims=True)
        
      dA2 = np.dot(W3.T, dZ3)
      dZ2 = np.multiply(dA2, np.int64(A2 > 0))
       
      dW2 = 1./m * np.dot(dZ2, A1.T) + ((lambd/m) * W2)
        
      db2 = 1. / m * np.sum(dZ2, axis=1, keepdims=True)
        
      dA1 = np.dot(W2.T, dZ2)
      dZ1 = np.multiply(dA1, np.int64(A1 > 0))
       
      dW1 = 1./m * np.dot(dZ1, X.T) + ((lambd/m) * W1)
        
      db1 = 1. / m * np.sum(dZ1, axis=1, keepdims=True)
        
      gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
                   "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
                   "dZ1": dZ1, "dW1": dW1, "db1": db1}
        
      return gradients

  parameters = model(train_X, train_Y, lambd = 0.7)
  print ("On the train set:")
  predictions_train = predict(train_X, train_Y, parameters)
  print ("On the test set:")
  predictions_test = predict(test_X, test_Y, parameters)

  Cost after iteration 0: 0.6974484493131264
  Cost after iteration 10000: 0.2684918873282238
  Cost after iteration 20000: 0.26809163371273004
    
  On the train set:
  Accuracy: 0.9383886255924171
  On the test set:
  Accuracy: 0.93

Untitled 5

Untitled 6

lambd 파라미터를 조정하며, bias와 variance간의 적절한 지점을 찾는 것이중요하다. 너무 높은 lambd 값은 모델이 너무 지나친 bias를 갖게하며, 과소적합으로 이어지고, 너무 작은 lambd는 과적합으로 이어질 수 있기 때문이다.

Dropout

각 레이어에서 뉴런의 연결을 끊음으로써, 정규화역할을 할 수 있으며, 각 레이어별로, keep_prob 파라미터를 통해 얼마의 뉴런을 남길지 정할 수 있다.

Forwarding with Dropout

각 레이어 별로, keep_prob에 맞춰, random.rand를 통해 균일분포에서 랜덤값을 생성하여, 마스크 형태로 만들고, 각 뉴런의 활성화함수 결과값에 multiply하여, 유지할 비율에 맞게 활성화값을 보낸다. 여기서, 기존 뉴런들의 기댓값을 충족시키기 위해 inverted dropout을 시행한다.

  def forward_propagation_with_dropout(X, parameters, keep_prob = 0.5):
  	np.random.seed(1)
  	W1 = parameters["W1"]
  	b1 = parameters["b1"]
  	W2 = parameters["W2"]
  	b2 = parameters["b2"]
  	W3 = parameters["W3"]
  	b3 = parameters["b3"]
    	
  	Z1 = np.dot(W1, X) + b1
  	A1 = relu(Z1)
  	D1 = np.random.rand(A1.shape[0], A1.shape[1])
  	D1 = (D1 < keep_prob).astype(int)
  	A1 = A1 * D1
  	A1 /= keep_prob
    	
  	Z2 = np.dot(W2, A1) + b2
  	A2 = relu(Z2)
    	
  	D2 = np.random.rand(A2.shape[0], A2.shape[1])
  	D2 = (D2 < keep_prob).astype(int)
  	A2 = A2 * D2
  	A2 /= keep_prob
    	
  	Z3 = np.dot(W3,A2) +b3
  	A3 = sigmoid(Z3)
    	
  	cache = (Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3)
    	
  	return A3, cache

Backwarding with Dropout

마지막 출력층에서는 dropout을 시행하지않고, 히든레이어 2개에서 Dropout을 시행했기때문에, 2개의 레이어에 대해서 활성함수 결과값이 0이상인 값들에 한해서만, 역전파를 수행해주면된다.

  def backward_propagation_with_dropout(X, Y, cache, keep_prob):
    
  	m = X.shape[1]
  	(Z1, D1, A1, W1, b1, Z2, D2, A2, W2, b2, Z3, A3, W3, b3) = cache
    	
  	dZ3 = A3 - Y
  	dW3 = 1./m * np.dot(dZ3, A2.T)
  	db3 = 1./m * np.sum(dZ3, axis=1, keepdims=True)
  	dA2 = np.dot(W3.T, dZ3)
    	
  	dA2 = dA2 * D2
  	dA2 /= keep_prob
    	
    	
  	dZ2 = np.multiply(dA2, np.int64(A2 > 0))
  	dW2 = 1./m * np.dot(dZ2, A1.T)
  	db2 = 1./m * np.sum(dZ2, axis=1, keepdims=True)
    	
  	dA1 = np.dot(W2.T, dZ2)
  	dA1 = dA1 * D1
  	dA1 /= keep_prob
    	
  	dZ1 = np.multiply(dA1, np.int64(A1 > 0))
  	dW1 = 1./m * np.dot(dZ1, X.T)
  	db1 = 1./m * np.sum(dZ1, axis=1, keepdims=True)
    	    
  	gradients = {"dZ3": dZ3, "dW3": dW3, "db3": db3,"dA2": dA2,
  	             "dZ2": dZ2, "dW2": dW2, "db2": db2, "dA1": dA1, 
  	             "dZ1": dZ1, "dW1": dW1, "db1": db1}
    	    
  	return gradients

  parameters = model(train_X, train_Y, keep_prob = 0.86, learning_rate = 0.3)
    
  print ("On the train set:")
  predictions_train = predict(train_X, train_Y, parameters)
  print ("On the test set:")
  predictions_test = predict(test_X, test_Y, parameters)

  Cost after iteration 0: 0.6543912405149825
  Cost after iteration 10000: 0.0610169865749056
  Cost after iteration 20000: 0.060582435798513114
    
  On the train set:
  Accuracy: 0.9289099526066351
  On the test set:
  Accuracy: 0.95

Untitled 7

Untitled 8

결과

수행	train ACC	test ACC
non regularization	95	91.5
L2-Regularlization	94	93
NN with Dropout	93	95

정규화는 과적합을 방지할 수 있으며, L2 정규화와 Dropout기법은 효과적인 정규화 기법이다.

Twitter Facebook LinkedIn

최종원

Improving NN - Regularization

Improving NN - Regularization

공유하기

댓글남기기

참고

NMT - with Attention

L layer NN - with ReLu

L layer NN - Cat Recognition

Transformers Architecture with TensorFlow