Part-2 DNN

Lab-08-1 Perceptron

퍼셉트론(Perceptron)
선형분류기(Linear Classifier)
AND, OR, XOR 게이트

Neuron

인공 신경망은, 인간의 뉴런을 본따 만든 신경망
입력 신호들의 합이 임계값을 넘으면 신호를 출력하는 방식

  
import torch

device = "cuda" if torch.cuda.is_available() else "cpu"

  
# XOR
X = torch.FloatTensor([[0, 0], [0, 1], [1, 0], [1, 1]]).to(device)
Y = torch.FloatTensor([[0], [1], [1], [0]]).to(device)

## nn Layers
linear = torch.nn.Linear(2, 1, bias=True)
sigmoid = torch.nn.Sigmoid()

model = torch.nn.Sequential(linear, sigmoid).to(device)

# define cost/loss and optimizer
criterion = torch.nn.BCELoss().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=1)

for step in range(10001):
    optimizer.zero_grad()
    hypothesis = model(X)
    
    cost = criterion(hypothesis, Y)
    cost.backward()
    optimizer.step()
    
    if step%1000==0:
        print(step, cost.item())
    

0.7055874466896057
0.6931471824645996
0.6931471824645996
0.6931471824645996
0.6931471824645996
0.6931471824645996
0.6931471824645996
0.6931471824645996
0.6931471824645996
0.6931471824645996
0.6931471824645996

loss가 줄어들지 않음 -> 학습이 제대로 진행되지 않음 (XOR은 퍼셉트론으로 해결 불가능)

Lab-08-2 Multi Layer Perceptron

다중 퍼셉트론(Multi Layer Perceptron)
오차역전파(Backpropagation)

다중 레이어를 학습 시킬 수 있는 방법 -> 오차역전파 (loss 값의 그라디언트를 최소화하는 방법)

  
# backpropagation
X = torch.FloatTensor([[0, 0], [0, 1], [1, 0], [1, 1]]).to(device)
Y = torch.FloatTensor([[0], [1], [1], [0]]).to(device)

# nn Layers
w1 = torch.Tensor(2, 2).to(device)
b1 = torch.Tensor(2).to(device)
w2 = torch.Tensor(2, 1).to(device)
b2 = torch.Tensor(1).to(device)

def sigmoid(x):
    return 1.0 / (1.0 + torch.exp(-x))

def sigmoid_prime(x):
    return sigmoid(x) * (1-sigmoid(x))

  
lr = 1

for step in range(10001):
    # forward
    l1 = torch.add(torch.matmul(X, w1), b1)
    a1 = sigmoid(l1)
    l2 = torch.add(torch.matmul(a1, w2), b2)
    Y_pred = sigmoid(l2)
    # BCE
    cost = -torch.mean(Y*torch.log(Y_pred) + (1-Y) * torch.log(1-Y_pred))
    
    # Back prop (chain rule) (backward)
    # loss derivative
    d_Y_pred = (Y_pred-Y) / (Y_pred * (1.0-Y_pred) + 1e-7)
    
    # Layer 2
    d_l2 = d_Y_pred * sigmoid_prime(l2)
    d_b2 = d_l2
    d_w2 = torch.matmul(torch.transpose(a1, 0, 1), d_b2)
    
    # Layer 1
    d_a1 = torch.matmul(d_b2, torch.transpose(w2, 0, 1))
    d_l1 = d_a1 * sigmoid_prime(l1)
    d_b1 = d_l1
    d_w1 = torch.matmul(torch.transpose(X, 0, 1), d_b1)
    
    # weight update (step)
    w1 = w1 - lr*d_w1
    b1 = b1 - lr*torch.mean(d_b1, 0)
    w2 = w2 - lr*d_w2
    b2 = b2 - lr*torch.mean(d_b2, 0)
    
    if step%1000==0:
        print(step, cost.item())

0.34671515226364136
0.3467007279396057
0.3466891646385193
0.34667932987213135
0.3466709852218628
0.3466639220714569
0.346657931804657
0.34665271639823914
0.34664785861968994
0.3466435968875885
0.3466399610042572

Code: xor-nn

  
X = torch.FloatTensor([[0, 0], [0, 1], [1, 0], [1, 1]]).to(device)
Y = torch.FloatTensor([[0], [1], [1], [0]]).to(device)
# nn Layers
linear1 = torch.nn.Linear(2, 2, bias=True)
linear2 = torch.nn.Linear(2, 1, bias=True)
sigmoid = torch.nn.Sigmoid()
model = torch.nn.Sequential(linear1, sigmoid, linear2, sigmoid).to(device)
# define cost/loss and optimizer
criterion = torch.nn.BCELoss().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=1)
for step in range(10001):
    optimizer.zero_grad()
    hypothesis = model(X)
    # cost/loss function
    cost = criterion(hypothesis, Y)
    cost.backward()
    optimizer.step()
    
    if step%300==0:
        print(step, cost.item())

0.715934157371521
0.596375584602356
0.3768468201160431
0.35934001207351685
0.3544251024723053
0.35218489170074463
0.3509174585342407
0.35010701417922974
0.34954583644866943
0.34913522005081177
0.34882205724716187
0.3485758304595947
0.3483772873878479
0.348213791847229
0.3480769991874695
0.34796082973480225
0.34786105155944824
0.3477745056152344
0.34769824147224426
0.34763121604919434
0.34757161140441895
0.34751802682876587
0.3474700450897217
0.34742674231529236
0.34738701581954956
0.34735098481178284
0.34731805324554443
0.34728747606277466
0.3472594618797302
0.347233384847641
0.3472091555595398
0.3471869230270386
0.3471660315990448
0.34714627265930176

Code: xor-nn-wide-deep

  
X = torch.FloatTensor([[0, 0], [0, 1], [1, 0], [1, 1]]).to(device)
Y = torch.FloatTensor([[0], [1], [1], [0]]).to(device)

# nn Layers
linear1 = torch.nn.Linear(2, 10, bias=True)
linear2 = torch.nn.Linear(10, 10, bias=True)
linear3 = torch.nn.Linear(10, 10, bias=True)
linear4 = torch.nn.Linear(10, 1, bias=True)
sigmoid = torch.nn.Sigmoid()

model = torch.nn.Sequential(linear1, sigmoid, linear2, sigmoid, linear3, sigmoid, linear4, sigmoid).to(device)
# define cost/loss and optimizer
criterion = torch.nn.BCELoss().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=1)

for step in range(10001):
    optimizer.zero_grad()
    hypothesis = model(X)
    # cost/loss function
    cost = criterion(hypothesis, Y)
    cost.backward()
    optimizer.step()
    
    if step%100==0:
        print(step, cost.item())

0.7164230346679688
0.6931554079055786
0.6931540966033936
0.6931527853012085
0.693151593208313
0.6931504607200623
0.6931492686271667
0.693148136138916
0.6931470632553101
0.6931459307670593
0.6931447982788086
0.6931437253952026
0.6931427121162415
0.6931416392326355
0.6931405067443848
0.6931394338607788
0.6931382417678833
0.6931371688842773
0.6931359767913818
0.6931347846984863
0.6931334733963013
0.6931322813034058
0.6931309103965759
0.6931295394897461
0.6931281089782715
0.6931265592575073
0.6931249499320984
0.6931232213973999
0.6931214332580566
0.6931195855140686
0.6931174397468567
0.693115234375
0.693112850189209
0.6931103467941284
0.6931076049804688
0.6931045055389404
0.6931012272834778
0.6930975914001465
0.6930936574935913
0.6930893659591675
0.6930843591690063
0.6930789947509766
0.6930727958679199
0.693065881729126
0.6930580139160156
0.6930490732192993
0.6930387020111084
0.6930266618728638
0.6930125951766968
0.6929957866668701
0.6929758787155151
0.692951500415802
0.69292151927948
0.6928839087486267
0.6928355693817139
0.692772388458252
0.6926865577697754
0.6925660371780396
0.6923881769180298
0.6921088695526123
0.6916301250457764
0.690697968006134
0.6884748935699463
0.6807767748832703
0.6191476583480835
0.07421649247407913
0.010309841483831406
0.004796213004738092
0.003011580090969801
0.0021596732549369335
0.001668647164478898
0.0013521360233426094
0.0011324126971885562
0.0009715152555145323
0.0008490675827488303
0.000752839376218617
0.0006754109635949135
0.0006118417950347066
0.000558803731109947
0.0005138349952176213
0.00047530914889648557
0.000441973126726225
0.00041282744496129453
0.0003871413064189255
0.0003643627860583365
0.0003439848660491407
0.00032572413329035044
0.0003092226979788393
0.0002942568971775472
0.0002806179691106081
0.0002681269543245435
0.00025669438764452934
0.0002461116237100214
0.0002363934472668916
0.00022736100072506815
0.0002190142695326358
0.00021117438154760748
0.0002038860257016495
0.00019705976592376828
0.00019065086962655187
0.00018467426707502455

Lab-09-1 ReLU

ReLU
Sigmoid
Optimizer

Problem of Sigmoid

Input -> Network -> Output
GD 계산시, 시그모이드 사용지 작은 값이 곱해지면서 Gradient Vanishing 이라고 함

ReLU

$f(x) = max(0, x)$

  
torch.nn.Sigmoid()
torch.nn.Tanh()
torch.nn.ReLU()
torch.nn.LeakyReLU()

Optimizer in PyTorch

  
torch.optim.SGD()
torch.optim.Adadelta()
torch.optim.RMSprop()
...

Lab-09-2 Weight initialization

가중치 초기화(Wegight Initialization)
Xavier / He initialization

Why good initialization?

가중치 초기화를 진행한 경우, 성능과 학습에 좋음
단순히 0으로 초기화하는 경우, 오류역전파 과정에서 모든것이 0으로 초기화되는 문제가 있음
RBM을 이용해 가중치를 초기화

RBM (Restricted Boltzmann Machine)

레이어 안에서는 연결이 없다
다른 레이어 사이에는 FC

Pre-training

x - (RBM) - h1
x - h1 - (RBM) - h2
x - h1 - h2 - (RBM) - y, h3
Fine-tuning

최근에는, Xavier / He initialization와 같이 RBM 보다 성능이 향상된 초기화 방법이 있음

Xavier / He initialization

RBM에서는 무작위로 가중치를 초기화
위 두 방식은, 레이어에 따라 초기화하는 방식이 달라짐
Normal Distribution (정규분포)
- 가우스 분포
- 수집된 자료의 분포를 근사하는데 자주 사용
Uniform Distribution (연속균등분포)
- 연속 확률 분포, 분포가 특정 범위내에서 균등하게 나타나 있을 경우
- [a, b]

Xavier

Normal
1. $W \sim N(0, Var(W))$
2. $Var(x) = \sqrt {2 \over n_{in} + n_{out} }$
Uniform
1. $W \sim U(-\sqrt {6 \over n_{in} + n_{out} }, + \sqrt {6 \over n_{in} + n_{out} })$

He ubutuakuzatuib

out을 모두 제거

Normal
1. $W \sim N(0, Var(W))$
2. $Var(W) = \sqrt {2 \over n_{in} }$
Uniform
1. $W \sim U(-\sqrt{6 \over n_{in} }, +\sqrt{6 \over n_{in} })$

  
# PyTorch Xavier 공식 구현 코드
def xavier_uniform_(tensor, gain=1):
    fan_in, fan_out = _calculate_fan_in_and_out(tensor) # 초기화하고자하는 인풋, 아웃풋 레이아웃 수
    std = gain * math.sqrt(2.0 / (fan_in + fan_out))
    a = math.sqrt(3.0) * std
    with torch.no_grad():
        return tensor.uniform_(-a, a)

Lab-09-3 Dropout

과최적화(Overfitting)
드롭아웃(Dropout)

Dropout

학습을 진행하면, 각 레이어에 존재하는 노드를 무작위로 껐다 켯다하면서 학습

  
# Dropout
# nn Layers
linear1 = torch.nn.Linear(784, 512, bias=True)
linear2 = torch.nn.Linear(512, 512, bias=True)
linear3 = torch.nn.Linear(512, 512, bias=True)
linear4 = torch.nn.Linear(512, 512, bias=True)
linear5 = torch.nn.Linear(512, 10, bias=True)
relu = torch.nn.ReLU()
drop_prob = 0.5 # dropout 확률
dropout = torch.nn.Dropout(p=drop_prob)

# model
model = torch.nn.Sequential(
    linear1, relu, dropout,
    linear2, relu, dropout,
    linear3, relu, dropout,
    linear4, relu, dropout,
    linear5
).to(device)

Train & eval mode

  
model.train() # dropout 적용

# dropout 미적용 (평가)
with torch.no_grad():
    model.eval()

Lab-09-4 Batch Normalization

Batch Normalization
경사 소실(Gradient Vanishin) / 폭발(Exploding)

Gradient Vanishing / Exploding

Vanishing: 시그모이드 함수를 사용하는 경우, 층이 깊어 질수록 backpropagation이 잘 안되는 것
Exploding: Vanishing의 반대

해결 방법

활성함수 변경
초기화
작은 learning rate
Batch Normalization

Internal Covariate Shift

Train과 Test의 분포가 달라서 발생하는 문제
입력과 출력의 분포 차이를 줄이기 위해 정규화를 사용했지만, 각 레이어마다 분포가 달라질 수 있음

Batch Normalization

Input: Values of $x$ over a mini-batch: $\beta = {x_1, … , x_m}$
Parameters: $\gamma, \beta$
Ouput: ${y_i = BN_{\gamma, \beta} (x_i) }$
$\mu \leftarrow {1 \over m} {\sum_{i=1}^m} x_i $ // mini-batch mean
$\sigma^2\beta \leftarrow {1 \over m} {\sum{i=1}^m}(x_i - \mu\beta)^2$ // mini-bath var
$\hat x_i = { {x_i - \mu\beta} \over {\sqrt \sigma^2_\beta + \epsilon} }$ // normalize
$y_i = \gamma \hat x_i + \beta \equiv BN_{\gamma, \beta}(x_i)$ // scale and shift

$\gamma$를 곱해주고 $\beta$를 더해주는 과정을 통해, 활성 함수 이후 잃어버린 선형성을 보정

Train & eval mode

Dropout과 동일하게 Batch Normalization에도 적용해야 함
활성함수 이전에 사용하는 경향이 있음

[BC] 모두를 위한 딥러닝 2 - DNN

Part-2 DNN

Lab-08-1 Perceptron

Neuron

Lab-08-2 Multi Layer Perceptron

Code: xor-nn

Code: xor-nn-wide-deep

Lab-09-1 ReLU

Problem of Sigmoid

ReLU

Optimizer in PyTorch

Lab-09-2 Weight initialization

Why good initialization?

RBM (Restricted Boltzmann Machine)

Pre-training

Xavier / He initialization

Xavier

He ubutuakuzatuib

Lab-09-3 Dropout

Dropout

Train & eval mode

Lab-09-4 Batch Normalization

Gradient Vanishing / Exploding

해결 방법

Internal Covariate Shift

Batch Normalization

Train & eval mode

프로젝트 A. DNN

[BC] 모두를 위한 딥러닝 2 - DNN

Part-2 DNN

Lab-08-1 Perceptron

Neuron

Lab-08-2 Multi Layer Perceptron

Code: xor-nn

Code: xor-nn-wide-deep

Lab-09-1 ReLU

Problem of Sigmoid

ReLU

Optimizer in PyTorch

Lab-09-2 Weight initialization

Why good initialization?

RBM (Restricted Boltzmann Machine)

Pre-training

Xavier / He initialization

Xavier

He ubutuakuzatuib

Lab-09-3 Dropout

Dropout

Train & eval mode

Lab-09-4 Batch Normalization

Gradient Vanishing / Exploding

해결 방법

Internal Covariate Shift

Batch Normalization

Train & eval mode

프로젝트 A. DNN

Further Reading

[PyTorch] Tensor

[PyTorch] Autograd와 Variabel

[PyTorch] Gradeint Descent