Part-2 DNN
Lab-08-1 Perceptron
- 퍼셉트론(Perceptron)
- 선형분류기(Linear Classifier)
- AND, OR, XOR 게이트
Neuron
- 인공 신경망은, 인간의 뉴런을 본따 만든 신경망
- 입력 신호들의 합이 임계값을 넘으면 신호를 출력하는 방식
1
2
3
import torch
device = "cuda" if torch.cuda.is_available() else "cpu"
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
# XOR
X = torch.FloatTensor([[0, 0], [0, 1], [1, 0], [1, 1]]).to(device)
Y = torch.FloatTensor([[0], [1], [1], [0]]).to(device)
## nn Layers
linear = torch.nn.Linear(2, 1, bias=True)
sigmoid = torch.nn.Sigmoid()
model = torch.nn.Sequential(linear, sigmoid).to(device)
# define cost/loss and optimizer
criterion = torch.nn.BCELoss().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=1)
for step in range(10001):
optimizer.zero_grad()
hypothesis = model(X)
cost = criterion(hypothesis, Y)
cost.backward()
optimizer.step()
if step%1000==0:
print(step, cost.item())
1
2
3
4
5
6
7
8
9
10
11
0 0.7055874466896057
1000 0.6931471824645996
2000 0.6931471824645996
3000 0.6931471824645996
4000 0.6931471824645996
5000 0.6931471824645996
6000 0.6931471824645996
7000 0.6931471824645996
8000 0.6931471824645996
9000 0.6931471824645996
10000 0.6931471824645996
loss가 줄어들지 않음 -> 학습이 제대로 진행되지 않음 (XOR은 퍼셉트론으로 해결 불가능)
Lab-08-2 Multi Layer Perceptron
- 다중 퍼셉트론(Multi Layer Perceptron)
- 오차역전파(Backpropagation)
다중 레이어를 학습 시킬 수 있는 방법 -> 오차역전파 (loss 값의 그라디언트를 최소화하는 방법)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
# backpropagation
X = torch.FloatTensor([[0, 0], [0, 1], [1, 0], [1, 1]]).to(device)
Y = torch.FloatTensor([[0], [1], [1], [0]]).to(device)
# nn Layers
w1 = torch.Tensor(2, 2).to(device)
b1 = torch.Tensor(2).to(device)
w2 = torch.Tensor(2, 1).to(device)
b2 = torch.Tensor(1).to(device)
def sigmoid(x):
return 1.0 / (1.0 + torch.exp(-x))
def sigmoid_prime(x):
return sigmoid(x) * (1-sigmoid(x))
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
lr = 1
for step in range(10001):
# forward
l1 = torch.add(torch.matmul(X, w1), b1)
a1 = sigmoid(l1)
l2 = torch.add(torch.matmul(a1, w2), b2)
Y_pred = sigmoid(l2)
# BCE
cost = -torch.mean(Y*torch.log(Y_pred) + (1-Y) * torch.log(1-Y_pred))
# Back prop (chain rule) (backward)
# loss derivative
d_Y_pred = (Y_pred-Y) / (Y_pred * (1.0-Y_pred) + 1e-7)
# Layer 2
d_l2 = d_Y_pred * sigmoid_prime(l2)
d_b2 = d_l2
d_w2 = torch.matmul(torch.transpose(a1, 0, 1), d_b2)
# Layer 1
d_a1 = torch.matmul(d_b2, torch.transpose(w2, 0, 1))
d_l1 = d_a1 * sigmoid_prime(l1)
d_b1 = d_l1
d_w1 = torch.matmul(torch.transpose(X, 0, 1), d_b1)
# weight update (step)
w1 = w1 - lr*d_w1
b1 = b1 - lr*torch.mean(d_b1, 0)
w2 = w2 - lr*d_w2
b2 = b2 - lr*torch.mean(d_b2, 0)
if step%1000==0:
print(step, cost.item())
1
2
3
4
5
6
7
8
9
10
11
0 0.34671515226364136
1000 0.3467007279396057
2000 0.3466891646385193
3000 0.34667932987213135
4000 0.3466709852218628
5000 0.3466639220714569
6000 0.346657931804657
7000 0.34665271639823914
8000 0.34664785861968994
9000 0.3466435968875885
10000 0.3466399610042572
Code: xor-nn
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
X = torch.FloatTensor([[0, 0], [0, 1], [1, 0], [1, 1]]).to(device)
Y = torch.FloatTensor([[0], [1], [1], [0]]).to(device)
# nn Layers
linear1 = torch.nn.Linear(2, 2, bias=True)
linear2 = torch.nn.Linear(2, 1, bias=True)
sigmoid = torch.nn.Sigmoid()
model = torch.nn.Sequential(linear1, sigmoid, linear2, sigmoid).to(device)
# define cost/loss and optimizer
criterion = torch.nn.BCELoss().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=1)
for step in range(10001):
optimizer.zero_grad()
hypothesis = model(X)
# cost/loss function
cost = criterion(hypothesis, Y)
cost.backward()
optimizer.step()
if step%300==0:
print(step, cost.item())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
0 0.715934157371521
300 0.596375584602356
600 0.3768468201160431
900 0.35934001207351685
1200 0.3544251024723053
1500 0.35218489170074463
1800 0.3509174585342407
2100 0.35010701417922974
2400 0.34954583644866943
2700 0.34913522005081177
3000 0.34882205724716187
3300 0.3485758304595947
3600 0.3483772873878479
3900 0.348213791847229
4200 0.3480769991874695
4500 0.34796082973480225
4800 0.34786105155944824
5100 0.3477745056152344
5400 0.34769824147224426
5700 0.34763121604919434
6000 0.34757161140441895
6300 0.34751802682876587
6600 0.3474700450897217
6900 0.34742674231529236
7200 0.34738701581954956
7500 0.34735098481178284
7800 0.34731805324554443
8100 0.34728747606277466
8400 0.3472594618797302
8700 0.347233384847641
9000 0.3472091555595398
9300 0.3471869230270386
9600 0.3471660315990448
9900 0.34714627265930176
Code: xor-nn-wide-deep
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
X = torch.FloatTensor([[0, 0], [0, 1], [1, 0], [1, 1]]).to(device)
Y = torch.FloatTensor([[0], [1], [1], [0]]).to(device)
# nn Layers
linear1 = torch.nn.Linear(2, 10, bias=True)
linear2 = torch.nn.Linear(10, 10, bias=True)
linear3 = torch.nn.Linear(10, 10, bias=True)
linear4 = torch.nn.Linear(10, 1, bias=True)
sigmoid = torch.nn.Sigmoid()
model = torch.nn.Sequential(linear1, sigmoid, linear2, sigmoid, linear3, sigmoid, linear4, sigmoid).to(device)
# define cost/loss and optimizer
criterion = torch.nn.BCELoss().to(device)
optimizer = torch.optim.SGD(model.parameters(), lr=1)
for step in range(10001):
optimizer.zero_grad()
hypothesis = model(X)
# cost/loss function
cost = criterion(hypothesis, Y)
cost.backward()
optimizer.step()
if step%100==0:
print(step, cost.item())
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
0 0.7164230346679688
100 0.6931554079055786
200 0.6931540966033936
300 0.6931527853012085
400 0.693151593208313
500 0.6931504607200623
600 0.6931492686271667
700 0.693148136138916
800 0.6931470632553101
900 0.6931459307670593
1000 0.6931447982788086
1100 0.6931437253952026
1200 0.6931427121162415
1300 0.6931416392326355
1400 0.6931405067443848
1500 0.6931394338607788
1600 0.6931382417678833
1700 0.6931371688842773
1800 0.6931359767913818
1900 0.6931347846984863
2000 0.6931334733963013
2100 0.6931322813034058
2200 0.6931309103965759
2300 0.6931295394897461
2400 0.6931281089782715
2500 0.6931265592575073
2600 0.6931249499320984
2700 0.6931232213973999
2800 0.6931214332580566
2900 0.6931195855140686
3000 0.6931174397468567
3100 0.693115234375
3200 0.693112850189209
3300 0.6931103467941284
3400 0.6931076049804688
3500 0.6931045055389404
3600 0.6931012272834778
3700 0.6930975914001465
3800 0.6930936574935913
3900 0.6930893659591675
4000 0.6930843591690063
4100 0.6930789947509766
4200 0.6930727958679199
4300 0.693065881729126
4400 0.6930580139160156
4500 0.6930490732192993
4600 0.6930387020111084
4700 0.6930266618728638
4800 0.6930125951766968
4900 0.6929957866668701
5000 0.6929758787155151
5100 0.692951500415802
5200 0.69292151927948
5300 0.6928839087486267
5400 0.6928355693817139
5500 0.692772388458252
5600 0.6926865577697754
5700 0.6925660371780396
5800 0.6923881769180298
5900 0.6921088695526123
6000 0.6916301250457764
6100 0.690697968006134
6200 0.6884748935699463
6300 0.6807767748832703
6400 0.6191476583480835
6500 0.07421649247407913
6600 0.010309841483831406
6700 0.004796213004738092
6800 0.003011580090969801
6900 0.0021596732549369335
7000 0.001668647164478898
7100 0.0013521360233426094
7200 0.0011324126971885562
7300 0.0009715152555145323
7400 0.0008490675827488303
7500 0.000752839376218617
7600 0.0006754109635949135
7700 0.0006118417950347066
7800 0.000558803731109947
7900 0.0005138349952176213
8000 0.00047530914889648557
8100 0.000441973126726225
8200 0.00041282744496129453
8300 0.0003871413064189255
8400 0.0003643627860583365
8500 0.0003439848660491407
8600 0.00032572413329035044
8700 0.0003092226979788393
8800 0.0002942568971775472
8900 0.0002806179691106081
9000 0.0002681269543245435
9100 0.00025669438764452934
9200 0.0002461116237100214
9300 0.0002363934472668916
9400 0.00022736100072506815
9500 0.0002190142695326358
9600 0.00021117438154760748
9700 0.0002038860257016495
9800 0.00019705976592376828
9900 0.00019065086962655187
10000 0.00018467426707502455
Lab-09-1 ReLU
- ReLU
- Sigmoid
- Optimizer
Problem of Sigmoid
- Input -> Network -> Output
- GD 계산시, 시그모이드 사용지 작은 값이 곱해지면서 Gradient Vanishing 이라고 함
ReLU
- $f(x) = max(0, x)$
1
2
3
4
torch.nn.Sigmoid()
torch.nn.Tanh()
torch.nn.ReLU()
torch.nn.LeakyReLU()
Optimizer in PyTorch
1
2
3
4
torch.optim.SGD()
torch.optim.Adadelta()
torch.optim.RMSprop()
...
Lab-09-2 Weight initialization
- 가중치 초기화(Wegight Initialization)
- Xavier / He initialization
Why good initialization?
- 가중치 초기화를 진행한 경우, 성능과 학습에 좋음
- 단순히 0으로 초기화하는 경우, 오류역전파 과정에서 모든것이 0으로 초기화되는 문제가 있음
- RBM을 이용해 가중치를 초기화
RBM (Restricted Boltzmann Machine)
- 레이어 안에서는 연결이 없다
- 다른 레이어 사이에는 FC
Pre-training
- x - (RBM) - h1
- x - h1 - (RBM) - h2
- x - h1 - h2 - (RBM) - y, h3
- Fine-tuning
최근에는, Xavier / He initialization
와 같이 RBM 보다 성능이 향상된 초기화 방법이 있음
Xavier / He initialization
- RBM에서는 무작위로 가중치를 초기화
- 위 두 방식은, 레이어에 따라 초기화하는 방식이 달라짐
- Normal Distribution (정규분포)
- 가우스 분포
- 수집된 자료의 분포를 근사하는데 자주 사용
- Uniform Distribution (연속균등분포)
- 연속 확률 분포, 분포가 특정 범위내에서 균등하게 나타나 있을 경우
[a, b]
Xavier
- Normal
- $W \sim N(0, Var(W))$
- $Var(x) = \sqrt {2 \over n_{in} + n_{out} }$
- Uniform
- $W \sim U(-\sqrt {6 \over n_{in} + n_{out} }, + \sqrt {6 \over n_{in} + n_{out} })$
He ubutuakuzatuib
out
을 모두 제거
- Normal
- $W \sim N(0, Var(W))$
- $Var(W) = \sqrt {2 \over n_{in} }$
- Uniform
- $W \sim U(-\sqrt{6 \over n_{in} }, +\sqrt{6 \over n_{in} })$
1
2
3
4
5
6
7
# PyTorch Xavier 공식 구현 코드
def xavier_uniform_(tensor, gain=1):
fan_in, fan_out = _calculate_fan_in_and_out(tensor) # 초기화하고자하는 인풋, 아웃풋 레이아웃 수
std = gain * math.sqrt(2.0 / (fan_in + fan_out))
a = math.sqrt(3.0) * std
with torch.no_grad():
return tensor.uniform_(-a, a)
Lab-09-3 Dropout
- 과최적화(Overfitting)
- 드롭아웃(Dropout)
Dropout
- 학습을 진행하면, 각 레이어에 존재하는 노드를 무작위로 껐다 켯다하면서 학습
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
# Dropout
# nn Layers
linear1 = torch.nn.Linear(784, 512, bias=True)
linear2 = torch.nn.Linear(512, 512, bias=True)
linear3 = torch.nn.Linear(512, 512, bias=True)
linear4 = torch.nn.Linear(512, 512, bias=True)
linear5 = torch.nn.Linear(512, 10, bias=True)
relu = torch.nn.ReLU()
drop_prob = 0.5 # dropout 확률
dropout = torch.nn.Dropout(p=drop_prob)
# model
model = torch.nn.Sequential(
linear1, relu, dropout,
linear2, relu, dropout,
linear3, relu, dropout,
linear4, relu, dropout,
linear5
).to(device)
Train & eval mode
1
2
3
4
5
model.train() # dropout 적용
# dropout 미적용 (평가)
with torch.no_grad():
model.eval()
Lab-09-4 Batch Normalization
- Batch Normalization
- 경사 소실(Gradient Vanishin) / 폭발(Exploding)
Gradient Vanishing / Exploding
- Vanishing: 시그모이드 함수를 사용하는 경우, 층이 깊어 질수록 backpropagation이 잘 안되는 것
- Exploding: Vanishing의 반대
해결 방법
- 활성함수 변경
- 초기화
- 작은 learning rate
- Batch Normalization
Internal Covariate Shift
- Train과 Test의 분포가 달라서 발생하는 문제
- 입력과 출력의 분포 차이를 줄이기 위해 정규화를 사용했지만, 각 레이어마다 분포가 달라질 수 있음
Batch Normalization
- Input: Values of $x$ over a mini-batch: $\beta = {x_1, … , x_m}$
- Parameters: $\gamma, \beta$
Ouput: ${y_i = BN_{\gamma, \beta} (x_i) }$
- $\mu \leftarrow {1 \over m} {\sum_{i=1}^m} x_i $ // mini-batch mean
- $\sigma^2\beta \leftarrow {1 \over m} {\sum{i=1}^m}(x_i - \mu\beta)^2$ // mini-bath var
- $\hat x_i = { {x_i - \mu\beta} \over {\sqrt \sigma^2_\beta + \epsilon} }$ // normalize
- $y_i = \gamma \hat x_i + \beta \equiv BN_{\gamma, \beta}(x_i)$ // scale and shift
$\gamma$를 곱해주고 $\beta$를 더해주는 과정을 통해, 활성 함수 이후 잃어버린 선형성을 보정
Train & eval mode
Dropout
과 동일하게Batch Normalization
에도 적용해야 함- 활성함수 이전에 사용하는 경향이 있음
Comments powered by Disqus.