Home 신경망을 이용한 DACON 병원 개업/폐업 분류 예측 경진대회
Post
Cancel

신경망을 이용한 DACON 병원 개업/폐업 분류 예측 경진대회

DACON 병원 개/폐업 분류 예측 경진대회

DACON 병원 개/폐업 분류 예측 경진대회

머신 러닝으로했던 주제를 딥러닝을 이용해 해결해봄
DACON 병원 개업/폐업 분류 예측 경진대회 머신러닝 버전

딥러닝 공부 초반부라 제대로 된건지는 모르겠지만, 제출시 85점이라는 준수한 성적이 나왔다.
역시 정형 데이터는 딥러닝보다는 머신러닝의 성능(87점)이 더 좋은것 같다.

Tensorflow

사용 라이브러리

1
2
3
4
5
6
7
8
9
10
11
12
13
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

import tensorflow as tf
from keras.callbacks import EarlyStopping

import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

Data Load

이전에 머신러닝으로 해당 데이터를 다룰 때 이용한 전처리를 그대로 적용함

1
2
3
4
train = pd.read_csv("pre_train.csv")
test = pd.read_csv("pre_test.csv")

train.shape, test.shape
1
((301, 76), (127, 72))

Train

1
2
3
4
5
6
label = "open"
feature_names = train.columns.tolist()
feature_names.remove(label)
feature_names.remove("dental_clinic")
feature_names.remove("경상도")
feature_names.remove("광주")
1
2
3
x_train, x_test, y_train, y_test = train_test_split(train[feature_names], train[label], test_size=0.2, stratify=train[label])

print(f"x_train: {x_train.shape}\ny_train: {y_train.shape}\nx_test: {x_test.shape}\ny_test: {y_test.shape}")
1
2
3
4
x_train: (240, 72)
y_train: (240,)
x_test: (61, 72)
y_test: (61,)

Model and Train

1
2
3
4
5
6
7
8
9
10
11
12
13
14
model = tf.keras.models.Sequential([
    tf.keras.layers.Dense(72, input_shape=[x_train.shape[1]]),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(144, activation="relu"),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(256, activation="swish"),
    tf.keras.layers.Dense(128, activation="swish"),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.BatchNormalization(),
    tf.keras.layers.Dense(32, activation="relu"),
    tf.keras.layers.Dropout(0.5),
    tf.keras.layers.Dense(1, activation="sigmoid")
])
1
2
3
4
5
with tf.device("/device:GPU:0"):
    model.compile(optimizer="adam",
                  loss="binary_crossentropy",
                  metrics=["accuracy"]
                  )
1
2
3
4
early_stopping = EarlyStopping(monitor="val_loss", patience=100)

with tf.device("/device:GPU:0"):
    history = model.fit(x_train, y_train, verbose=0, validation_split=0.2, epochs=1000, callbacks=[early_stopping])
1
2
df_hist = pd.DataFrame(history.history)
df_hist.tail()
lossaccuracyval_lossval_accuracy
1270.0412070.9947920.5932820.9375
1280.0284800.9895830.5903250.9375
1290.0310310.9843750.5854700.9375
1300.0476070.9791670.5883870.9375
1310.0292720.9947920.5922120.9375
1
_ = df_hist[["loss", "val_loss"]].plot()

png

1
_ = df_hist[["accuracy", "val_accuracy"]].plot()

png

Evaluate

1
y_pred = model.predict(x_test).flatten()
1
2/2 [==============================] - 0s 3ms/step
1
pred = (y_pred>0.5).astype(int)
1
accuracy_score(pred, y_test)
1
0.9508196721311475
1
2
3
test_loss, test_acc = model.evaluate(x_test,y_test)
print('test loss :',np.round(test_loss,4))
print('test acc :',np.round(test_acc,4))
1
2
3
2/2 [==============================] - 0s 6ms/step - loss: 0.2958 - accuracy: 0.9508
test loss : 0.2958
test acc : 0.9508

Submission

1
2
3
from glob import glob

sub = pd.read_csv(glob("data/*")[2])
1
2
3
4
pred = model.predict(test[feature_names])
pred = (pred>0.5).astype(int)
sub["OC"] = pred
sub.to_csv("sub_tf.csv", index=False)
1
4/4 [==============================] - 0s 19ms/step

마무리

public: 0.85 / private: 0.84가 나옴
머신러닝 알고리즘(RF, XGBoost, LGBM)들이 public: 0.87이 나온 것에 비해 낮은 점수임

This post is licensed under CC BY 4.0 by the author.

Kaggle - New York City Taxi Trip Duration

22년 7월 2주차 주간 회고

Comments powered by Disqus.