Home DACON Wine 데이터 머신러닝으로 분류하기
Post
Cancel

DACON Wine 데이터 머신러닝으로 분류하기

DACON Wine

Kaggle wine에서는 마땅한 데이터 셋을 찾지 못해서.. DACON의 Wine 데이터 셋을 이용
데이터 셋 출처

전처리 및 EDA

사용 라이브러리

1
2
3
4
5
6
7
8
9
10
11
12
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import koreanize_matplotlib

from sklearn.tree import DecisionTreeClassifier, plot_tree
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from lightgbm import LGBMClassifier
from lightgbm import plot_importance

Data Load

1
2
3
4
train = pd.read_csv("data/train.csv")
test = pd.read_csv("data/test.csv")

train.shape, test.shape
1
((5497, 14), (1000, 13))
1
2
display(train.sample(3))
display(test.sample(3))
indexqualityfixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholtype
2274227457.50.4900.191.90.07610.044.00.995703.390.549.7red
4452445267.20.6050.021.90.09610.031.00.995003.460.5311.8red
4500450076.80.1800.3012.80.06219.0171.00.998083.000.529.0white
indexfixed acidityvolatile aciditycitric acidresidual sugarchloridesfree sulfur dioxidetotal sulfur dioxidedensitypHsulphatesalcoholtype
3543548.30.180.301.10.03320.057.00.991093.020.5111.0white
7957955.50.240.328.70.06019.0102.00.994003.270.3110.4white
1971977.60.250.341.30.05634.0176.00.994343.100.519.5white

기본 정보

1
train.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5497 entries, 0 to 5496
Data columns (total 14 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   index                 5497 non-null   int64  
 1   quality               5497 non-null   int64  
 2   fixed acidity         5497 non-null   float64
 3   volatile acidity      5497 non-null   float64
 4   citric acid           5497 non-null   float64
 5   residual sugar        5497 non-null   float64
 6   chlorides             5497 non-null   float64
 7   free sulfur dioxide   5497 non-null   float64
 8   total sulfur dioxide  5497 non-null   float64
 9   density               5497 non-null   float64
 10  pH                    5497 non-null   float64
 11  sulphates             5497 non-null   float64
 12  alcohol               5497 non-null   float64
 13  type                  5497 non-null   object 
dtypes: float64(11), int64(2), object(1)
memory usage: 601.4+ KB
1
test.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 13 columns):
 #   Column                Non-Null Count  Dtype  
---  ------                --------------  -----  
 0   index                 1000 non-null   int64  
 1   fixed acidity         1000 non-null   float64
 2   volatile acidity      1000 non-null   float64
 3   citric acid           1000 non-null   float64
 4   residual sugar        1000 non-null   float64
 5   chlorides             1000 non-null   float64
 6   free sulfur dioxide   1000 non-null   float64
 7   total sulfur dioxide  1000 non-null   float64
 8   density               1000 non-null   float64
 9   pH                    1000 non-null   float64
 10  sulphates             1000 non-null   float64
 11  alcohol               1000 non-null   float64
 12  type                  1000 non-null   object 
dtypes: float64(11), int64(1), object(1)
memory usage: 101.7+ KB
  • index 구분자
  • quality 품질
  • fixed acidity 산도
  • volatile acidity 휘발성산
  • citric acid 시트르산
  • residual sugar 잔당 : 발효 후 와인 속에 남아있는 당분
  • chlorides 염화물
  • free sulfur dioxide 독립 이산화황
  • total sulfur dioxide 총 이산화황
  • density 밀도
  • pH 수소이온농도
  • sulphates 황산염
  • alcohol 도수
  • type 종류

quality가 분류해야할 타겟 값

1
_ = train.hist(bins=50, figsize=(12, 10))

png

결측치 확인

1
_ = sns.heatmap(train.isnull()).set_title("Train")

png

1
_ = sns.heatmap(test.isnull()).set_title("Test")

png

결측치는 존재하지 않음

시각화

1
_ = sns.countplot(data=train, x="type").set_title("와인 종류별 개수")

png

1
_ = sns.countplot(data=train, x="quality", hue="type").set_title("와인 종류별 품질 분포")

png

1
_ = sns.scatterplot(data=train, x="pH", y="alcohol", hue="type")

png

일반적으로 레드 와인의 산도가 화이트 와인에 비해 높아 보임

범주형 데이터 수치형 데이터로 변경하기

type의 경우 범주형 데이터이므로 수치형 데이터로 전환함
white=1, red=0

1
2
3
4
# train
temp = pd.get_dummies(train["type"], drop_first=True)
train = pd.concat([train, temp], axis=1).copy()
train.drop(columns="type", inplace=True)
1
2
3
4
# test
temp = pd.get_dummies(test["type"], drop_first=True)
test = pd.concat([test, temp], axis=1).copy()
test.drop(columns="type", inplace=True)
1
2
3
4
5
label_name = "quality"

features_names = train.columns.tolist()
features_names.remove("index")
features_names.remove(label_name)
1
2
3
X_train, X_test, y_train, y_test = train_test_split(train[features_names], train[label_name], test_size=0.2)

print(f"X_train: {X_train.shape}\ny_train: {y_train.shape}\nX_test: {X_test.shape}\ny_test: {y_test.shape}")
1
2
3
4
X_train: (4397, 12)
y_train: (4397,)
X_test: (1100, 12)
y_test: (1100,)

Train - Decision Tress

1
2
3
4
5
model_dt = DecisionTreeClassifier()

model_dt.fit(X_train, y_train)

pred_dt = model_dt.predict(X_test)
1
accuracy_score(pred_dt, y_test)
1
0.61

Decision Tree 분석

1
model_dt.feature_importances_
1
2
3
array([0.07201916, 0.11946309, 0.06761771, 0.08113314, 0.08731109,
       0.08870931, 0.11023723, 0.07477896, 0.07667382, 0.08779586,
       0.13392146, 0.00033917])
1
2
3
plt.figure(figsize=(12, 8))
_ = plot_tree(model_dt, max_depth=4, feature_names=features_names, filled=True)
plt.show()

png

1
_ = sns.barplot(x=model_dt.feature_importances_, y=model_dt.feature_names_in_)

png

Train - Random Forest

1
2
3
4
5
model_rf = RandomForestClassifier()

model_rf.fit(X_train, y_train)

pred_rf = model_rf.predict(X_test)
1
accuracy_score(pred_rf, y_test)
1
0.6663636363636364

Random Forest 분석

1
model_rf.feature_importances_
1
2
3
array([0.07451421, 0.09884907, 0.08039004, 0.08426116, 0.08436724,
       0.08455709, 0.09202456, 0.10306862, 0.08115849, 0.08710219,
       0.1264419 , 0.00326544])
1
2
3
plt.figure(figsize=(12, 8))
_ = sns.barplot(x=model_rf.feature_importances_, y=model_rf.feature_names_in_)
plt.show()

png

Train - LightGBM

1
2
3
4
lgbm_wrapper = LGBMClassifier(n_estimators=400)
evals = [(X_test, y_test)]

lgbm_wrapper.fit(X_train, y_train, eval_metric="logloss", eval_set=evals, verbose=False)
1
LGBMClassifier(n_estimators=400)
1
2
3
preds = lgbm_wrapper.predict(X_test)

accuracy_score(y_test, preds)
1
0.6436363636363637

LightGBM 분석

1
2
fig, ax = plt.subplots(figsize=(10, 12))
_ = plot_importance(lgbm_wrapper, ax=ax)

png

DACON 제출시 LightGBM으로 0.671을 달성했다.
테스트시에는, RF 모델의 성능이 더 좋았지만, 제출시에는 LightGBM이 성능이 더 좋았음

img

This post is licensed under CC BY 4.0 by the author.

머신러닝 주요 분류 모델 정리

멋쟁이 사자처럼 AI Shcool 8주차

Comments powered by Disqus.