Home Kaggle Santander Customer 기본 모델(XGBoost, LightGBM)
Post
Cancel

Kaggle Santander Customer 기본 모델(XGBoost, LightGBM)

Kaggle Santander Customer Satisfaction

대부분이 만족이고 불만족인 데이터는 일부일 것이기에 정확도 수치보다는 ROC-AUC가 더 적합함
별다른 전처리 없이 기본적인 분류 모델을 인기있는 2가지 부스팅 계열 알고리즘으로 만들었음

1
2
3
4
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

Data Load

1
2
3
4
import glob

path = glob.glob("santander-customer-satisfaction/*")
path
1
2
3
['santander-customer-satisfaction\\sample_submission.csv',
 'santander-customer-satisfaction\\test.csv',
 'santander-customer-satisfaction\\train.csv']
1
2
3
4
train = pd.read_csv(path[2])
test = pd.read_csv(path[1])

train.shape, test.shape
1
((76020, 371), (75818, 370))

EDA

1
2
display(train.head())
display(test.head())
IDvar3var15imp_ent_var16_ult1imp_op_var39_comer_ult1imp_op_var39_comer_ult3imp_op_var40_comer_ult1imp_op_var40_comer_ult3imp_op_var40_efect_ult1imp_op_var40_efect_ult3...saldo_medio_var33_hace2saldo_medio_var33_hace3saldo_medio_var33_ult1saldo_medio_var33_ult3saldo_medio_var44_hace2saldo_medio_var44_hace3saldo_medio_var44_ult1saldo_medio_var44_ult3var38TARGET
012230.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.039205.1700000
132340.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.049278.0300000
242230.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.067333.7700000
382370.0195.0195.00.00.00.00.0...0.00.00.00.00.00.00.00.064007.9700000
4102390.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.0117310.9790160

5 rows × 371 columns

IDvar3var15imp_ent_var16_ult1imp_op_var39_comer_ult1imp_op_var39_comer_ult3imp_op_var40_comer_ult1imp_op_var40_comer_ult3imp_op_var40_efect_ult1imp_op_var40_efect_ult3...saldo_medio_var29_ult3saldo_medio_var33_hace2saldo_medio_var33_hace3saldo_medio_var33_ult1saldo_medio_var33_ult3saldo_medio_var44_hace2saldo_medio_var44_hace3saldo_medio_var44_ult1saldo_medio_var44_ult3var38
022320.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.040532.10
152350.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.045486.72
262230.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.046993.95
372240.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.0187898.61
492230.00.00.00.00.00.00.0...0.00.00.00.00.00.00.00.00.073649.73

5 rows × 370 columns

1
train.info()
1
2
3
4
5
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76020 entries, 0 to 76019
Columns: 371 entries, ID to TARGET
dtypes: float64(111), int64(260)
memory usage: 215.2 MB
1
train.describe()
IDvar3var15imp_ent_var16_ult1imp_op_var39_comer_ult1imp_op_var39_comer_ult3imp_op_var40_comer_ult1imp_op_var40_comer_ult3imp_op_var40_efect_ult1imp_op_var40_efect_ult3...saldo_medio_var33_hace2saldo_medio_var33_hace3saldo_medio_var33_ult1saldo_medio_var33_ult3saldo_medio_var44_hace2saldo_medio_var44_hace3saldo_medio_var44_ult1saldo_medio_var44_ult3var38TARGET
count76020.00000076020.00000076020.00000076020.00000076020.00000076020.00000076020.00000076020.00000076020.00000076020.000000...76020.00000076020.00000076020.00000076020.00000076020.00000076020.00000076020.00000076020.0000007.602000e+0476020.000000
mean75964.050723-1523.19927733.21286586.20826572.363067119.5296323.5591306.4726980.4129460.567352...7.9358241.36514612.2155808.78407431.5053241.85857576.02616556.6143511.172358e+050.039569
std43781.94737939033.46236412.9564861614.757313339.315831546.26629493.155749153.73706630.60486436.513513...455.887218113.959637783.207399538.4392112013.125393147.7865844040.3378422852.5793971.826646e+050.194945
min1.000000-999999.0000005.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000005.163750e+030.000000
25%38104.7500002.00000023.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000006.787061e+040.000000
50%76043.0000002.00000028.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000001.064092e+050.000000
75%113748.7500002.00000040.0000000.0000000.0000000.0000000.0000000.0000000.0000000.000000...0.0000000.0000000.0000000.0000000.0000000.0000000.0000000.0000001.187563e+050.000000
max151838.000000238.000000105.000000210000.00000012888.03000021024.8100008237.82000011073.5700006600.0000006600.000000...50003.88000020385.720000138831.63000091778.730000438329.22000024650.010000681462.900000397884.3000002.203474e+071.000000

8 rows × 371 columns

var3의 경우 결측치르 -999999로 처리했을것이라 예상

1
train["var3"].value_counts()[:5]
1
2
3
4
5
6
 2         74165
 8           138
-999999      116
 9           110
 3           108
Name: var3, dtype: int64
1
2
3
# 가장 많은 값인 2로 변환하고 id는 버려줌
train["var3"].replace(-999999, 2, inplace=True)
train.drop(columns="ID", axis=1, inplace=True)
1
2
test["var3"].replace(-999999, 2, inplace=True)
test.drop(columns="ID", axis=1, inplace=True)

학습 데이터 만들기

1
2
3
4
X_features = train.iloc[:, :-1]
y_labels = train.iloc[:, -1]

X_features.shape, y_labels.shape
1
((76020, 369), (76020,))
1
2
3
4
5
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X_features, y_labels, test_size=0.2)

print(f"X_train: {X_train.shape}\ny_train: {y_train.shape}\nX_test: {X_test.shape}\ny_test: {y_test.shape}")
1
2
3
4
X_train: (60816, 369)
y_train: (60816,)
X_test: (15204, 369)
y_test: (15204,)

XGBoost

1
2
3
4
5
6
7
8
9
10
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score

xgb_clf = XGBClassifier(n_estimators=500, use_label_encoder=False)

xgb_clf.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="auc", eval_set=[(X_train, y_train), (X_test, y_test)], verbose=0)

xgb_clf_score = roc_auc_score(y_test, xgb_clf.predict_proba(X_test)[:,1], average="macro")

xgb_clf_score
1
0.8407743805528098

GridSearch를 이용한 하이퍼파라미터 튜닝

1
2
3
4
5
6
7
8
9
10
11
12
from tabnanny import verbose
from sklearn.model_selection import GridSearchCV

xgb_clf = XGBClassifier(n_estimators=100, use_label_encoder=False)

params = {"max_depth": [5, 7],
          "min_child_weight": [1, 3],
          "colsample_bytree": [0.5, 0.75]
          }

gridcv = GridSearchCV(xgb_clf, param_grid=params, cv=3)
gridcv.fit(X_train, y_train, early_stopping_rounds=30, eval_metric="auc", eval_set=[(X_train, y_train), (X_test, y_test)], verbose=0)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
GridSearchCV(cv=3,
             estimator=XGBClassifier(base_score=None, booster=None,
                                     colsample_bylevel=None,
                                     colsample_bynode=None,
                                     colsample_bytree=None,
                                     enable_categorical=False, gamma=None,
                                     gpu_id=None, importance_type=None,
                                     interaction_constraints=None,
                                     learning_rate=None, max_delta_step=None,
                                     max_depth=None, min_child_weight=None,
                                     missing=nan, monotone_constraints=None,
                                     n_estimators=100, n_jobs=None,
                                     num_parallel_tree=None, predictor=None,
                                     random_state=None, reg_alpha=None,
                                     reg_lambda=None, scale_pos_weight=None,
                                     subsample=None, tree_method=None,
                                     use_label_encoder=False,
                                     validate_parameters=None, verbosity=None),
             param_grid={'colsample_bytree': [0.5, 0.75], 'max_depth': [5, 7],
                         'min_child_weight': [1, 3]})
1
gridcv.best_estimator_
1
2
3
4
5
6
7
8
9
10
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.75,
              enable_categorical=False, gamma=0, gpu_id=-1,
              importance_type=None, interaction_constraints='',
              learning_rate=0.300000012, max_delta_step=0, max_depth=5,
              min_child_weight=3, missing=nan, monotone_constraints='()',
              n_estimators=100, n_jobs=12, num_parallel_tree=1,
              predictor='auto', random_state=0, reg_alpha=0, reg_lambda=1,
              scale_pos_weight=1, subsample=1, tree_method='exact',
              use_label_encoder=False, validate_parameters=1, verbosity=None)
1
gridcv.best_params_
1
{'colsample_bytree': 0.75, 'max_depth': 5, 'min_child_weight': 3}
1
2
3
4
5
6
7
xgb_clf = XGBClassifier(n_estimators=1000, learning_rate=0.02, max_depth=5, min_child_weight=3, colsample_bytree=0.75, reg_alpha=0.03, use_label_encoder=False)

xgb_clf.fit(X_train, y_train, early_stopping_rounds=200, eval_metric="auc", eval_set=[(X_train, y_train), (X_test, y_test)], verbose=0)

xgb_roc_score = roc_auc_score(y_test, xgb_clf.predict_proba(X_test)[:, 1], average="macro")

xgb_roc_score
1
0.8401977778587405
1
2
3
from xgboost import plot_importance

_ = plot_importance(xgb_clf, max_num_features=20, height=0.4)

png

1
2
sub = pd.read_csv(path[0])
sub.head()
IDTARGET
020
150
260
370
490
1
2
3
sub["TARGET"] = xgb_clf.predict(test)
file_name = f"sub_XGB_roc_{xgb_roc_score:.4f}.csv"
sub.to_csv(file_name, index=False)

LightGBM

1
2
3
4
5
6
7
8
9
10
from lightgbm import LGBMClassifier

lgbm_clf = LGBMClassifier(n_estimators=500)

evals = [(X_test, y_test)]
lgbm_clf.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="auc", eval_set=evals, verbose=False)

lgbm_score = roc_auc_score(y_test, lgbm_clf.predict_proba(X_test)[:,1], average="macro")

lgbm_score
1
0.8340050803020832

GridSearch를 이용한 하이퍼파라미터 튜닝

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from lightgbm import early_stopping
from sklearn.model_selection import GridSearchCV

lgbm_clf = LGBMClassifier(n_estimators=200)

params = {"num_leaves": [32, 64],
          "max_depth": [128, 160],
          "min_child_samples": [60, 100],
          "subsample": [0.8, 1]
          }

gridcv = GridSearchCV(lgbm_clf, param_grid=params, cv=3)
gridcv.fit(X_train, y_train, eval_metric="auc", collbaacks=[early_stopping(30)], eval_set=[(X_train, y_train), (X_test, y_test)], verbose=0)

lgbm_roc_score = roc_auc_score(y_test, gridcv.predict_proba(X_test)[:,1], average="macro")

lgbm_roc_score
1
0.8375396920779556
1
2
lgbm_clf = gridcv.best_estimator_
lgbm_clf
1
2
LGBMClassifier(max_depth=128, min_child_samples=100, n_estimators=200,
               num_leaves=32, subsample=0.8)
1
2
3
4
5
lgbm_clf.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="auc", eval_set=[(X_test, y_test)], verbose=0)

lgbm_roc_score = roc_auc_score(y_test, lgbm_clf.predict_proba(X_test)[:, 1], average="macro")

lgbm_roc_score
1
0.8375396920779556
1
2
3
sub["TARGET"] = lgbm_clf.predict(test)
file_name = f"sub_lgbm_roc_{lgbm_roc_score:.4f}.csv"
sub.to_csv(file_name, index=False)

EDA보다 부스팅 계열 모델 생성에 초점을 맞췄다보니,
Kaggle에서 현재 점수가 0.5점인 상태..

This post is licensed under CC BY 4.0 by the author.

DACON 해외 축구 선수 이적료 예측하기 - LightGBM

Warning 메시지 무시하기

Comments powered by Disqus.