Kaggle Santander Customer Satisfaction
대부분이 만족이고 불만족인 데이터는 일부일 것이기에 정확도 수치보다는 ROC-AUC가 더 적합함
별다른 전처리 없이 기본적인 분류 모델을 인기있는 2가지 부스팅 계열 알고리즘으로 만들었음
1
2
3
4
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
Data Load
1
2
3
4
import glob
path = glob.glob("santander-customer-satisfaction/*")
path
1
2
3
['santander-customer-satisfaction\\sample_submission.csv',
'santander-customer-satisfaction\\test.csv',
'santander-customer-satisfaction\\train.csv']
1
2
3
4
train = pd.read_csv(path[2])
test = pd.read_csv(path[1])
train.shape, test.shape
1
((76020, 371), (75818, 370))
EDA
1
2
display(train.head())
display(test.head())
ID | var3 | var15 | imp_ent_var16_ult1 | imp_op_var39_comer_ult1 | imp_op_var39_comer_ult3 | imp_op_var40_comer_ult1 | imp_op_var40_comer_ult3 | imp_op_var40_efect_ult1 | imp_op_var40_efect_ult3 | ... | saldo_medio_var33_hace2 | saldo_medio_var33_hace3 | saldo_medio_var33_ult1 | saldo_medio_var33_ult3 | saldo_medio_var44_hace2 | saldo_medio_var44_hace3 | saldo_medio_var44_ult1 | saldo_medio_var44_ult3 | var38 | TARGET | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 2 | 23 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 39205.170000 | 0 |
1 | 3 | 2 | 34 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 49278.030000 | 0 |
2 | 4 | 2 | 23 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 67333.770000 | 0 |
3 | 8 | 2 | 37 | 0.0 | 195.0 | 195.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 64007.970000 | 0 |
4 | 10 | 2 | 39 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 117310.979016 | 0 |
5 rows × 371 columns
ID | var3 | var15 | imp_ent_var16_ult1 | imp_op_var39_comer_ult1 | imp_op_var39_comer_ult3 | imp_op_var40_comer_ult1 | imp_op_var40_comer_ult3 | imp_op_var40_efect_ult1 | imp_op_var40_efect_ult3 | ... | saldo_medio_var29_ult3 | saldo_medio_var33_hace2 | saldo_medio_var33_hace3 | saldo_medio_var33_ult1 | saldo_medio_var33_ult3 | saldo_medio_var44_hace2 | saldo_medio_var44_hace3 | saldo_medio_var44_ult1 | saldo_medio_var44_ult3 | var38 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2 | 2 | 32 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 40532.10 |
1 | 5 | 2 | 35 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 45486.72 |
2 | 6 | 2 | 23 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 46993.95 |
3 | 7 | 2 | 24 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 187898.61 |
4 | 9 | 2 | 23 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | ... | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 0.0 | 73649.73 |
5 rows × 370 columns
1
train.info()
1
2
3
4
5
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 76020 entries, 0 to 76019
Columns: 371 entries, ID to TARGET
dtypes: float64(111), int64(260)
memory usage: 215.2 MB
1
train.describe()
ID | var3 | var15 | imp_ent_var16_ult1 | imp_op_var39_comer_ult1 | imp_op_var39_comer_ult3 | imp_op_var40_comer_ult1 | imp_op_var40_comer_ult3 | imp_op_var40_efect_ult1 | imp_op_var40_efect_ult3 | ... | saldo_medio_var33_hace2 | saldo_medio_var33_hace3 | saldo_medio_var33_ult1 | saldo_medio_var33_ult3 | saldo_medio_var44_hace2 | saldo_medio_var44_hace3 | saldo_medio_var44_ult1 | saldo_medio_var44_ult3 | var38 | TARGET | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | ... | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 76020.000000 | 7.602000e+04 | 76020.000000 |
mean | 75964.050723 | -1523.199277 | 33.212865 | 86.208265 | 72.363067 | 119.529632 | 3.559130 | 6.472698 | 0.412946 | 0.567352 | ... | 7.935824 | 1.365146 | 12.215580 | 8.784074 | 31.505324 | 1.858575 | 76.026165 | 56.614351 | 1.172358e+05 | 0.039569 |
std | 43781.947379 | 39033.462364 | 12.956486 | 1614.757313 | 339.315831 | 546.266294 | 93.155749 | 153.737066 | 30.604864 | 36.513513 | ... | 455.887218 | 113.959637 | 783.207399 | 538.439211 | 2013.125393 | 147.786584 | 4040.337842 | 2852.579397 | 1.826646e+05 | 0.194945 |
min | 1.000000 | -999999.000000 | 5.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 5.163750e+03 | 0.000000 |
25% | 38104.750000 | 2.000000 | 23.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 6.787061e+04 | 0.000000 |
50% | 76043.000000 | 2.000000 | 28.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.064092e+05 | 0.000000 |
75% | 113748.750000 | 2.000000 | 40.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | ... | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 0.000000 | 1.187563e+05 | 0.000000 |
max | 151838.000000 | 238.000000 | 105.000000 | 210000.000000 | 12888.030000 | 21024.810000 | 8237.820000 | 11073.570000 | 6600.000000 | 6600.000000 | ... | 50003.880000 | 20385.720000 | 138831.630000 | 91778.730000 | 438329.220000 | 24650.010000 | 681462.900000 | 397884.300000 | 2.203474e+07 | 1.000000 |
8 rows × 371 columns
var3
의 경우 결측치르 -999999
로 처리했을것이라 예상
1
train["var3"].value_counts()[:5]
1
2
3
4
5
6
2 74165
8 138
-999999 116
9 110
3 108
Name: var3, dtype: int64
1
2
3
# 가장 많은 값인 2로 변환하고 id는 버려줌
train["var3"].replace(-999999, 2, inplace=True)
train.drop(columns="ID", axis=1, inplace=True)
1
2
test["var3"].replace(-999999, 2, inplace=True)
test.drop(columns="ID", axis=1, inplace=True)
학습 데이터 만들기
1
2
3
4
X_features = train.iloc[:, :-1]
y_labels = train.iloc[:, -1]
X_features.shape, y_labels.shape
1
((76020, 369), (76020,))
1
2
3
4
5
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_features, y_labels, test_size=0.2)
print(f"X_train: {X_train.shape}\ny_train: {y_train.shape}\nX_test: {X_test.shape}\ny_test: {y_test.shape}")
1
2
3
4
X_train: (60816, 369)
y_train: (60816,)
X_test: (15204, 369)
y_test: (15204,)
XGBoost
1
2
3
4
5
6
7
8
9
10
from xgboost import XGBClassifier
from sklearn.metrics import roc_auc_score
xgb_clf = XGBClassifier(n_estimators=500, use_label_encoder=False)
xgb_clf.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="auc", eval_set=[(X_train, y_train), (X_test, y_test)], verbose=0)
xgb_clf_score = roc_auc_score(y_test, xgb_clf.predict_proba(X_test)[:,1], average="macro")
xgb_clf_score
1
0.8407743805528098
GridSearch를 이용한 하이퍼파라미터 튜닝
1
2
3
4
5
6
7
8
9
10
11
12
from tabnanny import verbose
from sklearn.model_selection import GridSearchCV
xgb_clf = XGBClassifier(n_estimators=100, use_label_encoder=False)
params = {"max_depth": [5, 7],
"min_child_weight": [1, 3],
"colsample_bytree": [0.5, 0.75]
}
gridcv = GridSearchCV(xgb_clf, param_grid=params, cv=3)
gridcv.fit(X_train, y_train, early_stopping_rounds=30, eval_metric="auc", eval_set=[(X_train, y_train), (X_test, y_test)], verbose=0)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
GridSearchCV(cv=3,
estimator=XGBClassifier(base_score=None, booster=None,
colsample_bylevel=None,
colsample_bynode=None,
colsample_bytree=None,
enable_categorical=False, gamma=None,
gpu_id=None, importance_type=None,
interaction_constraints=None,
learning_rate=None, max_delta_step=None,
max_depth=None, min_child_weight=None,
missing=nan, monotone_constraints=None,
n_estimators=100, n_jobs=None,
num_parallel_tree=None, predictor=None,
random_state=None, reg_alpha=None,
reg_lambda=None, scale_pos_weight=None,
subsample=None, tree_method=None,
use_label_encoder=False,
validate_parameters=None, verbosity=None),
param_grid={'colsample_bytree': [0.5, 0.75], 'max_depth': [5, 7],
'min_child_weight': [1, 3]})
1
gridcv.best_estimator_
1
2
3
4
5
6
7
8
9
10
XGBClassifier(base_score=0.5, booster='gbtree', colsample_bylevel=1,
colsample_bynode=1, colsample_bytree=0.75,
enable_categorical=False, gamma=0, gpu_id=-1,
importance_type=None, interaction_constraints='',
learning_rate=0.300000012, max_delta_step=0, max_depth=5,
min_child_weight=3, missing=nan, monotone_constraints='()',
n_estimators=100, n_jobs=12, num_parallel_tree=1,
predictor='auto', random_state=0, reg_alpha=0, reg_lambda=1,
scale_pos_weight=1, subsample=1, tree_method='exact',
use_label_encoder=False, validate_parameters=1, verbosity=None)
1
gridcv.best_params_
1
{'colsample_bytree': 0.75, 'max_depth': 5, 'min_child_weight': 3}
1
2
3
4
5
6
7
xgb_clf = XGBClassifier(n_estimators=1000, learning_rate=0.02, max_depth=5, min_child_weight=3, colsample_bytree=0.75, reg_alpha=0.03, use_label_encoder=False)
xgb_clf.fit(X_train, y_train, early_stopping_rounds=200, eval_metric="auc", eval_set=[(X_train, y_train), (X_test, y_test)], verbose=0)
xgb_roc_score = roc_auc_score(y_test, xgb_clf.predict_proba(X_test)[:, 1], average="macro")
xgb_roc_score
1
0.8401977778587405
1
2
3
from xgboost import plot_importance
_ = plot_importance(xgb_clf, max_num_features=20, height=0.4)
1
2
sub = pd.read_csv(path[0])
sub.head()
ID | TARGET | |
---|---|---|
0 | 2 | 0 |
1 | 5 | 0 |
2 | 6 | 0 |
3 | 7 | 0 |
4 | 9 | 0 |
1
2
3
sub["TARGET"] = xgb_clf.predict(test)
file_name = f"sub_XGB_roc_{xgb_roc_score:.4f}.csv"
sub.to_csv(file_name, index=False)
LightGBM
1
2
3
4
5
6
7
8
9
10
from lightgbm import LGBMClassifier
lgbm_clf = LGBMClassifier(n_estimators=500)
evals = [(X_test, y_test)]
lgbm_clf.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="auc", eval_set=evals, verbose=False)
lgbm_score = roc_auc_score(y_test, lgbm_clf.predict_proba(X_test)[:,1], average="macro")
lgbm_score
1
0.8340050803020832
GridSearch를 이용한 하이퍼파라미터 튜닝
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
from lightgbm import early_stopping
from sklearn.model_selection import GridSearchCV
lgbm_clf = LGBMClassifier(n_estimators=200)
params = {"num_leaves": [32, 64],
"max_depth": [128, 160],
"min_child_samples": [60, 100],
"subsample": [0.8, 1]
}
gridcv = GridSearchCV(lgbm_clf, param_grid=params, cv=3)
gridcv.fit(X_train, y_train, eval_metric="auc", collbaacks=[early_stopping(30)], eval_set=[(X_train, y_train), (X_test, y_test)], verbose=0)
lgbm_roc_score = roc_auc_score(y_test, gridcv.predict_proba(X_test)[:,1], average="macro")
lgbm_roc_score
1
0.8375396920779556
1
2
lgbm_clf = gridcv.best_estimator_
lgbm_clf
1
2
LGBMClassifier(max_depth=128, min_child_samples=100, n_estimators=200,
num_leaves=32, subsample=0.8)
1
2
3
4
5
lgbm_clf.fit(X_train, y_train, early_stopping_rounds=100, eval_metric="auc", eval_set=[(X_test, y_test)], verbose=0)
lgbm_roc_score = roc_auc_score(y_test, lgbm_clf.predict_proba(X_test)[:, 1], average="macro")
lgbm_roc_score
1
0.8375396920779556
1
2
3
sub["TARGET"] = lgbm_clf.predict(test)
file_name = f"sub_lgbm_roc_{lgbm_roc_score:.4f}.csv"
sub.to_csv(file_name, index=False)
EDA보다 부스팅 계열 모델 생성에 초점을 맞췄다보니,
Kaggle에서 현재 점수가 0.5점인 상태..
Comments powered by Disqus.