Home PyCaret 맛보기
Post
Cancel

PyCaret 맛보기

피마 인디언 데이터셋 with PyCaret

1
2
3
4
5
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import koreanize_matplotlib

Data Load

피마 인디언 당뇨병 데이터 셋

1
2
df_pima = pd.read_csv("http://bit.ly/data-diabetes-csv")
df_pima.shape
1
(768, 9)

PyCaret

당뇨병 여부 분류 문제 적용시

1
from pycaret.classification import *

setup

Train data, Test data, Label, Target 등을 설정하는 부분이며, 데이터에 전처리 기법들을 적용 할 수 있음

1
2
3
4
5
6
7
8
9
10
pycaret_models = setup(
    session_id=42, # 랜덤 시드
    data=df_pima, # Input Data
    target="Outcome", # Target
    normalize=True, # 정규화 여부
    normalize_method="minmax", # 정규화 방식
    transformation=True, # 데이터의 분포가 정규 분포에 더 가까워지도록 처리
    fold_strategy="stratifiedkfold",
    use_gpu=True
)
 DescriptionValue
0session_id42
1TargetOutcome
2Target TypeBinary
3Label EncodedNone
4Original Data(768, 9)
5Missing ValuesFalse
6Numeric Features7
7Categorical Features1
8Ordinal FeaturesFalse
9High Cardinality FeaturesFalse
10High Cardinality MethodNone
11Transformed Train Set(537, 24)
12Transformed Test Set(231, 24)
13Shuffle Train-TestTrue
14Stratify Train-TestFalse
15Fold GeneratorStratifiedKFold
16Fold Number10
17CPU Jobs-1
18Use GPUTrue
19Log ExperimentFalse
20Experiment Nameclf-default-name
21USId7e1
22Imputation Typesimple
23Iterative Imputation IterationNone
24Numeric Imputermean
25Iterative Imputation Numeric ModelNone
26Categorical Imputerconstant
27Iterative Imputation Categorical ModelNone
28Unknown Categoricals Handlingleast_frequent
29NormalizeTrue
30Normalize Methodminmax
31TransformationTrue
32Transformation Methodyeo-johnson
33PCAFalse
34PCA MethodNone
35PCA ComponentsNone
36Ignore Low VarianceFalse
37Combine Rare LevelsFalse
38Rare Level ThresholdNone
39Numeric BinningFalse
40Remove OutliersFalse
41Outliers ThresholdNone
42Remove MulticollinearityFalse
43Multicollinearity ThresholdNone
44Remove Perfect CollinearityTrue
45ClusteringFalse
46Clustering IterationNone
47Polynomial FeaturesFalse
48Polynomial DegreeNone
49Trignometry FeaturesFalse
50Polynomial ThresholdNone
51Group FeaturesFalse
52Feature SelectionFalse
53Feature Selection Methodclassic
54Features Selection ThresholdNone
55Feature InteractionFalse
56Feature RatioFalse
57Interaction ThresholdNone
58Fix ImbalanceFalse
59Fix Imbalance MethodSMOTE

models

1
models_list = models()
1
models_list
NameReferenceTurbo
ID
lrLogistic Regressionsklearn.linear_model._logistic.LogisticRegressionTrue
knnK Neighbors Classifiersklearn.neighbors._classification.KNeighborsCl...True
nbNaive Bayessklearn.naive_bayes.GaussianNBTrue
dtDecision Tree Classifiersklearn.tree._classes.DecisionTreeClassifierTrue
svmSVM - Linear Kernelsklearn.linear_model._stochastic_gradient.SGDC...True
rbfsvmSVM - Radial Kernelsklearn.svm._classes.SVCFalse
gpcGaussian Process Classifiersklearn.gaussian_process._gpc.GaussianProcessC...False
mlpMLP Classifiersklearn.neural_network._multilayer_perceptron....False
ridgeRidge Classifiersklearn.linear_model._ridge.RidgeClassifierTrue
rfRandom Forest Classifiersklearn.ensemble._forest.RandomForestClassifierTrue
qdaQuadratic Discriminant Analysissklearn.discriminant_analysis.QuadraticDiscrim...True
adaAda Boost Classifiersklearn.ensemble._weight_boosting.AdaBoostClas...True
gbcGradient Boosting Classifiersklearn.ensemble._gb.GradientBoostingClassifierTrue
ldaLinear Discriminant Analysissklearn.discriminant_analysis.LinearDiscrimina...True
etExtra Trees Classifiersklearn.ensemble._forest.ExtraTreesClassifierTrue
lightgbmLight Gradient Boosting Machinelightgbm.sklearn.LGBMClassifierTrue
dummyDummy Classifiersklearn.dummy.DummyClassifierTrue

pycaret에서 사용 가능한 모델 목록을 확인 할 수 있음

compare_models

1
2
3
4
pc_clf_models = compare_models(
    n_select=25, # 반환할 모델 개수
    include=models_list.index.tolist()
)
 ModelAccuracyAUCRecallPrec.F1KappaMCCTT (Sec)
gpcGaussian Process Classifier0.77100.80980.54850.73050.62230.46450.47640.1060
etExtra Trees Classifier0.76910.81850.56430.72060.62790.46540.47550.4720
lrLogistic Regression0.76530.83680.58010.70550.63200.46320.47040.0260
rfRandom Forest Classifier0.76150.84060.56930.69620.62020.45110.45910.4840
adaAda Boost Classifier0.75970.81990.60610.67410.63600.45800.46090.0820
lightgbmLight Gradient Boosting Machine0.75970.81740.63330.66770.64370.46430.46910.9150
ldaLinear Discriminant Analysis0.75960.83190.57980.68660.62230.45010.45650.0140
rbfsvmSVM - Radial Kernel0.75770.84180.52630.70800.60040.43310.44450.0300
ridgeRidge Classifier0.75590.00000.56930.67900.61510.44050.44570.0090
gbcGradient Boosting Classifier0.75410.83960.62250.65660.63320.45020.45440.1010
knnK Neighbors Classifier0.74660.78000.54240.67890.60000.41830.42620.3870
mlpMLP Classifier0.74110.80440.58600.64830.61050.41860.42301.6220
svmSVM - Linear Kernel0.72990.00000.62840.61210.61190.40730.41270.0090
dtDecision Tree Classifier0.71870.69090.59650.60130.59190.37990.38480.0100
nbNaive Bayes0.66100.76660.11230.44990.17190.08240.11270.0090
dummyDummy Classifier0.64990.50000.00000.00000.00000.00000.00000.0060
qdaQuadratic Discriminant Analysis0.55290.55730.58650.49490.43450.11670.17240.0090

create_model

여러 모델이 아닌 하나의 모델에 대해서 setup 설정으로 학습 및 결과 확인

1
clf_lgbm = create_model("lightgbm")
 AccuracyAUCRecallPrec.F1KappaMCC
Fold       
00.81480.89320.78950.71430.75000.60350.6054
10.75930.80450.47370.75000.58060.42360.4456
20.72220.84660.63160.60000.61540.39820.3985
30.68520.72780.63160.54550.58540.33380.3361
40.77780.84510.73680.66670.70000.52420.5259
50.85190.90230.78950.78950.78950.67520.6752
60.72220.71580.47370.64290.54550.35200.3605
70.73580.80790.55560.62500.58820.39480.3963
80.81130.84920.77780.70000.73680.59040.5924
90.71700.77860.47370.64290.54550.34680.3553
Mean0.75970.81710.63330.66770.64370.46430.4691
Std0.05030.05970.12760.06880.08660.11730.1152

tune_model

하이퍼파라미터 튜닝을 도와주는 메서드

1
tuned_clf_lgbm = tune_model(clf_lgbm, n_iter=10, optimize="Accuracy")
 AccuracyAUCRecallPrec.F1KappaMCC
Fold       
00.83330.93080.84210.72730.78050.64730.6518
10.81480.87820.63160.80000.70590.57350.5820
20.81480.84960.73680.73680.73680.59400.5940
30.68520.73530.42110.57140.48480.26560.2720
40.72220.82260.63160.60000.61540.39820.3985
50.83330.89770.68420.81250.74290.62090.6259
60.75930.78050.52630.71430.60610.43840.4490
70.71700.85000.55560.58820.57140.36040.3607
80.75470.84920.55560.66670.60610.43010.4339
90.71700.77240.52630.62500.57140.36250.3655
Mean0.76520.83660.61110.68420.64210.46910.4733
Std0.05220.05710.11490.08260.08970.12380.1241

save_model

학습한 모델을 저장

1
save_model(tuned_clf_lgbm, "./tuned_clf_lgbm")
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
Transformation Pipeline and Model Successfully Saved





(Pipeline(memory=None,
          steps=[('dtypes',
                  DataTypes_Auto_infer(categorical_features=[],
                                       display_types=True, features_todrop=[],
                                       id_columns=[],
                                       ml_usecase='classification',
                                       numerical_features=[], target='Outcome',
                                       time_features=[])),
                 ('imputer',
                  Simple_Imputer(categorical_strategy='not_available',
                                 fill_value_categorical=None,
                                 fill_value_numerical=None,
                                 numeric_stra...
                                 colsample_bytree=1.0, device='gpu',
                                 feature_fraction=1.0, importance_type='split',
                                 learning_rate=0.1, max_depth=-1,
                                 min_child_samples=71, min_child_weight=0.001,
                                 min_split_gain=0.6, n_estimators=130, n_jobs=-1,
                                 num_leaves=4, objective=None, random_state=42,
                                 reg_alpha=0.3, reg_lambda=4, silent='warn',
                                 subsample=1.0, subsample_for_bin=200000,
                                 subsample_freq=0)]],
          verbose=False),
 './tuned_clf_lgbm.pkl')

load_model

1
clf_lgbm = load_model("./tuned_clf_lgbm")
1
Transformation Pipeline and Model Successfully Loaded
1
clf_lgbm["trained_model"]
1
2
3
4
5
6
7
8
LGBMClassifier(bagging_fraction=0.6, bagging_freq=5, boosting_type='gbdt',
               class_weight=None, colsample_bytree=1.0, device='gpu',
               feature_fraction=1.0, importance_type='split', learning_rate=0.1,
               max_depth=-1, min_child_samples=71, min_child_weight=0.001,
               min_split_gain=0.6, n_estimators=130, n_jobs=-1, num_leaves=4,
               objective=None, random_state=42, reg_alpha=0.3, reg_lambda=4,
               silent='warn', subsample=1.0, subsample_for_bin=200000,
               subsample_freq=0)

위와 같이 하이퍼파라미터 튜닝 목록을 확인할 수 있음

This post is licensed under CC BY 4.0 by the author.

22년 11월 3주차 주간 회고

22년 11월 4주차 주간 회고

Comments powered by Disqus.