DACON Basic 여행 상품 분석 시각화 경진대회

단순히 EDA와 시각화를 진행하는건데 가설 설정이 너무 어려운거 같다..
많은 시간을 두고 진행했지만 딱히 떠오르는게 없어서 스터디에서 다른 사람들이 해오는걸 보고 내용을 추가해봐야할꺼 같다

라이브러리

  
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import koreanize_matplotlib

Data Load

  
df = pd.read_csv("./train.csv")

df.head()

	id	Age	TypeofContact	CityTier	DurationOfPitch	Occupation	Gender	NumberOfPersonVisiting	NumberOfFollowups	ProductPitched	PreferredPropertyStar	MaritalStatus	NumberOfTrips	Passport	PitchSatisfactionScore	OwnCar	NumberOfChildrenVisiting	Designation	MonthlyIncome	ProdTaken
0	1	28.0	Company Invited	1	10.0	Small Business	Male	3	4.0	Basic	3.0	Married	3.0	0	1	0	1.0	Executive	20384.0	0
1	2	34.0	Self Enquiry	3	NaN	Small Business	Female	2	4.0	Deluxe	4.0	Single	1.0	1	5	1	0.0	Manager	19599.0	1
2	3	45.0	Company Invited	1	NaN	Salaried	Male	2	3.0	Deluxe	4.0	Married	2.0	0	4	1	0.0	Manager	NaN	0
3	4	29.0	Company Invited	1	7.0	Small Business	Male	3	5.0	Basic	4.0	Married	3.0	0	4	0	1.0	Executive	21274.0	1
4	5	42.0	Self Enquiry	3	6.0	Salaried	Male	2	3.0	Deluxe	3.0	Divorced	2.0	0	3	1	0.0	Manager	19907.0	0

id : 샘플 아이디
Age : 나이
TypeofContact : 고객의 제품 인지 방법 (회사의 홍보 or 스스로 검색)
CityTier : 주거 중인 도시의 등급. (인구, 시설, 생활 수준 기준) (1등급 > 2등급 > 3등급)
DurationOfPitch : 영업 사원이 고객에게 제공하는 프레젠테이션 기간
Occupation : 직업
Gender : 성별
NumberOfPersonVisiting : 고객과 함께 여행을 계획 중인 총 인원
NumberOfFollowups : 영업 사원의 프레젠테이션 후 이루어진 후속 조치 수
ProductPitched : 영업 사원이 제시한 상품
PreferredPropertyStar : 선호 호텔 숙박업소 등급
MaritalStatus : 결혼여부
NumberOfTrips : 평균 연간 여행 횟수
Passport : 여권 보유 여부 (0: 없음, 1: 있음)
PitchSatisfactionScore : 영업 사원의 프레젠테이션 만족도
OwnCar : 자동차 보유 여부 (0: 없음, 1: 있음)
NumberOfChildrenVisiting : 함께 여행을 계획 중인 5세 미만의 어린이 수
Designation : (직업의) 직급
MonthlyIncome : 월 급여
ProdTaken : 여행 패키지 신청 여부 (0: 신청 안 함, 1: 신청함)

id는 버림

  
df.drop(columns=["id"], inplace=True)

결측치 확인

  
plt.figure(figsize=(12, 7))
sns.heatmap(df.isna())
plt.show()

시각화가 목적이므로 결측치는 모두 버림

  
df.dropna(inplace=True)

Age: 나이

  
plt.figure(figsize=(15, 5))
sns.countplot(data=df, x="Age")
plt.show()

Band Age

예측이나 분류에서는 주어진 데이터의 분포에 맞춰 정제하는 것이 좋은 성능의 모델을 만드는데 도움이 된다고 하지만, 시각화가 목표이므로 분포는 무시하고 일반적인 분포로 나눔

  
df["Band Age"] = df["Age"].map(lambda x: str(x)[0]+"0대")

  
plt.figure(figsize=(10, 5))
plt.subplot(221)
sns.countplot(data=df, y="Band Age", order=df["Band Age"].value_counts().index)
plt.subplot(222)
plt.pie(df["Band Age"].value_counts().values, labels=df["Band Age"].value_counts().index, autopct="%.2f%%")
plt.show()

TypeofContact: 인지 방법

  
plt.figure(figsize=(12, 7))
plt.subplot(221)
sns.countplot(data=df, x="TypeofContact")
plt.subplot(222)
plt.pie(df["TypeofContact"].value_counts().values, labels=df["TypeofContact"].value_counts().index, autopct="%.2f%%")
plt.show()

CityTier: 주거 중인 도시 등급

  
sns.countplot(data=df, x="CityTier");

DurationOfPitch: 고객에게 제공하는 프레젠테이션 기간

  
plt.figure(figsize=(15, 4))
sns.countplot(data=df, x="DurationOfPitch");

Gender: 성별

Fe Male과 Female은 같은 것 아닌가..?

  
df["Gender"] = df["Gender"].str.replace("Fe Male", "Female")

  
sns.countplot(data=df, x="Gender");

NumberOfPersonVisiting: 고객과 함께 여행을 계획 중인 총 인원

  
sns.countplot(data=df, x="NumberOfPersonVisiting");

NumberOfFollowups: 영업 사원의 프레젠테이션 후 이루어진 후속 조치 수

정확히 의미하는게 뭔지 모르겠음

  
sns.countplot(data=df, x="NumberOfFollowups");

ProductPitched: 영업 사원이 제시한 상품

  
plt.figure(figsize=(12, 7), facecolor="white")
plt.subplot(221)
plt.pie(df["ProductPitched"].value_counts().values, labels=df["ProductPitched"].value_counts().index, autopct="%.2f%%")
plt.subplot(222)
sns.countplot(data=df, x="ProductPitched")
plt.show()

PreferredPropertyStar: 선호 호텔 숙박업소 등급

  
plt.figure(figsize=(12, 7), facecolor="white")
plt.subplot(221)
plt.pie(df["PreferredPropertyStar"].value_counts().values, labels=df["PreferredPropertyStar"].value_counts().index, autopct="%.2f%%")
plt.subplot(222)
sns.countplot(data=df, x="PreferredPropertyStar")
plt.show()

MaritalStatus: 결혼여부

  
plt.figure(figsize=(12, 7))
plt.subplot(221)
plt.pie(df["MaritalStatus"].value_counts().values, labels=df["MaritalStatus"].value_counts().index, autopct="%.2f%%")
plt.subplot(222)
sns.countplot(data=df, y="MaritalStatus")
plt.show()

NumberOfTrips: 평균 연간 여행 횟수

  
sns.countplot(data=df, x="NumberOfTrips");

Passport: 여권 보유 여부 (0: 없음, 1: 있음)

  
sns.countplot(data=df, x="Passport");

PitchSatisfactionScore : 영업 사원의 프레젠테이션 만족도

  
plt.figure(figsize=(12, 7))
plt.subplot(221)
plt.pie(df["PitchSatisfactionScore"].value_counts().values, labels=df["PitchSatisfactionScore"].value_counts().index, autopct="%.2f%%")
plt.subplot(222)
sns.countplot(data=df, y="PitchSatisfactionScore")
plt.show()

OwnCar: 자동차 보유 여부 (0: 없음, 1: 있음)

  
sns.countplot(data=df, x="OwnCar");

NumberOfChildrenVisiting: 함께 여행을 계획 중인 5세 미만의 어린이 수

  
sns.countplot(data=df, y="NumberOfChildrenVisiting");

Designation: (직업의) 직급

  
plt.figure(figsize=(12, 7))
plt.subplot(221)
plt.pie(df["Designation"].value_counts().values, labels=df["Designation"].value_counts().index, autopct="%.2f%%")
plt.subplot(222)
sns.countplot(data=df, x="Designation")
plt.show()

MonthlyIncome: 월 급여

  
sns.scatterplot(data=df, x="Designation", y="MonthlyIncome");

상관 관계로 본 시각화

  
df.drop(columns=["id"]).corr(method="pearson").style.background_gradient()

	Age	CityTier	DurationOfPitch	NumberOfPersonVisiting	NumberOfFollowups	PreferredPropertyStar	NumberOfTrips	Passport	PitchSatisfactionScore	OwnCar	NumberOfChildrenVisiting	MonthlyIncome	ProdTaken
Age	1.000000	0.007875	0.025779	0.010795	0.009834	-0.026789	0.178143	0.030162	0.032860	0.060298	0.039495	0.440733	-0.135832
CityTier	0.007875	1.000000	0.056010	0.018071	0.023532	-0.011882	-0.020887	0.013665	-0.028168	0.014177	0.025359	0.057705	0.085583
DurationOfPitch	0.025779	0.056010	1.000000	0.096268	0.039485	-0.004448	0.022236	0.043478	0.011926	-0.015087	0.047770	0.016011	0.072899
NumberOfPersonVisiting	0.010795	0.018071	0.096268	1.000000	0.333738	0.017057	0.214895	0.023638	-0.012981	0.018545	0.610193	0.168701	0.006483
NumberOfFollowups	0.009834	0.023532	0.039485	0.333738	1.000000	-0.049151	0.135183	-0.005332	-0.007195	0.051920	0.293942	0.194668	0.105038
PreferredPropertyStar	-0.026789	-0.011882	-0.004448	0.017057	-0.049151	1.000000	0.035064	0.014701	-0.019620	0.031355	0.027038	-0.024338	0.114923
NumberOfTrips	0.178143	-0.020887	0.022236	0.214895	0.135183	0.035064	1.000000	0.004418	0.034816	0.005982	0.189517	0.137093	0.044922
Passport	0.030162	0.013665	0.043478	0.023638	-0.005332	0.014701	0.004418	1.000000	0.018526	-0.045133	0.030512	0.017044	0.293726
PitchSatisfactionScore	0.032860	-0.028168	0.011926	-0.012981	-0.007195	-0.019620	0.034816	0.018526	1.000000	0.073097	0.023842	-0.005497	0.067736
OwnCar	0.060298	0.014177	-0.015087	0.018545	0.051920	0.031355	0.005982	-0.045133	0.073097	1.000000	0.036416	0.109662	-0.040465
NumberOfChildrenVisiting	0.039495	0.025359	0.047770	0.610193	0.293942	0.027038	0.189517	0.030512	0.023842	0.036416	1.000000	0.179255	0.006060
MonthlyIncome	0.440733	0.057705	0.016011	0.168701	0.194668	-0.024338	0.137093	0.017044	-0.005497	0.109662	0.179255	1.000000	-0.140617
ProdTaken	-0.135832	0.085583	0.072899	0.006483	0.105038	0.114923	0.044922	0.293726	0.067736	-0.040465	0.006060	-0.140617	1.000000

상관 계수의 경우,

0.8 <= r : 강한 상관
0.6 <= r < 0.8 : 상관
0.4 <= r < 0.6 : 약한 상관

  
corr = pd.DataFrame(df.drop(columns=["id", "ProdTaken"]).corr(method="pearson"))

plt.figure(figsize=(12, 5))

mask = np.zeros_like(corr, dtype=np.bool_)
mask[np.triu_indices_from(mask)] = True
sns.heatmap(corr,
            cmap="RdYlBu_r",
            annot=True,
            mask=mask,
            linewidths=.5,
            cbar_kws={"shrink": .5},
            vmin=-1,
            vmax=1
);

가설 설정

중장년층의 급여가 높을것이다 (Age-MonthlyIncome)
총 여행 인원이 많은 경우는 아이가 많은 경우일 것이다 (NumberOfPersonVisiting-NumberOfChildrenVisiting)

[가설] 중장년층의 급여가 높을 것이다

  
sns.scatterplot(data=df, x="Age", y="MonthlyIncome");

30대 후반부터 임금이 증가해 비슷한 추이를 갖는 모습을 볼 수 있음

[가설] 총 여행 인원이 많은 경우는 아이가 많은 경우일 것이다

  
sns.scatterplot(data=df, x="NumberOfChildrenVisiting", y="NumberOfPersonVisiting");

DACON 여행 상품 분석 시각화 경진대회