Home 영화 관객수 예측 경진대회
Post
Cancel

영화 관객수 예측 경진대회

DACON - 영화 관객수 예측 경진대회

DACON - 영화 관객수 예측 경진대회

너무 귀찮아서 대충했다보니 성능이 참.. EDA를 집중해서 진행하지 못했다.

EDA and Preprocessing

사용 라이브러리

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import koreanize_matplotlib
import re
import glob
import warnings
warnings.filterwarnings("ignore")

from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error, accuracy_score

import xgboost as xgb
from xgboost import XGBRegressor

import lightgbm as lgbm
from lightgbm import LGBMRegressor

Data Load

1
2
paths = glob.glob("./data/*")
paths
1
2
3
['./data\\movies_test.csv',
 './data\\movies_train.csv',
 './data\\submission.csv']
1
2
3
train, test = pd.read_csv(paths[1]), pd.read_csv(paths[0])

train.shape, test.shape
1
((600, 12), (243, 11))
1
2
display(train.head(3))
display(test.head(3))
titledistributorgenrerelease_timetimescreening_ratdirectordir_prev_bfnumdir_prev_numnum_staffnum_actorbox_off_num
0개들의 전쟁롯데엔터테인먼트액션2012-11-2296청소년 관람불가조병옥NaN091223398
1내부자들(주)쇼박스느와르2015-11-19130청소년 관람불가우민호1161602.50238737072501
2은밀하게 위대하게(주)쇼박스액션2013-06-0512315세 관람가장철수220775.25434346959083
titledistributorgenrerelease_timetimescreening_ratdirectordir_prev_bfnumdir_prev_numnum_staffnum_actor
0용서는 없다시네마서비스느와르2010-01-07125청소년 관람불가김형준3.005290e+0523043
1아빠가 여자를 좋아해(주)쇼박스멜로/로맨스2010-01-1411312세 관람가이광재3.427002e+0542753
2하모니CJ 엔터테인먼트드라마2010-01-2811512세 관람가강대규4.206611e+0634197
  • title : 영화의 제목
  • distributor : 배급사
  • genre : 장르
  • release_time : 개봉일
  • time : 상영시간(분)
  • screening_rat : 상영등급
  • director : 감독이름
  • dir_prev_bfnum : 해당 감독이 이 영화를 만들기 전 제작에 참여한 영화에서의 평균 관객수(단 관객수가 알려지지 않은 영화 제외)
  • dir_prev_num : 해당 감독이 이 영화를 만들기 전 제작에 참여한 영화의 개수(단 관객수가 알려지지 않은 영화 제외)
  • num_staff : 스텝수
  • num_actor : 주연배우수
  • box_off_num : 관객수

EDA and Preprocessing

기본 정보

1
train.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 600 entries, 0 to 599
Data columns (total 12 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           600 non-null    object 
 1   distributor     600 non-null    object 
 2   genre           600 non-null    object 
 3   release_time    600 non-null    object 
 4   time            600 non-null    int64  
 5   screening_rat   600 non-null    object 
 6   director        600 non-null    object 
 7   dir_prev_bfnum  270 non-null    float64
 8   dir_prev_num    600 non-null    int64  
 9   num_staff       600 non-null    int64  
 10  num_actor       600 non-null    int64  
 11  box_off_num     600 non-null    int64  
dtypes: float64(1), int64(5), object(6)
memory usage: 56.4+ KB
1
train.describe().round(2)
timedir_prev_bfnumdir_prev_numnum_staffnum_actorbox_off_num
count600.00270.00600.00600.00600.00600.00
mean100.861050442.890.88151.123.71708181.75
std18.101791408.301.18165.652.451828005.85
min45.001.000.000.000.001.00
25%89.0020380.000.0017.002.001297.25
50%100.00478423.620.0082.503.0012591.00
75%114.001286568.622.00264.004.00479886.75
max180.0017615314.005.00869.0025.0014262766.00
1
test.info()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 243 entries, 0 to 242
Data columns (total 11 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   title           243 non-null    object 
 1   distributor     243 non-null    object 
 2   genre           243 non-null    object 
 3   release_time    243 non-null    object 
 4   time            243 non-null    int64  
 5   screening_rat   243 non-null    object 
 6   director        243 non-null    object 
 7   dir_prev_bfnum  107 non-null    float64
 8   dir_prev_num    243 non-null    int64  
 9   num_staff       243 non-null    int64  
 10  num_actor       243 non-null    int64  
dtypes: float64(1), int64(4), object(6)
memory usage: 21.0+ KB
1
test.describe().round(2)
timedir_prev_bfnumdir_prev_numnum_staffnum_actor
count243.00107.00243.00243.00243.00
mean109.80891669.520.85159.323.48
std124.021217341.451.20162.982.11
min40.0034.000.000.000.00
25%91.0062502.000.0018.002.00
50%104.00493120.000.00105.003.00
75%114.501080849.581.00282.004.00
max2015.006173099.506.00776.0016.00

결측치 확인

1
train.isnull().sum()
1
2
3
4
5
6
7
8
9
10
11
12
13
title               0
distributor         0
genre               0
release_time        0
time                0
screening_rat       0
director            0
dir_prev_bfnum    330
dir_prev_num        0
num_staff           0
num_actor           0
box_off_num         0
dtype: int64
1
test.isnull().sum()
1
2
3
4
5
6
7
8
9
10
11
12
title               0
distributor         0
genre               0
release_time        0
time                0
screening_rat       0
director            0
dir_prev_bfnum    136
dir_prev_num        0
num_staff           0
num_actor           0
dtype: int64
1
2
3
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
_ = sns.heatmap(train.isnull(), ax=ax[0]).set_title("Train - Missing")
_ = sns.heatmap(test.isnull(), ax=ax[1]).set_title("Test - Missing")

png

dir_prev_bfnum은 해당 감독이 영화를 만들기 전 제작에 참여한 영화에서의 평균 관객수부분에 결측치가 존재함
관객수가 알려지지 않은 부분이 결측치로 존재하는거라, 정보가 없다라는 정보 그 자체로 사용해도 괜찮을꺼 같음

distributor: 배급사

1
2
train["distributor"] = train["distributor"].str.replace("\(|주|\)", "", regex=True)
test["distributor"] = test["distributor"].str.replace("\(|주|\)", "", regex=True)
1
2
3
4
fig, ax = plt.subplots(1, 2, figsize=(15, 5))
sns.countplot(x=train["distributor"], ax=ax[0]).set_title("Train - distributor")
sns.countplot(x=test["distributor"], ax=ax[1]).set_title("Test - distributor")
plt.show()

png

1
2
3
# 정규 표현식으로 문자와 숫자만 
train["distributor"] = [re.sub(r'[^0-9a-zA-Z가-힣]', '', x) for x in train.distributor]
test['distributor'] = [re.sub(r'[^0-9a-zA-Z가-힣]', '', x) for x in test.distributor]
1
_ = train["distributor"].value_counts().hist()

png

1
2
3
4
5
6
def distributor_band(col, d=dict(train["distributor"].value_counts())):
    try:
        if d[col]<=15: return "소형"
        else: return "중대형"
    except:
        return "소형"
1
train["distributor"].apply(distributor_band).value_counts()
1
2
3
소형     357
중대형    243
Name: distributor, dtype: int64
1
2
train["distributor"] = train["distributor"].apply(distributor_band)
test["distributor"] = test["distributor"].apply(distributor_band)
1
2
3
4
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
ax[0].pie(train["distributor"].value_counts().values, labels=train["distributor"].value_counts().index, autopct="%.2f%%")
ax[1].pie(test["distributor"].value_counts().values, labels=test["distributor"].value_counts().index, autopct="%.2f%%")
plt.show()

png

genre: 장르

1
train.groupby("genre")["box_off_num"].mean().sort_values()
1
2
3
4
5
6
7
8
9
10
11
12
13
14
genre
뮤지컬       6.627000e+03
다큐멘터리     6.717226e+04
서스펜스      8.261100e+04
애니메이션     1.819267e+05
멜로/로맨스    4.259680e+05
미스터리      5.275482e+05
공포        5.908325e+05
드라마       6.256898e+05
코미디       1.193914e+06
SF        1.788346e+06
액션        2.203974e+06
느와르       2.263695e+06
Name: box_off_num, dtype: float64
1
rank = {'뮤지컬' : 1, '다큐멘터리' : 2, '서스펜스' : 3, '애니메이션' : 4, '멜로/로맨스' : 5, '미스터리' : 6, '공포' : 7, '드라마' : 8, '코미디' : 9, 'SF' : 10, '액션' : 11, '느와르' : 12}
1
2
train["rank_genre"] = train["genre"].apply(lambda x: rank[x])
test["rank_genre"] = test["genre"].apply(lambda x: rank[x])
1
2
train.drop(columns="genre", axis=1, inplace=True)
test.drop(columns="genre", axis=1, inplace=True)

release_time: 개봉일

1
2
train["release_time"] = pd.to_datetime(train["release_time"])
test["release_time"] = pd.to_datetime(test["release_time"])
1
2
3
4
5
6
7
target = [train, test]

for t in target:
    t["year"] = t["release_time"].dt.year
    t["month"] = t["release_time"].dt.month
    t["day"] = t["release_time"].dt.day
    t["dayofweek"] = t["release_time"].dt.dayofweek
1
2
train.drop(columns="release_time", axis=1, inplace=True)
test.drop(columns="release_time", axis=1, inplace=True)
1
2
3
4
5
6
date = ["year", "month", "day", "dayofweek"]

fig, ax = plt.subplots(1, len(date), figsize=(24, 7))
for col_name, ax in zip(date, ax):
    sns.countplot(data=train, x=col_name, ax=ax).set_title(f"{col_name}")
plt.show()

png

수요일, 목요일에 개봉한 영화들이 많고 목요일에 개봉한 영화가 유독 많음

Train

1
2
train.drop(columns=["title", "director"], axis=1, inplace=True)
test.drop(columns=["title", "director"], axis=1, inplace=True)
1
2
train["dir_prev_bfnum"].fillna(0, inplace=True)
test["dir_prev_bfnum"].fillna(0, inplace=True)
1
2
train = pd.get_dummies(train, columns=["distributor", "screening_rat"])
test = pd.get_dummies(test, columns=["distributor", "screening_rat"])
1
2
3
4
label = "box_off_num"
features = train.columns.tolist()
features.remove(label)
features
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
['time',
 'dir_prev_bfnum',
 'dir_prev_num',
 'num_staff',
 'num_actor',
 'rank_genre',
 'year',
 'month',
 'day',
 'dayofweek',
 'distributor_소형',
 'distributor_중대형',
 'screening_rat_12세 관람가',
 'screening_rat_15세 관람가',
 'screening_rat_전체 관람가',
 'screening_rat_청소년 관람불가']

Log Scale

1
2
train["num_actor"] = np.log1p(train["num_actor"])
test["num_actor"] = np.log1p(test["num_actor"])
1
2
3
X_train, X_val, y_train, y_val = train_test_split(train[features], train[label], test_size=0.15)

print(f"X_train: {X_train.shape}\ny_train: {y_train.shape}\nX_val: {X_val.shape}\ny_val: {y_val.shape}")
1
2
3
4
X_train: (510, 16)
y_train: (510,)
X_val: (90, 16)
y_val: (90,)
1
test.shape
1
(243, 16)

Random Forest

1
2
3
4
5
6
reg_rf = RandomForestRegressor()

pred_rf = reg_rf.fit(X_train, y_train).predict(X_val)

print(f"rmse: {np.sqrt(mean_squared_error(y_val, pred_rf))}")

1
rmse: 1377636.6138653848
1
_ = sns.barplot(x=reg_rf.feature_importances_, y=reg_rf.feature_names_in_)

png

XGBoost

1
2
3
4
5
6
reg_xgb = XGBRegressor()

pred_xgb = reg_xgb.fit(X_train, y_train).predict(X_val)

print(f"rmse: {np.sqrt(mean_squared_error(y_val, pred_xgb))}")

1
rmse: 1444161.2032999645
1
_ = xgb.plot_importance(reg_xgb)

png

LightGBM

1
2
3
4
5
6
reg_lgbm = LGBMRegressor()

pred_lgbm = reg_lgbm.fit(X_train, y_train).predict(X_val)

print(f"rmse: {np.sqrt(mean_squared_error(y_val, pred_lgbm))}")

1
rmse: 1444161.2032999645
1
_ = lgbm.plot_importance(reg_lgbm)

png

Submit

1
2
sub = pd.read_csv(paths[2])
sub.head()
titlebox_off_num
0용서는 없다0
1아빠가 여자를 좋아해0
2하모니0
3의형제0
4평행 이론0
1
2
pred = (0.6*reg_rf.predict(X_val)) + (0.2*reg_xgb.predict(X_val)) + (0.2*reg_lgbm.predict(X_val))
print(f"rmse: {np.sqrt(mean_squared_error(y_val, pred))}")
1
rmse: 1368430.7286695272
1
pred = (0.6*reg_rf.predict(test)) + (0.2*reg_xgb.predict(test)) + (0.2*reg_lgbm.predict(test))
1
sub["box_off_num"] = pred
1
sub.to_csv("sub_rmse_1368.csv", index=False)
This post is licensed under CC BY 4.0 by the author.

CNN으로 MNIST 분류하기

멋쟁이 사자처럼 AI School 12주차

Comments powered by Disqus.