Home Kaggle - New York City Taxi Trip Duration
Post
Cancel

Kaggle - New York City Taxi Trip Duration

Kaggle - New York City Taxi Trip Duration

EDA and Preprocessing

Data

Kaggle - New York City Taxi Trip Duration

사용 라이브러리

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import glob
import warnings
warnings.filterwarnings("ignore")

from sklearn.metrics import mean_squared_log_error
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestRegressor

import xgboost as xgb
from xgboost import XGBRegressor
import lightgbm as lgbm
from lightgbm import LGBMRegressor

Data Load

1
2
path = glob.glob("data/*")
path
1
2
3
4
5
['data\\pre_test.csv',
 'data\\pre_train.csv',
 'data\\sample_submission.zip',
 'data\\test.zip',
 'data\\train.zip']
1
2
3
train, test = pd.read_csv(path[-1]), pd.read_csv(path[-2])

train.shape, test.shape
1
((1458644, 11), (625134, 9))
1
2
display(train.head())
display(test.head())
idvendor_idpickup_datetimedropoff_datetimepassenger_countpickup_longitudepickup_latitudedropoff_longitudedropoff_latitudestore_and_fwd_flagtrip_duration
0id287542122016-03-14 17:24:552016-03-14 17:32:301-73.98215540.767937-73.96463040.765602N455
1id237739412016-06-12 00:43:352016-06-12 00:54:381-73.98041540.738564-73.99948140.731152N663
2id385852922016-01-19 11:35:242016-01-19 12:10:481-73.97902740.763939-74.00533340.710087N2124
3id350467322016-04-06 19:32:312016-04-06 19:39:401-74.01004040.719971-74.01226840.706718N429
4id218102822016-03-26 13:30:552016-03-26 13:38:101-73.97305340.793209-73.97292340.782520N435
idvendor_idpickup_datetimepassenger_countpickup_longitudepickup_latitudedropoff_longitudedropoff_latitudestore_and_fwd_flag
0id300467212016-06-30 23:59:581-73.98812940.732029-73.99017340.756680N
1id350535512016-06-30 23:59:531-73.96420340.679993-73.95980840.655403N
2id121714112016-06-30 23:59:471-73.99743740.737583-73.98616040.729523N
3id215012622016-06-30 23:59:411-73.95607040.771900-73.98642740.730469N
4id159824512016-06-30 23:59:331-73.97021540.761475-73.96151040.755890N
  • vendor_id: 제공회사
  • store_and_fwd_flag: 회사로 데이터를 전송하는지

EDA and Preprocessing

기본정보

1
2
display(train.dtypes)
display(test.dtypes)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
id                     object
vendor_id               int64
pickup_datetime        object
dropoff_datetime       object
passenger_count         int64
pickup_longitude      float64
pickup_latitude       float64
dropoff_longitude     float64
dropoff_latitude      float64
store_and_fwd_flag     object
trip_duration           int64
dtype: object



id                     object
vendor_id               int64
pickup_datetime        object
passenger_count         int64
pickup_longitude      float64
pickup_latitude       float64
dropoff_longitude     float64
dropoff_latitude      float64
store_and_fwd_flag     object
dtype: object

결측치

1
2
3
4
fig, ax = plt.subplots(1, 2, figsize=(12, 7))
sns.heatmap(train.isnull(), ax=ax[0]).set_title("Train - Missing")
sns.heatmap(test.isnull(), ax=ax[1]).set_title("Test - Missing")
plt.show()

png

결측치가 없는 데이터임

중복값

1
train[train.duplicated()]
idvendor_idpickup_datetimedropoff_datetimepassenger_countpickup_longitudepickup_latitudedropoff_longitudedropoff_latitudestore_and_fwd_flagtrip_duration
1
test[test.duplicated()]
idvendor_idpickup_datetimepassenger_countpickup_longitudepickup_latitudedropoff_longitudedropoff_latitudestore_and_fwd_flag

중복값도 없음

vendor_id

1
2
3
4
fig, ax = plt.subplots(1, 2, figsize=(12, 5))
sns.countplot(x=train["vendor_id"], ax=ax[0]).set_title("vendor_id (Train)")
sns.countplot(x=test["vendor_id"], ax=ax[1]).set_title("vendor_id (Test)")
plt.show()

png

pickup_datetime / dropoff_datetime

1
2
3
4
train["pickup_datetime"] = pd.to_datetime(train["pickup_datetime"])
train["dropoff_datetime"] = pd.to_datetime(train["dropoff_datetime"])

test["pickup_datetime"] = pd.to_datetime(test["pickup_datetime"])
1
2
3
4
5
6
7
# train - pickup
train["p_year"] = train["pickup_datetime"].dt.year
train["p_month"] = train["pickup_datetime"].dt.month
train["p_day"] = train["pickup_datetime"].dt.day
train["p_dow"] = train["pickup_datetime"].dt.dayofweek
train["p_hour"] = train["pickup_datetime"].dt.hour
train["p_min"] = train["pickup_datetime"].dt.minute
1
2
3
4
5
6
7
# # train - dropoff
# train["d_year"] = train["dropoff_datetime"].dt.year
# train["d_month"] = train["dropoff_datetime"].dt.month
# train["d_day"] = train["dropoff_datetime"].dt.day
# train["d_dow"] = train["dropoff_datetime"].dt.dayofweek
# train["d_hour"] = train["dropoff_datetime"].dt.hour
# train["d_min"] = train["dropoff_datetime"].dt.minute

dropoff_datetimetest에 없으므로 사용 안함

1
2
3
4
5
6
7
# test - pickup
test["p_year"] = test["pickup_datetime"].dt.year
test["p_month"] = test["pickup_datetime"].dt.month
test["p_day"] = test["pickup_datetime"].dt.day
test["p_dow"] = test["pickup_datetime"].dt.dayofweek
test["p_hour"] = test["pickup_datetime"].dt.hour
test["p_min"] = test["pickup_datetime"].dt.minute
1
date = ["p_year", "p_month", "p_day", "p_dow", "p_hour", "p_min"]
1
2
3
4
fig, ax = plt.subplots(1, len(date), figsize=(30, 5))
for col, ax in zip(date, ax):
    sns.countplot(x=train[col], ax=ax).set_title(f"{col} - Train")
plt.show()

png

1
2
3
4
fig, ax = plt.subplots(1, len(date), figsize=(30, 5))
for col, ax in zip(date, ax):
    sns.countplot(x=test[col], ax=ax).set_title(f"{col} - Test")
plt.show()

png

2016년 상반기(1~6) 데이터임
이동 거리를 예측해야하는 문제라 test에는 dropoff_datetime이 없는 듯

longitude / latitude

1
cor = ["pickup_longitude", "pickup_latitude", "dropoff_longitude", "dropoff_latitude"]

Haversine formula

$ haversine(θ) = sin²{θ \over 2} $

a = sin²((φB - φA)/2) + cos φA . cos φB . sin²((λB - λA)/2)
c = 2 * atan2( √a, √(1−a) )
d = R ⋅ c
d = Haversine distance

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
def haversine_distance(lat1, long1, lat2, long2):
    data = [train, test]
    for _ in data:
        R = 6371 # km, 지구의 반지름
        phi1 = np.radians(_[lat1])
        phi2 = np.radians(_[lat2])
        
        delta_phi = np.radians(_[lat2]-_[lat1])
        delta_lambda = np.radians(_[long2]-_[long2])
        
        a = np.sin(delta_phi / 2.0) ** 2 + np.cos(phi1) * np.cos(phi2) * np.sin(delta_lambda / 2.0) ** 2
        c = 2 * np.arctan2(np.sqrt(a), np.sqrt(1-a))
        d = (R * c)
        _["H_Distance"] = d
    return d
1
haversine_distance('pickup_latitude', 'pickup_longitude', 'dropoff_latitude', 'dropoff_longitude')
1
2
3
4
5
6
7
8
9
10
11
12
0          2.741019
1          2.734232
2          0.896282
3          4.606964
4          0.620992
            ...    
625129     0.949304
625130     4.301558
625131     1.245378
625132    17.593930
625133     5.837496
Length: 625134, dtype: float64

passenger_count

1
_ = (train["passenger_count"].value_counts().sort_index()).plot.bar()

png

1
train["passenger_count"].value_counts().sort_index()
1
2
3
4
5
6
7
8
9
10
11
0         60
1    1033540
2     210318
3      59896
4      28404
5      78088
6      48333
7          3
8          1
9          1
Name: passenger_count, dtype: int64

0, 7, 8, 9명이 탄것은 이상치
5명 이상도 조금 이상한거 같지만, 대형 택시라고 생각함

1
2
3
4
5
6
cond1 = train["passenger_count"]==0
cond2 = train["passenger_count"]==7
cond3 = train["passenger_count"]==8
cond4 = train["passenger_count"]==9

cond = cond1 | cond2 | cond3 | cond4
1
train = train[~cond]

store_and_fwd_flag

1
_ = sns.countplot(x=train["store_and_fwd_flag"])

png

Model - Regressor

Train Data Split

1
2
3
4
5
6
7
label = "trip_duration"
features = train.columns.tolist()
features.remove(label)
features.remove("id")
features.remove("pickup_datetime")
features.remove("dropoff_datetime")
features
1
2
3
4
5
6
7
8
9
10
11
12
13
14
['vendor_id',
 'passenger_count',
 'pickup_longitude',
 'pickup_latitude',
 'dropoff_longitude',
 'dropoff_latitude',
 'store_and_fwd_flag',
 'p_year',
 'p_month',
 'p_day',
 'p_dow',
 'p_hour',
 'p_min',
 'H_Distance']
1
2
train["store_and_fwd_flag"] = pd.get_dummies(train["store_and_fwd_flag"], drop_first=True)
test["store_and_fwd_flag"] = pd.get_dummies(test["store_and_fwd_flag"], drop_first=True)
1
2
3
X_train, X_test, y_train, y_test = train_test_split(train[features], train[label], test_size=0.3)

print(f"X_train: {X_train.shape}\ny_train: {y_train.shape}\nX_test: {X_test.shape}\ny_test: {y_test.shape}")
1
2
3
4
X_train: (1021005, 14)
y_train: (1021005,)
X_test: (437574, 14)
y_test: (437574,)

Random Forest

1
2
3
4
5
reg_rf = RandomForestRegressor()

pred_rf = reg_rf.fit(X_train, y_train).predict(X_test)

mean_squared_log_error(y_test, pred_rf)
1
0.3483171563954924
1
_ = sns.barplot(x=reg_rf.feature_importances_, y=reg_rf.feature_names_in_)

png

XGBoost

1
2
3
4
5
reg_xgb = XGBRegressor()

pred_xgb = reg_xgb.fit(X_train, y_train).predict(X_test)

mean_squared_log_error(y_test, abs(pred_xgb))
1
0.3297185297357314
1
_ = xgb.plot_importance(reg_xgb)

png

LGBM

1
2
3
4
5
reg_lgbm = LGBMRegressor()

pred_lgbm = reg_lgbm.fit(X_train, y_train).predict(X_test)

mean_squared_log_error(y_test, abs(pred_lgbm))
1
0.3620231060763725
1
_ = lgbm.plot_importance(reg_lgbm)

png

This post is licensed under CC BY 4.0 by the author.

Spaceship Titanic - 머신러닝으로 분류하기

신경망을 이용한 DACON 병원 개업/폐업 분류 예측 경진대회

Comments powered by Disqus.