Home 판다스 요약 정리
Post
Cancel

판다스 요약 정리

Pandas Summary

모듈

1
2
3
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

1. 객체 생성 (Object Creation)

Pandas는 기본적으로 2개의 데이터 형식을 가지고 있음

  1. Series
    • 벡터 형식의 데이터
  2. DataFrame
    • 행렬 형식의 데이터

Series와 DataFrame

Pandas는 값을 가지고 있는 리스트를 통해 Series를 만들고, 정수 인덱스를 기본으로 제공

1
2
s = pd.Series([1, 3, 5, np.nan, 6, 8])
print(s)
1
2
3
4
5
6
7
0    1.0
1    3.0
2    5.0
3    NaN
4    6.0
5    8.0
dtype: float64
1
2
dates = pd.date_range("20220513", periods=6)
print(dates)
1
2
3
DatetimeIndex(['2022-05-13', '2022-05-14', '2022-05-15', '2022-05-16',
               '2022-05-17', '2022-05-18'],
              dtype='datetime64[ns]', freq='D')

DataFrame은 행과 열을 가진 데이터 형식으로, 인덱스와 레이블이 있는 열을 가지고 있는 np 배열을 전달해 생성 할 수 있다.

1
2
df = pd.DataFrame(np.random.randn(6,4), index=dates, columns=list("ABCD"))
print(df)
1
2
3
4
5
6
7
                   A         B         C         D
2022-05-13 -0.364352 -1.806739 -0.605339 -0.259275
2022-05-14  1.792022 -0.109226 -0.782558  0.959711
2022-05-15 -0.035533 -1.695418 -0.262033 -0.739229
2022-05-16 -0.938873  0.246531 -1.118899 -0.017989
2022-05-17  0.335735 -0.935445 -0.217757  0.198702
2022-05-18 -0.099656 -0.011014  0.577307 -1.973498
1
2
3
4
5
6
7
df2 = pd.DataFrame({'A': 1.,
              'B': pd.Timestamp("20220513"),
              'C': pd.Series(1, index=list(range(4)), dtype='float32'),
              'D': np.array([3]*4, dtype='int32'),
              'E': pd.Categorical(["test", "train", "test", "train"]),
              'F': "foo" })
print(df2)
1
2
3
4
5
     A          B    C  D      E    F
0  1.0 2022-05-13  1.0  3   test  foo
1  1.0 2022-05-13  1.0  3  train  foo
2  1.0 2022-05-13  1.0  3   test  foo
3  1.0 2022-05-13  1.0  3  train  foo
  • pd.Timestamp()
    • para: year, month, day, hour, minute, second
    • 하나의 문자열로 넣어도 변환되는거 같음
1
print(df2.dtypes)
1
2
3
4
5
6
7
A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

2. 데이터 확인하기 (Viewing Data)

tail과 head

tailhead 함수의 경우 인자로 숫자를 넣지 않을 경우, 기본값인 5로 처리됨

1
print(df.tail())
1
2
3
4
5
6
                   A         B         C         D
2022-05-14  1.792022 -0.109226 -0.782558  0.959711
2022-05-15 -0.035533 -1.695418 -0.262033 -0.739229
2022-05-16 -0.938873  0.246531 -1.118899 -0.017989
2022-05-17  0.335735 -0.935445 -0.217757  0.198702
2022-05-18 -0.099656 -0.011014  0.577307 -1.973498
1
print(df.head())
1
2
3
4
5
6
                   A         B         C         D
2022-05-13 -0.364352 -1.806739 -0.605339 -0.259275
2022-05-14  1.792022 -0.109226 -0.782558  0.959711
2022-05-15 -0.035533 -1.695418 -0.262033 -0.739229
2022-05-16 -0.938873  0.246531 -1.118899 -0.017989
2022-05-17  0.335735 -0.935445 -0.217757  0.198702

인덱스(index), 열(column), numpy 데이터에 대한 세부 정보

1
print(df.index)
1
2
3
DatetimeIndex(['2022-05-13', '2022-05-14', '2022-05-15', '2022-05-16',
               '2022-05-17', '2022-05-18'],
              dtype='datetime64[ns]', freq='D')
1
print(df.columns)
1
Index(['A', 'B', 'C', 'D'], dtype='object')
1
print(df.values)
1
2
3
4
5
6
[[-0.36435205 -1.80673941 -0.60533949 -0.25927545]
 [ 1.79202234 -0.10922598 -0.78255824  0.95971119]
 [-0.03553265 -1.69541806 -0.26203292 -0.73922933]
 [-0.93887282  0.24653065 -1.11889869 -0.01798904]
 [ 0.33573461 -0.9354455  -0.21775696  0.198702  ]
 [-0.09965588 -0.01101404  0.5773068  -1.97349815]]

대략적인 통계적 정보 요약

1
print(df.describe())
1
2
3
4
5
6
7
8
9
              A         B         C         D
count  6.000000  6.000000  6.000000  6.000000
mean   0.114891 -0.718552 -0.401547 -0.305263
std    0.924153  0.893167  0.585268  0.990971
min   -0.938873 -1.806739 -1.118899 -1.973498
25%   -0.298178 -1.505425 -0.738254 -0.619241
50%   -0.067594 -0.522336 -0.433686 -0.138632
75%    0.242918 -0.035567 -0.228826  0.144529
max    1.792022  0.246531  0.577307  0.959711

데이터 전치

1
print(df.T)
1
2
3
4
5
   2022-05-13  2022-05-14  2022-05-15  2022-05-16  2022-05-17  2022-05-18
A   -0.364352    1.792022   -0.035533   -0.938873    0.335735   -0.099656
B   -1.806739   -0.109226   -1.695418    0.246531   -0.935445   -0.011014
C   -0.605339   -0.782558   -0.262033   -1.118899   -0.217757    0.577307
D   -0.259275    0.959711   -0.739229   -0.017989    0.198702   -1.973498

축 별로 정렬

1
df.sort_index(axis=0, ascending=False)
ABCD
2022-05-18-0.099656-0.0110140.577307-1.973498
2022-05-170.335735-0.935445-0.2177570.198702
2022-05-16-0.9388730.246531-1.118899-0.017989
2022-05-15-0.035533-1.695418-0.262033-0.739229
2022-05-141.792022-0.109226-0.7825580.959711
2022-05-13-0.364352-1.806739-0.605339-0.259275
1
df.sort_index(axis=1, ascending=False)
DCBA
2022-05-13-0.259275-0.605339-1.806739-0.364352
2022-05-140.959711-0.782558-0.1092261.792022
2022-05-15-0.739229-0.262033-1.695418-0.035533
2022-05-16-0.017989-1.1188990.246531-0.938873
2022-05-170.198702-0.217757-0.9354450.335735
2022-05-18-1.9734980.577307-0.011014-0.099656

값 별로 정렬

1
df.sort_values(by='B')
ABCD
2022-05-13-0.364352-1.806739-0.605339-0.259275
2022-05-15-0.035533-1.695418-0.262033-0.739229
2022-05-170.335735-0.935445-0.2177570.198702
2022-05-141.792022-0.109226-0.7825580.959711
2022-05-18-0.099656-0.0110140.577307-1.973498
2022-05-16-0.9388730.246531-1.118899-0.017989

3. 선택 (Selection)

데이터 얻기 (Getting)

1
df
ABCD
2022-05-13-0.364352-1.806739-0.605339-0.259275
2022-05-141.792022-0.109226-0.7825580.959711
2022-05-15-0.035533-1.695418-0.262033-0.739229
2022-05-16-0.9388730.246531-1.118899-0.017989
2022-05-170.335735-0.935445-0.2177570.198702
2022-05-18-0.099656-0.0110140.577307-1.973498
1
df['A']
1
2
3
4
5
6
7
2022-05-13   -0.364352
2022-05-14    1.792022
2022-05-15   -0.035533
2022-05-16   -0.938873
2022-05-17    0.335735
2022-05-18   -0.099656
Freq: D, Name: A, dtype: float64

행을 분할 할수도 있음

1
df[0:3]
ABCD
2022-05-13-0.364352-1.806739-0.605339-0.259275
2022-05-141.792022-0.109226-0.7825580.959711
2022-05-15-0.035533-1.695418-0.262033-0.739229
1
df['2022-05-13':'2022-05-15']
ABCD
2022-05-13-0.364352-1.806739-0.605339-0.259275
2022-05-141.792022-0.109226-0.7825580.959711
2022-05-15-0.035533-1.695418-0.262033-0.739229

라벨을 통한 선택

1
print(dates)
1
2
3
DatetimeIndex(['2022-05-13', '2022-05-14', '2022-05-15', '2022-05-16',
               '2022-05-17', '2022-05-18'],
              dtype='datetime64[ns]', freq='D')
1
df.loc[dates[0]]
1
2
3
4
5
A   -0.364352
B   -1.806739
C   -0.605339
D   -0.259275
Name: 2022-05-13 00:00:00, dtype: float64
1
df.loc[:,['A','B']]
AB
2022-05-13-0.364352-1.806739
2022-05-141.792022-0.109226
2022-05-15-0.035533-1.695418
2022-05-16-0.9388730.246531
2022-05-170.335735-0.935445
2022-05-18-0.099656-0.011014
1
df.loc['2022-05-14':'2022-05-17', ['B','D']]
BD
2022-05-14-0.1092260.959711
2022-05-15-1.695418-0.739229
2022-05-160.246531-0.017989
2022-05-17-0.9354450.198702
1
df.loc[dates[0], 'A']
1
-0.3643520548458078

위치를 통한 선택

1
df.iloc[3]
1
2
3
4
5
A   -0.938873
B    0.246531
C   -1.118899
D   -0.017989
Name: 2022-05-16 00:00:00, dtype: float64
1
df.iloc[3:5, 0:2]
AB
2022-05-16-0.9388730.246531
2022-05-170.335735-0.935445
1
df.iloc[[1,2,4],[0,2]]
AC
2022-05-141.792022-0.782558
2022-05-15-0.035533-0.262033
2022-05-170.335735-0.217757
  • loc: 인덱스 명칭을 기준으로 값을 색인
  • iloc: 인덱스 순서를 기준으로 값을 색인, range 값을 기준으로

df.set_index[키워드] 형식으로 인덱스를 설정할 수 있음

특정한 값 얻기

1
df.iloc[1,1]
1
-0.10922598205957576

Boolean Indexing

1
df[df['A']>0]
ABCD
2022-05-141.792022-0.109226-0.7825580.959711
2022-05-170.335735-0.935445-0.2177570.198702
1
df[df>0]
ABCD
2022-05-13NaNNaNNaNNaN
2022-05-141.792022NaNNaN0.959711
2022-05-15NaNNaNNaNNaN
2022-05-16NaN0.246531NaNNaN
2022-05-170.335735NaNNaN0.198702
2022-05-18NaNNaN0.577307NaN
1
2
3
df2 = df.copy()
df2['E'] = ["one", "one", "two", "three", "four", "three"]
print(df2)
1
2
3
4
5
6
7
                   A         B         C         D      E
2022-05-13 -0.364352 -1.806739 -0.605339 -0.259275    one
2022-05-14  1.792022 -0.109226 -0.782558  0.959711    one
2022-05-15 -0.035533 -1.695418 -0.262033 -0.739229    two
2022-05-16 -0.938873  0.246531 -1.118899 -0.017989  three
2022-05-17  0.335735 -0.935445 -0.217757  0.198702   four
2022-05-18 -0.099656 -0.011014  0.577307 -1.973498  three
1
df2[df2['E'].isin(["two", "four"])]
ABCDE
2022-05-15-0.035533-1.695418-0.262033-0.739229two
2022-05-170.335735-0.935445-0.2177570.198702four

Setting

새 열을 설정하면 데이터가 인덱스 별로 자동 정렬

1
2
s1 = pd.Series([1, 2, 3, 4, 5, 6], index=pd.date_range("20220513", periods=6))
print(s1)
1
2
3
4
5
6
7
2022-05-13    1
2022-05-14    2
2022-05-15    3
2022-05-16    4
2022-05-17    5
2022-05-18    6
Freq: D, dtype: int64
1
2
df['F'] = s1
print(df)
1
2
3
4
5
6
7
                   A         B         C         D  F
2022-05-13 -0.364352 -1.806739 -0.605339 -0.259275  1
2022-05-14  1.792022 -0.109226 -0.782558  0.959711  2
2022-05-15 -0.035533 -1.695418 -0.262033 -0.739229  3
2022-05-16 -0.938873  0.246531 -1.118899 -0.017989  4
2022-05-17  0.335735 -0.935445 -0.217757  0.198702  5
2022-05-18 -0.099656 -0.011014  0.577307 -1.973498  6
1
2
df.at[dates[0], 'A'] = 0
print(df)
1
2
3
4
5
6
7
                   A         B         C         D  F
2022-05-13  0.000000 -1.806739 -0.605339 -0.259275  1
2022-05-14  1.792022 -0.109226 -0.782558  0.959711  2
2022-05-15 -0.035533 -1.695418 -0.262033 -0.739229  3
2022-05-16 -0.938873  0.246531 -1.118899 -0.017989  4
2022-05-17  0.335735 -0.935445 -0.217757  0.198702  5
2022-05-18 -0.099656 -0.011014  0.577307 -1.973498  6
1
2
df.iat[0,1] = 0
print(df)
1
2
3
4
5
6
7
                   A         B         C         D  F
2022-05-13  0.000000  0.000000 -0.605339 -0.259275  1
2022-05-14  1.792022 -0.109226 -0.782558  0.959711  2
2022-05-15 -0.035533 -1.695418 -0.262033 -0.739229  3
2022-05-16 -0.938873  0.246531 -1.118899 -0.017989  4
2022-05-17  0.335735 -0.935445 -0.217757  0.198702  5
2022-05-18 -0.099656 -0.011014  0.577307 -1.973498  6
1
2
df.loc[:, 'D'] = np.array([5]*len(df))
print(df)
1
2
3
4
5
6
7
                   A         B         C  D  F
2022-05-13  0.000000  0.000000 -0.605339  5  1
2022-05-14  1.792022 -0.109226 -0.782558  5  2
2022-05-15 -0.035533 -1.695418 -0.262033  5  3
2022-05-16 -0.938873  0.246531 -1.118899  5  4
2022-05-17  0.335735 -0.935445 -0.217757  5  5
2022-05-18 -0.099656 -0.011014  0.577307  5  6
1
2
3
df2 = df.copy()
df2[df2>0] = -df2
print(df2)
1
2
3
4
5
6
7
                   A         B         C  D  F
2022-05-13  0.000000  0.000000 -0.605339 -5 -1
2022-05-14 -1.792022 -0.109226 -0.782558 -5 -2
2022-05-15 -0.035533 -1.695418 -0.262033 -5 -3
2022-05-16 -0.938873 -0.246531 -1.118899 -5 -4
2022-05-17 -0.335735 -0.935445 -0.217757 -5 -5
2022-05-18 -0.099656 -0.011014 -0.577307 -5 -6

4. 결측치 (Missing Data)

Pandas는 결측치를 표한하기 위해 주로 np.nan 값을 사용

Reindexing

지정된 축 상의 인덱스를 변경 / 추가 / 삭제 할 수 있고 데이터의 복사본은 반환

1
2
df1 = df.reindex(index=dates[0:4], columns=list(df.columns)+['E'])
print(df1)
1
2
3
4
5
                   A         B         C  D  F   E
2022-05-13  0.000000  0.000000 -0.605339  5  1 NaN
2022-05-14  1.792022 -0.109226 -0.782558  5  2 NaN
2022-05-15 -0.035533 -1.695418 -0.262033  5  3 NaN
2022-05-16 -0.938873  0.246531 -1.118899  5  4 NaN
1
2
df1.loc[dates[0]:dates[1], 'E'] = 1
print(df1)
1
2
3
4
5
                   A         B         C  D  F    E
2022-05-13  0.000000  0.000000 -0.605339  5  1  1.0
2022-05-14  1.792022 -0.109226 -0.782558  5  2  1.0
2022-05-15 -0.035533 -1.695418 -0.262033  5  3  NaN
2022-05-16 -0.938873  0.246531 -1.118899  5  4  NaN

결측치를 가지고 있는 행들을 지움

1
df1.dropna(how='any')
ABCDFE
2022-05-130.0000000.000000-0.605339511.0
2022-05-141.792022-0.109226-0.782558521.0

결측치를 채워 넣음

1
df1.fillna(value=5)
ABCDFE
2022-05-130.0000000.000000-0.605339511.0
2022-05-141.792022-0.109226-0.782558521.0
2022-05-15-0.035533-1.695418-0.262033535.0
2022-05-16-0.9388730.246531-1.118899545.0

nan인 값에 boolean을 통한 표식

  • 데이터프레임의 모든 값이 boolean 형태로 표시
  • nan인 값에만 True가 표시되게 함
1
pd.isna(df1)
ABCDFE
2022-05-13FalseFalseFalseFalseFalseFalse
2022-05-14FalseFalseFalseFalseFalseFalse
2022-05-15FalseFalseFalseFalseFalseTrue
2022-05-16FalseFalseFalseFalseFalseTrue

5. 연산 (Operation)

통계 (Stats)

  • 일반적으로 결측치를 제외한 후 연산

    평균

    축 선택이 가능함

1
df.mean() # default: 열(row)
1
2
3
4
5
6
A    0.175616
B   -0.417429
C   -0.401547
D    5.000000
F    3.500000
dtype: float64
1
df.mean(1) # 행(col)
1
2
3
4
5
6
7
2022-05-13    1.078932
2022-05-14    1.580048
2022-05-15    1.201403
2022-05-16    1.437752
2022-05-17    1.836506
2022-05-18    2.293327
Freq: D, dtype: float64

pd.shift()

행의 위치를 일정 칸수씩 이동

1
2
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates)
s
1
2
3
4
5
6
7
2022-05-13    1.0
2022-05-14    3.0
2022-05-15    5.0
2022-05-16    NaN
2022-05-17    6.0
2022-05-18    8.0
Freq: D, dtype: float64
1
2
s = pd.Series([1, 3, 5, np.nan, 6, 8], index=dates).shift(2)
s
1
2
3
4
5
6
7
2022-05-13    NaN
2022-05-14    NaN
2022-05-15    1.0
2022-05-16    3.0
2022-05-17    5.0
2022-05-18    NaN
Freq: D, dtype: float64
1
df.sub(s, axis='index')
ABCDF
2022-05-13NaNNaNNaNNaNNaN
2022-05-14NaNNaNNaNNaNNaN
2022-05-15-1.035533-2.695418-1.2620334.02.0
2022-05-16-3.938873-2.753469-4.1188992.01.0
2022-05-17-4.664265-5.935445-5.2177570.00.0
2022-05-18NaNNaNNaNNaNNaN

Apply

데이터에 함수를 적용

1
df.apply(np.cumsum)
ABCDF
2022-05-130.0000000.000000-0.60533951
2022-05-141.792022-0.109226-1.387898103
2022-05-151.756490-1.804644-1.649931156
2022-05-160.817617-1.558113-2.7688292010
2022-05-171.153351-2.493559-2.9865862515
2022-05-181.053696-2.504573-2.4092793021
1
df.apply(lambda x: x.max()-x.min())
1
2
3
4
5
6
A    2.730895
B    1.941949
C    1.696205
D    0.000000
F    5.000000
dtype: float64

히스토그래밍

1
2
s = pd.Series(np.random.randint(0, 7, size=10))
s
1
2
3
4
5
6
7
8
9
10
11
0    6
1    5
2    5
3    5
4    1
5    0
6    5
7    1
8    1
9    0
dtype: int32
1
print(s.value_counts())
1
2
3
4
5
5    4
1    3
0    2
6    1
dtype: int64

문자열 메소드 (String Methods)

  • 문자열의 패턴 일치 확인은 기본적으로 정규 표현식을 사용
  • 몇몇 경우에는 항상 정규 표현식을 사용
  • str 메서드를 이용
  • 참고 문서
1
2
s = pd.Series(['A', 'B', 'C', 'Aaba', 'Baca', np.nan, 'CABA', 'dog', 'cat'])
print(s.str.lower())
1
2
3
4
5
6
7
8
9
10
0       a
1       b
2       c
3    aaba
4    baca
5     NaN
6    caba
7     dog
8     cat
dtype: object

6. 병합 (Merge)

연결 (Concat)

결합 (join) / 병합 (merge) 형태의 연산에 대한 인덱스, 관계 대수 기능을 위한 다양한 형태의 논리를 포함한 객체를 손쉽게 결합할 수 있음

1
2
df = pd.DataFrame(np.random.randn(10, 4))
df
0123
00.4349561.4435640.6444351.496020
10.837197-0.2171020.3417050.325993
2-0.8209440.776307-1.0416410.713703
3-0.405343-1.151217-0.5084350.075641
40.4870230.4463581.2974041.569944
50.2714870.6303230.196226-0.552806
60.2710900.5586620.716677-1.491536
7-1.0544621.934392-0.031558-0.740327
80.8406780.8471080.349140-1.254350
90.857202-0.7089430.179937-0.259415
1
2
pieces = [df[:3], df[3:7], df[7:]]
pd.concat(pieces)
0123
00.4349561.4435640.6444351.496020
10.837197-0.2171020.3417050.325993
2-0.8209440.776307-1.0416410.713703
3-0.405343-1.151217-0.5084350.075641
40.4870230.4463581.2974041.569944
50.2714870.6303230.196226-0.552806
60.2710900.5586620.716677-1.491536
7-1.0544621.934392-0.031558-0.740327
80.8406780.8471080.349140-1.254350
90.857202-0.7089430.179937-0.259415

결합 (Join)

SQL 방식으로 병합

1
2
left = pd.DataFrame({'key':['foo', 'foo'], 'lval':[1, 2]})
right = pd.DataFrame({'key':['foo', 'foo'], 'rval':[4,5]})
1
left
keylval
0foo1
1foo2
1
right
keyrval
0foo4
1foo5
1
pd.merge(left, right, on='key')
keylvalrval
0foo14
1foo15
2foo24
3foo25
1
2
left = pd.DataFrame({'key':['foo', 'bar'], 'lval':[1, 2]})
right = pd.DataFrame({'key':['foo', 'bar'], 'rval':[4,5]})
1
pd.merge(left, right, on='key')
keylvalrval
0foo14
1bar25

추가 (Append)

1
2
df = pd.DataFrame(np.random.randn(8, 4), columns=['A', 'B', 'C', 'D'])
df
ABCD
00.9253611.590198-0.046855-1.486980
10.970544-0.441473-0.058223-1.217569
21.8229000.2924000.607530-1.003081
30.1194980.7068621.1783072.322978
4-0.5321402.189385-0.852177-0.113575
50.595338-2.4150041.3311141.422960
60.103675-0.2708800.539115-0.082935
70.3190830.996896-0.6033170.223083
1
2
s = df.iloc[3]
s
1
2
3
4
5
A    0.119498
B    0.706862
C    1.178307
D    2.322978
Name: 3, dtype: float64
1
df.append(s, ignore_index=True)
1
2
C:\Users\spec3\AppData\Local\Temp/ipykernel_10056/4011806271.py:1: FutureWarning: The frame.append method is deprecated and will be removed from pandas in a future version. Use pandas.concat instead.
  df.append(s, ignore_index=True)
ABCD
00.9253611.590198-0.046855-1.486980
10.970544-0.441473-0.058223-1.217569
21.8229000.2924000.607530-1.003081
30.1194980.7068621.1783072.322978
4-0.5321402.189385-0.852177-0.113575
50.595338-2.4150041.3311141.422960
60.103675-0.2708800.539115-0.082935
70.3190830.996896-0.6033170.223083
80.1194980.7068621.1783072.322978

append는 추후 삭제될 예정이므로 concat을 사용하라고 함

7. 그룹화 (Grouping)

그룹화는 다음 단계 중 하나 이상을 포함하는 과정

  1. 몇몇 기준에 따라 여러 그룹을 데이터를 분할 -> splitting
  2. 각 그룹에 독립적으로 함수를 적용 -> applying
  3. 결과물들을 하나의 데이터 구조로 결합 -> combining
1
2
3
4
5
6
7
8
9
10
df = pd.DataFrame(
    {
        'A': ['foo', 'bar', 'foo', 'bar', 'foo', 'bar', 'foo', 'foo'],
        'B': ['one', 'one', 'two', 'three', 'two', 'two','one', 'three'],
        'C': np.random.randn(8),
        'D': np.random.randn(8)
    }
)

print(df)
1
2
3
4
5
6
7
8
9
     A      B         C         D
0  foo    one -0.641923  0.427765
1  bar    one -0.643603  0.494289
2  foo    two -0.186318 -0.767165
3  bar  three  0.588941 -0.066441
4  foo    two -1.536313  0.290155
5  bar    two  0.449030 -0.116639
6  foo    one -0.956890  0.661999
7  foo  three -0.016754  0.907288
1
df.groupby('A').sum()
CD
A
bar0.3943680.311210
foo-3.3381991.520043
1
df.groupby(['A', 'B']).sum()
CD
AB
barone-0.6436030.494289
three0.588941-0.066441
two0.449030-0.116639
fooone-1.5988131.089764
three-0.0167540.907288
two-1.722632-0.477010

8. 변형 (Reshaping)

스택 (Stack)

1
2
3
4
5
6
tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))

print(tuples)
1
[('bar', 'one'), ('bar', 'two'), ('baz', 'one'), ('baz', 'two'), ('foo', 'one'), ('foo', 'two'), ('qux', 'one'), ('qux', 'two')]
1
2
3
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])

print(index)
1
2
3
4
5
6
7
8
9
MultiIndex([('bar', 'one'),
            ('bar', 'two'),
            ('baz', 'one'),
            ('baz', 'two'),
            ('foo', 'one'),
            ('foo', 'two'),
            ('qux', 'one'),
            ('qux', 'two')],
           names=['first', 'second'])
1
2
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
print(df)
1
2
3
4
5
6
7
8
9
10
                     A         B
first second                    
bar   one     0.288745 -0.503898
      two    -0.024742  0.001157
baz   one    -2.595053 -0.804304
      two    -0.405564 -0.110970
foo   one     1.143463  0.600136
      two    -1.193027 -0.706129
qux   one    -1.632288  0.230009
      two     0.391425 -0.299005
1
2
df2 = df[:4]
df2
AB
firstsecond
barone0.288745-0.503898
two-0.0247420.001157
bazone-2.595053-0.804304
two-0.405564-0.110970

stack() 메소드는 데이터프레임 열들의 계층을 압축

1
2
stacked = df2.stack()
stacked
1
2
3
4
5
6
7
8
9
10
first  second   
bar    one     A    0.288745
               B   -0.503898
       two     A   -0.024742
               B    0.001157
baz    one     A   -2.595053
               B   -0.804304
       two     A   -0.405564
               B   -0.110970
dtype: float64
1
stacked.unstack()
AB
firstsecond
barone0.288745-0.503898
two-0.0247420.001157
bazone-2.595053-0.804304
two-0.405564-0.110970

피봇 테이블 (Pivot Tables)

1
2
3
4
5
6
7
df = pd.DataFrame({'A': ['one', 'one', 'two', 'three']*3,
                   'B': ['A', 'B', 'C']*4,
                   'C': ['foo', 'foo', 'foo', 'bar', 'bar', 'bar']*2,
                   'D': np.random.randn(12),
                   'E': np.random.randn(12)})

df
ABCDE
0oneAfoo-0.5629200.665956
1oneBfoo0.4616040.009933
2twoCfoo0.3555070.423206
3threeAbar-0.3886220.509214
4oneBbar-0.6550120.015648
5oneCbar0.2855310.566449
6twoAfoo-0.149651-0.964291
7threeBfoo0.0622041.159678
8oneCfoo-0.7104241.138363
9oneAbar-0.3835730.615988
10twoBbar0.220584-1.644652
11threeCbar-0.0053310.041637
1
pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
Cbarfoo
AB
oneA-0.383573-0.562920
B-0.6550120.461604
C0.285531-0.710424
threeA-0.388622NaN
BNaN0.062204
C-0.005331NaN
twoANaN-0.149651
B0.220584NaN
CNaN0.355507

9. 범주화 (Categoricals)

1
2
df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']})
df
idraw_grade
01a
12b
23b
34a
45a
56e
1
2
df["grade"] = df["raw_grade"].astype("category")
df
idraw_gradegrade
01aa
12bb
23bb
34aa
45aa
56ee
1
df["grade"]
1
2
3
4
5
6
7
8
0    a
1    b
2    b
3    a
4    a
5    e
Name: grade, dtype: category
Categories (3, object): ['a', 'b', 'e']
1
2
3
# 의미 있는 이름 붙이기
df["grade"].cat.categories = ["very good", "good", "very bad"]
df
idraw_gradegrade
01avery good
12bgood
23bgood
34avery good
45avery good
56every bad
1
2
3
# 순서를 바꾸고 누락된 범주를 추가
df["grade"] = df["grade"].cat.set_categories(["very bad", "bad", "medium", "good", "very good"])
df
idraw_gradegrade
01avery good
12bgood
23bgood
34avery good
45avery good
56every bad
1
df["grade"]
1
2
3
4
5
6
7
8
0    very good
1         good
2         good
3    very good
4    very good
5     very bad
Name: grade, dtype: category
Categories (5, object): ['very bad', 'bad', 'medium', 'good', 'very good']

정렬은, 해당 범주에서 지정된 순서대로 배열됨

1
df.sort_values(by="grade")
idraw_gradegrade
56every bad
12bgood
23bgood
01avery good
34avery good
45avery good

10. 입/출력

CSV 파일

  • 쓰기: df.to_csv("파일이름.csv")
  • 읽기: pd.read_csv("파일이름.csv")

    HDF5

  • 쓰기: df.to_hdf("파일이름.h5", 'df')
  • 읽기: pd.read_hdf("파일이름.h5", 'df)

    Excel

  • 쓰기: df.to_excel('파일이름.xlsx', sheet_name="시트 이름")
  • 읽기: pd.read_excel("파일이름.xlsx", "시트 이름", index_col=None, na_values['NA'])
This post is licensed under CC BY 4.0 by the author.

기초 SQL 정리 2

백준 2475 검증수

Comments powered by Disqus.