- Pandas [Panel Data + Python Data Analysis]
- NumPy and Matplotlib๋ฅผ ํตํฉํ ๊ฒ์ด๋ค
- ๋ฐ์ดํฐ์ ๋ค์ merge(๋ณํฉ) ํ ์ ์์ต๋๋ค
- ๊ธฐ๋ณธ์ ์ธ ํต๊ณ๋ฐ์ดํฐ ์ ๊ณต
- NaN values ๋ฅผ ์์์ ์ฒ๋ฆฌํฉ๋๋ค
- ์ซ์ ๋ฌธ์์ด์ ์์์ ๋ก๋ํฉ๋๋ค
- ๋ฐ์ดํฐ๋ฅผ ๋ถ์ํ๋ ๋ฐ ์ฌ์ฉํฉ๋๋ค
- ์ง์ ๋ถํ ๋ฐ์ดํฐ์ธํธ๋ฅผ ์ ๋ฆฌํ์ฌ ์ฝ๊ธฐ ์ฝ๊ณ ๊ด๋ จ์ฑ์๊ฒ ๋ง๋ญ๋๋ค
- ๊ณ ๊ธ => ์๊ด๊ด๊ณ, ํ๋กํ
์ค์น pip install pandas
ํ๋ค์ค์ 1์ฐจ์ ๋ฐ์ดํฐ๋ฅผ => ์๋ฆฌ์ฆ(Serise)๋ผ๊ณ ๋ถ๋ฆ ๋๋ค
ํ๋ค์ค์ 2์ฐจ์ ๋ฐ์ดํฐ๋ฅผ = > ๋ฐ์ดํฐํ๋ ์(DataFrame)์ด๋ผ๊ณ ๋ถ๋ฆ
๋๋ค
1์ฐจ์ ๋ฐ์ดํฐ : ์๋ฆฌ์ฆ (Serise)
x = pd.Series(data = data, index = index)
eggs 30
apples 6
milk Yes
bread No
dtype: object
x.values
array([30, 6, 'Yes', 'No'], dtype=object)
x.index
Index(['eggs', 'apples', 'milk', 'bread'], dtype='object')
x.ndim #์ฐจ์ ์์๋ณด๊ธฐ
1
x[['eggs','bread']] #๋ ๊ฐ ์ด์์ ๋ฐ์ดํฐ ๊ฐ์ ธ์ฌ ๋(accessํ ๋)๋ []๋ฌถ์์ฒ๋ฆฌํ๋ค
eggs 30
bread No
dtype: object
x[1:3+1] #์ฌ๋ผ์ด์ฑ
apples 6
milk Yes
bread No
dtype: object
x['apples' : 'bread'] #์ฌ๋ผ์ด์ฑ
apples 6
milk Yes
bread No
dtype: object
2์ฐจ์ ๋ฐ์ดํฐ : ๋ฐ์ดํฐํ๋ ์(DataFrame)
ํ๋ค์ค์ 2์ฐจ์ ๋ฐ์ดํฐ ์ฒ๋ฆฌ๋ ๋ฐ์ดํฐ ํ๋ ์์ผ๋ก ํ๋ค. (DataFrame)
์ค์ ๋ฐ์ดํฐ ๋ถ์์์๋ csv ํ์ผ์ ํ๋ค์ค์ ๋ฐ์ดํฐ ํ๋ ์์ผ๋ก ์ฝ์ด์์ ์์
ํ๋ค
df.describe()
: count , mean, std, min, max ๋ฑ์ ์ ๋ณด๋ฅผ ์๋ ค์ค๋ค
items = {'Bob' : pd.Series(data = [245, 25, 55], index = ['bike', 'pants', 'watch']),
'Alice' : pd.Series(data = [40, 110, 500, 45], index = ['book', 'glasses', 'bike', 'pants'])}
Bob | Alice | |
bike | 245.0 | 500.0 |
book | NaN | 40.0 |
glasses | NaN | 110.0 |
pants | 25.0 | 45.0 |
watch | 55.0 | NaN |
์ผ์ชฝ์ ์งํ๊ธ์จ : index
์์ชฝ์ ์งํ ๊ธ์จ : coulmns
์์ชฝ์ ์์นํ ๋ฐ์ดํฐ : values
์๋ก์ด Column ์์ฑํ๊ธฐ, ์ปฌ๋ผ๋ช ๋ณ๊ฒฝํ๊ธฐ
# ์๋ก์ด ์ปฌ๋ผ name์ ๋ง๋ค๋, ๋ฐ์ดํฐ๋ A,B,C๋ผ๊ณ ๋ฃ์.
bike | pants | shirts | suits | |
store 1 | 20 | 30 | 15.0 | 45.0 |
store 2 | 15 | 5 | 2.0 | 7.0 |
store 3 | 20 | 30 | NaN | NaN |
df['name'] = ['A' , 'B' , 'C']
df
bike | pants | shirts | suits | name | |
store 1 | 20 | 30 | 15.0 | 45.0 | A |
store 2 | 15 | 5 | 2.0 | 7.0 | B |
store 3 | 20 | 30 | NaN | NaN | C |
# ์ปฌ๋ผ๋ช ๋ณ๊ฒฝ. bikes => hat, suits => shoes
bike | pants | shirts | suits | |
store 1 | 20 | 30 | 15.0 | 45.0 |
store 2 | 15 | 5 | 2.0 | 7.0 |
store 3 | 20 | 30 | NaN | NaN |
df.rename ( columns= { 'bikes' : 'hat' , 'suits' : 'shoes' } ,inplace = True)
hat | pants | shirts | shoes | |
store 1 | 20 | 30 | 15.0 | 45.0 |
store 2 | 15 | 5 | 2.0 | 7.0 |
store 3 | 20 | 30 | NaN | NaN |
๋ฐ์ดํฐ ๊ฐ์ ธ์ค๊ธฐ (Access)
"๋ฐ์ดํฐ ํ๋ ์"์์ ์ํ๋ ๋ฐ์ดํฐ๋ฅผ access ํ๋ ๋ฐฉ๋ฒ
# 1. ์ปฌ๋ผ์ ๋ฐ์ดํฐ๋ฅผ ๊ฐ์ ธ์ค๋ ๋ฐฉ๋ฒ : ๋ณ์๋ช
๋ฐ๋ก ์ค๋ฅธ์ชฝ์ ๋๊ดํธ ์ฌ์ฉ!
# 2. ํ๊ณผ ์ด์ ์ ๋ณด๋ก, ์ํ๋ ๋ฐ์ดํฐ๋ฅผ ๊ฐ์ ธ์ค๋ ๋ฐฉ๋ฒ (1) .loc[ , ]๋ก ๊ฐ์ ธ์ค๋ ๋ฐฉ๋ฒ - ์ด ๋ฐฉ๋ฒ์ ์ฌ๋์ฉ์ธ ์ธ๋ฑ์ค์ ์ปฌ๋ผ์ผ๋ก ๋ฐ์ดํฐ๋ฅผ accessํ๋ ๋ฐฉ๋ฒ์ด๋ค
# 3. ํ๊ณผ ์ด์ ์ ๋ณด๋ก, ๋ฐ์ดํฐ๋ฅผ ๊ฐ์ ธ์ค๋ ๋ฐฉ๋ฒ(2) .iloc[ , ]๋ก ๊ฐ์ ธ์ค๋ ๋ฐฉ๋ฒ - ์ปดํจํฐ๊ฐ ๋งค๊ธฐ๋ ์ธ๋ฑ์ค(offset)๋ก ๋ฐ์ดํฐ๋ฅผ accessํ๋ ๋ฐฉ๋ฒ์ด๋ค
bikes | pants | watches | glasses | |
store 1 | 20 | 30 | 35 | NaN |
store 2 | 15 | 5 | 10 | 50.0 |
df['watches']
store 1 35
store 2 10
Name: watches, dtype: int64
#๋ฐ์ดํฌ๋ ์์น ์ปฌ๋ผ ๋๊ฐ๋ฅผ ํ๋ฒ์ ๊ฐ์ ธ์ค์์ค
bikes | watches | |
store 1 | 20 | 35 |
store 2 | 15 | 10 |
# ์คํ ์ด 1์ ํฌ์ธ ๋ฐ์ดํฐ๋ฅผ ๊ฐ์ ธ์ค์์ค
df.loc['store 1','pants']
30
# ์คํ ์ด 2์ ๋ฐ์ดํฌ, ํฌ์ธ ๋ฐ์ดํฐ๋ฅผ ๊ฐ์ ธ์ค์์ค
df.loc['store 2',['bikes','watches']]
bikes 15.0
watches 10.0
Name: store 2, dtype: float64
# ์คํ ์ด 2์์, pants๋ถํฐ glasses๊น์ง ๋ฐ์ดํฐ๋ฅผ ๊ฐ์ ธ์ค์์ค
df.loc['store 2','pants':'glasses']
pants 5.0
watches 10.0
glasses 50.0
Name: store 2, dtype: float64
# ์คํ ์ด 1์์ pants ๋ฐ์ดํฐ๋ฅผ ๊ฐ์ ธ์ค์์ค
df.iloc [ 0, 1 ]
30
# ์คํ ์ด 2์ bikes์ watches ๋ฐ์ดํฐ๋ฅผ ๊ฐ์ ธ์ค์์ค
df.iloc[ 1 , [0,2] ]
bikes 15.0
watches 10.0
Name: store 2, dtype: float64
๋ฐ์ดํฐ์ ๋ฆฌ
: ๋ฐ์ดํฐ ์ธํธ์ ์๋ชป๋ ๋ฐ์ดํฐ๋ฅผ ์์ ํ๋ ๊ฒ์ ์๋ฏธํ๋ค
=>๋น ์ ์ฒญ์, ์๋ชป๋ ํ์์ ๋ฐ์ดํฐ ์ ๋ฆฌ, ์ค๋ฅ ๋ฐ์ดํฐ ์ ๋ฆฌ(๋๋ ๋ฐ์ดํฐ ๊ฐ ๋ณ๊ฒฝ), ์ค๋ณต์ ๊ฑฐ(drop_duplicates)
(1) ๋น ์ ์ฒญ์
: ๋ฐ์ดํฐ๋ฅผ ๋ถ์ํ ๋ ๋น ์ ์ ์ ์ฌ์ ์ผ๋ก ์๋ชป๋ ๊ฒฐ๊ณผ๋ฅผ ์ ๊ณตํ ์ ์๋ค
๊ณต๋ฐฑ ๊ฐ์ ๊ตฌํ๊ธฐ
df.isna().sum() #์ปฌ๋ผ๋ณ๋ก nan๊ฐ์๊ฐ ๋ํด์ง๋ค
df.isna().sum().sum() #์ ์ฒด nan๊ฐ์๋ฅผ ๊ณ์ฐํด์ค๋ค
๋น ์
์ด ํฌํจ๋ ํ์ ์ ๊ฑฐ : df.dropna() : nan์ด ๋ค์ด๊ฐ์๋ ํ์ ์ญ์ ํด์ค๋ค
๋น ์ ์ ์ฑ์ฐ๊ธฐ : df.fillna(0) : 0์ผ๋ก ์ฑ์ฐ๊ธฐ
df.fillna("a") : ๋ฌธ์์ด(a)๋ก ์ฑ์ฐ๊ธฐ
bikes | pants | watches | shirts | shoes | suits | glassess | |
store 1 | 20 | 30 | 35 | 15.0 | 8 | 45.0 | NaN |
store 2 | 15 | 5 | 10 | 2.0 | 5 | 7.0 | 50.0 |
store 3 | 20 | 30 | 35 | NaN | 10 | NaN | 4.0 |
# nan์ด ๋ค์ด๊ฐ ํ์ ์ญ์
df.dropna()
bikes | pants | watches | shirts | shoes | suits | glassess | |
store 2 | 15 | 5 | 10 | 2.0 | 5 | 7.0 | 50.0 |
# 0์ผ๋ก ์ฑ์ฐ๊ธฐ
df.fillna(0)
bikes | pants | watches | shirts | shoes | suits | glassess | |
store 1 | 20 | 30 | 35 | 15.0 | 8 | 45.0 | 0.0 |
store 2 | 15 | 5 | 10 | 2.0 | 5 | 7.0 | 50.0 |
store 3 | 20 | 30 | 35 | 0.0 | 10 | 0.0 | 4.0 |
# ํ๊ท ๊ฐ์ผ๋ก ์ฑ์ฐ๊ธฐ
df.fillna(df.mean())
bikes | pants | watches | shirts | shoes | suits | glassess | |
store 1 | 20 | 30 | 35 | 15.0 | 8 | 45.0 | 27.0 |
store 2 | 15 | 5 | 10 | 2.0 | 5 | 7.0 | 50.0 |
store 3 | 20 | 30 | 35 | 8.5 | 10 | 26.0 | 4.0 |
#numeric_only=True๋ ์ซ์๋ฐ์ดํฐ์ ๊ฐ๋ง ๊ณ์ฐํ๋ค
Book Title | Author | User1 | User2 | User3 | User4 | |
0 | Great Expectations | Charles Dickens | 3.2 | 5.0 | 2.0 | 4.0 |
1 | Of Mice and Men |
John Steinbeck | NaN | 1.3 | 2.3 | 3.5 |
2 | Romeo and Juliet | William Shakespeare | 2.5 | 4.0 | NaN | 4.0 |
3 | The Time Machine | H. G. Wells | NaN | 3.8 | 4.0 | 5.0 |
4 | Alice in Wonderland | Lewis Carroll | NaN | NaN | NaN | 4.2 |
df.fillna(df.mean(numeric_only=True))
Book Title | Author | User1 | User2 | User3 | User4 | |
0 | Great Expectations | Charles Dickens | 3.2 | 5.0 | 2.0 | 4.0 |
1 | Of Mice and Men |
John Steinbeck | 2.85 | 1.3 | 2.3 | 3.5 |
2 | Romeo and Juliet | William Shakespeare | 2.5 | 4.0 | 2.766667 | 4.0 |
3 | The Time Machine | H. G. Wells | 2.85 | 3.8 | 4.0 | 5.0 |
4 | Alice in Wonderland | Lewis Carroll | 2.85 | 3.525 | 2.766667 | 4.2 |
(2) ์๋ชป๋ ํ์์ ๋ฐ์ดํฐ ์ ๋ฆฌ
- ๋ฐ์ดํฐ์ ํ์์ ์๋ง๊ฒ ๋ฐ๊ฟ์ค๋ค
(3) ๋ฐ์ดํฐ ์ค๋ฅ ์ ๋ฆฌ ๋๋ ๋ฐ์ดํฐ ๊ฐ ๋ณ๊ฒฝํ๊ธฐ
Replacing Values
Removing Rows
# ์คํ ์ด 2์ watches ๋ฐ์ดํฐ๋ฅผ, 20์ผ๋ก ๋ณ๊ฒฝํด ์ฃผ์ธ์
bikes | pants | watches | glasses | |
store 1 | 20 | 30 | 35 | NaN |
store 2 | 15 | 5 | 10 | 50.0 |
df.loc['store 2','watches'] = 20
df
bikes | pants | watches | glasses | |
store 1 | 20 | 30 | 35 | NaN |
store 2 | 15 | 5 | 20 | 50.0 |
-----------------------------------------------------------------------------------------------------------------------------------
๋ฐ์ดํฐ๋ฅผ ์ ๋ฆฌํ๋๊ฒ ์๋๋ผ, null ์์ด ๋ฐ์ดํฐ๋ฅผ accessํ๊ณ ์ถ๋ค๋ฉด
.notna๋ฅผ ๊ฐ์ ธ์จ๋ค
null์ธ์ง ํ์ธํ๊ณ ์ถ๋ค๋ฉด
.isna()
์ด๋ณ๋ก null์ ๊ฐ์๋ฅผ ํ์ธํ๊ณ ์ถ๋ค๋ฉด
.isna().sum()
์ด null์ ๊ฐ์๋ฅผ ํ์ธํ๊ณ ์ถ๋ค๋ฉด
.isna().sum().sum()
--------------------------------------------------------------------------------------------------------------------------------------
# ๋ฐ์ดํฐ๋ฅผ ์ญ์ ํ๋ ๋ฐฉ๋ฒ
- ํ ์ญ์ , ์ด ์ญ์
- drop()ํจ์๋ฅผ ์ด์ฉํ๊ณ , axis๋ง ์ค์ ํด์ฃผ๋ฉด ๋๋ค
df.drop('watches', axis=1)
bikes | pants | glasses | |
store 1 | 20 | 30 | NaN |
store 2 | 15 | 5 | 50.0 |
df.drop('store 2' , axis=0)
bikes | pants | watches | glasses | |
store 1 | 20 | 30 | 35 | NaN |
(4) ์ค๋ณต ์ ๊ฑฐ (drop_duplicates)
Year Name Department Age Salary
0 1990 Alice HR 25 50000
1 1990 Bob RD 30 48000
2 1990 Charlie Admin 45 55000
3 1991 Alice HR 26 52000
4 1991 Bob RD 31 50000
5 1991 Charlie Admin 46 60000
6 1992 Alice HR 27 60000
7 1992 Bob RD 32 52000
8 1992 Charlie Admin 47 62000
< ์นดํ ๊ณ ๋ฆฌ์ปฌ ๋ฐ์ดํฐ (Categorical Data) >
-์ค๋ณต์ด ๋ฐ์ํ๋ ๋ฐ์ดํฐ
-์ ๋ํฌํ ๋ฐ์ดํฐ๋ฅผ ํ์ฉํด์ผํ๋ค
-์นดํ
๊ณ ๋ฆฌ์ปฌ ๋ฐ์ดํฐ์ ๊ฒฝ์ฐ, ๋ฐ์ดํฐ ๋ถ์์ ๋ฐ์ดํฐ๋ณ๋ก ๋ฌถ์ด์ ๋ฐ์ดํฐ๋ฅผ ๋ถ์ํ ์ ์๋ค
df4['Department'].unique()
array(['HR', 'RD', 'Admin'], dtype=object)
df4['Year'].unique()
array([1990, 1991, 1992], dtype=int64)
df4['Year'].nunique()
3
๋ฐ์ดํฐ ๋ฌถ์ด์ ์ฒ๋ฆฌํ๊ธฐ .groupby()
- ๊ฐ ๋ ๋๋ณ๋ก, ์ง๊ธํ ์ฐ๋ด ์ดํฉ์ ๊ตฌํ๋ผ.
df4.groupby('Year')['Salary'].sum()
Year
1990 153000
1991 162000
1992 174000
Name: Salary, dtype: int64
* df4.groupby('Year')['Salary'].count()ํ๋ฉด, ๋ฐ์ดํฐ๊ฐ ๋ช ๊ฐ์ธ์ง ์ ์ ์๋ค.
๋ช ๊ฐ๊ฐ ์ค๋ณต๋์๋์ง ์์๋ณด๊ธฐ
# Name ์ปฌ๋ผ์, ๊ฐ ์ด๋ฆ๋ณ๋ก ๋ช๊ฐ์ ๋ฐ์ดํฐ๊ฐ ์์๊น?
df4.groupby('Name')['Name'].count()
Name
Alice 3
Bob 3
Charlie 3
Name: Name, dtype: int64
df4['Name'].value_counts
Name
Alice 3
Bob 3
Charlie 3
Name: count, dtype: int64
'Python > Python Language' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
๋ฐ์ดํฐํ๋ ์(DataFrame)์ column, index ์ค์ ๋ฐฉ๋ฒ (0) | 2024.04.09 |
---|---|
Pandas์์ ์ ๊ณตํ๋ ๋ฌธ์์ด(Str) ํจ์ (2) | 2024.04.09 |
Python(ํ์ด์ฌ) ํจ์ def ์ฌ์ฉํ๊ธฐ (0) | 2024.04.04 |
์กฐ๊ฑด๋ฌธ (if, elif, else) ์์ฑํ๊ธฐ (0) | 2024.04.04 |
Python ๋ฐ๋ณต๋ฌธ(for, while) ์์ฑํ๊ธฐ (0) | 2024.04.04 |