Prediction
- regressor (ํ๊ท๋ชจ๋ธ)
๋จธ์ ๋ฌ๋ regressor๋ ์ง๋(ํ์ต)๋ชจ๋ธ๋ก ๊ณผ๊ฑฐ ๋ฐ์ดํฐ๋ฅผ ์ ๋ ฅ์์ผ์ฃผ๋ฉด ๊ทธ์ ๋ํ ์์ธก๊ฐ์ ์๋ ค์ฃผ๋ ๋ชจ๋ธ์ด๋ค.
ํ๊ท ๋ชจ๋ธ์ ์ฃผ๋ก ์์ธกํ๋ ค๋ ๊ฐ์ด ์ฐ์ํ ๋ฐ์ดํฐ์ธ ๊ฒฝ์ฐ์ ์ฌ์ฉ๋๋ฉฐ ํน์ ํ ๋ฐ์ดํฐ์ ๋ํ ํจํด์ ํ์ตํ๊ณ , ๊ทธ ํจํด์ ๊ธฐ๋ฐ์ผ๋ก ์๋ก์ด ์ ๋ ฅ์ ๋ํ ๊ฐ์ ์์ธกํ๋ค
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
๋ฌธ์ ) ๋จธ์ ๋ฌ๋์ ํตํด Purchased ๊ฐ์ ์๊ณ ์ถ๋ค.
df
Country | Age | Salary | Purchased | |
0 | France | 44.000000 | 72000.000000 | No |
1 | Spain | 27.000000 | 48000.000000 | Yes |
2 | Germany | 30.000000 | 54000.000000 | No |
3 | Spain | 38.000000 | 61000.000000 | No |
4 | Germany | 40.000000 | Nan | Yes |
5 | France | 35.000000 | 58000.000000 | Yes |
6 | Spain | Nan | 52000.000000 | No |
7 | France | 48.000000 | 79000.000000 | Yes |
8 | Germany | 50.000000 | 83000.000000 | No |
9 | France | 37.000000 | 67000.000000 | Yes |
๋จผ์ nan์ฒ๋ฆฌ๋ฅผ ํด์ค๋ค
1) ์ญ์ ์ ๋ต dropna
2) ์ฑ์ฐ๋์ ๋ต fillna
์ค, 1๋ฒ ์ญ์ ์ ๋ต์ ํตํด nan์ ์ฒ๋ฆฌํ๋ฉด ์๋์ ๊ฐ๋ค
df.dropna()
df
Country | Age | Salary | Purchased | |
0 | France | 44.000000 | 72000.000000 | No |
1 | Spain | 27.000000 | 48000.000000 | Yes |
2 | Germany | 30.000000 | 54000.000000 | No |
3 | Spain | 38.000000 | 61000.000000 | No |
5 | France | 35.000000 | 58000.000000 | Yes |
7 | France | 48.000000 | 79000.000000 | Yes |
8 | Germany | 50.000000 | 83000.000000 | No |
9 | France | 37.000000 | 67000.000000 | Yes |
ํ์ต๋ฐ์ดํฐ X์ ๊ฒฐ๊ณผ๋ฐ์ดํฐ y๋ก ๋ถ๋ฆฌ
๋จผ์ , ๋จธ์ ๋ฌ๋์ ํ์ต์ํฌ ๋ฐ์ดํฐ X์ ๊ฒฐ๊ณผ๋ฅผ ๋์ถ์ํค๋ ํ ์คํธ ๋ฐ์ดํฐ y๋ก ๋ถ๋ฆฌํ๋ค
y
Purchased | |
0 | No |
1 | Yes |
2 | No |
3 | No |
5 | Yes |
7 | Yes |
8 | No |
9 | Yes |
X
Country | Age | Salary | |
0 | France | 44.000000 | 72000.000000 |
1 | Spain | 27.000000 | 48000.000000 |
2 | Germany | 30.000000 | 54000.000000 |
3 | Spain | 38.000000 | 61000.000000 |
5 | France | 35.000000 | 58000.000000 |
7 | France | 48.000000 | 79000.000000 |
8 | Germany | 50.000000 | 83000.000000 |
9 | France | 37.000000 | 67000.000000 |
๋ฌธ์์ด์ ์ซ์๋ก ๋ฐ๊ฟ์ฃผ๊ธฐ
-์ปดํจํฐ๊ฐ ์ดํดํ ์ ์๋๋ก ๋ฌธ์๋ก ๋ ๋ฐ์ดํฐ๋ ์ซ์๋ก ๋ณ๊ฒฝํด์ฃผ์ด์ผ ํ๋๋ฐ, ์ด๋ฅผ ์ ํซ ์ธ์ฝ๋ฉ์ด๋ผ๊ณ ํ๋ค.
์ํซ ์ธ์ฝ๋ฉ์ ์์ ๊ฒฐ๊ณผ๋ฅผ ์์ฑํด๋ณด์๋ค
# France Germany Spain Age Salary
# 1 0 0 44 72000
# 0 0 1 27 48000
# 0 1 0 30 54000
# 0 0 1 38 61000
# 1 0 0 35 58000
# 1 0 0 48 79000
# 0 1 0 50 83000
# 1 0 0 37 67000
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.compose import ColumnTransformer
<์ํซ ์ธ์ฝ๋ฉ ๊ณต์>
ColumnTransformer( [ ('encoder', OneHotEncoder() , [] ) ] , remainder= 'passthrough' )
[]์์ ์๋ ์ธ๋ฑ์ค ๋ฒํธ๋ ์ํซ์ธ์ฝ๋ฉํ๊ณ , remainder ๋๋จธ์ง๋ passthrough ๋ณํ์ํค์ง๋ง๊ณ ์ง๋์ณ๋ผ
ct = ColumnTransformer( [ ('encoder', OneHotEncoder() , [0] ) ] , remainder= 'passthrough' )
X = ct.fit_transform(X)
array([[1.0e+00, 0.0e+00, 0.0e+00, 4.4e+01, 7.2e+04],
[0.0e+00, 0.0e+00, 1.0e+00, 2.7e+01, 4.8e+04],
[0.0e+00, 1.0e+00, 0.0e+00, 3.0e+01, 5.4e+04],
[0.0e+00, 0.0e+00, 1.0e+00, 3.8e+01, 6.1e+04],
[1.0e+00, 0.0e+00, 0.0e+00, 3.5e+01, 5.8e+04],
[1.0e+00, 0.0e+00, 0.0e+00, 4.8e+01, 7.9e+04],
[0.0e+00, 1.0e+00, 0.0e+00, 5.0e+01, 8.3e+04],
[1.0e+00, 0.0e+00, 0.0e+00, 3.7e+01, 6.7e+04]])
์ค๋ช ) 4.4e+01์ 4.4*10=44 , 7.2e+04๋ 7.2 * 10000 = 72000์ ํํํ ๊ฒ์ด๋ค.
y๋ No, Yes๋ก 0๊ณผ 1 ๋๊ฐ๋ก ํํํ ์ ์์ผ๋ฏ๋ก ๊ทธ๋ฅ ์ธ์ฝ๋ฉํ๋ค
Purchased | |
0 | No |
1 | Yes |
2 | No |
3 | No |
4 | Yes |
5 | Yes |
6 | No |
7 | Yes |
8 | No |
9 | Yes |
sorted(y.unique())
y = encoder.fit_transform(y)
array([0, 1, 0, 0, 1, 1, 0, 1])
Feature Scaling
๋ฅ๋ฌ๋ ๋ฑ๊ณผ ๊ฐ์ ๊ฒฝ์ฐ์๋ feature scaling์ ํตํด ๋ฐ์ดํฐ์ ์ค์ผ์ผ์ ์กฐ์ ํ์ฌ ๋ชจ๋ธ์ ์์ ์ฑ์ ํฅ์์ํค๊ณ ์ฑ๋ฅ์ ํฅ์์์ผ์ผ ํ๋ค.
๋ฐฉ๋ฒ์ ๋๊ฐ์ง์ด๋ค. (๋๊ฐ์ง ์ค ํ๋๋ฅผ ์ ํํ์ฌ ์ฌ์ฉํ๋ฉด ๋๋ค)
- ํ์คํ : ํ๊ท ์ ๊ธฐ์ค์ผ๋ก ์ผ๋ง๋ ๋จ์ด์ ธ ์๋๋? ๊ฐ์ ๊ธฐ์ค์ผ๋ก ๋ง๋๋ ๋ฐฉ๋ฒ, ์์๋ ์กด์ฌ, ๋ฐ์ดํฐ์ ์ต๋์ต์๊ฐ ๋ชจ๋ฅผ๋ ์ฌ์ฉ. ( StandardScaler )
- ์ ๊ทํ : 0 ~ 1 ์ฌ์ด๋ก ๋ง์ถ๋ ๊ฒ. ๋ฐ์ดํฐ์ ์์น ๋น๊ต๊ฐ ๊ฐ๋ฅ, ๋ฐ์ดํฐ์ ์ต๋ ์ต์๊ฐ ์๋ ์ฌ์ฉ ( MinMaxScaler )
from sklearn.preprocessing import StandardScaler, MinMaxScaler
# ๋ฐฉ๋ฒ1. ํ์คํ
X_scaler = StandardScaler()
X_scaler.fit_transform( X )
array([[ 1. , -0.57735027, -0.57735027, 0.69985807, 0.58989097],
[-1. , -0.57735027, 1.73205081, -1.51364653, -1.50749915],
[-1. , 1.73205081, -0.57735027, -1.12302807, -0.98315162],
[-1. , -0.57735027, 1.73205081, -0.08137885, -0.37141284],
[ 1. , -0.57735027, -0.57735027, -0.47199731, -0.6335866 ],
[ 1. , -0.57735027, -0.57735027, 1.22068269, 1.20162976],
[-1. , 1.73205081, -0.57735027, 1.48109499, 1.55119478],
[ 1. , -0.57735027, -0.57735027, -0.211585 , 0.1529347 ]])
y # y๋ 0๊ณผ 1๋ก๋ง ํํ๋๋ฏ๋ก ์ค์ผ์ผ๋ง ๊ณผ์ ์ ๊ฑฐ์น์ง ์์๋ ๋๋ค
array([0, 1, 0, 0, 1, 1, 0, 1])
# ๋ฐฉ๋ฒ2. ์ ๊ทํ
X_scaler = MinMaxScaler()
X = X_scaler.fit_transform( X )
array([[1. , 0. , 0. , 0.73913043, 0.68571429],
[0. , 0. , 1. , 0. , 0. ],
[0. , 1. , 0. , 0.13043478, 0.17142857],
[0. , 0. , 1. , 0.47826087, 0.37142857],
[1. , 0. , 0. , 0.34782609, 0.28571429],
[1. , 0. , 0. , 0.91304348, 0.88571429],
[0. , 1. , 0. , 1. , 1. ],
[1. , 0. , 0. , 0.43478261, 0.54285714]])
y
array([0, 1, 0, 0, 1, 1, 0, 1])
Dataset์ Training ์ฉ๊ณผ Test์ฉ์ผ๋ก ๋๋๋ค.
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=32)
# 80%๋ ํ์ต์ฉ์ผ๋ก ํ๊ณ , 20%๋ ํ
์คํธ์ฉ์ผ๋ก ํด๋ผ (0.2 ๋๋ 0.25๋ก ์ฃผ๋ก ์ฌ์ฉํ๋ค)
# random_state ๋ random seed๊ฐ๊ณผ ๋น์ทํ๊ฒ์ด๋ค
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
๊ฒ์ฆํ๊ธฐ : MSE(mean squared error) ๊ตฌํ๊ธฐ
y_pred = regressor.predict(X_test)
# MSE
error = y_test - y_pred
(error ** 2).mean()
# ์ฑ๋ฅ์ ์ธก์ ํ๊ธฐ ์ํด์๋ ์ค์ฐจ๋ฅผ ์ ๊ณฑํด์, ๋ถํธ๋ฅผ ๋จผ์ ์์ค ํ์ ํ๊ท ์ ๊ตฌํ๋ค
# RMSE (๋ฃจํธ๋ฅผ ์์ด ๊ฐ์ด๋ค)
np.sqrt((error ** 2).mean())
df_test = y_test.to_frame()
df_test.reset_index(drop=True, inplace=True)
df_test['y_pred'] = y_pred
df_test.plot(kind='bar')
plt.savefig('test.jpg')
plt.show()