ML (MachineLearning)

Regressor(ํšŒ๊ท€๋ชจ๋ธ) ์ƒ์„ฑํ•˜๊ณ , MSE(ํ‰๊ท ์ œ๊ณฑ์˜ค์ฐจ)๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•

567Rabbit 2024. 4. 12. 16:09

Prediction

- regressor (ํšŒ๊ท€๋ชจ๋ธ)

 

๋จธ์‹ ๋Ÿฌ๋‹ regressor๋Š” ์ง€๋„(ํ•™์Šต)๋ชจ๋ธ๋กœ ๊ณผ๊ฑฐ ๋ฐ์ดํ„ฐ๋ฅผ ์ž…๋ ฅ์‹œ์ผœ์ฃผ๋ฉด ๊ทธ์— ๋Œ€ํ•œ ์˜ˆ์ธก๊ฐ’์„ ์•Œ๋ ค์ฃผ๋Š” ๋ชจ๋ธ์ด๋‹ค.

ํšŒ๊ท€ ๋ชจ๋ธ์€ ์ฃผ๋กœ ์˜ˆ์ธกํ•˜๋ ค๋Š” ๊ฐ’์ด ์—ฐ์†ํ˜• ๋ฐ์ดํ„ฐ์ธ ๊ฒฝ์šฐ์— ์‚ฌ์šฉ๋˜๋ฉฐ ํŠน์ •ํ•œ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•œ ํŒจํ„ด์„ ํ•™์Šตํ•˜๊ณ , ๊ทธ ํŒจํ„ด์„ ๊ธฐ๋ฐ˜์œผ๋กœ ์ƒˆ๋กœ์šด ์ž…๋ ฅ์— ๋Œ€ํ•œ ๊ฐ’์„ ์˜ˆ์ธกํ•œ๋‹ค

 

 

 

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

 

 

๋ฌธ์ œ ) ๋จธ์‹ ๋Ÿฌ๋‹์„ ํ†ตํ•ด Purchased ๊ฐ’์„ ์•Œ๊ณ ์‹ถ๋‹ค.  

 

df

  Country Age Salary Purchased
0 France 44.000000 72000.000000 No
1 Spain 27.000000 48000.000000 Yes
2 Germany 30.000000 54000.000000 No
3 Spain 38.000000 61000.000000 No
4 Germany 40.000000 Nan Yes
5 France 35.000000 58000.000000 Yes
6 Spain Nan 52000.000000 No
7 France 48.000000 79000.000000 Yes
8 Germany 50.000000 83000.000000 No
9 France 37.000000 67000.000000 Yes

 

๋จผ์ € nan์ฒ˜๋ฆฌ๋ฅผ ํ•ด์ค€๋‹ค

1) ์‚ญ์ œ์ „๋žต dropna

2) ์ฑ„์šฐ๋Š”์ „๋žต fillna

 

์ค‘, 1๋ฒˆ ์‚ญ์ œ์ „๋žต์„ ํ†ตํ•ด nan์„ ์ฒ˜๋ฆฌํ•˜๋ฉด ์•„๋ž˜์™€ ๊ฐ™๋‹ค

 

df.dropna()

 

df

  Country Age Salary Purchased
0 France 44.000000 72000.000000 No
1 Spain 27.000000 48000.000000 Yes
2 Germany 30.000000 54000.000000 No
3 Spain 38.000000 61000.000000 No
5 France 35.000000 58000.000000 Yes
7 France 48.000000 79000.000000 Yes
8 Germany 50.000000 83000.000000 No
9 France 37.000000 67000.000000 Yes

 

 

ํ•™์Šต๋ฐ์ดํ„ฐ X์™€ ๊ฒฐ๊ณผ๋ฐ์ดํ„ฐ y๋กœ ๋ถ„๋ฆฌ

๋จผ์ €, ๋จธ์‹ ๋Ÿฌ๋‹์„ ํ•™์Šต์‹œํ‚ฌ ๋ฐ์ดํ„ฐ X์™€ ๊ฒฐ๊ณผ๋ฅผ ๋„์ถœ์‹œํ‚ค๋Š” ํ…Œ์ŠคํŠธ ๋ฐ์ดํ„ฐ y๋กœ ๋ถ„๋ฆฌํ•œ๋‹ค

 

y

  Purchased
0 No
1 Yes
2 No
3 No
5 Yes
7 Yes
8 No
9 Yes

 

 

 

X

  Country Age Salary
0 France 44.000000 72000.000000
1 Spain 27.000000 48000.000000
2 Germany 30.000000 54000.000000
3 Spain 38.000000 61000.000000
5 France 35.000000 58000.000000
7 France 48.000000 79000.000000
8 Germany 50.000000 83000.000000
9 France 37.000000 67000.000000

 

 

๋ฌธ์ž์—ด์„ ์ˆซ์ž๋กœ ๋ฐ”๊ฟ”์ฃผ๊ธฐ

-์ปดํ“จํ„ฐ๊ฐ€ ์ดํ•ดํ•  ์ˆ˜ ์žˆ๋„๋ก ๋ฌธ์ž๋กœ ๋œ ๋ฐ์ดํ„ฐ๋Š” ์ˆซ์ž๋กœ ๋ณ€๊ฒฝํ•ด์ฃผ์–ด์•ผ ํ•˜๋Š”๋ฐ, ์ด๋ฅผ ์› ํ•ซ ์ธ์ฝ”๋”ฉ์ด๋ผ๊ณ  ํ•œ๋‹ค.

 

์›ํ•ซ ์ธ์ฝ”๋”ฉ์˜ ์˜ˆ์ƒ ๊ฒฐ๊ณผ๋ฅผ ์ž‘์„ฑํ•ด๋ณด์•˜๋‹ค

 

# France    Germany     Spain    Age       Salary
#   1             0                 0           44         72000
#   0             0                 1           27         48000
#   0             1                 0           30         54000
#   0             0                 1           38         61000
#   1             0                 0           35         58000
#   1             0                 0           48         79000
#   0             1                 0           50         83000
#   1             0                 0           37         67000

 

 

from sklearn.preprocessing import LabelEncoder, OneHotEncoder

from sklearn.compose import ColumnTransformer

 

<์›ํ•ซ ์ธ์ฝ”๋”ฉ ๊ณต์‹>

ColumnTransformer( [ ('encoder', OneHotEncoder() , [] ) ] , remainder= 'passthrough' )

[]์•ˆ์— ์žˆ๋Š” ์ธ๋ฑ์Šค ๋ฒˆํ˜ธ๋Š” ์›ํ•ซ์ธ์ฝ”๋”ฉํ•˜๊ณ , remainder ๋‚˜๋จธ์ง€๋Š” passthrough ๋ณ€ํ˜•์‹œํ‚ค์ง€๋ง๊ณ  ์ง€๋‚˜์ณ๋ผ

 

 

ct = ColumnTransformer( [ ('encoder', OneHotEncoder() , [0] ) ] , remainder= 'passthrough' ) 

X = ct.fit_transform(X)   

array([[1.0e+00, 0.0e+00, 0.0e+00, 4.4e+01, 7.2e+04],
       [0.0e+00, 0.0e+00, 1.0e+00, 2.7e+01, 4.8e+04],
       [0.0e+00, 1.0e+00, 0.0e+00, 3.0e+01, 5.4e+04],
       [0.0e+00, 0.0e+00, 1.0e+00, 3.8e+01, 6.1e+04],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.5e+01, 5.8e+04],
       [1.0e+00, 0.0e+00, 0.0e+00, 4.8e+01, 7.9e+04],
       [0.0e+00, 1.0e+00, 0.0e+00, 5.0e+01, 8.3e+04],
       [1.0e+00, 0.0e+00, 0.0e+00, 3.7e+01, 6.7e+04]])

 

์„ค๋ช… ) 4.4e+01์€ 4.4*10=44   ,   7.2e+04๋Š” 7.2 * 10000 = 72000์„ ํ‘œํ˜„ํ•œ ๊ฒƒ์ด๋‹ค.

 

 

 

y๋Š” No, Yes๋กœ 0๊ณผ 1 ๋‘๊ฐœ๋กœ ํ‘œํ˜„ํ•  ์ˆ˜ ์žˆ์œผ๋ฏ€๋กœ ๊ทธ๋ƒฅ ์ธ์ฝ”๋”ฉํ•œ๋‹ค

  Purchased
0 No
1 Yes
2 No
3 No
4 Yes
5 Yes
6 No
7 Yes
8 No
9 Yes

 

sorted(y.unique())

 

y = encoder.fit_transform(y)

array([0, 1, 0, 0, 1, 1, 0, 1])

 

 

Feature Scaling

๋”ฅ๋Ÿฌ๋‹ ๋“ฑ๊ณผ ๊ฐ™์€ ๊ฒฝ์šฐ์—๋Š” feature scaling์„ ํ†ตํ•ด ๋ฐ์ดํ„ฐ์˜ ์Šค์ผ€์ผ์„ ์กฐ์ •ํ•˜์—ฌ ๋ชจ๋ธ์˜ ์•ˆ์ •์„ฑ์„ ํ–ฅ์ƒ์‹œํ‚ค๊ณ  ์„ฑ๋Šฅ์„ ํ–ฅ์ƒ์‹œ์ผœ์•ผ ํ•œ๋‹ค.

 

๋ฐฉ๋ฒ•์€ ๋‘๊ฐ€์ง€์ด๋‹ค. (๋‘๊ฐ€์ง€ ์ค‘ ํ•˜๋‚˜๋ฅผ ์„ ํƒํ•˜์—ฌ ์‚ฌ์šฉํ•˜๋ฉด ๋œ๋‹ค)

 

  • ํ‘œ์ค€ํ™” : ํ‰๊ท ์„ ๊ธฐ์ค€์œผ๋กœ ์–ผ๋งˆ๋‚˜ ๋–จ์–ด์ ธ ์žˆ๋А๋ƒ? ๊ฐ™์€ ๊ธฐ์ค€์œผ๋กœ ๋งŒ๋“œ๋Š” ๋ฐฉ๋ฒ•, ์Œ์ˆ˜๋„ ์กด์žฌ, ๋ฐ์ดํ„ฐ์˜ ์ตœ๋Œ€์ตœ์†Œ๊ฐ’ ๋ชจ๋ฅผ๋•Œ ์‚ฌ์šฉ. ( StandardScaler )
  • ์ •๊ทœํ™” : 0 ~ 1 ์‚ฌ์ด๋กœ ๋งž์ถ”๋Š” ๊ฒƒ. ๋ฐ์ดํ„ฐ์˜ ์œ„์น˜ ๋น„๊ต๊ฐ€ ๊ฐ€๋Šฅ, ๋ฐ์ดํ„ฐ์˜ ์ตœ๋Œ€ ์ตœ์†Œ๊ฐ’ ์•Œ๋–„ ์‚ฌ์šฉ ( MinMaxScaler )
from sklearn.preprocessing import StandardScaler, MinMaxScaler

 

 

 

# ๋ฐฉ๋ฒ•1. ํ‘œ์ค€ํ™”
X_scaler = StandardScaler()

X_scaler.fit_transform( X )

array([[ 1.        , -0.57735027, -0.57735027,  0.69985807,  0.58989097],
       [-1.        , -0.57735027,  1.73205081, -1.51364653, -1.50749915],
       [-1.        ,  1.73205081, -0.57735027, -1.12302807, -0.98315162],
       [-1.        , -0.57735027,  1.73205081, -0.08137885, -0.37141284],
       [ 1.        , -0.57735027, -0.57735027, -0.47199731, -0.6335866 ],
       [ 1.        , -0.57735027, -0.57735027,  1.22068269,  1.20162976],
       [-1.        ,  1.73205081, -0.57735027,  1.48109499,  1.55119478],
       [ 1.        , -0.57735027, -0.57735027, -0.211585  ,  0.1529347 ]])

 

y         # y๋Š” 0๊ณผ 1๋กœ๋งŒ ํ‘œํ˜„๋˜๋ฏ€๋กœ ์Šค์ผ€์ผ๋ง ๊ณผ์ •์„ ๊ฑฐ์น˜์ง€ ์•Š์•„๋„ ๋œ๋‹ค

array([0, 1, 0, 0, 1, 1, 0, 1])

 

 

# ๋ฐฉ๋ฒ•2. ์ •๊ทœํ™”

X_scaler = MinMaxScaler()

X = X_scaler.fit_transform( X )

array([[1.        , 0.        , 0.        , 0.73913043, 0.68571429],
       [0.        , 0.        , 1.        , 0.        , 0.        ],
       [0.        , 1.        , 0.        , 0.13043478, 0.17142857],
       [0.        , 0.        , 1.        , 0.47826087, 0.37142857],
       [1.        , 0.        , 0.        , 0.34782609, 0.28571429],
       [1.        , 0.        , 0.        , 0.91304348, 0.88571429],
       [0.        , 1.        , 0.        , 1.        , 1.        ],
       [1.        , 0.        , 0.        , 0.43478261, 0.54285714]])

 

y

array([0, 1, 0, 0, 1, 1, 0, 1])

 

 

Dataset์„ Training ์šฉ๊ณผ Test์šฉ์œผ๋กœ ๋‚˜๋ˆˆ๋‹ค.

 

from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size= 0.2, random_state=32)

 

# 80%๋Š” ํ•™์Šต์šฉ์œผ๋กœ ํ•˜๊ณ , 20%๋Š” ํ…Œ์ŠคํŠธ์šฉ์œผ๋กœ ํ•ด๋ผ (0.2 ๋˜๋Š” 0.25๋กœ ์ฃผ๋กœ ์‚ฌ์šฉํ•œ๋‹ค)
# random_state ๋Š” random seed๊ฐ’๊ณผ ๋น„์Šทํ•œ๊ฒƒ์ด๋‹ค

 

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

regressor.fit(X_train, y_train)

 

 

 

 

๊ฒ€์ฆํ•˜๊ธฐ : MSE(mean squared error) ๊ตฌํ•˜๊ธฐ

 

y_pred = regressor.predict(X_test)

 

# MSE

error = y_test - y_pred

(error ** 2).mean()

 

# ์„ฑ๋Šฅ์„ ์ธก์ •ํ•˜๊ธฐ ์œ„ํ•ด์„œ๋Š” ์˜ค์ฐจ๋ฅผ ์ œ๊ณฑํ•ด์„œ, ๋ถ€ํ˜ธ๋ฅผ ๋จผ์ € ์—†์•ค ํ›„์— ํ‰๊ท ์„ ๊ตฌํ•œ๋‹ค

 

# RMSE (๋ฃจํŠธ๋ฅผ ์”Œ์šด ๊ฐ’์ด๋‹ค)
np.sqrt((error ** 2).mean())

 

df_test = y_test.to_frame()

df_test.reset_index(drop=True, inplace=True)

 

df_test['y_pred'] = y_pred

 

df_test.plot(kind='bar')
plt.savefig('test.jpg')
plt.show()