Simple linear regression
ํ๋์ ๋ณ์๋ก X -> y๋ฅผ ์์๋ธ๋ค
Multiple linear regression
์ฌ๋ฌ๊ฐ์ ๋ณ์๋ก X1, X2, X3 ... -> y๋ฅผ ์์๋ธ๋ค
์ฌ๊ธฐ์๋ Multiple linear regression๋ก ๋ฐ์ดํฐ ๊ธฐ๋ฐ ์์ธก๊ฐ์ ์์๋ณด๋๋ก ํ๊ฒ ๋ค.
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
์ฌ๋ฌ๊ฐ์ ๋ฐ์ดํฐ๋ฅผ ๊ธฐ๋ฐ์ผ๋ก Profit (์์ต)์ ์์ธกํ๋ ค ํ๋ค.
df
R&D Spend | Administration | Marketing Spend | State | Profit | |
0 | 165349.20 | 136897.80 | 471784.10 | New York | 192261.83 |
1 | 162597.70 | 151377.59 | 443898.53 | California | 191792.06 |
2 | 153441.51 | 101145.55 | 407934.54 | Florida | 191050.39 |
3 | 144372.41 | 118671.85 | 383199.62 | New York | 182901.99 |
4 | 142107.34 | 91391.77 | 366168.42 | Florida | 166187.94 |
5 | 131876.90 | 99814.71 | 362861.36 | New York | 156991.12 |
6 | 134615.46 | 147198.87 | 127716.82 | California | 156122.51 |
7 | 130298.13 | 145530.06 | 323876.68 | Florida | 155752.60 |
8 | 120542.52 | 148718.95 | 311613.29 | New York | 152211.77 |
9 | 123334.88 | 108679.17 | 304981.62 | California | 149759.96 |
10 | 101913.08 | 110594.11 | 229160.95 | Florida | 146121.95 |
11 | 100671.96 | 91790.61 | 249744.55 | California | 144259.40 |
12 | 93863.75 | 127320.38 | 249839.44 | Florida | 141585.52 |
13 | 91992.39 | 135495.07 | 252664.93 | California | 134307.35 |
14 | 119943.24 | 156547.42 | 256512.92 | Florida | 132602.65 |
15 | 114523.61 | 122616.84 | 261776.23 | New York | 129917.04 |
16 | 78013.11 | 121597.55 | 264346.06 | California | 126992.93 |
17 | 94657.16 | 145077.58 | 282574.31 | New York | 125370.37 |
18 | 91749.16 | 114175.79 | 294919.57 | Florida | 124266.90 |
19 | 86419.70 | 153514.11 | 0.00 | New York | 122776.86 |
20 | 76253.86 | 113867.30 | 298664.47 | California | 118474.03 |
21 | 78389.47 | 153773.43 | 299737.29 | New York | 111313.02 |
22 | 73994.56 | 122782.75 | 303319.26 | Florida | 110352.25 |
23 | 67532.53 | 105751.03 | 304768.73 | Florida | 108733.99 |
24 | 77044.01 | 99281.34 | 140574.81 | New York | 108552.04 |
25 | 64664.71 | 139553.16 | 137962.62 | California | 107404.34 |
26 | 75328.87 | 144135.98 | 134050.07 | Florida | 105733.54 |
27 | 72107.60 | 127864.55 | 353183.81 | New York | 105008.31 |
28 | 66051.52 | 182645.56 | 118148.20 | Florida | 103282.38 |
29 | 65605.48 | 153032.06 | 107138.38 | New York | 101004.64 |
30 | 61994.48 | 115641.28 | 91131.24 | Florida | 99937.59 |
31 | 61136.38 | 152701.92 | 88218.23 | New York | 97483.56 |
32 | 63408.86 | 129219.61 | 46085.25 | California | 97427.84 |
33 | 55493.95 | 103057.49 | 214634.81 | Florida | 96778.92 |
34 | 46426.07 | 157693.92 | 210797.67 | California | 96712.80 |
35 | 46014.02 | 85047.44 | 205517.64 | New York | 96479.51 |
36 | 28663.76 | 127056.21 | 201126.82 | Florida | 90708.19 |
37 | 44069.95 | 51283.14 | 197029.42 | California | 89949.14 |
38 | 20229.59 | 65947.93 | 185265.10 | New York | 81229.06 |
39 | 38558.51 | 82982.09 | 174999.30 | California | 81005.76 |
40 | 28754.33 | 118546.05 | 172795.67 | California | 78239.91 |
41 | 27892.92 | 84710.77 | 164470.71 | Florida | 77798.83 |
42 | 23640.93 | 96189.63 | 148001.11 | California | 71498.49 |
43 | 15505.73 | 127382.30 | 35534.17 | New York | 69758.98 |
44 | 22177.74 | 154806.14 | 28334.72 | California | 65200.33 |
45 | 1000.23 | 124153.04 | 1903.93 | New York | 64926.08 |
46 | 1315.46 | 115816.21 | 297114.46 | Florida | 49490.75 |
47 | 0.00 | 135426.92 | 0.00 | California | 42559.73 |
48 | 542.05 | 51743.15 | 0.00 | New York | 35673.41 |
49 | 0.00 | 116983.80 | 45173.06 | California | 14681.40 |
1. nan ์ฒ๋ฆฌ
df.isna().sum()
R&D Spend 0
Administration 0
Marketing Spend 0
State 0
Profit 0
dtype: int64
2. X,y๋ก ๋ถ๋ฆฌ
y = df['Profit']
X = df.loc[ : , 'R&D Spend':'State' ]
3. ๋ฌธ์์ด ๋ฐ์ดํฐ๋ ์ซ์๋ก
df['State'].unique()
array(['New York', 'California', 'Florida'], dtype=object)
# ์ํซ ์ธ์ฝ๋ฉ
-์ ํซ ์ธ์ฝ๋ฉํ๋ฉด ์ ํซ ์ธ์ฝ๋ฉ๋ ์ปฌ๋ผ์ด ํญ์ ๋งจ ์ผ์ชฝ์ ์์นํ๊ฒ ๋๋ค
from sklearn.preprocessing import OneHotEncoder
from sklearn.compose import ColumnTransformer
ct = ColumnTransformer( [ ('encoder', OneHotEncoder() , [3] ) ] , remainder= 'passthrough' )
X = ct.fit_transform(X)
X = ct.fit_transform(X)
# ํผ์ฒ ์ค์ผ์ผ๋ง์ linear์ด๋ฏ๋ก ํจ์คํ๋ค
(๋ง์ฝ ํผ์ฒ์ค์ผ์ผ๋ง์ ํด์ผํ๋ค๋ฉด ์ด์ ๊ธ์ ์ฐธ์กฐํ๊ธธ ๋ฐ๋๋ค)
4. traing๊ณผ test๋ก ๋๋๊ณ regression ๋ชจ๋ธ์ ๋ง๋ ๋ค
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y , test_size=0.2, random_state=65)
#ํญ์ ์ด ํํ์ฌ์ผ ํ๋ค
test_size=0.2๋ test 2 : traning 8 ์ ๋น์จ๋ก ํ๋ค๋ ๋ป์ด๋ค. ์ค๋ฌด์์๋ ์ฃผ๋ก 2 ๋๋ 2.5๋ก ์ด๋ค
random_state๋ random์ seed๊ฐ๊ณผ ๋น์ทํ ๊ฐ๋ ์ด๋ค
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
5. regression ๋ชจ๋ธ์ ์ ์์๋ณด๊ธฐ
Multiple linear regression์์๋ ์ฌ๋ฌ๊ฐ์ ๋ณ์๋ฅผ ์ฐ๋๋ฐ, ์ด ์์ ์์๋ ๋ณ์๊ฐ ์๋์ ๊ฐ๋ค.
R&D Spend | Administration | Marketing Spend | State |
๊ทธ๋ฐ๋ฐ, state๋ ์ ํซ ์ธ์ฝ๋ฉ์ผ๋ก ๋ฌธ์์ด์ New York, California, Florida ์ธ๊ฐ๋ก ๋๋์๊ธฐ ๋๋ฌธ์,
์๋์ ๊ฐ์ด ์ฌ์ฏ๊ฐ์ ๋ณ์๋ผ๊ณ ํ ์ ์๋ค.
R&D Spend
Administration
Marketing Spend
California (1, 0, 0)
Florida (0, 1, 0)
New York (0, 0, 1)
X๊ฐ ์ด 6๊ฐ์ ๋ณ์์ด๋ฏ๋ก
y = aX1 + bX2 + cX3 + dX4 + eX5 + fX6 + g ์ ๊ฐ์ ์์ ์ฌ์ฉํ๋ค๊ณ ํ ์ ์๋ค.
์ฌ๊ธฐ์,
regressor.coef_
array([ 8.29736108e+00, 1.35646415e+03, -1.36476151e+03, 8.24637324e-01,
-1.12195852e-02, 2.80920611e-02])
coef๋ ๊ณ์๋ฅผ ๋ปํ๋ฉฐ a, b, c, d, e, f์ ํด๋นํ๋ค.
regressor.intercept_
46989.22920268966
intercept๋ ์์ํญ์ ๋ปํ๋ฉฐ g์ ํด๋นํ๋ค
์ฆ,
y = 8.29736108e+00X1+ 1.35646415e+03X2 + -1.36476151e+03X3 + 8.24637324e-01X4 +
-1.12195852e-02X5 + 2.80920611e-02X6 + 46989.22920268966
์ด๋ผ๋ ์์ด ๋์จ ๊ฒ์ด๋ค.
6. ๋ฐ์ดํฐ ์ ๋ ฅ๋ฐ๊ธฐ ์ , ์์๋ก ํ ์คํธ(test)ํด๋ณด๊ณ ์ค์ฐจ ๊ตฌํด๋ณด๊ธฐ
y_pred = regressor.predict(X_test)
#MSE(์ค์ฐจ ๊ตฌํ๊ธฐ)
((y_test - y_pred) **2).mean
df_test = y_test.to_frame()
df_test.reset_index(drop=True, inplace=True)
df_test['y_pred'] = y_pred
#๋ํ๋ก ํ์ธํ๊ธฐ
df_test.plot(kind='bar')
plt.savefig('test.jpg')
plt.show()
7. ๋ฐ์ดํฐ๋ฅผ ์ ๋ ฅ๋ฐ๋ ์ค์๋ฒ์ ๋ฐฐํฌํ๊ธฐ ์ํด์ regressor๊ณผ ct๋ฅผ ์ ์ฅํ๊ธฐ
import joblib
joblib.dump( regressor , 'regressor.pkl' )
joblib.dump(ct, 'ct.pkl' )
8. X๋ฅผ ์ ๋ ฅ๋ฐ์์ y ์์ธกํ๊ธฐ
์์ reํ์ผ๋ก ์ ์ฅํด์ ์ค์๋ฒ์ ๋ฐฐํฌํ ํ์,
์๋์ ๊ฐ์ ์๋ก์ด ๋ฐ์ดํฐ๋ฅผ ๋ฐ์๋ค. ์ด ๋ฐ์ดํฐ๋ฅผ ๋ฐํ์ผ๋ก ํ์ฌ์ ์์ต์ ์์ธกํ ๊ฒ์ด๋ค
์ด์๋น๋ 15๋ง๋ฌ๋ฌ, ๋ง์ผํ ๋น๋ 40๋ง๋ฌ๋ฌ, ์ฐ๊ตฌ๊ฐ๋ฐ๋น 13๋ง๋ฌ๋ฌ์ด๊ณ ํ์ฌ์์น๋ Florida์ ์๋ค.
data = {'R&D Spend' : [130000,150000], 'Administration':[150000,110000],'Marketing Spend':[400000,600000], 'State':['Florida','New York']}
new_data = pd.DataFrame(data) #๋ฐ์ดํฐํ๋ ์์ผ๋ก ๊ผญ ๋ง๋ค์ด์ผ ํ๋ค!
new_data
R&D Spend | Administration | Marketing Spend | State | |
0 | 130000 | 150000 | 400000 | Florida |
1 | 150000 | 110000 | 600000 | New York |
new_data = ct.transform( new_data )
regressor.predict( new_data )
array([165102.43212675, 184941.14856592])
์ด๋ฅผ ํตํด์,
R&D Spend | Administration | Marketing Spend | State | |
0 | 130000 | 150000 | 400000 | Florida |
์ ์์ต(Profit)์ 165102.43212675์ด๊ณ ,
R&D Spend | Administration | Marketing Spend | State | |
1 | 150000 | 110000 | 600000 | New York |
์ ์์ต(Profit)์ 184941.14856592 ๋ผ๋ ๊ฒ์ ์ ์ ์๋ค.