ML (MachineLearning)

Logistic Regression (๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€)

567Rabbit 2024. 4. 15. 12:18

๋จธ์‹ ๋Ÿฌ๋‹์˜ ์ง€๋„ํ•™์Šต์— ์†ํ•˜๋Š”

Classfication(๋ถ„๋ฅ˜)

- Logistic Regression (๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€)

- KNN(K nearest neighbor) ์•Œ๊ณ ๋ฆฌ์ฆ˜, 

- SVC(Support Vector Machine) ์•Œ๊ณ ๋ฆฌ์ฆ˜,

- DT(Decision Tree) ์•Œ๊ณ ๋ฆฌ์ฆ˜

 

๋„ค ๊ฐ€์ง€ ๋ฐฉ๋ฒ• ์ค‘์— ์ •ํ™•๋„๊ฐ€ ๋” ๋†’์€ ๋ฐฉ๋ฒ•์œผ๋กœ ์•Œ๊ณ ๋ฆฌ์ฆ˜์„ ์„ ํƒํ•˜์—ฌ ์‚ฌ์šฉํ•œ๋‹ค


๋กœ์ง€์Šคํ‹ฑ ํšŒ๊ท€๋ž€ ? 

๋ถ„๋ฅ˜ ๋ฌธ์ œ๋ฅผ ํ•ด๊ฒฐํ•˜๋Š”๊ฒƒ์„ ๋ชฉํ‘œ๋กœ ํ•ฉ๋‹ˆ๋‹ค
- ์—ฐ์†์ ์ธ ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•˜๋Š” ์„ ํ˜•ํšŒ๊ท€์™€ ๋‹ค๋ฅด๊ฒŒ ๋ฒ”์ฃผํ˜• ๊ฒฐ๊ณผ๋ฅผ ์˜ˆ์ธกํ•˜์—ฌ ์ด๋ฅผ ์ˆ˜ํ–‰ํ•ฉ๋‹ˆ๋‹ค
- ๊ฐ€์žฅ ๊ฐ„๋‹จํ•œ ๋ฐฉ๋ฒ•์ด ์ดํ•ญ์ด๋ผ๊ณ  ๋ถˆ๋ฆฌ๋Š” ๋‘๊ฐ€์ง€ ๊ฒฐ๊ณผ๊ฐ€ ์žˆ๋Š”๋ฐ ๊ทธ ์˜ˆ๋กœ ์•”์ด ์•…์„ฑ์ธ์ง€ ์–‘์„ฑ์ธ์ง€ ์˜ˆ์ธกํ•˜๋Š” ๊ฒƒ์ž…๋‹ˆ๋‹ค

 

 

Confusion Matrix

 

๋Œ€๋ถ€๋ถ„์˜ ๊ฒฝ์šฐ, C๊ฐ€ ๋” ์ค‘์š”ํ•˜๋‹ค

์˜คํƒ์ง€์˜ ๊ฒฝ์šฐ ๋งž์€๊ฒƒ์„ ๋ชป์ฐพ์€ ๊ฒƒ์ด์ง€๋งŒ

๋ฏธํƒ์ง€์˜ ๊ฒฝ์šฐ ํ‹€๋ฆฐ๊ฒƒ์„ ๋ชป์ฐพ์€ ๊ฒƒ์ด๋ฏ€๋กœ,

ํ‹€๋ฆฐ๊ฒƒ์„ ๋ชป์ฐพ์€ ๊ฒƒ์ด ๋Œ€๋ถ€๋ถ„ ๋ฌธ์ œ๋ฅผ ๋” ๋งŽ์ด ๋ฐœ์ƒ์‹œํ‚ค๊ณ  ์†์‹ค์„ ๋” ๋งŽ์ด ๊ฐ€์ ธ์˜ค๊ธฐ ๋•Œ๋ฌธ์ด๋‹ค

 

 

 

 

์˜ˆ์ œ

# ๋‚˜์ด์™€ ์—ฐ๋ด‰์œผ๋กœ ๋ถ„์„ํ•ด์„œ, ๋ฌผ๊ฑด์„ ๊ตฌ๋งคํ• ์ง€ ์•ˆํ• ์ง€๋ฅผ ๋ถ„๋ฅ˜ํ•˜์ž   (๊ตฌ๋งคํ•˜๋ฉด 1, ๊ตฌ๋งคํ•˜์ง€ ์•Š์œผ๋ฉด 0)

 

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

import seaborn as sb

 

 

df

  User ID Gender Age Estimated Salary Purchased
0 15624510 Male 19 19000 0
1 15810944 Male 35 20000 0
2 15668575 Female 26 43000 0
3 15603246 Female 27 57000 0
4 15804002 Male 19 76000 0
... ... ... ... ... ...
395 15691863 Female 46 41000 1
396 15706071 Male 51 23000 1
397 15654296 Female 50 20000 1
398 15755018 Male 36 33000 0
399 15594041 Female 49 36000 1

 

 

๊ณต๋ฐฑ ์—†์• ๊ธฐ

df.isna().sum()

User ID            0
Gender             0
Age                0
EstimatedSalary    0
Purchased          0
dtype: int64

 

 

 

 

ํŠน์„ฑ์—ด๊ณผ ๋Œ€์ƒ์—ด๋กœ ๋‚˜๋ˆ„๊ธฐ

 

ํŠน์„ฑ ์—ด(X)์€ ๋ฐ์ดํ„ฐ์…‹์—์„œ ๊ฐ๊ฐ์˜ ๊ด€์ธก์น˜์— ๋Œ€ํ•œ ์„ค๋ช…๋ณ€์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค
๋Œ€์ƒ ์—ด(y)์€ ์˜ˆ์ธกํ•˜๋ ค๋Š” ๊ฐ’์ด ํฌํ•จ๋œ ์—ด์ด๋‹ค

 

y = df['Purchased']

X = df.iloc[ : ,2:3+1]

 

 

 

 

ํ”ผ์ณ์Šค์ผ€์ผ๋ง ํ•˜๊ธฐ

๋กœ์ง€์Šคํ‹ฑ ๋ฆฌ๊ทธ๋ ˆ์…˜์€ ํ”ผ์ณ์Šค์ผ€์ผ๋ง์„ ํ•ด์•ผํ•œ๋‹ค

 

from sklearn.preprocessing import StandardScaler, MinMaxScaler  (์ฒซ๋ฒˆ์งธ ๊ฒŒ์‹œ๊ธ€์— ์ž์„ธํžˆ ๋‚˜์™€์žˆ๋‹ค.)

 

X_scaler = StandardScaler()
X = X_scaler.fit_transform( X )

 

y      # 0๊ณผ 1๋กœ๋งŒ ์ด๋ฃจ์–ด์ ธ์žˆ์œผ๋ฏ€๋กœ ๋”ฐ๋กœ ํ”ผ์ณ์Šค์ผ€์ผ๋ง์„ ํ•  ํ•„์š”๊ฐ€ ์—†๋‹ค

 

 

 

train๊ณผ test๋กœ ๋‚˜๋ˆ„๊ธฐ

from sklearn.model_selection import train_test_split

 

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=1)

 

 

 

 

๋ชจ๋ธ๋งํ•˜๊ธฐ

from sklearn.linear_model import LogisticRegression

 

classifier = LogisticRegression()

 

classifier.fit(X_train, y_train)

 

classifier.predict(X_test) # X_test ์˜ˆ์ธกํ•˜๊ธฐ

classifier.predict_proba(X_test) # ํ™•๋ฅ ๋กœ ๋ณผ ์ˆ˜ ์žˆ๋‹ค

 

y_pred = classifier.predict(X_test)

df_test = y_test.to_frame()

df_test['y_pred'] = y_pred

df_test

 

Purchased y_pred
0 0
0 0
1 1
1 1
0 0
... ...
1 1
1 1
1 0
0 0
0 0

 

 

 

 

Confusion Matrix 

from sklearn.metrics import confusion_matrix

cm = confusion_matrix(y_test, y_pred)

 

 

cm

array([[52,  6],
       [11, 31]], dtype=int64)

 

 

# ์ •ํ™•๋„ ๊ณ„์‚ฐ : accuracy
(52+31) / cm.sum()

0.83

 

 

from sklearn.metrics import accuracy_score

accuracy_score(y_test, y_pred)

0.83

 

 

 

report ๋‚˜ํƒ€๋‚ด๊ธฐ

- ๋ถ„๋ฅ˜ ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ํ‰๊ฐ€ํ•˜๊ธฐ ์œ„ํ•ด ์‚ฌ์šฉ๋œ๋‹ค

- ์ฃผ๋กœ ๋ชจ๋ธ์˜ ์ •ํ™•๋„, ์ •๋ฐ€๋„, ์žฌํ˜„์œจ, F1 ์ ์ˆ˜ ๋“ฑ๊ณผ ๊ฐ™์€ ์—ฌ๋Ÿฌ ์ง€ํ‘œ๋ฅผ ์ œ๊ณตํ•˜์—ฌ

- ๋ชจ๋ธ์˜ ์„ฑ๋Šฅ์„ ์ดํ•ดํ•˜๊ณ  ๋น„๊ตํ•˜๋Š” ๋ฐ ๋„์›€์„ ์ค€๋‹ค

 

 

from sklearn.metrics import classification_report

print(classification_report(y_test, y_pred))

 

              precision    recall  f1-score   support

           0       0.83      0.90      0.86        58
           1       0.84      0.74      0.78        42

    accuracy                           0.83       100
   macro avg       0.83      0.82      0.82       100
weighted avg       0.83      0.83      0.83       100

 

 

  • precision: ์˜ˆ์ธก๋œ ์–‘์„ฑ ์ค‘ ์‹ค์ œ ์–‘์„ฑ์˜ ๋น„์œจ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค. ์ฆ‰, ๋ชจ๋ธ์ด ์–‘์„ฑ์œผ๋กœ ์˜ˆ์ธกํ•œ ๊ฒƒ ์ค‘์—์„œ ์‹ค์ œ๋กœ ์–‘์„ฑ์ธ ๊ฒƒ์˜ ๋น„์œจ์ด๋‹ค. ์—ฌ๊ธฐ์„œ 0๊ณผ 1 ํด๋ž˜์Šค์— ๋Œ€ํ•œ precision์€ ๊ฐ๊ฐ 0.83๊ณผ 0.84์ด๋‹ค. ์ด๋Š” ๋ชจ๋ธ์ด ํด๋ž˜์Šค 0์„ 83%์˜ ์ •๋ฐ€๋„๋กœ, ํด๋ž˜์Šค 1์„ 84%์˜ ์ •๋ฐ€๋„๋กœ ์˜ˆ์ธกํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค
  • recall: ์‹ค์ œ ์–‘์„ฑ ์ค‘์—์„œ ๋ชจ๋ธ์ด ์–‘์„ฑ์œผ๋กœ ์˜ˆ์ธกํ•œ ๋น„์œจ์„ ๋‚˜ํƒ€๋‚ธ๋‹ค. ์ฆ‰, ๋ชจ๋ธ์ด ์‹ค์ œ ์–‘์„ฑ ์ค‘์—์„œ ์–ผ๋งˆ๋‚˜ ๋งŽ์€ ๊ฒƒ์„ ์‹๋ณ„ํ•  ์ˆ˜ ์žˆ๋Š”์ง€๋ฅผ ๋ณด์—ฌ์ค€๋‹ค. ์—ฌ๊ธฐ์„œ 0๊ณผ 1 ํด๋ž˜์Šค์— ๋Œ€ํ•œ recall์€ ๊ฐ๊ฐ 0.90๊ณผ 0.74์ด๋‹ค. ์ด๋Š” ํด๋ž˜์Šค 0์˜ ๊ฒฝ์šฐ 90%์˜ ์žฌํ˜„์œจ๋กœ ์‹ค์ œ ์–‘์„ฑ ์ค‘ ๋Œ€๋ถ€๋ถ„์„ ์ž˜ ์‹๋ณ„ํ•œ๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•˜๋ฉฐ, ํด๋ž˜์Šค 1์˜ ๊ฒฝ์šฐ 74%์˜ ์žฌํ˜„์œจ๋กœ ์–‘์„ฑ์„ ๋†“์น˜๋Š” ๋น„์œจ์ด ์ƒ๋Œ€์ ์œผ๋กœ ๋†’๋‹ค๋Š” ๊ฒƒ์„ ์˜๋ฏธํ•œ๋‹ค.
  • f1-score: precision๊ณผ recall์˜ ์กฐํ™” ํ‰๊ท ์ด๋‹ค. ๋ชจ๋ธ์˜ ์ •๋ฐ€๋„์™€ ์žฌํ˜„์œจ์„ ๋ชจ๋‘ ๊ณ ๋ คํ•˜์—ฌ ๊ณ„์‚ฐ๋œ๋‹ค. ๋”ฐ๋ผ์„œ ๋ถˆ๊ท ํ˜•ํ•œ ํด๋ž˜์Šค ํฌ๊ธฐ์— ๋Œ€ํ•ด ๋” ์•ˆ์ •์ ์ธ ์„ฑ๋Šฅ ์ธก์ •์„ ์ œ๊ณตํ•œ๋‹ค. ์—ฌ๊ธฐ์„œ 0๊ณผ 1 ํด๋ž˜์Šค์— ๋Œ€ํ•œ f1-score๋Š” ๊ฐ๊ฐ 0.86๊ณผ 0.78์ด๋‹ค.
  • support: ๊ฐ ํด๋ž˜์Šค์˜ ์ƒ˜ํ”Œ ์ˆ˜๋ฅผ ๋‚˜ํƒ€๋‚ธ๋‹ค. ์—ฌ๊ธฐ์„œ ํด๋ž˜์Šค 0๊ณผ 1์— ๋Œ€ํ•œ ์ƒ˜ํ”Œ ์ˆ˜๋Š” ๊ฐ๊ฐ 58๊ณผ 42์ด๋‹ค.
  • accuracy: ๋ชจ๋ธ์ด ์˜ฌ๋ฐ”๋ฅด๊ฒŒ ๋ถ„๋ฅ˜ํ•œ ์ „์ฒด ์ƒ˜ํ”Œ์˜ ๋น„์œจ์ด๋‹ค. ์—ฌ๊ธฐ์„œ ์ •ํ™•๋„๋Š” 0.83์ด๋‹ค.
  • macro avg: ํด๋ž˜์Šค๋ณ„ ์ง€ํ‘œ์˜ ํ‰๊ท ๊ฐ’์„ ๊ณ„์‚ฐํ•œ ๊ฒƒ์œผ๋กœ, ๋‹จ์ˆœ ํ‰๊ท ์„ ์ทจํ•œ๋‹ค.
  • weighted avg: ํด๋ž˜์Šค๋ณ„ ์ง€ํ‘œ์˜ ๊ฐ€์ค‘ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•œ ๊ฒƒ์œผ๋กœ, ๊ฐ ํด๋ž˜์Šค์˜ ์ง€์ง€๋„๋ฅผ ๊ฐ€์ค‘์น˜๋กœ ์‚ฌ์šฉํ•˜์—ฌ ํ‰๊ท ์„ ๊ณ„์‚ฐํ•จ

 

 

 

 

confusion matrix์„ ํžˆํŠธ๋งต์œผ๋กœ ๋ณด๊ธฐ

 

sb.heatmap(data=cm, cmap='RdPu',annot=True) 

plt.show()

 

# annot์€ ๋„ํ‘œ ์•ˆ์˜ ์ฃผ์„(๋ฐ์ดํ„ฐ๋ฅผ ์ˆซ์ž๋กœ ํ‘œํ˜„)์„ ๋‚˜ํƒ€๋‚ธ๋‹ค