ML (MachineLearning)

K-Means ์•Œ๊ณ ๋ฆฌ์ฆ˜

567Rabbit 2024. 4. 16. 11:06

๋จธ์‹ ๋Ÿฌ๋‹์˜ ๋น„์ง€๋„(unsupervised)ํ•™์Šต

1. ํ‰ํ• /๋ถ„ํ•  ๊ธฐ๋ฐ˜์˜ ๊ตฐ์ง‘ (Partition-based Clustering)

- ๋น„์Šทํ•œ ํŠน์ง•์„ ๊ฐ–๋Š” ๋ฐ์ดํ„ฐ๋ผ๋ฆฌ ๋ฌถ๋Š”๊ฒƒ์ด๋‹ค

- ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๋ฏธ๋ฆฌ ์ •์˜๋œ ์ˆ˜์˜ ๊ตฐ์ง‘์„ ํ˜•์„ฑํ•˜๋ฉฐ, ๋ฐ์ดํ„ฐ๋ฅผ ํ•ด๋‹น ๊ตฐ์ง‘์— ํ• ๋‹นํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋™์ž‘ํ•œ๋‹ค

ex ) K-Means Clusterin

 

2. ๊ณ„์ธต์  ๊ตฐ์ง‘ (Hierarchical Clustering)

- ๋ฐ์ดํ„ฐ๋ฅผ ์ˆœ์ฐจ์  ๋˜๋Š” ๊ณ„์ธต์ ์œผ๋กœ ๊ทธ๋ฃนํ™”ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜

- ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ ๊ฐ„์˜ ๊ฑฐ๋ฆฌ ๋˜๋Š” ์œ ์‚ฌ๋„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ„์ธต ๊ตฌ์กฐ๋ฅผ ํ˜•์„ฑํ•˜์—ฌ ๊ตฐ์ง‘์„ ํ˜•์„ฑ

- ๊ณ„์ธต์ ์ธ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์–ด ๊ตฐ์ง‘ํ™” ๊ฒฐ๊ณผ๋ฅผ ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์—์„œ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์‹œ๊ฐ์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ์‰ฝ๋‹ค

- ์‚ฌ์ „์— ๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜๋ฅผ ์ง€์ •ํ•  ํ•„์š”๊ฐ€ ์—†์–ด ํŽธ๋ฆฌ

- ํฐ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„œ๋Š” ๊ณ„์‚ฐ ๋น„์šฉ์ด ์ฆ๊ฐ€ํ•  ์ˆ˜ ์žˆ๊ณ , ํŠน์ • ์ˆ˜์ค€์—์„œ์˜ ๊ตฐ์ง‘ํ™” ๊ฒฐ๊ณผ๋ฅผ ํ•ด์„ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ค์šธ ์ˆ˜ ์žˆ๋‹ค

ex ) ๋ณ‘ํ•ฉ ๊ตฐ์ง‘ํ™” (Agglomerative Clustering) , ๋ถ„ํ•  ๊ตฐ์ง‘ํ™” (Divisive Clustering)

 

 

 

 

K-Means

- AI(์ธ๊ณต์ง€๋Šฅ)์ด ์ž…๋ ฅ์„ธํŠธ์—์„œ ํŒจํ„ด๊ณผ ์ƒ๊ด€๊ด€๊ณ„๋ฅผ ์ฐพ์•„๋‚ด๋Š” ๋จธ์‹ ๋Ÿฌ๋‹ ์•Œ๊ณ ๋ฆฌ์ฆ˜์ด๋‹ค.

- ๋น„์ง€๋„ํ•™์Šต ์ค‘ ๊ฐ€์žฅ ๋งŽ์ด ์“ฐ์ด๋Š” ๋ฐฉ์‹์ด๋‹ค.

- ํ‰๊ท ์„ ์ฐพ์•„๊ฐ€๋ฉฐ y๋ฅผ ์Šค์Šค๋กœ ์ฐพ๋Š” ๋ฐฉ์‹์ด๋‹ค.

- ๊ตฐ์ง‘์€ ์‚ฌ๋žŒ์ด ํ• ๋‹นํ•ด์•ผ ํ•œ๋‹ค.

 

 

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

 

 

 

# Annual Income (k$) ์— ๋น„ํ•ด์„œ Spending Score (1-100) ์ด ๋†’์€ ๊ทธ๋ฃน์œผ๋กœ ๋ฌถ๊ธฐ

 

df

  CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
... ... ... ... ... ...
195 196 Female 35 120 79
196 197 Female 45 126 28
197 198 Male 32 126 74
198 199 Male 32 137 18
199 200 Male 30 137 83

 

 

 

 

nan๊ฐ’ ์ •๋ฆฌํ•˜๊ธฐ

 

df.isna().sum()

CustomerID                0
Genre                     0
Age                       0
Annual Income (k$)        0
Spending Score (1-100)    0
dtype: int64
 
 
 

 

 

X๊ฐ’ ์ •ํ•˜๊ธฐ

 

 
- ๋น„์ง€๋„ํ•™์Šต์€ y๊ฐ’์„ ๋จธ์‹ ๋Ÿฌ๋‹์— ํ•™์Šต์‹œํ‚ค์ง€ ์•Š๋Š”๋‹ค
 
 
 
X = df.iloc[ : , 1 : ]
 
 
 
X
  Genre Age Annual Income (k$) Spending Score (1-100)
0 Male 19 15 39
1 Male 21 15 81
2 Female 20 16 6
3 Female 23 16 77
4 Female 31 17 40
... ... ... ... ...
195 Female 35 120 79
196 Female 45 126 28
197 Male 32 126 74
198 Male 32 137 18
199 Male 30 137 83

 

 

 

 

 

๋ฌธ์ž์—ด ์ˆซ์ž๋กœ ๋ฐ”๊พธ๊ธฐ

 

X['Genre'].unique()

array(['Male', 'Female'], dtype=object)

 

'Male'๊ณผ 'Female' 2๊ฐœ์ด๋ฏ€๋กœ 0๊ณผ 1๋กœ๋งŒ ๋ฐ”๊พธ๋ฉด ๋˜๊ธฐ๋•Œ๋ฌธ์—, ์›ํ•ซ์ธ์ฝ”๋”ฉ(3๊ฐœ ์ด์ƒ)์ด ์•„๋‹Œ ๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ์œผ๋กœ ํ•˜๋ฉด ๋œ๋‹ค 

 

 

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

X['Genre'] = encoder.fit_transform(X['Genre'])

 

 

X

  Genre Age Annual Income (k$) Spending Score (1-100)
0 1 19 15 39
1 1 21 15 81
2 0 20 16 6
3 0 23 16 77
4 0 31 17 40
... ... ... ... ...
195 0 35 120 79
196 0 45 126 28
197 1 32 126 74
198 1 32 137 18
199 1 30 137 83

 

 

 

sorted(X['Genre'].unique())

['Female', 'Male']

 

'Female'์ด 0, 'Male'์ด 1์ด๋ผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

 

 

K-means๋Š” ํ”ผ์ณ์Šค์ผ€์ผ๋ง ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์น˜์ง€ ์•Š๋Š”๋‹ค

 

 

 

 

WCSS

from sklearn.cluster import KMeans

 

KMeans(n_clusters = ???, random_state=10)  

 

n_clusters = ?  ๋ช‡๊ฐœ ๊ทธ๋ฃน์œผ๋กœ ํ• ๊ฑด์ง€ ๋ชจ๋ฅด๊ธฐ ๋•Œ๋ฌธ์— ??? ๋ถ€๋ถ„์„ ์ฐพ์•„์•ผ ํ•œ๋‹ค ์ด ???๋ฅผ WCSS๋ผ๊ณ  ํ•œ๋‹ค

 

 

wcss๋ฅผ ๊ตฌํ•˜๋Š” ๋ฐฉ๋ฒ•์€ ๋‹ค์Œ๊ณผ ๊ฐ™๋‹ค

wcss = []
for i in range (1, 10+1) :
    kmeans = KMeans(n_clusters = i,random_state=10)
    kmeans.fit(X)
    wcss.append( kmeans.inertia_ )

 

i๋ฅผ 1๋ถ€ํ„ฐ 10๊นŒ์ง€ ๋ฐ˜๋ณตํ•˜์—ฌ ๋„ฃ์–ด๋ณด๋ฉด์„œ, ์–ด๋–ค ๊ตฐ์ง‘๊ฐฏ์ˆ˜๊ฐ€ ์ข‹์€์ง€ ๊ทธ๋ž˜ํ”„๋กœ ๋ณด๋ ค๊ณ  ํ•œ๋‹ค.

random_state๋Š” random์˜ seed์™€ ๊ฐ™์€ ๊ฐœ๋…์œผ๋กœ ๋ณธ์ธ์ด ์›ํ•˜๋Š” ์ˆซ์ž ๋˜๋Š” ํšŒ์‚ฌ์—์„œ ์ •ํ•œ ์ˆซ์ž๋ฅผ ๋„ฃ์–ด์ฃผ๋ฉด ๋œ๋‹ค.

(์ด๊ฒƒ์ด ๋“œ๋Ÿฌ๋‚˜๋ฉด ํ•ดํ‚น ๋ณด์•ˆ์— ์ทจ์•ฝํ•ด์ง„๋‹ค)

 

 

inertia ?
# ์—ฌ๋Ÿฌ ๊ฐœ์˜ ์ˆœํšŒ ๊ฐ€๋Šฅํ•œ(iterable) ๊ฐ์ฒด๋ฅผ ์ธ์ž๋กœ ๋ฐ›๊ณ  ๊ฐ ๊ฐ์ฒด๊ฐ€ ๋‹ด๊ณ  ์žˆ๋Š” ์›์†Œ๋ฅผ ์ฐจ๋ก€๋กœ ์ ‘๊ทผํ•  ์ˆ˜ ์žˆ๋‹ค
# ๋งˆ์น˜ ์˜ท์˜ ์ง€ํผ๋ฅผ ์˜ฌ๋ฆฌ๋Š” ๊ฒƒ์ฒ˜๋Ÿผ ์–‘์ธก์— ์žˆ๋Š” ๋ฐ์ดํ„ฐ๋ฅผ ํ•˜๋‚˜์”ฉ ์ฐจ๋ก€๋กœ ์ง์„ ์ง€์–ด์ค€๋‹ค

 

 

plt.plot(wcss)
plt.show()

 

 

# ๋ฐ‘์˜ ๊ทธ๋ž˜ํ”„์ฒ˜๋Ÿผ k๊ฐ’์— ๋Œ€ํ•œ ๊ด€์„ฑ์„ ์‹œ๊ฐํ™”ํ•œ ๊ฒƒ์„ elbow method(ํŒ”๊ฟˆ์น˜) ๋ฐฉ์‹์ด๋ผ๊ณ  ํ•œ๋‹ค.

 

 

์—ฌ๊ธฐ์„œ ๊ตฐ์ง‘์ด ์ ๊ฒŒ ์žˆ์„์ˆ˜๋ก ์˜ค์ฐจ๊ฐ€ ํฌ๊ณ ,  ๊ตฐ์ง‘์ด ์—ฌ๋Ÿฌ๊ฐœ ์žˆ์„์ˆ˜๋ก ์˜ค์ฐจ๊ฐ€ ์ ๋‹ค๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

๊ทธ๋Ÿฌ๋ฏ€๋กœ ์‚ฌ๋žŒ์ด ๋ณด๊ธฐ์— ์˜ค์ฐจ๊ฐ€ ์ ์œผ๋ฉด์„œ, ๊ตฐ์ง‘์ด ๋„ˆ๋ฌด ํฌ์ง€ ์•Š์€ ํšจ์œจ์ ์ธ ๊ตฐ์ง‘ ๊ฐฏ์ˆ˜๋ฅผ ์ •ํ•ด์•ผ ํ•œ๋‹ค

 

๊ทธ๋ž˜ํ”„๋กœ ๋ณด์•˜์„ ๋•Œ, 5๊ฐœ์˜ ๊ตฐ์ง‘์ด ์ ๋‹นํ•œ ๊ฒƒ ๊ฐ™๋‹ค๊ณ  ์ƒ๊ฐํ•˜์˜€๋‹ค.

 

 

 

 

์˜ˆ์ธกํ•˜๊ธฐ

 

KMeans(n_clusters = 5, random_state=10)

y_pred = kmeans.fit_predict(X)

y_pred

 

 

df['Group'] = y_pred

 

df

  CustomerID Genre Age Annual Income (k$) Spending Score (1-100) Group
0 1 Male 19 15 39 3
1 2 Male 21 15 81 1
2 3 Female 20 16 6 3
3 4 Female 23 16 77 1
4 5 Female 31 17 40 3
... ... ... ... ... ...  
195 196 Female 35 120 79 2
196 197 Female 45 126 28 0
197 198 Male 32 126 74 2
198 199 Male 32 137 18 0
199 200 Male 30 137 83 2

 

 

 

# ์—ฐ๊ฐ„ income(์ˆ˜์ž…)์— ๋น„ํ•ด ์†Œ๋น„ ์ ์ˆ˜๊ฐ€ ๋†’์€ ๊ทธ๋ฃน 4๋ฅผ ๊ฐ€์ ธ์™€๋ดค๋‹ค

  CustomerID Genre Age Annual Income (k$) Spending Score (1-100) Group
46 47 Female 50 40 55 4
47 48 Female 27 40 47 4
48 49 Female 29 40 42 4
49 50 Female 31 40 42 4
50 51 Female 49 42 52 4
... ... ... ... ... ... ...
120 121 Male 27 67 56 4
121 122 Female 38 67 40 4
122 123 Female 40 69 58 4
126 127 Male 43 71 35 4
142 143 Female 28 76 40 4