ML (MachineLearning)

ํ•˜์ด๋ผํ‚ค ํด๋Ÿฌ์Šคํ„ฐ๋ง(Hierarchical Clustering) : ๊ณ„์ธต์  ๊ตฐ์ง‘

567Rabbit 2024. 4. 16. 12:28

๋จธ์‹ ๋Ÿฌ๋‹์˜ ๋น„์ง€๋„(unsupervised)ํ•™์Šต

 

1. ํ‰ํ• /๋ถ„ํ•  ๊ธฐ๋ฐ˜์˜ ๊ตฐ์ง‘ (Partition-based Clustering)

- ๋น„์Šทํ•œ ํŠน์ง•์„ ๊ฐ–๋Š” ๋ฐ์ดํ„ฐ๋ผ๋ฆฌ ๋ฌถ๋Š”๊ฒƒ์ด๋‹ค

์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์— ๋Œ€ํ•ด ๋ฏธ๋ฆฌ ์ •์˜๋œ ์ˆ˜์˜ ๊ตฐ์ง‘์„ ํ˜•์„ฑํ•˜๋ฉฐ, ๋ฐ์ดํ„ฐ๋ฅผ ํ•ด๋‹น ๊ตฐ์ง‘์— ํ• ๋‹นํ•˜๋Š” ๋ฐฉ์‹์œผ๋กœ ๋™์ž‘ํ•œ๋‹ค

ex ) K-Means Clusterin

 

 

2. ๊ณ„์ธต์  ๊ตฐ์ง‘ (Hierarchical Clustering)

- ๋ฐ์ดํ„ฐ๋ฅผ ์ˆœ์ฐจ์  ๋˜๋Š” ๊ณ„์ธต์ ์œผ๋กœ ๊ทธ๋ฃนํ™”ํ•˜๋Š” ์•Œ๊ณ ๋ฆฌ์ฆ˜

- ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ ๊ฐ„์˜ ๊ฑฐ๋ฆฌ ๋˜๋Š” ์œ ์‚ฌ๋„๋ฅผ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ„์ธต ๊ตฌ์กฐ๋ฅผ ํ˜•์„ฑํ•˜์—ฌ ๊ตฐ์ง‘์„ ํ˜•์„ฑ

๊ณ„์ธต์ ์ธ ๊ตฌ์กฐ๋ฅผ ๊ฐ€์ง€๊ณ  ์žˆ์–ด ๊ตฐ์ง‘ํ™” ๊ฒฐ๊ณผ๋ฅผ ๋‹ค์–‘ํ•œ ์ˆ˜์ค€์—์„œ ์‚ดํŽด๋ณผ ์ˆ˜ ์žˆ์œผ๋ฉฐ, ์‹œ๊ฐ์ ์œผ๋กœ ํ‘œํ˜„ํ•˜๊ธฐ ์‰ฝ๋‹ค

- ์‚ฌ์ „์— ๊ตฐ์ง‘์˜ ๊ฐœ์ˆ˜๋ฅผ ์ง€์ •ํ•  ํ•„์š”๊ฐ€ ์—†์–ด ํŽธ๋ฆฌ

- ํฐ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•ด์„œ๋Š” ๊ณ„์‚ฐ ๋น„์šฉ์ด ์ฆ๊ฐ€ํ•  ์ˆ˜ ์žˆ๊ณ , ํŠน์ • ์ˆ˜์ค€์—์„œ์˜ ๊ตฐ์ง‘ํ™” ๊ฒฐ๊ณผ๋ฅผ ํ•ด์„ํ•˜๋Š” ๊ฒƒ์ด ์–ด๋ ค์šธ ์ˆ˜ ์žˆ๋‹ค

ex ) ๋ณ‘ํ•ฉ ๊ตฐ์ง‘ํ™” (Agglomerative Clustering) , ๋ถ„ํ•  ๊ตฐ์ง‘ํ™” (Divisive Clustering)

 

 

 

 

 

ํ•˜์ด๋ผํ‚ค ํด๋Ÿฌ์Šคํ„ฐ๋ง(Hierarchical Clustering) : ๊ณ„์ธต์  ๊ตฐ์ง‘

 

- ์ ์  ๋„“ํ˜€๊ฐ€๋ฉด์„œ ๋น„์Šทํ•œ ๊ฐœ์ฒด๋ฅผ ๊ตฐ์ง‘์œผ๋กœ ๋งŒ๋“ค์–ด๋‚˜๊ฐ€๋Š” ํ˜•ํƒœ์ด๋‹ค

- ๋ฐ์ดํ„ฐ ํฌ์ธํŠธ ๊ฐ„์˜ ๊ฑฐ๋ฆฌ๋‚˜ ์œ ์‚ฌ์„ฑ์„ ๊ธฐ๋ฐ˜์œผ๋กœ ๊ณ„์ธต์ ์ธ ํŠธ๋ฆฌ ๊ตฌ์กฐ๋ฅผ ํ˜•์„ฑ

- ๊ฐ€์žฅ ์งง์€ ๊ฑฐ๋ฆฌ๋ฅผ ๊ฐ€์ง„ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๊ฒฐํ•ฉํ•˜์—ฌ ๋” ํฐ ํด๋Ÿฌ์Šคํ„ฐ๋ฅผ ๋งŒ๋“ ๋‹ค

 

 

 

df

  CustomerID Genre Age Annual Income (k$) Spending Score (1-100)
0 1 Male 19 15 39
1 2 Male 21 15 81
2 3 Female 20 16 6
3 4 Female 23 16 77
4 5 Female 31 17 40
... ... ... ... ... ...
195 196 Female 35 120 79
196 197 Female 45 126 28
197 198 Male 32 126 74
198 199 Male 32 137 18
199 200 Male 30 137 83

 

 

 

X = df.iloc[ : , 1 : ]

X
  Genre Age Annual Income (k$) Spending Score (1-100)
0 Male 19 15 39
1 Male 21 15 81
2 Female 20 16 6
3 Female 23 16 77
4 Female 31 17 40
... ... ... ... ...
195 Female 35 120 79
196 Female 45 126 28
197 Male 32 126 74
198 Male 32 137 18
199 Male 30 137 83

 

 

 

 

 

๋ฌธ์ž์—ด ์ˆซ์ž๋กœ ๋ฐ”๊พธ๊ธฐ

 

X['Genre'].unique()

array(['Male', 'Female'], dtype=object)

 

'Male'๊ณผ 'Female' 2๊ฐœ์ด๋ฏ€๋กœ 0๊ณผ 1๋กœ๋งŒ ๋ฐ”๊พธ๋ฉด ๋˜๊ธฐ๋•Œ๋ฌธ์—, ์›ํ•ซ์ธ์ฝ”๋”ฉ(3๊ฐœ ์ด์ƒ)์ด ์•„๋‹Œ ๋ผ๋ฒจ ์ธ์ฝ”๋”ฉ์œผ๋กœ ํ•˜๋ฉด ๋œ๋‹ค 

 

from sklearn.preprocessing import LabelEncoder

encoder = LabelEncoder()

X['Genre'] = encoder.fit_transform(X['Genre'])

 

 

X

  Genre Age Annual Income (k$) Spending Score (1-100)
0 1 19 15 39
1 1 21 15 81
2 0 20 16 6
3 0 23 16 77
4 0 31 17 40
... ... ... ... ...
195 0 35 120 79
196 0 45 126 28
197 1 32 126 74
198 1 32 137 18
199 1 30 137 83

 

 

 

sorted(X['Genre'].unique())

['Female', 'Male']

 

'Female'์ด 0, 'Male'์ด 1์ด๋ผ๋Š” ๊ฒƒ์„ ์•Œ ์ˆ˜ ์žˆ๋‹ค.

 

 

 

 

ํ•˜์ด๋ผํ‚ค ํด๋Ÿฌ์Šคํ„ฐ๋ง์€ ํ”ผ์ณ์Šค์ผ€์ผ๋ง ๋‹จ๊ณ„๋ฅผ ๊ฑฐ์น˜์ง€ ์•Š๋Š”๋‹ค

 

 

 

 

 

 

Dendrogram ๊ทธ๋ฆฌ๊ธฐ

 

์œ ํด๋ฆฌ๋“œ ๊ฑฐ๋ฆฌ๋ฅผ ์‚ฌ์šฉํ•ด์„œ ์™€๋“œ ์—ฐ๊ฒฐ์„ ๊ณ„์‚ฐํ•˜๊ณ  ๋ด๊ทธ๋กœ๊ทธ๋žจ์„ ์‚ฌ์šฉํ•˜์—ฌ ์‹œ๊ฐํ™” ํ•œ๋‹ค
๋ด๊ทธ๋กœ๊ทธ๋žจ ๊ณ„์ธต๊ตฌ์กฐ๋ฅผ ๋„์‹ํ™”ํ•œ ๊ทธ๋ฆผ์„ ๋ด๊ทธ๋กœ๊ทธ๋žจ์ด๋ผ ํ•œ๋‹ค

 

import scipy.cluster.hierarchy as sch

sch.dendrogram( sch.linkage(X, method='ward') )#linkage : ์—ฐ๊ฒฐ์„ ์–ด๋–ป๊ฒŒ ์‹œํ‚ฌ๊ฒƒ์ธ๊ฐ€? 'ward' ๋ฉ”์†Œ๋“œ๊ฐ€ ๊ฐ€์žฅ ์ผ๋ฐ˜์ 

plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Distance')
plt.show()

 

ward ๋ฉ”์†Œ๋“œ๋Š” ๊ณ„์ธต์  ๊ตฐ์ง‘ํ™”์—์„œ ์‚ฌ์šฉ๋˜๋Š” ๊ฑฐ๋ฆฌ ์ธก์ • ๋ฐฉ๋ฒ• ์ค‘ ํ•˜๋‚˜์ด๋ฉฐ ๊ฑฐ๋ฆฌ ์ธก์ • ๋ฐฉ๋ฒ•์„ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ„์ธต์  ๊ตฐ์ง‘ํ™”๋ฅผ ์‹คํ–‰ํ•˜๋Š” Python ์ฝ”๋“œ์ด๋‹ค. ๋น„๊ต์  ์•ˆ์ •์ ์ด๋ฉฐ ๊ตฐ์ง‘ ๊ฐ„์˜ ํฌ๊ธฐ ์ฐจ์ด์— ๋ฏผ๊ฐํ•˜์ง€ ์•Š์•„ ๋„๋ฆฌ ์‚ฌ์šฉ๋œ๋‹ค. ์ฃผ์–ด์ง„ ๋ฐ์ดํ„ฐ์…‹์— ๋Œ€ํ•œ ๊ฑฐ๋ฆฌ ํ–‰๋ ฌ์„ ๊ณ„์‚ฐํ•œ ํ›„์—, ์ด๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ๊ณ„์ธต์  ๊ตฐ์ง‘ํ™”๋ฅผ ์ˆ˜ํ–‰ํ•  ์ˆ˜ ์žˆ๋‹ค.

 

from sklearn.cluster import AgglomerativeClustering    

#AgglomerativeClustering๋Š” ๋ณ‘ํ•ฉ์  ๊ตฐ์ง‘ํ™”์ด๋‹ค


hc = AgglomerativeClustering(n_clusters=5)

 

y_pred = hc.fit_predict(X)
df['Group'] = y_pred

 

 

df

  CustomerID Genre Age Annual Income (k$) Spending Score (1-100) Group
0 1 Male 19 15 39 4
1 2 Male 21 15 81 3
2 3 Female 20 16 6 4
3 4 Female 23 16 77 3
4 5 Female 31 17 40 4
... ... ... ... ... ... ...
195 196 Female 35 120 79 2
196 197 Female 45 126 28 1
197 198 Male 32 126 74 2
198 199 Male 32 137 18 1
199 200 Male 30 137 83 2

 

 

 

 

df.loc[df['Group'] == 4, ]

 

  CustomerID Genre Age Annual Income (k$) Spending Score (1-100) Group
0 1 Male 19 15 39 4
2 3 Female 20 16 6 4
4 5 Female 31 17 40 4
6 7 Female 35 18 6 4
8 9 Male 64 19 3 4
10 11 Male 67 19 14 4
12 13 Female 58 20 15 4
14 15 Male 37 20 13 4
16 17 Female 35 21 35 4
18 19 Male 52 23 29 4
20 21 Male 35 24 35 4
22 23 Female 46 25 5 4
24 25 Female 54 28 14 4
26 27 Female 45 28 32 4
28 29 Female 40 29 31 4
30 31 Male 60 30 4 4
32 33 Male 53 33 4 4
34 35 Female 49 33 14 4
36 37 Female 42 34 17 4
38 39 Female 36 37 26 4
40 41 Female 65 38 35 4
42 43 Male 48 39 36 4
44 45 Female 49 39 28 4