๋จธ์ ๋ฌ๋์ ๋น์ง๋(unsupervised)ํ์ต
1. ํํ /๋ถํ ๊ธฐ๋ฐ์ ๊ตฐ์ง (Partition-based Clustering)
- ๋น์ทํ ํน์ง์ ๊ฐ๋ ๋ฐ์ดํฐ๋ผ๋ฆฌ ๋ฌถ๋๊ฒ์ด๋ค
- ์ฃผ์ด์ง ๋ฐ์ดํฐ์ ๋ํด ๋ฏธ๋ฆฌ ์ ์๋ ์์ ๊ตฐ์ง์ ํ์ฑํ๋ฉฐ, ๋ฐ์ดํฐ๋ฅผ ํด๋น ๊ตฐ์ง์ ํ ๋นํ๋ ๋ฐฉ์์ผ๋ก ๋์ํ๋ค
ex ) K-Means Clustering
2. ๊ณ์ธต์ ๊ตฐ์ง (Hierarchical Clustering)
- ๋ฐ์ดํฐ๋ฅผ ์์ฐจ์ ๋๋ ๊ณ์ธต์ ์ผ๋ก ๊ทธ๋ฃนํํ๋ ์๊ณ ๋ฆฌ์ฆ
- ๋ฐ์ดํฐ ํฌ์ธํธ ๊ฐ์ ๊ฑฐ๋ฆฌ ๋๋ ์ ์ฌ๋๋ฅผ ๊ธฐ๋ฐ์ผ๋ก ๊ณ์ธต ๊ตฌ์กฐ๋ฅผ ํ์ฑํ์ฌ ๊ตฐ์ง์ ํ์ฑ
- ๊ณ์ธต์ ์ธ ๊ตฌ์กฐ๋ฅผ ๊ฐ์ง๊ณ ์์ด ๊ตฐ์งํ ๊ฒฐ๊ณผ๋ฅผ ๋ค์ํ ์์ค์์ ์ดํด๋ณผ ์ ์์ผ๋ฉฐ, ์๊ฐ์ ์ผ๋ก ํํํ๊ธฐ ์ฝ๋ค
- ์ฌ์ ์ ๊ตฐ์ง์ ๊ฐ์๋ฅผ ์ง์ ํ ํ์๊ฐ ์์ด ํธ๋ฆฌ
- ํฐ ๋ฐ์ดํฐ์ ์ ๋ํด์๋ ๊ณ์ฐ ๋น์ฉ์ด ์ฆ๊ฐํ ์ ์๊ณ , ํน์ ์์ค์์์ ๊ตฐ์งํ ๊ฒฐ๊ณผ๋ฅผ ํด์ํ๋ ๊ฒ์ด ์ด๋ ค์ธ ์ ์๋ค
ex ) ๋ณํฉ ๊ตฐ์งํ (Agglomerative Clustering) , ๋ถํ ๊ตฐ์งํ (Divisive Clustering)
ํ์ด๋ผํค ํด๋ฌ์คํฐ๋ง(Hierarchical Clustering) : ๊ณ์ธต์ ๊ตฐ์ง
- ์ ์ ๋ํ๊ฐ๋ฉด์ ๋น์ทํ ๊ฐ์ฒด๋ฅผ ๊ตฐ์ง์ผ๋ก ๋ง๋ค์ด๋๊ฐ๋ ํํ์ด๋ค
- ๋ฐ์ดํฐ ํฌ์ธํธ ๊ฐ์ ๊ฑฐ๋ฆฌ๋ ์ ์ฌ์ฑ์ ๊ธฐ๋ฐ์ผ๋ก ๊ณ์ธต์ ์ธ ํธ๋ฆฌ ๊ตฌ์กฐ๋ฅผ ํ์ฑ
- ๊ฐ์ฅ ์งง์ ๊ฑฐ๋ฆฌ๋ฅผ ๊ฐ์ง ํด๋ฌ์คํฐ๋ฅผ ๊ฒฐํฉํ์ฌ ๋ ํฐ ํด๋ฌ์คํฐ๋ฅผ ๋ง๋ ๋ค


df
CustomerID | Genre | Age | Annual Income (k$) | Spending Score (1-100) | |
0 | 1 | Male | 19 | 15 | 39 |
1 | 2 | Male | 21 | 15 | 81 |
2 | 3 | Female | 20 | 16 | 6 |
3 | 4 | Female | 23 | 16 | 77 |
4 | 5 | Female | 31 | 17 | 40 |
... | ... | ... | ... | ... | ... |
195 | 196 | Female | 35 | 120 | 79 |
196 | 197 | Female | 45 | 126 | 28 |
197 | 198 | Male | 32 | 126 | 74 |
198 | 199 | Male | 32 | 137 | 18 |
199 | 200 | Male | 30 | 137 | 83 |
X = df.iloc[ : , 1 : ]
Genre | Age | Annual Income (k$) | Spending Score (1-100) | |
0 | Male | 19 | 15 | 39 |
1 | Male | 21 | 15 | 81 |
2 | Female | 20 | 16 | 6 |
3 | Female | 23 | 16 | 77 |
4 | Female | 31 | 17 | 40 |
... | ... | ... | ... | ... |
195 | Female | 35 | 120 | 79 |
196 | Female | 45 | 126 | 28 |
197 | Male | 32 | 126 | 74 |
198 | Male | 32 | 137 | 18 |
199 | Male | 30 | 137 | 83 |
๋ฌธ์์ด ์ซ์๋ก ๋ฐ๊พธ๊ธฐ
X['Genre'].unique()
array(['Male', 'Female'], dtype=object)
'Male'๊ณผ 'Female' 2๊ฐ์ด๋ฏ๋ก 0๊ณผ 1๋ก๋ง ๋ฐ๊พธ๋ฉด ๋๊ธฐ๋๋ฌธ์, ์ํซ์ธ์ฝ๋ฉ(3๊ฐ ์ด์)์ด ์๋ ๋ผ๋ฒจ ์ธ์ฝ๋ฉ์ผ๋ก ํ๋ฉด ๋๋ค
from sklearn.preprocessing import LabelEncoder
encoder = LabelEncoder()
X['Genre'] = encoder.fit_transform(X['Genre'])
X
Genre | Age | Annual Income (k$) | Spending Score (1-100) | |
0 | 1 | 19 | 15 | 39 |
1 | 1 | 21 | 15 | 81 |
2 | 0 | 20 | 16 | 6 |
3 | 0 | 23 | 16 | 77 |
4 | 0 | 31 | 17 | 40 |
... | ... | ... | ... | ... |
195 | 0 | 35 | 120 | 79 |
196 | 0 | 45 | 126 | 28 |
197 | 1 | 32 | 126 | 74 |
198 | 1 | 32 | 137 | 18 |
199 | 1 | 30 | 137 | 83 |
sorted(X['Genre'].unique())
['Female', 'Male']
'Female'์ด 0, 'Male'์ด 1์ด๋ผ๋ ๊ฒ์ ์ ์ ์๋ค.
ํ์ด๋ผํค ํด๋ฌ์คํฐ๋ง์ ํผ์ณ์ค์ผ์ผ๋ง ๋จ๊ณ๋ฅผ ๊ฑฐ์น์ง ์๋๋ค
Dendrogram ๊ทธ๋ฆฌ๊ธฐ
์ ํด๋ฆฌ๋ ๊ฑฐ๋ฆฌ๋ฅผ ์ฌ์ฉํด์ ์๋ ์ฐ๊ฒฐ์ ๊ณ์ฐํ๊ณ ๋ด๊ทธ๋ก๊ทธ๋จ์ ์ฌ์ฉํ์ฌ ์๊ฐํ ํ๋ค
๋ด๊ทธ๋ก๊ทธ๋จ ๊ณ์ธต๊ตฌ์กฐ๋ฅผ ๋์ํํ ๊ทธ๋ฆผ์ ๋ด๊ทธ๋ก๊ทธ๋จ์ด๋ผ ํ๋ค
import scipy.cluster.hierarchy as sch
sch.dendrogram( sch.linkage(X, method='ward') )#linkage : ์ฐ๊ฒฐ์ ์ด๋ป๊ฒ ์ํฌ๊ฒ์ธ๊ฐ? 'ward' ๋ฉ์๋๊ฐ ๊ฐ์ฅ ์ผ๋ฐ์
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Distance')
plt.show()
ward ๋ฉ์๋๋ ๊ณ์ธต์ ๊ตฐ์งํ์์ ์ฌ์ฉ๋๋ ๊ฑฐ๋ฆฌ ์ธก์ ๋ฐฉ๋ฒ ์ค ํ๋์ด๋ฉฐ ๊ฑฐ๋ฆฌ ์ธก์ ๋ฐฉ๋ฒ์ ์ฌ์ฉํ์ฌ ๊ณ์ธต์ ๊ตฐ์งํ๋ฅผ ์คํํ๋ Python ์ฝ๋์ด๋ค. ๋น๊ต์ ์์ ์ ์ด๋ฉฐ ๊ตฐ์ง ๊ฐ์ ํฌ๊ธฐ ์ฐจ์ด์ ๋ฏผ๊ฐํ์ง ์์ ๋๋ฆฌ ์ฌ์ฉ๋๋ค. ์ฃผ์ด์ง ๋ฐ์ดํฐ์ ์ ๋ํ ๊ฑฐ๋ฆฌ ํ๋ ฌ์ ๊ณ์ฐํ ํ์, ์ด๋ฅผ ์ฌ์ฉํ์ฌ ๊ณ์ธต์ ๊ตฐ์งํ๋ฅผ ์ํํ ์ ์๋ค.

from sklearn.cluster import AgglomerativeClustering
#AgglomerativeClustering๋ ๋ณํฉ์ ๊ตฐ์งํ์ด๋ค
hc = AgglomerativeClustering(n_clusters=5)
y_pred = hc.fit_predict(X)
df['Group'] = y_pred
df
CustomerID | Genre | Age | Annual Income (k$) | Spending Score (1-100) | Group | |
0 | 1 | Male | 19 | 15 | 39 | 4 |
1 | 2 | Male | 21 | 15 | 81 | 3 |
2 | 3 | Female | 20 | 16 | 6 | 4 |
3 | 4 | Female | 23 | 16 | 77 | 3 |
4 | 5 | Female | 31 | 17 | 40 | 4 |
... | ... | ... | ... | ... | ... | ... |
195 | 196 | Female | 35 | 120 | 79 | 2 |
196 | 197 | Female | 45 | 126 | 28 | 1 |
197 | 198 | Male | 32 | 126 | 74 | 2 |
198 | 199 | Male | 32 | 137 | 18 | 1 |
199 | 200 | Male | 30 | 137 | 83 | 2 |
df.loc[df['Group'] == 4, ]
CustomerID | Genre | Age | Annual Income (k$) | Spending Score (1-100) | Group | |
0 | 1 | Male | 19 | 15 | 39 | 4 |
2 | 3 | Female | 20 | 16 | 6 | 4 |
4 | 5 | Female | 31 | 17 | 40 | 4 |
6 | 7 | Female | 35 | 18 | 6 | 4 |
8 | 9 | Male | 64 | 19 | 3 | 4 |
10 | 11 | Male | 67 | 19 | 14 | 4 |
12 | 13 | Female | 58 | 20 | 15 | 4 |
14 | 15 | Male | 37 | 20 | 13 | 4 |
16 | 17 | Female | 35 | 21 | 35 | 4 |
18 | 19 | Male | 52 | 23 | 29 | 4 |
20 | 21 | Male | 35 | 24 | 35 | 4 |
22 | 23 | Female | 46 | 25 | 5 | 4 |
24 | 25 | Female | 54 | 28 | 14 | 4 |
26 | 27 | Female | 45 | 28 | 32 | 4 |
28 | 29 | Female | 40 | 29 | 31 | 4 |
30 | 31 | Male | 60 | 30 | 4 | 4 |
32 | 33 | Male | 53 | 33 | 4 | 4 |
34 | 35 | Female | 49 | 33 | 14 | 4 |
36 | 37 | Female | 42 | 34 | 17 | 4 |
38 | 39 | Female | 36 | 37 | 26 | 4 |
40 | 41 | Female | 65 | 38 | 35 | 4 |
42 | 43 | Male | 48 | 39 | 36 | 4 |
44 | 45 | Female | 49 | 39 | 28 | 4 |
'ML (MachineLearning)' ์นดํ ๊ณ ๋ฆฌ์ ๋ค๋ฅธ ๊ธ
๋ฅ๋ฌ๋ : Neural Networks ์ผ๋ก Classification(๋ถ๋ฅ) ํ๊ธฐ (2) | 2024.04.16 |
---|---|
๋จธ์ ๋ฌ๋ ์๊ณ ๋ฆฌ์ฆ ๊ฐ๋ ์์ฝ (0) | 2024.04.16 |
K-Means ์๊ณ ๋ฆฌ์ฆ (0) | 2024.04.16 |
DTree(Decision Tree) ์๊ณ ๋ฆฌ์ฆ์ผ๋ก ์๋ก์ด ๋ฐ์ดํฐ ์นดํ ๊ณ ๋ฆฌ ๋ถ๋ฅํ๊ธฐ (0) | 2024.04.15 |
SVM(Support Vector Machine) ์๊ณ ๋ฆฌ์ฆ์ผ๋ก ์๋ก์ด ๋ฐ์ดํฐ ์นดํ ๊ณ ๋ฆฌ ๋ถ๋ฅํ๊ธฐ (0) | 2024.04.15 |