데이터 불균형이 발생할 때, 데이터 리샘플링하기

ML (MachineLearning)

567Rabbit 2024. 4. 15. 14:51

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sb

#당뇨병을 분류하는 모델

df['class']의 의미 ?

(1:당뇨이다, 0:당뇨가 아니다)

비어있는 데이터는 없지만 비어있는 항목 대신 0으로 셋팅한 데이터이다

따라서, Plas 컬럼부터 mass컬럼까지는 0으로 셋팅된 값을 nan으로 만들려고 한다.

df.loc[ : , 'Plas' : 'mass' ] = df.loc[ : , 'Plas' : 'mass' ].replace(0,np.nan)

df = df.dropna()

y = df['class']

X = df.loc[ : , 'Plas' : 'age' ]

y.value_counts()

class
0    262
1    130
Name: count, dtype: int64

당뇨가 아닌 사람은 262명, 당뇨인 사람은 130명 있다

이 수를 맞춰주고 싶다면

! pip install imblearn

from imblearn.over_sampling import SMOTE

sm = SMOTE(random_state=5)

X, y = sm.fit_resample(X, y)

y.value_counts()

class
0    262
1    262
Name: count, dtype: int64

SVM(Support Vector Machine) 알고리즘으로 새로운 데이터 카테고리 분류하기 (0)	2024.04.15
KNN(K nearest neighbor) 알고리즘으로 새로운 데이터 카테고리 분류하기 (0)	2024.04.15
Logistic Regression (로지스틱 회귀) (0)	2024.04.15
Linear regression을 사용하여 신규 데이터 입력 시, 데이터 기반 예측 값 알려주기 (5)	2024.04.15
Regressor(회귀모델) 생성하고, MSE(평균제곱오차)구하는 방법 (0)	2024.04.12

Rabbit's efficient coding 🖥️🐇 & 금융