대학공부/기계학습

실습 1차시: ZeroR, OneR, Naive Bayes Classifier

진진리 2023. 10. 11. 11:25

*Colab 이용

import numpy as np

import pandas as pd

import sklearn

print(sklearn.__version__) #1.2.2

# 데이터 받기

url = "https://raw.githubusercontent.com/inikoreaackr/ml_datasets/main/playgolf.csv"

df = pd.read_csv(url)

# 데이터 첫 다섯 instance 확인

df.head()

OUTLOOK	TEMPERATURE	HUMIDITY	WINDY	PLAY GOLF
Rainy	Hot	High	False	No
Rainy	Hot	High	True	No
Overcast	Hot	High	False	Yes
Sunny	Mild	High	False	Yes
Sunny	Cool	Normal	False	Yes

# 데이터 타입 확인

df.dtypes

OUTLOOK object

TEMPERATURE object

HUMIDITY object

WINDY bool

PLAY GOLF object

dtype: object

# object 타입을 category로 변경

for col in df.columns:

df[col] = df[col].astype('category')

# 변경이 되었는지 확인

df.dtypes

OUTLOOK category

TEMPERATURE category

HUMIDITY category

WINDY category

PLAY GOLF category

dtype: object

1. ZeroR

ZeroR은 가장 간단한 분류 방법이며, 다른 모든 feature들을 무시하고 label에만 의존합니다.

ZeroR 분류기는 단순히 데이터의 class를 다수 카테고리로 예측합니다.

ZeroR에는 예측 능력이 없지만, 이것은 표준 성능을 가늠하여 다른 분류 방법 성능의 기준점이 됩니다.

# PLAY GOLF feature 출력

df['PLAY GOLF']

0 No

1 No

2 Yes

3 Yes

4 Yes

5 No

6 Yes

7 No

8 Yes

9 Yes

10 Yes

11 Yes

12 Yes

13 No

Name: PLAY GOLF, dtype: category

Categories (2, object): ['No', 'Yes']

# PLAY GOLF는 binary 변수입니다. 각 카테고리의 갯수를 세어봅니다.

df['PLAY GOLF'].value_counts(sort=True)

Yes 9

No 5

Name: PLAY GOLF, dtype: int64

# 이 데이터셋에서 "Play Golf = Yes"로 예측하는 ZeroR 모델의 정확도를 계산해봅니다.

9 / (9+5)

0.6428571428571429

위의 데이터셋에서 "Play Golf = Yes"로 예측하는 ZeroR모델의 정확도는 0.64가 됩니다.

OneR

OneR은 One Rule의 약자이며, 간단하고 정확한 분류 알고리즘입니다.

OneR은 데이터의 각 feature 마다 하나의 룰 셋(Rule Set)을 생성합니다. 그리고 생성된 룰 셋 중에서, 전체데이터에 대해 오차가 가장 작은 룰 셋을 One Rule로 결정합니다.

각 feature당 룰 셋은 frequency table을 이용하여 만들 수 있습니다.

OneR Algorithm

각 feature 마다,
    각 feature의 value 마다, 룰을 아래와 같이 만듭니다.
        그 feature의 value에 해당되는 instance중에 target class가 몇개인지 셉니다.
        가장 갯수가 많은 class를 찾습니다.
        그 feature의 value가 해당되면 그 갯수가 많은 class로 예측되도록 룰을 하나 만듭니다.
    각 feature의 룰들의 전체 에러를 계산합니다. (반대로 정확도를 계산할 수도 있습니다.)
가장 작은 에러를 보이는 feature을 선택합니다.

아래 그림에서는 outlook과 humidity feature 모두 에러의 갯수가 4이므로 제일 작습니다. 하지만 활동에서는 첫번째 feature인 outlook만 고려할 것입니다.

# 위 수도코드를 구현해봅니다.

from collections import Counter

total_errors = []

for col in df.columns[:-1]:

error = 0

for val in df[col].unique():

length = len(df[df[col] == val])

# print(f"{col} : {val}, length : {length} ")

print(Counter(df[df[col] == val]['PLAY GOLF']).most_common())

error += (length - Counter(df[df[col] == val]['PLAY GOLF']).most_common()[0][1])

print(f"\nerror of {col}: [{error}]\n")

total_errors.append(error)

[('No', 3), ('Yes', 2)]

[('Yes', 4)]

[('Yes', 3), ('No', 2)]

error of OUTLOOK: [4]

[('No', 2), ('Yes', 2)]

[('Yes', 4), ('No', 2)]

[('Yes', 3), ('No', 1)]

error of TEMPERATURE: [5]

[('No', 4), ('Yes', 3)]

[('Yes', 6), ('No', 1)]

error of HUMIDITY: [4]

[('Yes', 6), ('No', 2)]

[('No', 3), ('Yes', 3)]

error of WINDY: [5]

# 오류가 가장 작은 feature를 고릅니다.

best_feature = df.columns[np.argmin(total_errors)]

print(best_feature)

OUTLOOK

# best feature에 대해 룰셋을 생성합니다.

oneRules = []

for val in df[best_feature].unique():

print(f"{best_feature} : {val}", "->", end=' ')

print(Counter(df[df[best_feature] == val]['PLAY GOLF']).most_common()[0][0])

oneRules.append((best_feature, val, Counter(df[df[best_feature] == val]['PLAY GOLF']).most_common()[0][0] ))

OUTLOOK : Rainy -> No

OUTLOOK : Overcast -> Yes

OUTLOOK : Sunny -> Yes

Naive Bayes Classifier with scikit-learn

scikit-learn의 Naive Bayes classifier 다큐멘테이션: https://scikit-learn.org/stable/modules/naive_bayes.html

1.9. Naive Bayes

Naive Bayes methods are a set of supervised learning algorithms based on applying Bayes’ theorem with the “naive” assumption of conditional independence between every pair of features given the val...

scikit-learn.org

df.head()

OUTLOOK	TEMPERATURE	HUMIDITY	WINDY	PLAY GOLF
Rainy	Hot	High	False	No
Rainy	Hot	High	True	No
Overcast	Hot	High	False	Yes
Sunny	Mild	High	False	Yes
Sunny	Cool	Normal	False	Yes

df.describe()

	OUTLOOK	TEMPERATURE	HUMIDITY	WINDY	PLAY GOLF
count	14	14	14	14	14
unique	3	3	2	2	2
top	Rainy	Mild	High	False	Yes
freq	5	6	7	8	9

# 카테고리 데이터를 정수로 인코딩

df_enc = pd.DataFrame()

df_enc['OUTLOOK'] = df['OUTLOOK'].cat.codes

df_enc['TEMPERATURE'] = df['TEMPERATURE'].cat.codes

df_enc['HUMIDITY'] = df['HUMIDITY'].cat.codes

df_enc['WINDY'] = df['WINDY'].cat.codes

df_enc['PLAY GOLF'] = df['PLAY GOLF'].cat.codes

df_enc.head()

OUTLOOK	TEMPERATURE	HUMIDITY	WINDY	PLAY GOLF
1	1	0	0	0
1	1	0	1	0
0	1	0	0	1
2	2	0	0	1
2	0	1	0	1

# 인코딩된 데이터의 타입을 프린트해봅니다.

df_enc.dtypes

OUTLOOK int8

TEMPERATURE int8

HUMIDITY int8

WINDY int8

PLAY GOLF int8

dtype: object

# 분류기에 넣을 feature과 해당 label을 구분합니다.

features = df_enc.drop(columns=['PLAY GOLF'])

label = df_enc['PLAY GOLF']

from sklearn.naive_bayes import CategoricalNB

model = CategoricalNB()

model.fit(features, label)

score = model.score(features, label)

score

0.9285714285714286

# p(x_i|y_i) 출력

from pprint import pprint

feature_log_prior = model.feature_log_prob_

for feature_prior in feature_log_prior:

pprint(np.exp(feature_prior))

array([[0.125 , 0.5 , 0.375 ],

[0.41666667, 0.25 , 0.33333333]])

array([[0.25 , 0.375 , 0.375 ],

[0.33333333, 0.25 , 0.41666667]])

array([[0.71428571, 0.28571429],

[0.36363636, 0.63636364]])

array([[0.42857143, 0.57142857],

[0.63636364, 0.36363636]])

# p(y_j) 출력

np.exp(model.class_log_prior_)

array([0.35714286, 0.64285714])

# instances에 대해서 예측을 해봅니다.

# ("Sunny", "Hot", "Normal", False) [2, 1, 1, 0]

# ("Rainy", "Mild", "High", False) [1, 2, 0, 0]

print(model.predict_proba([[2, 1, 1, 0]]), model.predict([[2, 1, 1, 0]]))

print(model.predict_proba([[1, 2, 0, 0]]), model.predict([[1, 2, 0, 0]]))

[[0.22086561 0.77913439]] [1]

[[0.5695011 0.4304989]] [0]

'대학공부 > 기계학습' 카테고리의 다른 글

실습 3차시: Linear/Logistic Regression (0)	2023.10.13
실습 2차시: DT (0)	2023.10.13
Deep NN (0)	2023.10.11
Feature selection, SVM, 앙상블 (0)	2023.10.11
Evaluation (0)	2023.10.11

현재글실습 1차시: ZeroR, OneR, Naive Bayes Classifier

진진진 개발일지

Today :
Yesterday :

진진진 개발일지

실습 1차시: ZeroR, OneR, Naive Bayes Classifier

1. ZeroR

OneR

Naive Bayes Classifier with scikit-learn

'대학공부 > 기계학습' 카테고리의 다른 글

'대학공부/기계학습'의 다른글

티스토리툴바

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

실습 1차시: ZeroR, OneR, Naive Bayes Classifier

1. ZeroR

OneR

Naive Bayes Classifier with scikit-learn

'대학공부 > 기계학습' 카테고리의 다른 글

'대학공부/기계학습'의 다른글

관련글

티스토리툴바