캐글데이터분석(Mushroom Edible)

25 분 소요

Mushroom Edible Classification

Description

Dataset Source :

https://www.kaggle.com/datasets/uciml/mushroom-classification

Problem :

The purpose of this model is to use the categorical properties of mushrooms to classify whether they are edible or not.

EDA :

CountPlot, Kramer V Coef

Feature Selection :

Based Kramer V Coef

  T1 : Drop gil-attachment, stalk-color-above-ring

  Used Features : 19

  T2 : Use Odor, spore-print-color

  Used Features : 2

Preprocessing :

LabelEncoder : All Features

OnehotEncoder (unique>=3) ? OnehotEncoder : None

Modeling :

Train , Test = 8 : 2

All : Stratified K-Fold(N-folds : 5) and Optuna Tuning

                 Use Model : SVM,   DT,  MLPClassifier

                 Hyper Parameter from Optuna, S-kfold(5)

SVM : params {'C': 72.17831419000564, 'kernel': 'poly'}

DT : params {'criterion': 'gini', 'max_depth': 27}

MLPClassifier : params {'learning_rate': 'constant', 'alpha': 0.0001, 'activation': 'tanh'}

Result :

All Model :

  Macro-F1 score : 1.0 , Accuracy : 1.0

Data Load & Simple Analysis

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

import numpy as np

data = pd.read_csv('/content/drive/MyDrive/dataset/BIohealthDataset/mushrooms.csv')

data.head(10)

	class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	p	x	s	n	t	p	f	c	n	k	e	e	s	s	w	w	p	w	o	p	k	s	u
1	e	x	s	y	t	a	f	c	b	k	e	c	s	s	w	w	p	w	o	p	n	n	g
2	e	b	s	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	n	m
3	p	x	y	w	t	p	f	c	n	n	e	e	s	s	w	w	p	w	o	p	k	s	u
4	e	x	s	g	f	n	f	w	b	k	t	e	s	s	w	w	p	w	o	e	n	a	g
5	e	x	y	y	t	a	f	c	b	n	e	c	s	s	w	w	p	w	o	p	k	n	g
6	e	b	s	w	t	a	f	c	b	g	e	c	s	s	w	w	p	w	o	p	k	n	m
7	e	b	y	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	s	m
8	p	x	y	w	t	p	f	c	n	p	e	e	s	s	w	w	p	w	o	p	k	v	g
9	e	b	s	y	t	a	f	c	b	g	e	c	s	s	w	w	p	w	o	p	k	s	m

결측치는 존재하지 않는다.

data.isnull().sum()

class                       0
cap-shape                   0
cap-surface                 0
cap-color                   0
bruises                     0
odor                        0
gill-attachment             0
gill-spacing                0
gill-size                   0
gill-color                  0
stalk-shape                 0
stalk-root                  0
stalk-surface-above-ring    0
stalk-surface-below-ring    0
stalk-color-above-ring      0
stalk-color-below-ring      0
veil-type                   0
veil-color                  0
ring-number                 0
ring-type                   0
spore-print-color           0
population                  0
habitat                     0
dtype: int64

data.describe()

	class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
count	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124	8124
unique	2	6	4	10	2	9	2	2	2	12	2	5	4	4	9	9	1	4	3	5	9	6	7
top	e	x	y	n	f	n	f	c	b	b	t	b	s	s	w	w	p	w	o	p	w	v	d
freq	4208	3656	3244	2284	4748	3528	7914	6812	5612	1728	4608	3776	5176	4936	4464	4384	8124	7924	7488	3968	2388	4040	3148

속성 정보: (등급: 식용=e, 독성=p)

모자 모양: 벨=b, 원추형=c, 볼록=x, 평면=f, 손잡이=k,가라앉은=s

캡 표면: 섬유질=f,홈=g,비늘 모양=y,매끄러운=s

캡 색상: 갈색=n,담황색=b,시나몬=c,회색=g,녹색=r,분홍색=p,보라색=u,빨간색=e,흰색=w,노란색=y

타박상: 타박상=t,아니오=f

냄새: 아몬드=a,아니스=l,크레오소트=c,비린내=y,파울=f,곰팡이=m,없음=n,매운=p,매운=s

아가미 부착: 부착=a,내림차순=d,자유=f,노치=n

아가미 간격: 닫기=c, 붐비는=w, 먼=d

아가미 크기: 넓음=b, 좁음=n

아가미 색상: 검정=k,갈색=n,버프=b,초콜릿=h,회색=g, 녹색=r,주황색=o,분홍색=p,보라색=u,빨간색=e,흰색=w,노란색= 와이

줄기 모양: 확대=e, 테이퍼링=t

줄기-뿌리: 구근=b,클럽=c,컵=u,동일=e,근근형=z,뿌리=r,누락=?

줄기-표면-위-고리: 섬유질=f,비늘=y,실키=k,매끄러운=s

줄기-표면-아래-고리: 섬유질=f,비늘=y,실키=k,부드러운=s

고리 위의 줄기 색상: 갈색=n,담황색=b,시나몬=c,회색=g, 주황색=o,분홍색=p,빨간색=e,흰색=w,노란색=y

줄기색-아래-고리: 갈색=n,담황색=b,시나몬=c,회색=g,주황색=o,분홍색=p,빨간색=e,흰색=w,노란색=y

베일형: 부분=p,보편=u

베일 색상: 갈색=n,주황색=o,흰색=w,노란색=y

링 번호: 없음=n, 1=o, 2=t

반지 유형: cobwebby=c,evanescent=e,flaring=f,large=l,none=n,pendant=p,sheathing=s,zone=z

spore-print-color: 검정=k,갈색=n,버프=b,초콜릿=h,녹색=r,주황색=o,보라색=u,흰색=w,노란색=y

인구: 풍부=a,군집=c,수많은=n,흩어진=s,여러=v,단독=y

서식지: 잔디=g,잎=l,초원=m,경로=p,도시=u,폐기물=w,나무=d

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 class                     8124 non-null   object
 cap-shape                 8124 non-null   object
 cap-surface               8124 non-null   object
 cap-color                 8124 non-null   object
 bruises                   8124 non-null   object
 odor                      8124 non-null   object
 gill-attachment           8124 non-null   object
 gill-spacing              8124 non-null   object
 gill-size                 8124 non-null   object
 gill-color                8124 non-null   object
stalk-shape               8124 non-null   object
stalk-root                8124 non-null   object
stalk-surface-above-ring  8124 non-null   object
stalk-surface-below-ring  8124 non-null   object
stalk-color-above-ring    8124 non-null   object
stalk-color-below-ring    8124 non-null   object
veil-type                 8124 non-null   object
veil-color                8124 non-null   object
ring-number               8124 non-null   object
ring-type                 8124 non-null   object
spore-print-color         8124 non-null   object
population                8124 non-null   object
habitat                   8124 non-null   object
dtypes: object(23)
memory usage: 1.4+ MB

Unique한 값

for i in data.columns:
  
  print(i, " : " ,data[i].nunique())

class  :  2
cap-shape  :  6
cap-surface  :  4
cap-color  :  10
bruises  :  2
odor  :  9
gill-attachment  :  2
gill-spacing  :  2
gill-size  :  2
gill-color  :  12
stalk-shape  :  2
stalk-root  :  5
stalk-surface-above-ring  :  4
stalk-surface-below-ring  :  4
stalk-color-above-ring  :  9
stalk-color-below-ring  :  9
veil-type  :  1
veil-color  :  4
ring-number  :  3
ring-type  :  5
spore-print-color  :  9
population  :  6
habitat  :  7

EDA

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8124 entries, 0 to 8123
Data columns (total 23 columns):
 #   Column                    Non-Null Count  Dtype 
---  ------                    --------------  ----- 
 class                     8124 non-null   object
 cap-shape                 8124 non-null   object
 cap-surface               8124 non-null   object
 cap-color                 8124 non-null   object
 bruises                   8124 non-null   object
 odor                      8124 non-null   object
 gill-attachment           8124 non-null   object
 gill-spacing              8124 non-null   object
 gill-size                 8124 non-null   object
 gill-color                8124 non-null   object
stalk-shape               8124 non-null   object
stalk-root                8124 non-null   object
stalk-surface-above-ring  8124 non-null   object
stalk-surface-below-ring  8124 non-null   object
stalk-color-above-ring    8124 non-null   object
stalk-color-below-ring    8124 non-null   object
veil-type                 8124 non-null   object
veil-color                8124 non-null   object
ring-number               8124 non-null   object
ring-type                 8124 non-null   object
spore-print-color         8124 non-null   object
population                8124 non-null   object
habitat                   8124 non-null   object
dtypes: object(23)
memory usage: 1.4+ MB

data['class'].value_counts()

e    4208
p    3916
Name: class, dtype: int64

def count_plot(data, col):
  sns.countplot(data=data,x=data[col],hue=data['class'])
  plt.title(col)
  plt.legend()
  plt.show()

for col in data.columns:
  count_plot(data,col)
   

png

# veil특성은 unique값이 1이므로, 제외한다.

data =data.drop('veil-type',axis=1)

크라머 상관계수 분석

from sklearn.feature_selection import chi2, RFECV
from sklearn.model_selection import train_test_split, cross_validate, GridSearchCV
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.tree import DecisionTreeClassifier, plot_tree
import scipy.stats as ss

from sklearn.model_selection import train_test_split

train_df, test_df=train_test_split(data,test_size=.2,random_state=5)

train_df['class'].value_counts()

e    3381
p    3118
Name: class, dtype: int64

test_df['class'].value_counts()

e    827
p    798
Name: class, dtype: int64

train_df.columns

Index(['class', 'cap-shape', 'cap-surface', 'cap-color', 'bruises', 'odor',
       'gill-attachment', 'gill-spacing', 'gill-size', 'gill-color',
       'stalk-shape', 'stalk-root', 'stalk-surface-above-ring',
       'stalk-surface-below-ring', 'stalk-color-above-ring',
       'stalk-color-below-ring', 'veil-color', 'ring-number', 'ring-type',
       'spore-print-color', 'population', 'habitat'],
      dtype='object')

def cramers_v(confusion_matrix):
    chi2 = ss.chi2_contingency(confusion_matrix)[0]
    n = confusion_matrix.sum()
    phi2 = chi2 / n
    r, k = confusion_matrix.shape
    phi2corr = max(0, phi2 - ((k-1)*(r-1))/(n-1))
    rcorr = r - ((r-1)**2)/(n-1)
    kcorr = k - ((k-1)**2)/(n-1)
    return np.sqrt(phi2corr / min((kcorr-1), (rcorr-1)))

rows= []

for var1 in data:
  col = []
  for var2 in data :
    confusion_matrix = pd.crosstab(data[var1], data[var2])
     # Cramer's V test
    col.append(round(cramers_v(confusion_matrix.values),2)) # Keeping of the rounded value of the Cramer's V  
  rows.append(col)
  
cramers_results = np.array(rows)
df_corr = pd.DataFrame(cramers_results, columns = data.columns, index =data.columns)

plt.figure(figsize=(30,15))
sns.heatmap(df_corr[df_corr > 0.6], annot=True, cmap=plt.cm.CMRmap_r)

plt.show()

png

TRY1 : gil-attachment, stalk-color-above-ring을 삭제해주고 진행하겠다.

train_col=list(train_df.columns)

train_col.remove('gill-attachment')
train_col.remove('stalk-color-above-ring')

TRY2 : Target과 연관성이 높은 odor , spore-print-color만 사용하겠다.

train_col = ['odor','spore-print-color','class']

train_col

['class',
 'cap-shape',
 'cap-surface',
 'cap-color',
 'bruises',
 'odor',
 'gill-spacing',
 'gill-size',
 'gill-color',
 'stalk-shape',
 'stalk-root',
 'stalk-surface-above-ring',
 'stalk-surface-below-ring',
 'stalk-color-below-ring',
 'veil-color',
 'ring-number',
 'ring-type',
 'spore-print-color',
 'population',
 'habitat']

Data Preprocessing

모든 컬럼들이 Categorical하기때문에, unique한값이 3이상인 값들에 대해서 원핫인코딩 진행하고, unique 값이 2인 컬럼들은 라벨인코더만 진행한다.

train_df=train_df[train_col]

test_df = test_df[train_col]

one_hot_cols=list(train_df.columns[train_df.nunique()>2])

binary_cols=list(train_df.columns[train_df.nunique()==2])

binary_cols

['class']

one_hot_cols

['odor', 'spore-print-color']

unique한 값이 3이상인 값들에 대해서만 원핫인코딩을 진행한다. 2인값들에 대해서는 라벨인코더만 진행한다.

일단 전체 column에 대해 LabelEncoder작업을 진행해준다.

train_df.nunique()

class          2
stalk-shape    2
dtype: int64

from sklearn.preprocessing import LabelEncoder

enc = LabelEncoder()

for col in train_df.columns:
  enc.fit(train_df[col])
  train_df[col]=enc.transform(train_df[col])
  # Data Leakage에 유의한다.
  test_df[col]=enc.transform(test_df[col])

train_df

	class	stalk-shape
5174	1	0
7927	0	0
2641	0	1
5669	0	0
2043	0	1
...	...	...
3046	0	1
1725	1	0
4079	1	0
2254	0	1
2915	0	1

6499 rows × 2 columns

test_df

	class	gill-color
5350	1	3
7926	1	0
4804	1	2
569	0	10
1844	0	7
...	...	...
529	0	10
6348	1	0
4659	1	2
5700	1	7
4578	1	3

1625 rows × 2 columns

인덱스 초기화

train_df = train_df.reset_index(drop=True)

test_df = test_df.reset_index(drop=True)

One hot Encoding

from sklearn.preprocessing import OneHotEncoder

ohe = OneHotEncoder(sparse=False)
# fit_transform은 train에만 사용하고 test에는 학습된 인코더에 transform만 해야한다

for col in one_hot_cols:
  
  ohe.fit(train_df[[col]])
  train_cat=ohe.transform(train_df[[col]])
  train_df=pd.concat([train_df.drop(columns=[col]),
           pd.DataFrame(train_cat, columns=[col+"_"+ str(c) for c in ohe.categories_[0]])], axis=1)
  test_cat = ohe.transform(test_df[[col]])
  test_df = pd.concat([test_df.drop(columns=[col]),
           pd.DataFrame(test_cat, columns=[ col +"_"+ str(c) for c in ohe.categories_[0]])], axis=1)

/usr/local/lib/python3.9/dist-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
  warnings.warn(
/usr/local/lib/python3.9/dist-packages/sklearn/preprocessing/_encoders.py:868: FutureWarning: `sparse` was renamed to `sparse_output` in version 1.2 and will be removed in 1.4. `sparse_output` is ignored unless you leave `sparse` to its default value.
  warnings.warn(

test_df

	class	stalk-shape
0	1	1
1	1	1
2	1	0
3	0	0
4	0	1
...	...	...
1620	0	0
1621	1	1
1622	1	0
1623	1	0
1624	1	0

1625 rows × 2 columns

y=train_df['class']
train=train_df.drop('class',axis=1)

y_test = test_df['class']

test=test_df.drop('class',axis=1)

from sklearn.svm import SVC
from sklearn.tree import DecisionTreeClassifier
from sklearn.neural_network import MLPClassifier

from sklearn.metrics import accuracy_score, confusion_matrix, f1_score, auc, roc_curve, ConfusionMatrixDisplay

Modeling

SVC

DT

NN

Confusion Matrix 출력을 위한 함수를 미리 생성한다.

def show_confusion(pred,y,model_name):
  confusionmatrix = confusion_matrix(y, pred)
  cm_display = ConfusionMatrixDisplay(confusionmatrix)
  cm_display.plot()
  plt.title(model_name)
  plt.show()

SVC

Optuna를 통해서 valid-set에 대해서 최적 하이퍼 파라미터를 추출하고, 이를 test set에 적용시킨다.

pip install optuna

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting optuna
  Downloading optuna-3.1.1-py3-none-any.whl (365 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m365.7/365.7 kB[0m [31m11.4 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: numpy in /usr/local/lib/python3.9/dist-packages (from optuna) (1.22.4)
Collecting cmaes>=0.9.1
  Downloading cmaes-0.9.1-py3-none-any.whl (21 kB)
Requirement already satisfied: packaging>=20.0 in /usr/local/lib/python3.9/dist-packages (from optuna) (23.0)
Requirement already satisfied: tqdm in /usr/local/lib/python3.9/dist-packages (from optuna) (4.65.0)
Requirement already satisfied: sqlalchemy>=1.3.0 in /usr/local/lib/python3.9/dist-packages (from optuna) (2.0.9)
Collecting colorlog
  Downloading colorlog-6.7.0-py2.py3-none-any.whl (11 kB)
Requirement already satisfied: PyYAML in /usr/local/lib/python3.9/dist-packages (from optuna) (6.0)
Collecting alembic>=1.5.0
  Downloading alembic-1.10.3-py3-none-any.whl (212 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m212.3/212.3 kB[0m [31m25.9 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting Mako
  Downloading Mako-1.2.4-py3-none-any.whl (78 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m78.7/78.7 kB[0m [31m10.2 MB/s[0m eta [36m0:00:00[0m
[?25hRequirement already satisfied: typing-extensions>=4 in /usr/local/lib/python3.9/dist-packages (from alembic>=1.5.0->optuna) (4.5.0)
Requirement already satisfied: greenlet!=0.4.17 in /usr/local/lib/python3.9/dist-packages (from sqlalchemy>=1.3.0->optuna) (2.0.2)
Requirement already satisfied: MarkupSafe>=0.9.2 in /usr/local/lib/python3.9/dist-packages (from Mako->alembic>=1.5.0->optuna) (2.1.2)
Installing collected packages: Mako, colorlog, cmaes, alembic, optuna
Successfully installed Mako-1.2.4 alembic-1.10.3 cmaes-0.9.1 colorlog-6.7.0 optuna-3.1.1

import optuna
from optuna import Trial
from optuna.samplers import TPESampler
from sklearn.model_selection import StratifiedKFold

# 데이터 분리에서 x_train , x_valid , y_train , y_valid 
#분리해놓은 것을 통해 최적 파라미터를 탐색한다.

sampler = TPESampler(seed=10)

# define function
def objective(trial):
    cbrm_param = {
        
        
        "C" : trial.suggest_float('C', 0.01, 100.0,),
        "kernel" : trial.suggest_categorical("kernel", ["linear", "poly", "rbf"]),
        "probability" : True
    }

    # Generate model
   
    model_cbrm =SVC(**cbrm_param )

    # k겹 교차검증 실시
    n_folds=5
    kf = StratifiedKFold(n_splits=n_folds, shuffle=True)
    
    # 추후 메타 모델이 사용할 학습 데이터 반환을 위한 넘파이 배열 초기화
   
    
    m_f1 =0.0
    for folder_counter, (train_index, valid_index) in enumerate(kf.split(train,y)):
        # 입력된 학습 데이터에서 기반 모델이 학습/예측할 폴드 데이터 세트 추출
        print('\t 폴드 세트: ',folder_counter+1,' 시작')
        X_tr = train.loc[train_index]
        y_tr = y.loc[train_index]
        X_te = train.loc[valid_index]
        y_te = y.loc[valid_index]


        model_cbrm = model_cbrm.fit(X_tr, y_tr)
        pred=model_cbrm.predict(X_te)
        print("macro-f1 score : ",f1_score(y_te,pred,average='macro'))
        m_f1+=f1_score(y_te,pred,average='macro')


                           
	# 평가지표 원하는 평가 지표가 있을 시 바꾸어 준다.
  
    m_f1 = m_f1 / n_folds
    return m_f1

hold_model=SVC()

hold_model.fit(train,y)

test_pred=hold_model.predict(test)

accuracy_score(y_test,test_pred)

0.9938461538461538

optuna_cbrm = optuna.create_study(direction='maximize', sampler=sampler)
optuna_cbrm.optimize(objective, n_trials=3)

[32m[I 2023-04-13 13:53:14,435][0m A new study created in memory with name: no-name-df3fec2e-ba0a-43b9-9edd-10bfd90e1988[0m


	 폴드 세트:  1  시작
macro-f1 score :  0.9884308082237375
	 폴드 세트:  2  시작
macro-f1 score :  0.9961464377374016
	 폴드 세트:  3  시작
macro-f1 score :  0.9946042706901019
	 폴드 세트:  4  시작


[32m[I 2023-04-13 13:53:15,423][0m Trial 0 finished with value: 0.9941399574362515 and parameters: {'C': 0.40478715012816596, 'kernel': 'poly'}. Best is trial 0 with value: 0.9941399574362515.[0m


macro-f1 score :  0.9946035027418356
	 폴드 세트:  5  시작
macro-f1 score :  0.9969147677881807
	 폴드 세트:  1  시작
macro-f1 score :  0.9946042706901019
	 폴드 세트:  2  시작
macro-f1 score :  0.9915185426602477
	 폴드 세트:  3  시작
macro-f1 score :  0.9938330170777988
	 폴드 세트:  4  시작
macro-f1 score :  0.9953747628083491
	 폴드 세트:  5  시작


[32m[I 2023-04-13 13:53:16,066][0m Trial 1 finished with value: 0.9941404214645356 and parameters: {'C': 72.17831419000564, 'kernel': 'poly'}. Best is trial 1 with value: 0.9941404214645356.[0m


macro-f1 score :  0.9953715140861799
	 폴드 세트:  1  시작
macro-f1 score :  0.9953754099876915
	 폴드 세트:  2  시작
macro-f1 score :  0.9930616463823603
	 폴드 세트:  3  시작
macro-f1 score :  0.9922901558337425
	 폴드 세트:  4  시작


[32m[I 2023-04-13 13:53:16,892][0m Trial 2 finished with value: 0.9941403386299712 and parameters: {'C': 54.259011357446006, 'kernel': 'rbf'}. Best is trial 1 with value: 0.9941404214645356.[0m


macro-f1 score :  0.9953747628083491
	 폴드 세트:  5  시작
macro-f1 score :  0.9945997181377133

cbrm_trial = optuna_cbrm.best_trial
cbrm_trial_params = cbrm_trial.params
print('Best Trial: score {},\nparams {}'.format(cbrm_trial.value, cbrm_trial_params))

Best Trial: score 0.9941404214645356,
params {'C': 72.17831419000564, 'kernel': 'poly'}

model = SVC(**cbrm_trial_params,probability=True)

model.fit(train_df,y)

SVC(C=72.17831419000564, kernel='poly', probability=True)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

y_test_pred=model.predict(test_df)

y_test_proba=model.predict_proba(test_df)

accuracy_score(y_test,y_test_pred)

1.0

f1_score(y_test,y_test_pred,average="macro")

1.0

SVC Confusion Matrix

show_confusion(y_test_pred,y_test,"SVC")

png

SVC Roc Curve

plt.figure(figsize=(10,10))
fprs, tprs, thresholds = roc_curve(y_test, y_test_proba[:,1])

    # ROC 곡선 그래프 곡선으로 그림
plt.plot(fprs, tprs, label='SVM Default Model',color='red')

plt.plot([0,1],[0,1], 'k--', label='Random')
    
    # FPR X 축의 Scale을 0.1 단위로 변경, X, Y 축 명 설정 등
start,end = plt.xlim()
plt.xticks(np.round(np.arange(start, end, 0.1), 2))
#plt.xlim(0, 1); plt.ylim(0, 1)
plt.xlabel('FPR(1 - Sensitivity)'); plt.ylabel('TPR(Recall)')
plt.title('ROC Curve')
plt.legend()

<matplotlib.legend.Legend at 0x7f3f341900a0>

png

DecisionTree

파라미터 튜닝

# 데이터 분리에서 x_train , x_valid , y_train , y_valid 
#분리해놓은 것을 통해 최적 파라미터를 탐색한다.

sampler = TPESampler(seed=10)

# define function
def objective(trial):
    cbrm_param = {
        
        "criterion" : trial.suggest_categorical("criterion", ["gini", "entropy","log_loss"]),
        "max_depth" : trial.suggest_int("max_depth",4,30)
        
    }

    # Generate model
   
    n_folds=5
    kf = StratifiedKFold(n_splits=n_folds, shuffle=True)
    
    # 추후 메타 모델이 사용할 학습 데이터 반환을 위한 넘파이 배열 초기화
   
    model_cbrm =DecisionTreeClassifier(**cbrm_param )
    m_f1 =0.0
    for folder_counter, (train_index, valid_index) in enumerate(kf.split(train,y)):
        # 입력된 학습 데이터에서 기반 모델이 학습/예측할 폴드 데이터 세트 추출
        print('\t 폴드 세트: ',folder_counter+1,' 시작')
        X_tr = train.loc[train_index]
        y_tr = y.loc[train_index]
        X_te = train.loc[valid_index]
        y_te = y.loc[valid_index]


        model_cbrm = model_cbrm.fit(X_tr, y_tr)
        pred=model_cbrm.predict(X_te)
        print("macro-f1 score : ",f1_score(y_te,pred,average='macro'))
        m_f1+=f1_score(y_te,pred,average='macro')


                           
	# 평가지표 원하는 평가 지표가 있을 시 바꾸어 준다.
  
    m_f1 = m_f1 / n_folds
    return m_f1

optuna_cbrm = optuna.create_study(direction='maximize', sampler=sampler)
optuna_cbrm.optimize(objective, n_trials=20)

[32m[I 2023-04-13 13:54:07,810][0m A new study created in memory with name: no-name-538ad95e-3e47-47e0-867c-25aa774359f3[0m


	 폴드 세트:  1  시작
macro-f1 score :  0.9930616463823603
	 폴드 세트:  2  시작
macro-f1 score :  0.9946042706901019
	 폴드 세트:  3  시작
macro-f1 score :  0.9946042706901019
	 폴드 세트:  4  시작
macro-f1 score :  0.993832124666994
	 폴드 세트:  5  시작


[32m[I 2023-04-13 13:54:08,066][0m Trial 0 finished with value: 0.9941404061134543 and parameters: {'criterion': 'gini', 'max_depth': 24}. Best is trial 0 with value: 0.9941404061134543.[0m


macro-f1 score :  0.9945997181377133
	 폴드 세트:  1  시작
macro-f1 score :  0.9922901558337425
	 폴드 세트:  2  시작
macro-f1 score :  0.9953754099876915
	 폴드 세트:  3  시작
macro-f1 score :  0.9922901558337425
	 폴드 세트:  4  시작


[32m[I 2023-04-13 13:54:08,370][0m Trial 1 finished with value: 0.9941404841551428 and parameters: {'criterion': 'gini', 'max_depth': 24}. Best is trial 1 with value: 0.9941404841551428.[0m


macro-f1 score :  0.9946035027418356
	 폴드 세트:  5  시작
macro-f1 score :  0.9961431963787019
	 폴드 세트:  1  시작
macro-f1 score :  0.990746804088489
	 폴드 세트:  2  시작


[32m[I 2023-04-13 13:54:08,653][0m Trial 2 finished with value: 0.9941403756734786 and parameters: {'criterion': 'log_loss', 'max_depth': 29}. Best is trial 1 with value: 0.9941404841551428.[0m


macro-f1 score :  0.9953754099876915
	 폴드 세트:  3  시작
macro-f1 score :  0.9946042706901019
	 폴드 세트:  4  시작
macro-f1 score :  0.99306062581293
	 폴드 세트:  5  시작
macro-f1 score :  0.9969147677881807
	 폴드 세트:  1  시작
macro-f1 score :  0.9961464377374016
	 폴드 세트:  2  시작
macro-f1 score :  0.9946042706901019
	 폴드 세트:  3  시작


[32m[I 2023-04-13 13:54:08,851][0m Trial 3 finished with value: 0.9941396892109434 and parameters: {'criterion': 'log_loss', 'max_depth': 20}. Best is trial 1 with value: 0.9941404841551428.[0m
[32m[I 2023-04-13 13:54:08,944][0m Trial 4 finished with value: 0.9941401366240677 and parameters: {'criterion': 'log_loss', 'max_depth': 23}. Best is trial 1 with value: 0.9941404841551428.[0m


macro-f1 score :  0.9946042706901019
	 폴드 세트:  4  시작
macro-f1 score :  0.993832124666994
	 폴드 세트:  5  시작
macro-f1 score :  0.9915113422701174
	 폴드 세트:  1  시작
macro-f1 score :  0.9915185426602477
	 폴드 세트:  2  시작
macro-f1 score :  0.9953754099876915
	 폴드 세트:  3  시작
macro-f1 score :  0.9938330170777988
	 폴드 세트:  4  시작
macro-f1 score :  0.996145907635787
	 폴드 세트:  5  시작
macro-f1 score :  0.9938278057588139
	 폴드 세트:  1  시작
macro-f1 score : 

[32m[I 2023-04-13 13:54:09,026][0m Trial 5 finished with value: 0.9941402235724436 and parameters: {'criterion': 'gini', 'max_depth': 22}. Best is trial 1 with value: 0.9941404841551428.[0m
[32m[I 2023-04-13 13:54:09,104][0m Trial 6 finished with value: 0.994139368965735 and parameters: {'criterion': 'log_loss', 'max_depth': 17}. Best is trial 1 with value: 0.9941404841551428.[0m


 0.9946042706901019
	 폴드 세트:  2  시작
macro-f1 score :  0.9930616463823603
	 폴드 세트:  3  시작
macro-f1 score :  0.9915185426602477
	 폴드 세트:  4  시작
macro-f1 score :  0.9969169399917943
	 폴드 세트:  5  시작
macro-f1 score :  0.9945997181377133
	 폴드 세트:  1  시작
macro-f1 score :  0.9938330170777988
	 폴드 세트:  2  시작
macro-f1 score :  0.996917356704512
	 폴드 세트:  3  시작
macro-f1 score :  0.9953754099876915
	 폴드 세트:  4  시작
macro-f1 score :  0.993832124666994
	 폴드 세트:  5  시작
macro-f1 score :  0.9907389363916785
	 폴드 세트:  1  시작
macro-f1 score :  0.9961464377374016
	 폴드 세트:  2  시작
macro-f1 score :  0.9953754099876915
	 폴드 세트:  3  시작
macro-f1 score :  0.9953754099876915
	 폴드 세트:  4  시작


[32m[I 2023-04-13 13:54:09,184][0m Trial 7 finished with value: 0.9941396861197674 and parameters: {'criterion': 'log_loss', 'max_depth': 18}. Best is trial 1 with value: 0.9941404841551428.[0m
[32m[I 2023-04-13 13:54:09,268][0m Trial 8 finished with value: 0.9941404295300813 and parameters: {'criterion': 'gini', 'max_depth': 12}. Best is trial 1 with value: 0.9941404841551428.[0m
[32m[I 2023-04-13 13:54:09,346][0m Trial 9 finished with value: 0.9941403027156713 and parameters: {'criterion': 'entropy', 'max_depth': 20}. Best is trial 1 with value: 0.9941404841551428.[0m


macro-f1 score :  0.9899733671272392
	 폴드 세트:  5  시작
macro-f1 score :  0.9938278057588139
	 폴드 세트:  1  시작
macro-f1 score :  0.9946042706901019
	 폴드 세트:  2  시작
macro-f1 score :  0.9938330170777988
	 폴드 세트:  3  시작
macro-f1 score :  0.9938330170777988
	 폴드 세트:  4  시작
macro-f1 score :  0.993832124666994
	 폴드 세트:  5  시작
macro-f1 score :  0.9945997181377133
	 폴드 세트:  1  시작
macro-f1 score :  0.990746804088489
	 폴드 세트:  2  시작
macro-f1 score :  0.9946042706901019
	 폴드 세트:  3  시작
macro-f1 score :  0.9938330170777988
	 폴드 세트:  4  시작
macro-f1 score :  0.996145907635787
	 폴드 세트:  5  시작
macro-f1 score :  0.9953715140861799
	 폴드 세트:  1  시작
macro-f1 score :  0.996917356704512


[32m[I 2023-04-13 13:54:09,437][0m Trial 10 finished with value: 0.9941397625237048 and parameters: {'criterion': 'entropy', 'max_depth': 4}. Best is trial 1 with value: 0.9941404841551428.[0m
[32m[I 2023-04-13 13:54:09,529][0m Trial 11 finished with value: 0.994139257416301 and parameters: {'criterion': 'gini', 'max_depth': 12}. Best is trial 1 with value: 0.9941404841551428.[0m


	 폴드 세트:  2  시작
macro-f1 score :  0.9953754099876915
	 폴드 세트:  3  시작
macro-f1 score :  0.9938330170777988
	 폴드 세트:  4  시작
macro-f1 score :  0.9915172546751494
	 폴드 세트:  5  시작
macro-f1 score :  0.9930557741733722
	 폴드 세트:  1  시작
macro-f1 score :  0.9953754099876915
	 폴드 세트:  2  시작
macro-f1 score :  0.9976881696527808
	 폴드 세트:  3  시작
macro-f1 score :  0.9930616463823603
	 폴드 세트:  4  시작
macro-f1 score :  0.993832124666994
	 폴드 세트:  5  시작
macro-f1 score :  0.9907389363916785
	 폴드 세트:  1  시작
macro-f1 score :  0.9953754099876915
	 폴드 세트:  2  시작
macro-f1 score :  0.9946042706901019
	 폴드 세트:  3  시작


[32m[I 2023-04-13 13:54:09,615][0m Trial 12 finished with value: 0.9941402258654672 and parameters: {'criterion': 'gini', 'max_depth': 11}. Best is trial 1 with value: 0.9941404841551428.[0m
[32m[I 2023-04-13 13:54:09,698][0m Trial 13 finished with value: 0.99414016387632 and parameters: {'criterion': 'gini', 'max_depth': 12}. Best is trial 1 with value: 0.9941404841551428.[0m
[32m[I 2023-04-13 13:54:09,784][0m Trial 14 finished with value: 0.9941401589374108 and parameters: {'criterion': 'gini', 'max_depth': 30}. Best is trial 1 with value: 0.9941404841551428.[0m


macro-f1 score :  0.9938330170777988
	 폴드 세트:  4  시작
macro-f1 score :  0.99306062581293
	 폴드 세트:  5  시작
macro-f1 score :  0.9938278057588139
	 폴드 세트:  1  시작
macro-f1 score :  0.990746804088489
	 폴드 세트:  2  시작
macro-f1 score :  0.9961464377374016
	 폴드 세트:  3  시작
macro-f1 score :  0.9922901558337425
	 폴드 세트:  4  시작
macro-f1 score :  0.996145907635787
	 폴드 세트:  5  시작
macro-f1 score :  0.9953715140861799
	 폴드 세트:  1  시작
macro-f1 score :  0.9946042706901019
	 폴드 세트:  2  시작
macro-f1 score :  0.9961464377374016
	 폴드 세트:  3  시작
macro-f1 score :  0.9922901558337425
	 폴드 세트:  4  시작
macro-f1 score :  0.993832124666994
	 폴드 세트:  5  시작
macro-f1 score :  0.9938278057588139


[32m[I 2023-04-13 13:54:09,870][0m Trial 15 finished with value: 0.9941402078743946 and parameters: {'criterion': 'gini', 'max_depth': 6}. Best is trial 1 with value: 0.9941404841551428.[0m
[32m[I 2023-04-13 13:54:09,961][0m Trial 16 finished with value: 0.9941404810181729 and parameters: {'criterion': 'gini', 'max_depth': 26}. Best is trial 1 with value: 0.9941404841551428.[0m


	 폴드 세트:  1  시작
macro-f1 score :  0.9946042706901019
	 폴드 세트:  2  시작
macro-f1 score :  0.9899749373433584
	 폴드 세트:  3  시작
macro-f1 score :  0.9961464377374016
	 폴드 세트:  4  시작
macro-f1 score :  0.99306062581293
	 폴드 세트:  5  시작
macro-f1 score :  0.9969147677881807
	 폴드 세트:  1  시작
macro-f1 score :  0.9922901558337425
	 폴드 세트:  2  시작
macro-f1 score :  0.9953754099876915
	 폴드 세트:  3  시작
macro-f1 score :  0.9938330170777988
	 폴드 세트:  4  시작
macro-f1 score :  0.99306062581293
	 폴드 세트:  5  시작
macro-f1 score :  0.9961431963787019
	 폴드 세트:  1  시작
macro-f1 score :  0.9930616463823603
	 폴드 세트:  2  시작
macro-f1 score :  0.9930616463823603

[32m[I 2023-04-13 13:54:10,055][0m Trial 17 finished with value: 0.9941400534967688 and parameters: {'criterion': 'entropy', 'max_depth': 26}. Best is trial 1 with value: 0.9941404841551428.[0m
[32m[I 2023-04-13 13:54:10,144][0m Trial 18 finished with value: 0.994140178983136 and parameters: {'criterion': 'gini', 'max_depth': 27}. Best is trial 1 with value: 0.9941404841551428.[0m



	 폴드 세트:  3  시작
macro-f1 score :  0.9961464377374016
	 폴드 세트:  4  시작
macro-f1 score :  0.9953747628083491
	 폴드 세트:  5  시작
macro-f1 score :  0.9930557741733722
	 폴드 세트:  1  시작
macro-f1 score :  0.9946042706901019
	 폴드 세트:  2  시작
macro-f1 score :  0.9953754099876915
	 폴드 세트:  3  시작
macro-f1 score :  0.9899749373433584
	 폴드 세트:  4  시작
macro-f1 score :  0.9953747628083491
	 폴드 세트:  5  시작
macro-f1 score :  0.9953715140861799
	 폴드 세트:  1  시작
macro-f1 score :  0.9922901558337425
	 폴드 세트:  2  시작
macro-f1 score :  0.9930616463823603
	 폴드 세트:  3  시작
macro-f1 score :  0.9915185426602477
	 폴드 세트:  4  시작


[32m[I 2023-04-13 13:54:10,233][0m Trial 19 finished with value: 0.9941404967196202 and parameters: {'criterion': 'gini', 'max_depth': 27}. Best is trial 19 with value: 0.9941404967196202.[0m


macro-f1 score :  0.996145907635787
	 폴드 세트:  5  시작
macro-f1 score :  0.9976862310859633

cbrm_trial = optuna_cbrm.best_trial
cbrm_trial_params = cbrm_trial.params
print('Best Trial: score {},\nparams {}'.format(cbrm_trial.value, cbrm_trial_params))

Best Trial: score 0.9941404967196202,
params {'criterion': 'gini', 'max_depth': 27}

model2=DecisionTreeClassifier(**cbrm_trial_params)

model2.fit(train_df,y)

DecisionTreeClassifier(max_depth=27)

y_test_pred=model2.predict(test_df)

y_test_proba=model2.predict_proba(test_df)

accuracy_score(y_test,y_test_pred)

1.0

f1_score(y_test,y_test_pred,average='macro')

1.0

DT Confusion Matrix

show_confusion(y_test_pred,y_test,"DT")

png

DT Roc Curve

plt.figure(figsize=(10,10))
fprs, tprs, thresholds = roc_curve(y_test, y_test_proba[:,1])

    # ROC 곡선 그래프 곡선으로 그림
plt.plot(fprs, tprs, label='DT',color='red')

plt.plot([0,1],[0,1], 'k--', label='Random')
    
    # FPR X 축의 Scale을 0.1 단위로 변경, X, Y 축 명 설정 등
start,end = plt.xlim()
plt.xticks(np.round(np.arange(start, end, 0.1), 2))
#plt.xlim(0, 1); plt.ylim(0, 1)
plt.xlabel('FPR(1 - Sensitivity)'); plt.ylabel('TPR(Recall)')
plt.title('ROC Curve')
plt.legend()

<matplotlib.legend.Legend at 0x7f3f340c3b50>

png

MLP

파라미터 튜닝

# 데이터 분리에서 x_train , x_valid , y_train , y_valid 
#분리해놓은 것을 통해 최적 파라미터를 탐색한다.

sampler = TPESampler(seed=10)

# define function
def objective(trial):
    cbrm_param = {
        "learning_rate": trial.suggest_categorical("learning_rate",['constant','invscaling','adaptive']),
        
        
        "alpha": trial.suggest_categorical("alpha",[.3,.1,.01,.001,.0001]),
        "activation": trial.suggest_categorical("activation",['logistic','relu','tanh'])
    }

    n_folds=5
    kf = StratifiedKFold(n_splits=n_folds, shuffle=True)
    
    # 추후 메타 모델이 사용할 학습 데이터 반환을 위한 넘파이 배열 초기화
    model_cbrm=MLPClassifier(**cbrm_param)
    
    m_f1 =0.0
    for folder_counter, (train_index, valid_index) in enumerate(kf.split(train,y)):
        # 입력된 학습 데이터에서 기반 모델이 학습/예측할 폴드 데이터 세트 추출
        print('\t 폴드 세트: ',folder_counter+1,' 시작')
        X_tr = train.loc[train_index]
        y_tr = y.loc[train_index]
        X_te = train.loc[valid_index]
        y_te = y.loc[valid_index]


        model_cbrm = model_cbrm.fit(X_tr, y_tr)
        pred=model_cbrm.predict(X_te)
        print("macro-f1 score : ",f1_score(y_te,pred,average='macro'))
        m_f1+=f1_score(y_te,pred,average='macro')


                           
	# 평가지표 원하는 평가 지표가 있을 시 바꾸어 준다.
  
    m_f1 = m_f1 / n_folds
    return m_f1

optuna_cbrm = optuna.create_study(direction='maximize', sampler=sampler)
optuna_cbrm.optimize(objective, n_trials=4)

[32m[I 2023-04-13 13:55:12,348][0m A new study created in memory with name: no-name-15a4683e-0212-4c96-96bc-df64bbd267ef[0m


	 폴드 세트:  1  시작
macro-f1 score :  0.9930616463823603
	 폴드 세트:  2  시작
macro-f1 score :  0.9938330170777988
	 폴드 세트:  3  시작
macro-f1 score :  0.9922901558337425
	 폴드 세트:  4  시작
macro-f1 score :  0.993832124666994
	 폴드 세트:  5  시작


[32m[I 2023-04-13 13:55:26,491][0m Trial 0 finished with value: 0.9941406350093718 and parameters: {'learning_rate': 'constant', 'alpha': 0.0001, 'activation': 'tanh'}. Best is trial 0 with value: 0.9941406350093718.[0m


macro-f1 score :  0.9976862310859633
	 폴드 세트:  1  시작
macro-f1 score :  0.9930616463823603
	 폴드 세트:  2  시작
macro-f1 score :  0.9922901558337425
	 폴드 세트:  3  시작
macro-f1 score :  0.9953754099876915
	 폴드 세트:  4  시작
macro-f1 score :  0.99306062581293
	 폴드 세트:  5  시작


[32m[I 2023-04-13 13:55:44,619][0m Trial 1 finished with value: 0.9941405211609811 and parameters: {'learning_rate': 'constant', 'alpha': 0.0001, 'activation': 'logistic'}. Best is trial 0 with value: 0.9941406350093718.[0m


macro-f1 score :  0.9969147677881807
	 폴드 세트:  1  시작
macro-f1 score :  0.9922901558337425
	 폴드 세트:  2  시작
macro-f1 score :  0.996917356704512
	 폴드 세트:  3  시작
macro-f1 score :  0.9953754099876915
	 폴드 세트:  4  시작
macro-f1 score :  0.9915172546751494
	 폴드 세트:  5  시작


[32m[I 2023-04-13 13:55:59,304][0m Trial 2 finished with value: 0.9941399790677619 and parameters: {'learning_rate': 'invscaling', 'alpha': 0.001, 'activation': 'tanh'}. Best is trial 0 with value: 0.9941406350093718.[0m


macro-f1 score :  0.9945997181377133
	 폴드 세트:  1  시작
macro-f1 score :  0.9938330170777988
	 폴드 세트:  2  시작
macro-f1 score :  0.9922901558337425
	 폴드 세트:  3  시작
macro-f1 score :  0.9946042706901019
	 폴드 세트:  4  시작
macro-f1 score :  0.9953747628083491
	 폴드 세트:  5  시작


[32m[I 2023-04-13 13:56:23,298][0m Trial 3 finished with value: 0.9941403849095412 and parameters: {'learning_rate': 'constant', 'alpha': 0.1, 'activation': 'tanh'}. Best is trial 0 with value: 0.9941406350093718.[0m


macro-f1 score :  0.9945997181377133

cbrm_trial = optuna_cbrm.best_trial
cbrm_trial_params = cbrm_trial.params
print('Best Trial: score {},\nparams {}'.format(cbrm_trial.value, cbrm_trial_params))

Best Trial: score 0.9941406350093718,
params {'learning_rate': 'constant', 'alpha': 0.0001, 'activation': 'tanh'}

mlp=MLPClassifier(**cbrm_trial_params)

mlp.fit(train_df,y)

MLPClassifier(activation='tanh')

y_test_pred=mlp.predict(test_df)

y_test_proba=mlp.predict_proba(test_df)

accuracy_score(y_test,y_test_pred)

1.0

MLP Confusion Matrix

show_confusion(y_test_pred,y_test,"MLP")

png

MLP Roc Curve

plt.figure(figsize=(10,10))
fprs, tprs, thresholds = roc_curve(y_test, y_test_proba[:,1])

    # ROC 곡선 그래프 곡선으로 그림
plt.plot(fprs, tprs, label='MLP Default Model',color='red')

plt.plot([0,1],[0,1], 'k--', label='Random')
    
    # FPR X 축의 Scale을 0.1 단위로 변경, X, Y 축 명 설정 등
start,end = plt.xlim()
plt.xticks(np.round(np.arange(start, end, 0.1), 2))
#plt.xlim(0, 1); plt.ylim(0, 1)
plt.xlabel('FPR(1 - Sensitivity)'); plt.ylabel('TPR(Recall)')
plt.title('ROC Curve')
plt.legend()

<matplotlib.legend.Legend at 0x7f3f3460fd00>

png

DNN

import tensorflow as tf

from tensorflow import keras

from tensorflow.keras.utils import to_categorical

# train : train, y / test : test / y_test

y_test

     0
     1
     1
     0
     0
       ..
  1
  0
  1
  1
  1
Name: class, Length: 1625, dtype: int64

y_Train=to_categorical(y)

y_Test = to_categorical(y_test)

train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6499 entries, 0 to 6498
Data columns (total 18 columns):
 #   Column               Non-Null Count  Dtype  
---  ------               --------------  -----  
 odor_0               6499 non-null   float64
 odor_1               6499 non-null   float64
 odor_2               6499 non-null   float64
 odor_3               6499 non-null   float64
 odor_4               6499 non-null   float64
 odor_5               6499 non-null   float64
 odor_6               6499 non-null   float64
 odor_7               6499 non-null   float64
 odor_8               6499 non-null   float64
 spore-print-color_0  6499 non-null   float64
spore-print-color_1  6499 non-null   float64
spore-print-color_2  6499 non-null   float64
spore-print-color_3  6499 non-null   float64
spore-print-color_4  6499 non-null   float64
spore-print-color_5  6499 non-null   float64
spore-print-color_6  6499 non-null   float64
spore-print-color_7  6499 non-null   float64
spore-print-color_8  6499 non-null   float64
dtypes: float64(18)
memory usage: 914.0 KB

model = keras.Sequential([keras.layers.Dense(5,input_shape=(18,)),
                          keras.layers.Dense(3,activation='relu'),
                          keras.layers.Dense(2,activation='softmax')])
model.summary()

Model: "sequential_6"
_________________________________________________________________
 Layer (type)                Output Shape              Param #   
=================================================================
 dense_18 (Dense)            (None, 5)                 95        
                                                                 
 dense_19 (Dense)            (None, 3)                 18        
                                                                 
 dense_20 (Dense)            (None, 2)                 8         
                                                                 
=================================================================
Total params: 121
Trainable params: 121
Non-trainable params: 0
_________________________________________________________________

train.shape

(6499, 9)

model.compile(optimizer='adam',loss='binary_crossentropy',metrics=['accuracy'])

history = model.fit(train,y_Train,validation_split=0.1,epochs=25)

Epoch 1/25
183/183 [==============================] - 2s 5ms/step - loss: 0.6181 - accuracy: 0.7812 - val_loss: 0.5082 - val_accuracy: 0.9446
Epoch 2/25
183/183 [==============================] - 0s 2ms/step - loss: 0.4422 - accuracy: 0.9085 - val_loss: 0.3348 - val_accuracy: 0.9446
Epoch 3/25
183/183 [==============================] - 0s 2ms/step - loss: 0.3134 - accuracy: 0.9598 - val_loss: 0.2488 - val_accuracy: 0.9892
Epoch 4/25
183/183 [==============================] - 0s 2ms/step - loss: 0.2489 - accuracy: 0.9846 - val_loss: 0.2075 - val_accuracy: 0.9892
Epoch 5/25
183/183 [==============================] - 0s 2ms/step - loss: 0.2144 - accuracy: 0.9846 - val_loss: 0.1802 - val_accuracy: 0.9892
Epoch 6/25
183/183 [==============================] - 0s 2ms/step - loss: 0.1888 - accuracy: 0.9846 - val_loss: 0.1589 - val_accuracy: 0.9892
Epoch 7/25
183/183 [==============================] - 0s 2ms/step - loss: 0.1680 - accuracy: 0.9846 - val_loss: 0.1410 - val_accuracy: 0.9892
Epoch 8/25
183/183 [==============================] - 0s 2ms/step - loss: 0.1504 - accuracy: 0.9846 - val_loss: 0.1260 - val_accuracy: 0.9892
Epoch 9/25
183/183 [==============================] - 0s 2ms/step - loss: 0.1353 - accuracy: 0.9915 - val_loss: 0.1129 - val_accuracy: 0.9969
Epoch 10/25
183/183 [==============================] - 0s 2ms/step - loss: 0.1221 - accuracy: 0.9938 - val_loss: 0.1011 - val_accuracy: 0.9969
Epoch 11/25
183/183 [==============================] - 0s 2ms/step - loss: 0.1099 - accuracy: 0.9938 - val_loss: 0.0903 - val_accuracy: 0.9969
Epoch 12/25
183/183 [==============================] - 0s 2ms/step - loss: 0.0986 - accuracy: 0.9938 - val_loss: 0.0807 - val_accuracy: 0.9969
Epoch 13/25
183/183 [==============================] - 0s 2ms/step - loss: 0.0892 - accuracy: 0.9938 - val_loss: 0.0728 - val_accuracy: 0.9969
Epoch 14/25
183/183 [==============================] - 0s 2ms/step - loss: 0.0817 - accuracy: 0.9938 - val_loss: 0.0665 - val_accuracy: 0.9969
Epoch 15/25
183/183 [==============================] - 0s 2ms/step - loss: 0.0756 - accuracy: 0.9938 - val_loss: 0.0610 - val_accuracy: 0.9969
Epoch 16/25
183/183 [==============================] - 0s 2ms/step - loss: 0.0704 - accuracy: 0.9938 - val_loss: 0.0564 - val_accuracy: 0.9969
Epoch 17/25
183/183 [==============================] - 0s 2ms/step - loss: 0.0658 - accuracy: 0.9938 - val_loss: 0.0523 - val_accuracy: 0.9969
Epoch 18/25
183/183 [==============================] - 0s 2ms/step - loss: 0.0619 - accuracy: 0.9938 - val_loss: 0.0487 - val_accuracy: 0.9969
Epoch 19/25
183/183 [==============================] - 0s 2ms/step - loss: 0.0584 - accuracy: 0.9938 - val_loss: 0.0455 - val_accuracy: 0.9969
Epoch 20/25
183/183 [==============================] - 0s 2ms/step - loss: 0.0554 - accuracy: 0.9938 - val_loss: 0.0427 - val_accuracy: 0.9969
Epoch 21/25
183/183 [==============================] - 0s 2ms/step - loss: 0.0527 - accuracy: 0.9938 - val_loss: 0.0402 - val_accuracy: 0.9969
Epoch 22/25
183/183 [==============================] - 0s 2ms/step - loss: 0.0504 - accuracy: 0.9938 - val_loss: 0.0379 - val_accuracy: 0.9969
Epoch 23/25
183/183 [==============================] - 0s 2ms/step - loss: 0.0483 - accuracy: 0.9938 - val_loss: 0.0360 - val_accuracy: 0.9969
Epoch 24/25
183/183 [==============================] - 0s 2ms/step - loss: 0.0465 - accuracy: 0.9938 - val_loss: 0.0342 - val_accuracy: 0.9969
Epoch 25/25
183/183 [==============================] - 0s 2ms/step - loss: 0.0449 - accuracy: 0.9938 - val_loss: 0.0327 - val_accuracy: 0.9969

plt.plot(history.history['accuracy'])
plt.plot(history.history['val_accuracy'])

plt.title("DNN Accuracy")

plt.legend(['training','valid'],loc='best')
plt.show()
plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])

plt.title("DNN Loss")

plt.show()

png

y_test

     0
     1
     1
     0
     0
       ..
  1
  0
  1
  1
  1
Name: class, Length: 1625, dtype: int64

y_pred=model.predict(test)

y_pred_labels = []

for i in y_pred :
  if (np.argmax(i)>0):
     y_pred_labels.append(1)
  else : 
    y_pred_labels.append(0)

51/51 [==============================] - 0s 1ms/step

accuracy_score(y_test,y_pred_labels)

0.9938461538461538

show_confusion(y_pred_labels , y_test,'DNN')

png

Twitter Facebook LinkedIn

최종원

캐글데이터분석(Mushroom Edible)

Mushroom Edible Classification

Description

Data Load & Simple Analysis

EDA

Data Preprocessing

Modeling

SVC

SVC Confusion Matrix

SVC Roc Curve

DecisionTree

DT Confusion Matrix

DT Roc Curve

MLP

MLP Confusion Matrix

MLP Roc Curve

DNN

공유하기

댓글남기기

참고

NMT - with Attention

L layer NN - with ReLu

L layer NN - Cat Recognition

Transformers Architecture with TensorFlow

	class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	p	x	s	n	t	p	f	c	n	k	e	e	s	s	w	w	p	w	o	p	k	s	u
1	e	x	s	y	t	a	f	c	b	k	e	c	s	s	w	w	p	w	o	p	n	n	g
2	e	b	s	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	n	m
3	p	x	y	w	t	p	f	c	n	n	e	e	s	s	w	w	p	w	o	p	k	s	u
4	e	x	s	g	f	n	f	w	b	k	t	e	s	s	w	w	p	w	o	e	n	a	g
5	e	x	y	y	t	a	f	c	b	n	e	c	s	s	w	w	p	w	o	p	k	n	g
6	e	b	s	w	t	a	f	c	b	g	e	c	s	s	w	w	p	w	o	p	k	n	m
7	e	b	y	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	s	m
8	p	x	y	w	t	p	f	c	n	p	e	e	s	s	w	w	p	w	o	p	k	v	g
9	e	b	s	y	t	a	f	c	b	g	e	c	s	s	w	w	p	w	o	p	k	s	m

	class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	p	x	s	n	t	p	f	c	n	k	e	e	s	s	w	w	p	w	o	p	k	s	u
1	e	x	s	y	t	a	f	c	b	k	e	c	s	s	w	w	p	w	o	p	n	n	g
2	e	b	s	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	n	m
3	p	x	y	w	t	p	f	c	n	n	e	e	s	s	w	w	p	w	o	p	k	s	u
4	e	x	s	g	f	n	f	w	b	k	t	e	s	s	w	w	p	w	o	e	n	a	g
5	e	x	y	y	t	a	f	c	b	n	e	c	s	s	w	w	p	w	o	p	k	n	g
6	e	b	s	w	t	a	f	c	b	g	e	c	s	s	w	w	p	w	o	p	k	n	m
7	e	b	y	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	s	m
8	p	x	y	w	t	p	f	c	n	p	e	e	s	s	w	w	p	w	o	p	k	v	g
9	e	b	s	y	t	a	f	c	b	g	e	c	s	s	w	w	p	w	o	p	k	s	m

	class	cap-shape	cap-surface	cap-color	bruises	odor	gill-attachment	gill-spacing	gill-size	gill-color	stalk-shape	stalk-root	stalk-surface-above-ring	stalk-surface-below-ring	stalk-color-above-ring	stalk-color-below-ring	veil-type	veil-color	ring-number	ring-type	spore-print-color	population	habitat
0	p	x	s	n	t	p	f	c	n	k	e	e	s	s	w	w	p	w	o	p	k	s	u
1	e	x	s	y	t	a	f	c	b	k	e	c	s	s	w	w	p	w	o	p	n	n	g
2	e	b	s	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	n	m
3	p	x	y	w	t	p	f	c	n	n	e	e	s	s	w	w	p	w	o	p	k	s	u
4	e	x	s	g	f	n	f	w	b	k	t	e	s	s	w	w	p	w	o	e	n	a	g
5	e	x	y	y	t	a	f	c	b	n	e	c	s	s	w	w	p	w	o	p	k	n	g
6	e	b	s	w	t	a	f	c	b	g	e	c	s	s	w	w	p	w	o	p	k	n	m
7	e	b	y	w	t	l	f	c	b	n	e	c	s	s	w	w	p	w	o	p	n	s	m
8	p	x	y	w	t	p	f	c	n	p	e	e	s	s	w	w	p	w	o	p	k	v	g
9	e	b	s	y	t	a	f	c	b	g	e	c	s	s	w	w	p	w	o	p	k	s	m