scikit-learn

카테고리 없음

scikit-learn

yuuuun 2021. 8. 20. 15:18

train_test_split(data, label, test_size)

import pandas as pd
data = pd.read_csv('**.csv')
data = data[['col1', 'col2', 'col3']]
y = data['ans']

from sklearn.model_selection import train_test_split

x_train, x_test, y_train, y_test = train_test_split(x, y, test_size=0.1)

RandomForestClassifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score

clf = RandomForestClassifier(n_estimators=20, max_depth=5, random_state=0)
clf.fit(train_x, train_y)

pred = clf.predict(test_x)
print(accuracy_score(test_y, pred))

decision tree가 overfitting될 가능성이 높다는 약점을 가지고 있기 때문에 일반화된 트리를 만드는 방법 필요

-> 여러 개의 decision tree를 형성하고 새로운 데이터 포인트를 각 트리에 동시에 통과시켜 각 트리가 분류한 결과에서 튜표를 실시하여 가장 많은 들표한 결과를 최종 분류 결과

Bagging(Bootsrap(전체 데이터에서 무작위 복원 추출을 통해 여러 개의 학습 데이터 표본을 추출하는 것) Aggregating)을 통하여 전체 데이터 셋 중 n_estimators의 개수만큼 뽑아낸뒤 각 데이터셋을 여러개의 decision tree에 넣는다.