투빅스 11기&12기 5주차 SVM - 12기 이홍정

by 올타임넘버원메시 posted Aug 27, 2019
?

단축키

Prev이전 문서

Next다음 문서

ESC닫기

+ - Up Down Comment Print
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
 
Preparing Data
 
train_df = pd.read_csv("train.csv")
test_df = pd.read_csv("test.csv")
 
 
train_df.head()
test_df.head()
 
train_df.shape, test_df.shape
 
train_df.describe()
test_df.describe()
 
# 데이터에 Null 값 있는지 확인해보기
train_df.isnull().sum().any()
 
# X와 y(target) 분리
train_X = train_df.drop(['ID_code','target'], axis = 1)
train_y = train_df['target']
 
# target 데이터가 편향적인지 확인해보기
train_y.value_counts()
 
graph_1 = sns.countplot(train_y)
<그래프 1>: 클래스 간의 비율이 상당히 차이가 나므로, 매우 많이 비대칭적이다.
 
Q1. 그렇다면 학습이 '0'에 치우치게 될텐데..
데이터가 20만개라는 점과 imbalance 데이터라는 점에서 Resampling의 under-sampling을 통해 해결해보자
 
 
from imblearn.under_sampling import RandomUnderSampler
 
rus = RandomUnderSampler(random_state=0)
rus.fit(train_X, train_y)
resample_X, resample_y = rus.fit_resample(train_X, train_y)
 
resample_X.shape, resample_y.shape
 
resample_X
 
graph_2 = sns.countplot(resample_y)
<그래프2> : target 변수가 동일한 비율로 resampling 되었을 뿐만 아니라, train_데이터의 갯수 역시 줄어들었다.
 
 
 
 
 
 
Dementioal Reduction¶
200개의 변수가 있는데, 차원을 축소시켜보자.
 
= pd.DataFrame(resample_X)
features = [c for c in train_df.columns if c not in ['ID_code''target']]
X.columns = features
= pd.DataFrame(resample_y, columns=['target'])
 
# 의사결정나무의 변수의 중요도를 통해 200개의 변수의 중요도에 대해 파악해보자
from sklearn.ensemble import RandomForestClassifier
forest = RandomForestClassifier(n_estimators=5, random_state=2)
forest = forest.fit(X,y)
forest.feature_importances_
 
feature_list = pd.DataFrame(forest.feature_importances_)
feature_list.columns = ['importance']
feature_list.sort_values('importance', ascending = False).head()
 
 
# 해당모델에서 중요하게 생각되는 변수들을 선택해보자
from sklearn.feature_selection import SelectFromModel
sfm = SelectFromModel(forest, prefit=True)
= sfm.transform(X)
 
X.shape
변수를 61개까지 줄였다.
 
 
Q2. 변수를 더 줄일 수 있는 방법은?
다중공선성을 이용해 변수를 줄여보자
 
= pd.DataFrame(X)
from statsmodels.stats.outliers_influence import variance_inflation_factor
vif = pd.DataFrame()
vif["VLF Factor"= [variance_inflation_factor(X.values, i) for i in range(X.shape[1])]
vif["features"= X.columns
vif.sort_values(["VLF Factor"], ascending=[False])
<자료 2> : 다중공선성이 너무 높은 33,4,27 (이하 vif >=1000) 은 제외하자.
X.drop([33,4,27], axis=1, inplace=True)
 
 
 
 
 
 
SVM (Model)
모델링하기 전에 데이터가 너무 크기 때문에 SVM을 돌리는데 시간이 많이 걸려서..층화추출을 통해 데이터의 갯수를 줄여보자!
 
df = pd.concat([X,y], axis=1)
# 층화추출하는 함수 
def sampling_func(data, sample_pct):
    np.random.seed(123)
    N = len(data)
    sample_n = int(len(data)*sample_pct)
    sample = data.take(np.random.permutation(N)[:sample_n])
    return sample
 
sample_set = df.groupby('target',group_keys=False).apply(sampling_func, sample_pct=0.05)
len(sample_set)
 
= sample_set.iloc[:,:-1]
= sample_set.iloc[:,-1::]
 
from sklearn.preprocessing import StandardScaler
sc_X = StandardScaler()
scaled_X = sc_X.fit_transform(X)
warnings.filterwarnings(action='once')
 
# SVM - linear
from sklearn.svm import SVC
svm_linear = SVC(kernel='linear', C=1.0, random_state=0)
svm_linear.fit(scaled_X, y)
svm_linear.score(scaled_X, y)
 
from sklearn.model_selection import cross_val_score
cross_val_score(svm_linear, scaled_X, y, cv=3).mean()
 
 
 
# SVM - rbf
svm_rbf = SVC(kernel='rbf', gamma=0.0001, random_state=0)
svm_rbf.fit(scaled_X, y)
svm_rbf.score(scaled_X, y)
cross_val_score(svm_rbf, scaled_X, y,cv=3).mean()
 
 
 
# poly
svm_poly = SVC(kernel='poly', C=1.0, random_state=0)
svm_poly.fit(scaled_X, y)
svm_poly.score(scaled_X, y)
cross_val_score(svm_poly, scaled_X, y, cv=3).mean()
 
 
Q3 낮은 값이 나왔는데, 이것은 parameters의 탓일까? feature selection의 탓일까?
그렇다면 우선, 최적의 parameters를 추정해보자
 
 
1) linear일 경우, C 추정해보자
: 오류값을 얼마나 허용하느냐에 따라 SVM 의 구분이 달라진다. gridsearch를 통해 알아보자.
from sklearn.svm import SVC
svm_linear = SVC(kernel='linear', C=1.0, random_state=0)
svm_linear.fit(scaled_X, y)
 
param_grid = {
    'C': [i for i in range(1,25)],
}
 
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(estimator=svm_linear, param_grid=param_grid, cv=3, n_jobs=-1)
grid = grid.fit(scaled_X, y)
grid.best_params_
grid.best_score_
 
: gridSearch를 한 경우, c값이 1일 때 best_score가 70이 나온다.
 
 
2) rbf일 경우
gamma의 경우, 표준편차의 반대의 개념으로, gamma가 크면 같은 집단으로 분류될 확률이 낮다.
또한 C 역시 오류값에 패널티를 적용하느냐에 따라 분류형태가 달라진다.
 
from sklearn.svm import SVC
svm_rbf = SVC(kernel='rbf', C=1.0, random_state=0)
svm_rbf.fit(scaled_X, y)
 
param_grid = {
    'C': [i for i in range(1,25)],
}
 
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(estimator=svm_rbf, param_grid=param_grid, cv=3, n_jobs=-1)
grid = grid.fit(scaled_X, y)
grid.best_params_
param_grid = {
    'C' : [1],
    'gamma': [0.0001,0.001,0.01,0.1,1,10,100],
}
 
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(estimator=svm_rbf, param_grid=param_grid, cv=3, n_jobs=-1)
grid = grid.fit(scaled_X, y)
grid.best_params_
grid.best_score_
:gridSearch를 한 경우, 'C'값와 'gamma'값이 각각 1과 0.01일 때, 72의 값이 나온다.
 
 
3) poly의 경우
 
'rbf'와 동일하다.
from sklearn.svm import SVC
svm_poly = SVC(kernel='poly', C=1.0, random_state=0)
svm_poly.fit(scaled_X, y)
 
param_grid = {
    'C': [i for i in range(1,25)],
}
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(estimator=svm_poly, param_grid=param_grid, cv=3, n_jobs=-1)
grid = grid.fit(scaled_X, y)
 
grid.best_params_
 
param_grid = {
    'C' : [1],
    'gamma': [0.0001,0.001,0.01,0.1,1,10,100],
}
 
from sklearn.model_selection import GridSearchCV
grid = GridSearchCV(estimator=svm_poly, param_grid=param_grid, cv=3, n_jobs=-1)
grid = grid.fit(scaled_X, y)
grid.best_params_
grid.best_score_
: gridSearch를 한 경우, 'C'값와 'gamma'값이 각각 1과 0.1일 때, 67의 값이 나온다.
 
Q4. 전체적으로 gridSearch를 통해 얻어진 best_score_가 낮게 나온다. 왜그럴까?
아무래도 resample(undersampling)을 통해서 재구성된 데이터가 1과 0을 찾을 때 0에 대한 학습이 기존의 그것보다 못해서 그런 것 같다.
그렇다면 test_df의 실제 target값이 0이 많다면 안좋은 결과를 1이 많다면 조금 더 나은 결과를 얻을 수 있겠다!
 
 
 
 
 
 
 
Test 데이터 예측
: 결과가 안 좋았으니까, 차원을 축소시키지 않은 데이터(학습을 줄인 데이터)에서 결과를 도출해보자
 
1) 데이터셋 준비
= pd.DataFrame(resample_X)
features = [c for c in train_df.columns if c not in ['ID_code''target']]
X.columns = features
= pd.DataFrame(resample_y, columns=['target'])
 
real_df = pd.concat([X,y], axis=1)
sample_set = real_df.groupby('target',group_keys=False).apply(sampling_func, sample_pct=0.05)
= sample_set.iloc[:,:-1]
= sample_set.iloc[:,-1::]
scaled_X = sc_X.fit_transform(X)
warnings.filterwarnings(action='once')
 
2) 테스트 데이터 스케일링
test_df.drop('ID_code', axis = 1, inplace = True)
scaled_test = sc_X.fit_transform(test_df)
warnings.filterwarnings(action='once')
 
 
 
3) 모델링
from sklearn.svm import SVC
svm_real = SVC(kernel='rbf', C=1.0, random_state=0)
svm_real.fit(scaled_X, y)
y_pred = svm_real.predict(scaled_test)
 
 
4) 제출
test_df = pd.read_csv("test.csv")
submission_data= test_df["ID_code"]
submission_data = pd.concat([submission_data, pd.Series(y_pred)], axis= 1)
submission_data.columns = ['ID_code''target']
 
submission_data['target'].value_counts()
 
 
submission_data.to_csv(r"C:\Users\ehj14\Desktop\TOBIGS\5week\SVM\submission_data_Ver2.csv", header=True, index=False)
 
cs

Articles

2 3 4 5 6 7 8 9 10 11

나눔글꼴 설치 안내


이 PC에는 나눔글꼴이 설치되어 있지 않습니다.

이 사이트를 나눔글꼴로 보기 위해서는
나눔글꼴을 설치해야 합니다.

설치 취소

Designed by sketchbooks.co.kr / sketchbook5 board skin

Sketchbook5, 스케치북5

Sketchbook5, 스케치북5

Sketchbook5, 스케치북5

Sketchbook5, 스케치북5