이번 포스팅에서는 인공신경망 (Neural network)를 R 환경에서 실습해보겠습니다.
실습 데이터는 German credit 데이터로, 독일의 신용 평가 데이터입니다.
german <- read.csv('German_credit.csv')
colnames(german)
[1] "ID" "Checking_account" "Duration_in_month" "Credit_history"
[5] "Purpose" "Credit_amount" "Saving_accout" "Present_employment"
[9] "Installment_rate" "Personal_status___sex" "Other_debtors_guarantors" "Present_residence"
[13] "Property" "Age" "Other_installment_plan" "Housing"
[17] "Num_of_existing_credits" "Job" "Num_of_people_liable" "Telephone"
[21] "Foreign_worker" "Credit_status"
우선 German credit 데이터를 german 변수에 저장하였습니다.
해당 데이터는 총 22개의 column을 가집니다.
저희는 Credit_status 변수를 Target 변수로 하여, 이진 분류 모델을 만들려고 합니다.
# 범주형 변수 설정
german$Checking_account <- as.factor(german$Checking_account)
german$Credit_history <- as.factor(german$Credit_history)
german$Purpose <- as.factor(german$Purpose)
german$Saving_accout <- as.factor(german$Saving_accout)
german$Present_employment <- as.factor(german$Present_employment)
german$Personal_status___sex <- as.factor(german$Personal_status___sex)
german$Other_debtors_guarantors <- as.factor(german$Other_debtors_guarantors)
german$Property <- as.factor(german$Property)
german$Other_installment_plan <- as.factor(german$Other_installment_plan)
german$Housing <- as.factor(german$Housing)
german$Job <- as.factor(german$Job)
german$Telephone <- as.factor(german$Telephone)
german$Foreign_worker <- as.factor(german$Foreign_worker)
german$Credit_status <- as.factor(german$Credit_status)
summary(german)
ID Checking_account Duration_in_month Credit_history Purpose Credit_amount Saving_accout
Min. : 1.0 1:274 Min. : 4.0 0: 40 3 :280 Min. : 250 1:603
1st Qu.: 250.8 2:269 1st Qu.:12.0 1: 49 0 :234 1st Qu.: 1366 2:103
Median : 500.5 3: 63 Median :18.0 2:530 2 :181 Median : 2320 3: 63
Mean : 500.5 4:394 Mean :20.9 3: 88 1 :103 Mean : 3271 4: 48
3rd Qu.: 750.2 3rd Qu.:24.0 4:293 9 : 97 3rd Qu.: 3972 5:183
Max. :1000.0 Max. :72.0 6 : 50 Max. :18424
(Other): 55
Present_employment Installment_rate Personal_status___sex Other_debtors_guarantors Present_residence Property
1: 62 Min. :1.000 1: 50 1:907 Min. :1.000 1:282
2:172 1st Qu.:2.000 2:310 2: 41 1st Qu.:2.000 2:232
3:339 Median :3.000 3:548 3: 52 Median :3.000 3:332
4:174 Mean :2.973 4: 92 Mean :2.845 4:154
5:253 3rd Qu.:4.000 3rd Qu.:4.000
Max. :4.000 Max. :4.000
Age Other_installment_plan Housing Num_of_existing_credits Job Num_of_people_liable Telephone
Min. :19.00 1:139 1:179 Min. :1.000 1: 22 Min. :1.000 1:596
1st Qu.:27.00 2: 47 2:713 1st Qu.:1.000 2:200 1st Qu.:1.000 2:404
Median :33.00 3:814 3:108 Median :1.000 3:630 Median :1.000
Mean :35.55 Mean :1.407 4:148 Mean :1.155
3rd Qu.:42.00 3rd Qu.:2.000 3rd Qu.:1.000
Max. :75.00 Max. :4.000 Max. :2.000
Foreign_worker Credit_status
1:963 N:700
2: 37 Y:300
우선 첫 번째로, 범주형 변수들은 factor 자료형으로 바꿔줍니다.
summary 함수로 전처리된 데이터를 보시면, Target 변수인 credit_status 가 700개의 N, 300개의 Y를 가지기 때문에, 이 후에 모델링 할 경우, Class imbalance 문제를 겪을 수 있습니다.
따라서 층화추출하여, 300, 300 개의 sample을 만들도록 하겠습니다.
library(sampling)
stratified_sampling <- strata(german, stratanames = c("Credit_status"), size =c(300,300),
method="srswor")
st_data <- getdata(german, stratified_sampling)
table(st_data$Credit_status)
N Y
300 300
rm(stratified_sampling)
sampling 패키지는 층화추출 함수인 strata를 제공합니다.
결과적으로, st_data는 Credit_status 인자로써, 300개의 N 과 300개의 Y 를 가집니다.
# scaling
st_data$Duration_in_month = (st_data$Duration_in_month - min(st_data$Duration_in_month))/(max(st_data$Duration_in_month)-min(st_data$Duration_in_month))
st_data$Credit_amount = (st_data$Credit_amount - min(st_data$Credit_amount))/(max(st_data$Credit_amount)-min(st_data$Credit_amount))
st_data$Installment_rate = (st_data$Installment_rate - min(st_data$Installment_rate))/(max(st_data$Installment_rate)-min(st_data$Installment_rate))
st_data$Present_residence = (st_data$Present_residence - min(st_data$Present_residence))/(max(st_data$Present_residence)-min(st_data$Present_residence))
st_data$Age = (st_data$Age - min(st_data$Age))/(max(st_data$Age)-min(st_data$Age))
st_data$Num_of_people_liable = (st_data$Num_of_people_liable - min(st_data$Num_of_people_liable))/(max(st_data$Num_of_people_liable)-min(st_data$Num_of_people_liable))
st_data$Num_of_existing_credits = (st_data$Num_of_existing_credits - min(st_data$Num_of_existing_credits))/(max(st_data$Num_of_existing_credits)-min(st_data$Num_of_existing_credits))
summary(st_data)
ID Checking_account Duration_in_month Credit_history Purpose Credit_amount Saving_accout
Min. : 5.0 1:201 Min. :0.0000 0: 29 0 :160 Min. :0.00000 1:383
1st Qu.: 368.5 2:170 1st Qu.:0.1176 1: 33 3 :154 1st Qu.:0.06092 2: 71
Median : 700.5 3: 35 Median :0.2059 2:329 2 :109 Median :0.11263 3: 33
Mean : 605.5 4:194 Mean :0.2609 3: 52 9 : 56 Mean :0.17242 4: 18
3rd Qu.: 850.2 3rd Qu.:0.3382 4:157 1 : 52 3rd Qu.:0.21832 5: 95
Max. :1000.0 Max. :1.0000 6 : 34 Max. :1.00000
(Other): 35
Present_employment Installment_rate Personal_status___sex Other_debtors_guarantors Present_residence Property
1: 42 Min. :0.0000 1: 32 1:539 Min. :0.0000 1:155
2:101 1st Qu.:0.3333 2:186 2: 28 1st Qu.:0.3333 2:139
3:213 Median :1.0000 3:324 3: 33 Median :0.6667 3:203
4:101 Mean :0.6806 4: 58 Mean :0.6289 4:103
5:143 3rd Qu.:1.0000 3rd Qu.:1.0000
Max. :1.0000 Max. :1.0000
Age Other_installment_plan Housing Num_of_existing_credits Job Num_of_people_liable Telephone
Min. :0.0000 1: 94 1:115 Min. :0.0000 1: 10 Min. :0.000 1:356
1st Qu.:0.1250 2: 26 2:416 1st Qu.:0.0000 2:114 1st Qu.:0.000 2:244
Median :0.2321 3:480 3: 69 Median :0.0000 3:385 Median :0.000
Mean :0.2871 Mean :0.1333 4: 91 Mean :0.155
3rd Qu.:0.3973 3rd Qu.:0.3333 3rd Qu.:0.000
Max. :1.0000 Max. :1.0000 Max. :1.000
Foreign_worker Credit_status ID_unit Prob Stratum
1:580 N:300 Min. : 5.0 Min. :0.4286 Min. :1.0
2: 20 Y:300 1st Qu.: 368.5 1st Qu.:0.4286 1st Qu.:1.0
Median : 700.5 Median :0.7143 Median :1.5
Mean : 605.5 Mean :0.7143 Mean :1.5
3rd Qu.: 850.2 3rd Qu.:1.0000 3rd Qu.:2.0
Max. :1000.0 Max. :1.0000 Max. :2.0
다음으로, 인공 신경망을 적합하기 전에 Numeric 변수들을 scaling 해야하기 때문에, min-max normalization을 수행하였습니다.
뒤에 3개의 column은 층화 추출 결과로 생성된 것들이기에 이후에 제거해주어야 합니다.
# Data partition
library(caret)
train <- createDataPartition(st_data$ID, p=0.7, list=FALSE)
td <- st_data[train,]
vd <- st_data[-train,]
rm(st_data, train)
colnames(td)
td <- td[, -c(1,23,24,25)]
vd <- vd[, -c(1,23,24,25)]
마지막으로, 데이터 분할을 7:3의 비율로 훈련 데이터셋과 검증 데이터셋을 만들었습니다.
## Fit neural network
# install library
install.packages("neuralnet")
# load library
library(neuralnet)
저희는 neuralnet 패키지를 사용할 것입니다.
set.seed(2)
td_x = model.matrix(Credit_status ~ ., td)
Credit_status = ifelse(td$Credit_status == 'N', 0, 1)
td1 = data.frame(cbind(td_x, Credit_status))
td1 = td1[,-1]
NN = neuralnet(Credit_status ~ . ,td1, hidden = c(3,3,3,3,3), linear.output = F, err.fct = 'ce', likelihood = T)
같은 결과를 내기 위해 seed를 지정하고, neuralnet 함수를 사용하기 위해, 훈련 데이터셋의 범주형 변수들을 one-hot encoding 하였습니다.
신경망은 5개의 hidden layer를 가지고, 각 layer는 3개의 뉴런을 가집니다.
# plot neural network
plot(NN)
plot 함수를 통해, 해당 신경망의 weight들을 그림으로 한눈에 보실 수 있습니다.
마지막으로, 만들어진 모델을 통해 검증형 데이터셋에 대해 예측을 진행해봅시다.
# validation
vd_x = model.matrix(Credit_status ~ ., vd)
Credit_status = ifelse(vd$Credit_status == 'N', 0, 1)
vd1 = data.frame(cbind(vd_x, Credit_status))
nn.results <- compute(NN, vd1)
predict_y = ifelse(nn.results$net.result > 0.5, 1, 0)
confusionMatrix(as.factor(predict_y), as.factor(vd1$Credit_status))
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 63 30
1 27 60
Accuracy : 0.6833
95% CI : (0.61, 0.7505)
No Information Rate : 0.5
P-Value [Acc > NIR] : 4.847e-07
Kappa : 0.3667
Mcnemar's Test P-Value : 0.7911
Sensitivity : 0.7000
Specificity : 0.6667
Pos Pred Value : 0.6774
Neg Pred Value : 0.6897
Prevalence : 0.5000
Detection Rate : 0.3500
Detection Prevalence : 0.5167
Balanced Accuracy : 0.6833
'Positive' Class : 0
정확도는 0.6833 으로 관찰되었습니다.
'데이터 다루기 > Base of R' 카테고리의 다른 글
[R] 협업 필터링 (Collaborative filtering) (0) | 2020.05.07 |
---|---|
[R] 연관성 분석 (Association rule) (0) | 2020.04.19 |
[R] Decision Tree (의사결정나무) (0) | 2020.04.09 |
[R] Logistic Regression (0) | 2020.04.07 |
[R] K-nearest neighbor (KNN) method (0) | 2020.04.07 |