본문 바로가기

데이터 다루기/Base of R

[R] 신경망 (Neural network)

728x90
반응형

이번 포스팅에서는 인공신경망 (Neural network)를 R 환경에서 실습해보겠습니다.

 

German_credit.csv
0.05MB

실습 데이터는 German credit 데이터로, 독일의 신용 평가 데이터입니다.

german <- read.csv('German_credit.csv')
colnames(german)

 [1] "ID"                       "Checking_account"         "Duration_in_month"        "Credit_history"          
 [5] "Purpose"                  "Credit_amount"            "Saving_accout"            "Present_employment"      
 [9] "Installment_rate"         "Personal_status___sex"    "Other_debtors_guarantors" "Present_residence"       
[13] "Property"                 "Age"                      "Other_installment_plan"   "Housing"                 
[17] "Num_of_existing_credits"  "Job"                      "Num_of_people_liable"     "Telephone"               
[21] "Foreign_worker"           "Credit_status"  

우선 German credit 데이터를 german 변수에 저장하였습니다.

해당 데이터는 총 22개의 column을 가집니다.

저희는 Credit_status 변수를 Target 변수로 하여, 이진 분류 모델을 만들려고 합니다.

# 범주형 변수 설정
german$Checking_account <- as.factor(german$Checking_account)
german$Credit_history <- as.factor(german$Credit_history)
german$Purpose <- as.factor(german$Purpose)
german$Saving_accout <- as.factor(german$Saving_accout)
german$Present_employment <- as.factor(german$Present_employment)
german$Personal_status___sex <- as.factor(german$Personal_status___sex)
german$Other_debtors_guarantors <- as.factor(german$Other_debtors_guarantors)
german$Property <- as.factor(german$Property)
german$Other_installment_plan <- as.factor(german$Other_installment_plan)
german$Housing <- as.factor(german$Housing)
german$Job <- as.factor(german$Job)
german$Telephone <- as.factor(german$Telephone)
german$Foreign_worker <- as.factor(german$Foreign_worker)
german$Credit_status <- as.factor(german$Credit_status)

summary(german)


       ID         Checking_account Duration_in_month Credit_history    Purpose    Credit_amount   Saving_accout
 Min.   :   1.0   1:274            Min.   : 4.0      0: 40          3      :280   Min.   :  250   1:603        
 1st Qu.: 250.8   2:269            1st Qu.:12.0      1: 49          0      :234   1st Qu.: 1366   2:103        
 Median : 500.5   3: 63            Median :18.0      2:530          2      :181   Median : 2320   3: 63        
 Mean   : 500.5   4:394            Mean   :20.9      3: 88          1      :103   Mean   : 3271   4: 48        
 3rd Qu.: 750.2                    3rd Qu.:24.0      4:293          9      : 97   3rd Qu.: 3972   5:183        
 Max.   :1000.0                    Max.   :72.0                     6      : 50   Max.   :18424                
                                                                    (Other): 55                                
 Present_employment Installment_rate Personal_status___sex Other_debtors_guarantors Present_residence Property
 1: 62              Min.   :1.000    1: 50                 1:907                    Min.   :1.000     1:282   
 2:172              1st Qu.:2.000    2:310                 2: 41                    1st Qu.:2.000     2:232   
 3:339              Median :3.000    3:548                 3: 52                    Median :3.000     3:332   
 4:174              Mean   :2.973    4: 92                                          Mean   :2.845     4:154   
 5:253              3rd Qu.:4.000                                                   3rd Qu.:4.000             
                    Max.   :4.000                                                   Max.   :4.000             
                                                                                                              
      Age        Other_installment_plan Housing Num_of_existing_credits Job     Num_of_people_liable Telephone
 Min.   :19.00   1:139                  1:179   Min.   :1.000           1: 22   Min.   :1.000        1:596    
 1st Qu.:27.00   2: 47                  2:713   1st Qu.:1.000           2:200   1st Qu.:1.000        2:404    
 Median :33.00   3:814                  3:108   Median :1.000           3:630   Median :1.000                 
 Mean   :35.55                                  Mean   :1.407           4:148   Mean   :1.155                 
 3rd Qu.:42.00                                  3rd Qu.:2.000                   3rd Qu.:1.000                 
 Max.   :75.00                                  Max.   :4.000                   Max.   :2.000                 
                                                                                                              
 Foreign_worker Credit_status
 1:963          N:700        
 2: 37          Y:300      

우선 첫 번째로, 범주형 변수들은 factor 자료형으로 바꿔줍니다.

summary 함수로 전처리된 데이터를 보시면, Target 변수인 credit_status 가 700개의 N, 300개의 Y를 가지기 때문에, 이 후에 모델링 할 경우, Class imbalance 문제를 겪을 수 있습니다.

따라서 층화추출하여, 300, 300 개의 sample을 만들도록 하겠습니다.

library(sampling)
stratified_sampling <- strata(german, stratanames = c("Credit_status"), size =c(300,300),
                              method="srswor")

st_data <- getdata(german, stratified_sampling)
table(st_data$Credit_status)

  N   Y 
300 300 


rm(stratified_sampling)

sampling 패키지는 층화추출 함수인 strata를 제공합니다.

결과적으로, st_data는 Credit_status 인자로써, 300개의 N 과 300개의 Y 를 가집니다.

# scaling
st_data$Duration_in_month = (st_data$Duration_in_month - min(st_data$Duration_in_month))/(max(st_data$Duration_in_month)-min(st_data$Duration_in_month))
st_data$Credit_amount = (st_data$Credit_amount - min(st_data$Credit_amount))/(max(st_data$Credit_amount)-min(st_data$Credit_amount))
st_data$Installment_rate = (st_data$Installment_rate - min(st_data$Installment_rate))/(max(st_data$Installment_rate)-min(st_data$Installment_rate))
st_data$Present_residence = (st_data$Present_residence - min(st_data$Present_residence))/(max(st_data$Present_residence)-min(st_data$Present_residence))
st_data$Age = (st_data$Age - min(st_data$Age))/(max(st_data$Age)-min(st_data$Age))
st_data$Num_of_people_liable = (st_data$Num_of_people_liable - min(st_data$Num_of_people_liable))/(max(st_data$Num_of_people_liable)-min(st_data$Num_of_people_liable))
st_data$Num_of_existing_credits = (st_data$Num_of_existing_credits - min(st_data$Num_of_existing_credits))/(max(st_data$Num_of_existing_credits)-min(st_data$Num_of_existing_credits))

summary(st_data)

       ID         Checking_account Duration_in_month Credit_history    Purpose    Credit_amount     Saving_accout
 Min.   :   5.0   1:201            Min.   :0.0000    0: 29          0      :160   Min.   :0.00000   1:383        
 1st Qu.: 368.5   2:170            1st Qu.:0.1176    1: 33          3      :154   1st Qu.:0.06092   2: 71        
 Median : 700.5   3: 35            Median :0.2059    2:329          2      :109   Median :0.11263   3: 33        
 Mean   : 605.5   4:194            Mean   :0.2609    3: 52          9      : 56   Mean   :0.17242   4: 18        
 3rd Qu.: 850.2                    3rd Qu.:0.3382    4:157          1      : 52   3rd Qu.:0.21832   5: 95        
 Max.   :1000.0                    Max.   :1.0000                   6      : 34   Max.   :1.00000                
                                                                    (Other): 35                                  
 Present_employment Installment_rate Personal_status___sex Other_debtors_guarantors Present_residence Property
 1: 42              Min.   :0.0000   1: 32                 1:539                    Min.   :0.0000    1:155   
 2:101              1st Qu.:0.3333   2:186                 2: 28                    1st Qu.:0.3333    2:139   
 3:213              Median :1.0000   3:324                 3: 33                    Median :0.6667    3:203   
 4:101              Mean   :0.6806   4: 58                                          Mean   :0.6289    4:103   
 5:143              3rd Qu.:1.0000                                                  3rd Qu.:1.0000            
                    Max.   :1.0000                                                  Max.   :1.0000            
                                                                                                              
      Age         Other_installment_plan Housing Num_of_existing_credits Job     Num_of_people_liable Telephone
 Min.   :0.0000   1: 94                  1:115   Min.   :0.0000          1: 10   Min.   :0.000        1:356    
 1st Qu.:0.1250   2: 26                  2:416   1st Qu.:0.0000          2:114   1st Qu.:0.000        2:244    
 Median :0.2321   3:480                  3: 69   Median :0.0000          3:385   Median :0.000                 
 Mean   :0.2871                                  Mean   :0.1333          4: 91   Mean   :0.155                 
 3rd Qu.:0.3973                                  3rd Qu.:0.3333                  3rd Qu.:0.000                 
 Max.   :1.0000                                  Max.   :1.0000                  Max.   :1.000                 
                                                                                                               
 Foreign_worker Credit_status    ID_unit            Prob           Stratum   
 1:580          N:300         Min.   :   5.0   Min.   :0.4286   Min.   :1.0  
 2: 20          Y:300         1st Qu.: 368.5   1st Qu.:0.4286   1st Qu.:1.0  
                              Median : 700.5   Median :0.7143   Median :1.5  
                              Mean   : 605.5   Mean   :0.7143   Mean   :1.5  
                              3rd Qu.: 850.2   3rd Qu.:1.0000   3rd Qu.:2.0  
                              Max.   :1000.0   Max.   :1.0000   Max.   :2.0 

다음으로, 인공 신경망을 적합하기 전에 Numeric 변수들을 scaling 해야하기 때문에, min-max normalization을 수행하였습니다.

뒤에 3개의 column은 층화 추출 결과로 생성된 것들이기에 이후에 제거해주어야 합니다.

# Data partition
library(caret)
train <- createDataPartition(st_data$ID, p=0.7, list=FALSE)
td <- st_data[train,]
vd <- st_data[-train,]
rm(st_data, train)

colnames(td)
td <- td[, -c(1,23,24,25)]
vd <- vd[, -c(1,23,24,25)]

마지막으로, 데이터 분할을 7:3의 비율로 훈련 데이터셋과 검증 데이터셋을 만들었습니다.

## Fit neural network

# install library
install.packages("neuralnet")

# load library
library(neuralnet)

저희는 neuralnet 패키지를 사용할 것입니다.

set.seed(2)

td_x = model.matrix(Credit_status ~ ., td)
Credit_status = ifelse(td$Credit_status == 'N', 0, 1)
td1 = data.frame(cbind(td_x, Credit_status))
td1 = td1[,-1]
NN = neuralnet(Credit_status ~ . ,td1, hidden = c(3,3,3,3,3), linear.output = F, err.fct = 'ce', likelihood = T)

같은 결과를 내기 위해 seed를 지정하고, neuralnet 함수를 사용하기 위해, 훈련 데이터셋의 범주형 변수들을 one-hot encoding 하였습니다.

신경망은 5개의 hidden layer를 가지고, 각 layer는 3개의 뉴런을 가집니다.

# plot neural network
plot(NN)

plot 함수를 통해, 해당 신경망의 weight들을 그림으로 한눈에 보실 수 있습니다.

마지막으로, 만들어진 모델을 통해 검증형 데이터셋에 대해 예측을 진행해봅시다.

# validation
vd_x = model.matrix(Credit_status ~ ., vd)
Credit_status = ifelse(vd$Credit_status == 'N', 0, 1)
vd1 = data.frame(cbind(vd_x, Credit_status))

nn.results <- compute(NN, vd1)
predict_y = ifelse(nn.results$net.result > 0.5, 1, 0)

confusionMatrix(as.factor(predict_y), as.factor(vd1$Credit_status))

Confusion Matrix and Statistics

          Reference
Prediction  0  1
         0 63 30
         1 27 60
                                        
               Accuracy : 0.6833        
                 95% CI : (0.61, 0.7505)
    No Information Rate : 0.5           
    P-Value [Acc > NIR] : 4.847e-07     
                                        
                  Kappa : 0.3667        
                                        
 Mcnemar's Test P-Value : 0.7911        
                                        
            Sensitivity : 0.7000        
            Specificity : 0.6667        
         Pos Pred Value : 0.6774        
         Neg Pred Value : 0.6897        
             Prevalence : 0.5000        
         Detection Rate : 0.3500        
   Detection Prevalence : 0.5167        
      Balanced Accuracy : 0.6833        
                                        
       'Positive' Class : 0             

정확도는 0.6833 으로 관찰되었습니다.

반응형