본문 바로가기

데이터 다루기/Base of R

[R] Data Partition (데이터 분할)

728x90
반응형
# data partition
install.packages('caret')
library(caret)
train <- createDataPartition(german$ID, p=0.7, list=FALSE)
# 분류,트레이닝비율, 결과를 리스트로 반환할지 행렬할지

training_data <- german[train,]
validation_data <- german[-train,]

caret 패키지의 createDataPartition는 데이터 분할기능을 편리하게 제공한다.

p 값을 통해 데이터의 얼만큼은 sampling할 것인지를 정해준다.

sampling된 결과를 가지고 데이터를 훈련 데이터셋과 검증 데이터셋으로 분류가능하다.

# sampling
random_sampling <- german[sample(1:nrow(german), 300, replace = FALSE),]

install.packages('sampling')
library(sampling)
stratified_sampling <- strata(german, stratanames = c("Credit_status"), size =c(100,100),
                              method="srswor")
#srswor=simple random sampling without replacement
st_data <- getdata(german, stratified_sampling)
summary(st_data) #100개, 100개

       ID         Checking_account Duration_in_month Credit_history    Purpose       Credit_amount   Saving_accout 
 Min.   :   4.0   Min.   :1.000    Min.   : 5.00     Min.   :0.00   Min.   : 0.000   Min.   :  362   Min.   :1.00  
 1st Qu.: 311.8   1st Qu.:1.000    1st Qu.:12.00     1st Qu.:2.00   1st Qu.: 0.000   1st Qu.: 1366   1st Qu.:1.00  
 Median : 701.0   Median :2.000    Median :18.00     Median :2.00   Median : 2.000   Median : 2499   Median :1.00  
 Mean   : 588.6   Mean   :2.275    Mean   :21.79     Mean   :2.49   Mean   : 2.765   Mean   : 3513   Mean   :1.84  
 3rd Qu.: 851.0   3rd Qu.:4.000    3rd Qu.:30.00     3rd Qu.:4.00   3rd Qu.: 3.000   3rd Qu.: 4485   3rd Qu.:2.00  
 Max.   :1000.0   Max.   :4.000    Max.   :60.00     Max.   :4.00   Max.   :10.000   Max.   :18424   Max.   :5.00  
 Present_employment Installment_rate Personal_status___sex Other_debtors_guarantors Present_residence    Property   
 Min.   :1.00       Min.   :1.000    Min.   :1.000         Min.   :1.000            Min.   :1.000     Min.   :1.00  
 1st Qu.:2.00       1st Qu.:2.000    1st Qu.:2.000         1st Qu.:1.000            1st Qu.:2.000     1st Qu.:2.00  
 Median :3.00       Median :3.000    Median :3.000         Median :1.000            Median :3.000     Median :2.50  
 Mean   :3.27       Mean   :2.815    Mean   :2.645         Mean   :1.155            Mean   :2.865     Mean   :2.43  
 3rd Qu.:4.00       3rd Qu.:4.000    3rd Qu.:3.000         3rd Qu.:1.000            3rd Qu.:4.000     3rd Qu.:3.00  
 Max.   :5.00       Max.   :4.000    Max.   :4.000         Max.   :3.000            Max.   :4.000     Max.   :4.00  
      Age        Other_installment_plan    Housing      Num_of_existing_credits      Job       Num_of_people_liable
 Min.   :19.00   Min.   :1.00           Min.   :1.000   Min.   :1.00            Min.   :1.00   Min.   :1.00        
 1st Qu.:25.75   1st Qu.:3.00           1st Qu.:2.000   1st Qu.:1.00            1st Qu.:3.00   1st Qu.:1.00        
 Median :31.00   Median :3.00           Median :2.000   Median :1.00            Median :3.00   Median :1.00        
 Mean   :34.52   Mean   :2.69           Mean   :1.985   Mean   :1.36            Mean   :2.91   Mean   :1.15        
 3rd Qu.:40.00   3rd Qu.:3.00           3rd Qu.:2.000   3rd Qu.:2.00            3rd Qu.:3.00   3rd Qu.:1.00        
 Max.   :74.00   Max.   :3.00           Max.   :3.000   Max.   :3.00            Max.   :4.00   Max.   :2.00        
   Telephone     Foreign_worker  Credit_status    ID_unit            Prob           Stratum   
 Min.   :1.000   Min.   :1.000   N:100         Min.   :   4.0   Min.   :0.1429   Min.   :1.0  
 1st Qu.:1.000   1st Qu.:1.000   Y:100         1st Qu.: 311.8   1st Qu.:0.1429   1st Qu.:1.0  
 Median :1.000   Median :1.000                 Median : 701.0   Median :0.2381   Median :1.5  
 Mean   :1.375   Mean   :1.025                 Mean   : 588.6   Mean   :0.2381   Mean   :1.5  
 3rd Qu.:2.000   3rd Qu.:1.000                 3rd Qu.: 851.0   3rd Qu.:0.3333   3rd Qu.:2.0  
 Max.   :2.000   Max.   :2.000                 Max.   :1000.0   Max.   :0.3333   Max.   :2.0  

strata 함수는 층화추출로써, 원하는 범주형 변수을 기준으로 랜덤 sampling을 진행한다.

위의 경우, Credit_status 기준으로 N과 Y에서 각각 100개씩 추출하였다.

반응형