728x90
반응형
# data partition
install.packages('caret')
library(caret)
train <- createDataPartition(german$ID, p=0.7, list=FALSE)
# 분류,트레이닝비율, 결과를 리스트로 반환할지 행렬할지
training_data <- german[train,]
validation_data <- german[-train,]
caret 패키지의 createDataPartition는 데이터 분할기능을 편리하게 제공한다.
p 값을 통해 데이터의 얼만큼은 sampling할 것인지를 정해준다.
sampling된 결과를 가지고 데이터를 훈련 데이터셋과 검증 데이터셋으로 분류가능하다.
# sampling
random_sampling <- german[sample(1:nrow(german), 300, replace = FALSE),]
install.packages('sampling')
library(sampling)
stratified_sampling <- strata(german, stratanames = c("Credit_status"), size =c(100,100),
method="srswor")
#srswor=simple random sampling without replacement
st_data <- getdata(german, stratified_sampling)
summary(st_data) #100개, 100개
ID Checking_account Duration_in_month Credit_history Purpose Credit_amount Saving_accout
Min. : 4.0 Min. :1.000 Min. : 5.00 Min. :0.00 Min. : 0.000 Min. : 362 Min. :1.00
1st Qu.: 311.8 1st Qu.:1.000 1st Qu.:12.00 1st Qu.:2.00 1st Qu.: 0.000 1st Qu.: 1366 1st Qu.:1.00
Median : 701.0 Median :2.000 Median :18.00 Median :2.00 Median : 2.000 Median : 2499 Median :1.00
Mean : 588.6 Mean :2.275 Mean :21.79 Mean :2.49 Mean : 2.765 Mean : 3513 Mean :1.84
3rd Qu.: 851.0 3rd Qu.:4.000 3rd Qu.:30.00 3rd Qu.:4.00 3rd Qu.: 3.000 3rd Qu.: 4485 3rd Qu.:2.00
Max. :1000.0 Max. :4.000 Max. :60.00 Max. :4.00 Max. :10.000 Max. :18424 Max. :5.00
Present_employment Installment_rate Personal_status___sex Other_debtors_guarantors Present_residence Property
Min. :1.00 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.000 Min. :1.00
1st Qu.:2.00 1st Qu.:2.000 1st Qu.:2.000 1st Qu.:1.000 1st Qu.:2.000 1st Qu.:2.00
Median :3.00 Median :3.000 Median :3.000 Median :1.000 Median :3.000 Median :2.50
Mean :3.27 Mean :2.815 Mean :2.645 Mean :1.155 Mean :2.865 Mean :2.43
3rd Qu.:4.00 3rd Qu.:4.000 3rd Qu.:3.000 3rd Qu.:1.000 3rd Qu.:4.000 3rd Qu.:3.00
Max. :5.00 Max. :4.000 Max. :4.000 Max. :3.000 Max. :4.000 Max. :4.00
Age Other_installment_plan Housing Num_of_existing_credits Job Num_of_people_liable
Min. :19.00 Min. :1.00 Min. :1.000 Min. :1.00 Min. :1.00 Min. :1.00
1st Qu.:25.75 1st Qu.:3.00 1st Qu.:2.000 1st Qu.:1.00 1st Qu.:3.00 1st Qu.:1.00
Median :31.00 Median :3.00 Median :2.000 Median :1.00 Median :3.00 Median :1.00
Mean :34.52 Mean :2.69 Mean :1.985 Mean :1.36 Mean :2.91 Mean :1.15
3rd Qu.:40.00 3rd Qu.:3.00 3rd Qu.:2.000 3rd Qu.:2.00 3rd Qu.:3.00 3rd Qu.:1.00
Max. :74.00 Max. :3.00 Max. :3.000 Max. :3.00 Max. :4.00 Max. :2.00
Telephone Foreign_worker Credit_status ID_unit Prob Stratum
Min. :1.000 Min. :1.000 N:100 Min. : 4.0 Min. :0.1429 Min. :1.0
1st Qu.:1.000 1st Qu.:1.000 Y:100 1st Qu.: 311.8 1st Qu.:0.1429 1st Qu.:1.0
Median :1.000 Median :1.000 Median : 701.0 Median :0.2381 Median :1.5
Mean :1.375 Mean :1.025 Mean : 588.6 Mean :0.2381 Mean :1.5
3rd Qu.:2.000 3rd Qu.:1.000 3rd Qu.: 851.0 3rd Qu.:0.3333 3rd Qu.:2.0
Max. :2.000 Max. :2.000 Max. :1000.0 Max. :0.3333 Max. :2.0
strata 함수는 층화추출로써, 원하는 범주형 변수을 기준으로 랜덤 sampling을 진행한다.
위의 경우, Credit_status 기준으로 N과 Y에서 각각 100개씩 추출하였다.
반응형
'데이터 다루기 > Base of R' 카테고리의 다른 글
[R] Principal Component Analysis (PCA), Factor Analysis (0) | 2020.03.11 |
---|---|
[R] Ridge, Lasso, ElasticNet Regression (1) | 2020.03.09 |
[R] 회귀 분석 (변수선택) (0) | 2020.03.08 |
[R] 회귀 분석 (0) | 2020.03.06 |
[Data] LendingClub (P2P Default 예측 데이터) (1) | 2020.01.17 |