Data Science Arts: Train and test Data sets formation using R

Before building any model for machine learning, there is one of the golden rules of machine learning and modelling in general: models are built using training data, and evaluated on testing data. The reason is overfitting: most models’ accuracy can be artificially increased to a point where they “learn” every single detail of the data used to build them; unfortunately, it usually means they lose the capability to generalise. That’s why we need unseen data (i.e., the testing set): if we overfit the training data, the performance on the testing data will be poor. In real life, simple models often beat complex ones, because they can generalise much better. We will do a random 70:30 split in our data set (70% will be for training models, 30% to evaluate them). For reproducibility, we will need to set the seed of the random number generator (it means every time I run the code, I’ll get the same train and test sets. Here’s goes the code:

> # Reproducing same set; 222 has no particular meaning, just taken
> set.seed(222)

> # randomly pick 70% of the number of observations 400
> data<- sample(1:nrow(mydata),size = 0.7*nrow(mydata))

> # subset mydata to include only the elements in the data
> train <- mydata[data,]

> # subset mydata to include all but the elements in the data i.e. 30%
> test <- mydata[-data,]

> nrow(train)
[1] 280

> nrow(test)
[1] 120

You can use library(ggplot2) to plot the train and test data by creating dataframe and plot it.

Data Science Arts

Train and test Data sets formation using R

No comments:

Post a Comment

7 Stages of Machine Learning - Framework Introduction