R : Exploring Data for Machine Learning Modeling

These are my notes on the Practical Machine Learning course (Week2: Plotting Predictors - Tutorial).

When exploring data for Machine Learning, we're looking for:

imbalance outcomes/predictors
outliners
groups of outcome points not explained by any of the predictors
skewed variables (that needs to be transformed)

We'll use the Wage dataset from the ISLR package.
This dataset reports wage and other data (age, education, jobclass, etc.) for a group of 3000 male workers in the Mid-Atlantic region.

library(ISLR)
data(Wage)
summary(Wage)

##       year           age               sex                    maritl           race     
##  Min.   :2003   Min.   :18.00   1. Male  :3000   1. Never Married: 648   1. White:2480  
##  1st Qu.:2004   1st Qu.:33.75   2. Female:   0   2. Married      :2074   2. Black: 293  
##  Median :2006   Median :42.00                    3. Widowed      :  19   3. Asian: 190  
##  Mean   :2006   Mean   :42.41                    4. Divorced     : 204   4. Other:  37  
##  3rd Qu.:2008   3rd Qu.:51.00                    5. Separated    :  55                  
##  Max.   :2009   Max.   :80.00                                                           
##                                                                                         
##               education                     region               jobclass               health    
##  1. < HS Grad      :268   2. Middle Atlantic   :3000   1. Industrial :1544   1. <=Good     : 858  
##  2. HS Grad        :971   1. New England       :   0   2. Information:1456   2. >=Very Good:2142  
##  3. Some College   :650   3. East North Central:   0                                              
##  4. College Grad   :685   4. West North Central:   0                                              
##  5. Advanced Degree:426   5. South Atlantic    :   0                                              
##                           6. East South Central:   0                                              
##                           (Other)              :   0                                              
##   health_ins      logwage           wage       
##  1. Yes:2083   Min.   :3.000   Min.   : 20.09  
##  2. No : 917   1st Qu.:4.447   1st Qu.: 85.38  
##                Median :4.653   Median :104.92  
##                Mean   :4.654   Mean   :111.70  
##                3rd Qu.:4.857   3rd Qu.:128.68  
##                Max.   :5.763   Max.   :318.34  
##

Building training and testing sets (50% of the Wage dataset each):

library(caret)
intrain <- createDataPartition(y = Wage$wage, p = 0.5, list = F)
training = Wage[intrain,]
testing = Wage[-intrain,]

The exploration is always done on the training set.

Plotting predictors against outcome¶

Plotting wage versus age, education and jobclass using the R featurePlot function (from the caret package):

featurePlot(x = training[, c("age", "education", "jobclass")], y = training$wage, plot = "pairs")

Plotting wage versus age:

library(ggplot2)
qplot(age, wage, data = training)

The graph shows some patterns: a trend in wages comparing to ages and a group of outlined observations (above 250 dollars raw wage).

Plotting wage versus age, grouping by jobclass:

library(ggplot2)

The jobclass difference could explain the two distinct groups.
The jobclass variable might be able to predict at least a part of the variability that appears in the top of the plot.

Plotting wage versus age, grouping by education, adding regression smoothers:

qq <- qplot(age, wage, color = education, data = training)
qq + geom_smooth(method = "lm", formula = y ~ x)

The "Advanced Degree" education seems to also explained a lot of the variation at the top.

Data Repartition¶

Breaking up the wage variable into three groups (factors actually) with the R cut2 function (from the Hmisc package):

library(Hmisc)
cutWage <- cut2(training$wage, g = 3)
table(cutWage)

## cutWage
## [ 20.1, 91.7) [ 91.7,118.9) [118.9,314.3] 
##           506           519           476

Looking at the repartition, we can see that there are more industrial jobs that there are information jobs with lower wage. Then the trend reverses itself.

Plotting a boxplot of the wage groups created above:

qplot(cutWage, age, data = training, fill = cutWage, geom = c("boxplot"))

Exploring the repartition of jobclass across wage groups:

t1 <- table(cutWage, training$jobclass)
t1

##                
## cutWage         1. Industrial 2. Information
##   [ 20.1, 91.7)           313            193
##   [ 91.7,118.9)           262            257
##   [118.9,314.3]           190            286

Using the prop function to get the proportion of jobclass (in each row) for each groups:

prop.table(t1, 1)

##                
## cutWage         1. Industrial 2. Information
##   [ 20.1, 91.7)     0.6185771      0.3814229
##   [ 91.7,118.9)     0.5048170      0.4951830
##   [118.9,314.3]     0.3991597      0.6008403

62% of the low wage job correponds to industrial jobs, 38% to information jobs.

Density Plots¶

Density plot can be a much more effective way to view the distribution of a variable than boxplots.

Plotting a density plot of the values of wages, grouping by education:

qplot(wage, color = education, data = training, geom = "density")

The "<HS grad" workers tend to have more values in the lower part of the range of wage. There is an outgroup of Advanced Degree and College Grad workers with higher wage.

In the next post, we'll fit a linear model with wage as outcome and age, jobclass and education as predictors and perform some diagnosis analysis.

Plotting predictors against outcome¶

Data Repartition¶

Density Plots¶

Part 1 of the Machine Learning Modeling series

Next articles

Related Posts: