As we mentioned in Chapter 15, variable selection is very important when dealing with bioinformatics, healthcare, and biomedical data where we may have more features than observations. Variable selection, or feature selection, can help us focus only on the core important information contained in the observations, instead of every piece of information. Due to presence of intrinsic and extrinsic noise, the volume and complexity of big health data, and different methodological and technological challenges, this process of identifying the salient features may resemble finding a needle in a haystack. Here, we will illustrate alternative strategies for feature selection using filtering (e.g., correlation-based feature selection), wrapping (e.g., recursive feature elimination), and embedding (e.g., variable importance via random forest classification) techniques.

The next Chapter, Chapter 17, provides the details of another powerful technique for variable-selection using decoy features to control for the false discovery rate of inconsequential features.

1 Feature selection methods

There are three major classes of variable or feature selection techniques - filtering-based, wrapper-based, and embedded methods.

1.1 Filtering techniques

Univariate: Univariate filtering methods focus on selecting single features with high score based on some statistics like $\chi^2$ or Information Gain Ratio. Each feature is viewed as independent of the others, effectively ignoring interactions between features.
- Examples: $\chi^2$, Euclidean distance, $i$-test, and Information gain.
Multivariate: Multivariate filtering methods rely on various (multivariate) statistics to select the principal features. They typically account for between-feature interactions by using higher-order statistics like correlation. The basic idea is that we iteratively triage variables that have high correlations with other features.
- Examples: Correlation-based feature selection, Markov blanket filter, and fast correlation-based feature selection.

1.2 Wrapper

Deterministic: Deterministic wrapper feature selection methods either start with no features (forward-selection) or with all features included in the model (backward-selection) and iteratively refine the set of chosen features according to some model quality measures. The iterative process of adding or removing features may rely on statistics like the Jaccard similarity coefficient.
- Examples: Sequential forward selection, Recursive Feature Elimination, Plus $q$ take-away $r$, and Beam search.
Randomized: Stochastic wrapper feature selection procedures utilize a binary feature-indexing vector indicating whether or not each variable should be includes in the list of salient features. At each iteration, we randomly perturb to the binary indicators vector and compare the combinations of features before and after the random inclusion-exclusion indexing change. Finally, we pick the indexing vector corresponding with the optimal performance based on some metric like acceptance probability measures. The iterative process continues until no improvement of the objective function is observed.
- Examples: Simulated annealing, Genetic algorithms, Estimation of distribution algorithms.

1.3 Embedded Techniques

Embedded feature selection techniques are based on various classifiers, predictors, or clustering procedures. For instance, we can accomplish feature selection by using decision trees where the separation of the training data relies on features associated with the highest information gain. Further tree branching separating the data deeper may utilize weaker features. This process of choosing the vital features based on their separability characteristics continues until the classifier generates group labels that are mostly homogeneous within clusters/classes and largely heterogeneous across groups, and when the information gain of further tree branching is marginal. The entire process may be iterated multiple times and select the features that appear most frequently.
- Examples: Decision trees, random forests, weighted naive Bayes, and feature selection using weighted-SVM.

The different types of feature selection methods have their own pros and cons. In this chapter, we are going to introduce the randomized wrapper method using the Boruta package, which utilizes random forest classification method to output variable importance measures (VIMs). Then, we will compare its results with Recursive Feature Elimination, a classical deterministic wrapper method.

2 Case Study - ALS

2.1 Step 1: Collecting Data

First things first, let’s explore the dataset we will be using. Case Study 15, Amyotrophic Lateral Sclerosis (ALS), examines the patterns, symmetries, associations and causality in a rare but devastating disease, amyotrophic lateral sclerosis (ALS), also known as Lou Gehrig disease. This ALS case-study reflects a large clinical trial including big, multi-source and heterogeneous datasets. It would be interesting to interrogate the data and attempt to derive potential biomarkers that can be used for detecting, prognosticating, and forecasting the progression of this neurodegenerative disorder. Overcoming many scientific, technical and infrastructure barriers is required to establish complete, efficient, and reproducible protocols for such complex data. These pipeline workflows start with ingesting the raw data, preprocessing, aggregating, harmonizing, analyzing, visualizing and interpreting the findings.

In this case-study, we use the training dataset that contains 2,223 observations and 131 numeric variables. We select ALSFRS slope as our outcome variable, as it captures the patients’ clinical decline over a year. Although we have more observations than features, this is one of the examples where multiple features are highly correlated. Therefore, we need to preprocess the variables before commencing with feature selection.

2.2 Step 2: Exploring and preparing the data

The dataset is located in our case-studies archive. We can use read.csv() to directly import the CSV dataset into R using the URL reference.

ALS.train<-read.csv("https://umich.instructure.com/files/1789624/download?download_frd=1")
summary(ALS.train)

##        ID            Age_mean      Albumin_max    Albumin_median 
##  Min.   :   1.0   Min.   :18.00   Min.   :37.00   Min.   :34.50  
##  1st Qu.: 614.5   1st Qu.:47.00   1st Qu.:45.00   1st Qu.:42.00  
##  Median :1213.0   Median :55.00   Median :47.00   Median :44.00  
##  Mean   :1214.9   Mean   :54.55   Mean   :47.01   Mean   :43.95  
##  3rd Qu.:1815.5   3rd Qu.:63.00   3rd Qu.:49.00   3rd Qu.:46.00  
##  Max.   :2424.0   Max.   :81.00   Max.   :70.30   Max.   :51.10  
##   Albumin_min    Albumin_range       ALSFRS_slope     ALSFRS_Total_max
##  Min.   :24.00   Min.   :0.000000   Min.   :-4.3452   Min.   :11.00   
##  1st Qu.:39.00   1st Qu.:0.009042   1st Qu.:-1.0863   1st Qu.:29.00   
##  Median :41.00   Median :0.012111   Median :-0.6207   Median :33.00   
##  Mean   :40.77   Mean   :0.013779   Mean   :-0.7283   Mean   :31.69   
##  3rd Qu.:43.00   3rd Qu.:0.015873   3rd Qu.:-0.2838   3rd Qu.:36.00   
##  Max.   :49.00   Max.   :0.243902   Max.   : 1.2070   Max.   :40.00   
##  ALSFRS_Total_median ALSFRS_Total_min ALSFRS_Total_range ALT.SGPT._max   
##  Min.   : 2.5        Min.   : 0.00    Min.   :0.00000    Min.   : 10.00  
##  1st Qu.:23.0        1st Qu.:14.00    1st Qu.:0.01404    1st Qu.: 32.00  
##  Median :28.0        Median :20.00    Median :0.02330    Median : 45.00  
##  Mean   :27.1        Mean   :19.88    Mean   :0.02604    Mean   : 54.44  
##  3rd Qu.:32.0        3rd Qu.:27.00    3rd Qu.:0.03480    3rd Qu.: 65.00  
##  Max.   :40.0        Max.   :40.00    Max.   :0.11765    Max.   :944.00  
##  ALT.SGPT._median ALT.SGPT._min    ALT.SGPT._range    AST.SGOT._max   
##  Min.   :  8.00   Min.   :  1.60   Min.   :0.002747   Min.   : 11.00  
##  1st Qu.: 22.00   1st Qu.: 15.00   1st Qu.:0.030303   1st Qu.: 30.00  
##  Median : 30.00   Median : 21.00   Median :0.047619   Median : 38.00  
##  Mean   : 32.99   Mean   : 23.01   Mean   :0.071137   Mean   : 43.13  
##  3rd Qu.: 40.00   3rd Qu.: 28.00   3rd Qu.:0.077539   3rd Qu.: 48.00  
##  Max.   :193.00   Max.   :109.00   Max.   :2.383117   Max.   :911.00  
##  AST.SGOT._median AST.SGOT._min   AST.SGOT._range   Bicarbonate_max
##  Min.   :  9.00   Min.   : 1.00   Min.   :0.00000   Min.   :20.0   
##  1st Qu.: 22.00   1st Qu.:17.00   1st Qu.:0.02352   1st Qu.:29.0   
##  Median : 27.00   Median :20.00   Median :0.03502   Median :31.0   
##  Mean   : 29.08   Mean   :21.54   Mean   :0.04919   Mean   :30.9   
##  3rd Qu.: 34.00   3rd Qu.:25.00   3rd Qu.:0.05243   3rd Qu.:32.0   
##  Max.   :100.00   Max.   :86.00   Max.   :1.91667   Max.   :52.0   
##  Bicarbonate_median Bicarbonate_min Bicarbonate_range
##  Min.   :19.50      Min.   : 2.50   Min.   :0.00000  
##  1st Qu.:26.00      1st Qu.:22.00   1st Qu.:0.01266  
##  Median :27.00      Median :23.00   Median :0.01493  
##  Mean   :26.96      Mean   :23.16   Mean   :0.01687  
##  3rd Qu.:28.00      3rd Qu.:24.45   3rd Qu.:0.01815  
##  Max.   :39.50      Max.   :34.00   Max.   :0.21429  
##  Blood.Urea.Nitrogen..BUN._max Blood.Urea.Nitrogen..BUN._median
##  Min.   : 2.921                Min.   : 2.191                  
##  1st Qu.: 5.842                1st Qu.: 4.640                  
##  Median : 6.937                Median : 5.423                  
##  Mean   : 7.353                Mean   : 5.558                  
##  3rd Qu.: 8.210                3rd Qu.: 6.353                  
##  Max.   :25.192                Max.   :11.866                  
##  Blood.Urea.Nitrogen..BUN._min Blood.Urea.Nitrogen..BUN._range
##  Min.   : 0.5842               Min.   :0.000000               
##  1st Qu.: 3.2859               1st Qu.:0.004109               
##  Median : 4.0700               Median :0.005817               
##  Mean   : 4.1609               Mean   :0.007133               
##  3rd Qu.: 5.0000               3rd Qu.:0.008353               
##  Max.   :10.2228               Max.   :0.069543               
##  bp_diastolic_max bp_diastolic_median bp_diastolic_min bp_diastolic_range
##  Min.   : 70.00   Min.   : 56.00      Min.   : 20.00   Min.   :0.00000   
##  1st Qu.: 88.00   1st Qu.: 78.00      1st Qu.: 65.00   1st Qu.:0.03527   
##  Median : 90.00   Median : 80.00      Median : 70.00   Median :0.04337   
##  Mean   : 92.03   Mean   : 81.11      Mean   : 69.89   Mean   :0.04766   
##  3rd Qu.: 98.00   3rd Qu.: 85.00      3rd Qu.: 75.00   3rd Qu.:0.05435   
##  Max.   :140.00   Max.   :110.00      Max.   :100.00   Max.   :0.71429   
##  bp_systolic_max bp_systolic_median bp_systolic_min bp_systolic_range
##  Min.   :100.0   Min.   : 90.0      Min.   : 72.0   Min.   :0.00000  
##  1st Qu.:138.0   1st Qu.:120.0      1st Qu.:108.0   1st Qu.:0.05272  
##  Median :145.0   Median :130.0      Median :110.0   Median :0.06494  
##  Mean   :147.1   Mean   :129.6      Mean   :113.4   Mean   :0.07118  
##  3rd Qu.:157.0   3rd Qu.:136.0      3rd Qu.:120.0   3rd Qu.:0.08190  
##  Max.   :220.0   Max.   :190.0      Max.   :165.0   Max.   :0.40462  
##   Calcium_max    Calcium_median   Calcium_min     Calcium_range      
##  Min.   :2.171   Min.   :2.046   Min.   :0.2438   Min.   :0.0000000  
##  1st Qu.:2.400   1st Qu.:2.283   1st Qu.:2.1707   1st Qu.:0.0003741  
##  Median :2.470   Median :2.345   Median :2.2300   Median :0.0004739  
##  Mean   :2.475   Mean   :2.346   Mean   :2.2229   Mean   :0.0005407  
##  3rd Qu.:2.530   3rd Qu.:2.400   3rd Qu.:2.2977   3rd Qu.:0.0005893  
##  Max.   :9.460   Max.   :2.800   Max.   :2.6500   Max.   :0.0129009  
##   Chloride_max   Chloride_median  Chloride_min    Chloride_range   
##  Min.   : 96.0   Min.   : 90.0   Min.   : 76.00   Min.   :0.00000  
##  1st Qu.:106.0   1st Qu.:102.0   1st Qu.: 98.00   1st Qu.:0.01250  
##  Median :107.0   Median :104.0   Median :100.00   Median :0.01587  
##  Mean   :107.2   Mean   :103.5   Mean   : 99.26   Mean   :0.01787  
##  3rd Qu.:109.0   3rd Qu.:105.0   3rd Qu.:101.00   3rd Qu.:0.01990  
##  Max.   :119.0   Max.   :111.0   Max.   :109.00   Max.   :0.21429  
##  Creatinine_max   Creatinine_median Creatinine_min   Creatinine_range 
##  Min.   : 22.00   Min.   : 18.00    Min.   :  0.00   Min.   :0.00000  
##  1st Qu.: 65.00   1st Qu.: 53.04    1st Qu.: 39.00   1st Qu.:0.03824  
##  Median : 79.56   Median : 62.00    Median : 53.00   Median :0.04865  
##  Mean   : 78.78   Mean   : 65.19    Mean   : 51.98   Mean   :0.05842  
##  3rd Qu.: 88.40   3rd Qu.: 78.85    3rd Qu.: 61.88   3rd Qu.:0.07026  
##  Max.   :248.00   Max.   :176.80    Max.   :167.96   Max.   :0.42095  
##   Gender_mean     Glucose_max     Glucose_median    Glucose_min    
##  Min.   :1.000   Min.   : 4.160   Min.   : 3.497   Min.   : 0.000  
##  1st Qu.:1.000   1st Qu.: 5.827   1st Qu.: 4.911   1st Qu.: 4.051  
##  Median :2.000   Median : 6.500   Median : 5.300   Median : 4.440  
##  Mean   :1.637   Mean   : 7.160   Mean   : 5.487   Mean   : 4.265  
##  3rd Qu.:2.000   3rd Qu.: 7.600   3rd Qu.: 5.695   3rd Qu.: 4.800  
##  Max.   :2.000   Max.   :33.688   Max.   :26.196   Max.   :12.200  
##  Glucose_range        hands_max      hands_median     hands_min    
##  Min.   :0.000000   Min.   :0.000   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.003051   1st Qu.:5.000   1st Qu.:3.000   1st Qu.:0.000  
##  Median :0.004695   Median :7.000   Median :5.500   Median :3.000  
##  Mean   :0.006319   Mean   :6.181   Mean   :4.905   Mean   :3.047  
##  3rd Qu.:0.007373   3rd Qu.:8.000   3rd Qu.:7.000   3rd Qu.:5.000  
##  Max.   :0.097463   Max.   :8.000   Max.   :8.000   Max.   :8.000  
##   hands_range       Hematocrit_max   Hematocrit_median Hematocrit_min  
##  Min.   :0.000000   Min.   : 0.373   Min.   : 0.362    Min.   : 0.311  
##  1st Qu.:0.003610   1st Qu.:42.300   1st Qu.:40.000    1st Qu.:37.000  
##  Median :0.006652   Median :45.200   Median :42.600    Median :40.000  
##  Mean   :0.006883   Mean   :41.939   Mean   :39.467    Mean   :36.962  
##  3rd Qu.:0.009513   3rd Qu.:47.700   3rd Qu.:45.000    3rd Qu.:42.700  
##  Max.   :0.042857   Max.   :81.000   Max.   :56.000    Max.   :52.900  
##  Hematocrit_range   Hemoglobin_max  Hemoglobin_median Hemoglobin_min   
##  Min.   :0.000000   Min.   :116.0   Min.   :106.0     Min.   :  6.204  
##  1st Qu.:0.007164   1st Qu.:144.0   1st Qu.:136.0     1st Qu.:128.000  
##  Median :0.009701   Median :152.0   Median :145.0     Median :136.000  
##  Mean   :0.011431   Mean   :152.1   Mean   :144.3     Mean   :135.461  
##  3rd Qu.:0.013579   3rd Qu.:160.0   3rd Qu.:152.0     3rd Qu.:145.000  
##  Max.   :0.185714   Max.   :280.0   Max.   :182.0     Max.   :180.000  
##  Hemoglobin_range     leg_max       leg_median      leg_min     
##  Min.   :0.00000   Min.   :0.00   Min.   :0.00   Min.   :0.000  
##  1st Qu.:0.02321   1st Qu.:3.00   1st Qu.:2.50   1st Qu.:1.000  
##  Median :0.03106   Median :5.00   Median :3.00   Median :2.000  
##  Mean   :0.03824   Mean   :5.31   Mean   :4.05   Mean   :2.493  
##  3rd Qu.:0.04205   3rd Qu.:8.00   3rd Qu.:6.00   3rd Qu.:3.000  
##  Max.   :0.56180   Max.   :8.00   Max.   :8.00   Max.   :8.000  
##    leg_range          mouth_max      mouth_median      mouth_min     
##  Min.   :0.000000   Min.   : 1.00   Min.   : 0.000   Min.   : 0.000  
##  1st Qu.:0.003378   1st Qu.:10.00   1st Qu.: 8.000   1st Qu.: 5.000  
##  Median :0.005435   Median :12.00   Median :11.000   Median : 9.000  
##  Mean   :0.006163   Mean   :10.74   Mean   : 9.703   Mean   : 7.778  
##  3rd Qu.:0.008718   3rd Qu.:12.00   3rd Qu.:12.000   3rd Qu.:11.000  
##  Max.   :0.042017   Max.   :12.00   Max.   :12.000   Max.   :12.000  
##   mouth_range       onset_delta_mean onset_site_mean Platelets_max  
##  Min.   :0.000000   Min.   :-3119    Min.   :1.000   Min.   : 84.0  
##  1st Qu.:0.001815   1st Qu.: -887    1st Qu.:2.000   1st Qu.:239.0  
##  Median :0.005329   Median : -572    Median :2.000   Median :275.0  
##  Mean   :0.006595   Mean   : -683    Mean   :1.801   Mean   :285.3  
##  3rd Qu.:0.010251   3rd Qu.: -374    3rd Qu.:2.000   3rd Qu.:320.0  
##  Max.   :0.036765   Max.   :  -16    Max.   :3.000   Max.   :866.0  
##  Platelets_median Platelets_min     Potassium_max    Potassium_median
##  Min.   : 73.0    Min.   :  0.197   Min.   : 3.400   Min.   :3.000   
##  1st Qu.:204.0    1st Qu.:175.000   1st Qu.: 4.400   1st Qu.:4.000   
##  Median :233.0    Median :204.000   Median : 4.500   Median :4.200   
##  Mean   :238.8    Mean   :208.382   Mean   : 4.628   Mean   :4.189   
##  3rd Qu.:270.0    3rd Qu.:236.000   3rd Qu.: 4.800   3rd Qu.:4.300   
##  Max.   :526.0    Max.   :476.000   Max.   :43.000   Max.   :5.100   
##  Potassium_min   Potassium_range      pulse_max       pulse_median   
##  Min.   :2.400   Min.   :0.000000   Min.   : 53.00   Min.   : 50.00  
##  1st Qu.:3.700   1st Qu.:0.001058   1st Qu.: 84.00   1st Qu.: 72.00  
##  Median :3.900   Median :0.001425   Median : 90.00   Median : 77.00  
##  Mean   :3.857   Mean   :0.001744   Mean   : 90.64   Mean   : 76.97  
##  3rd Qu.:4.000   3rd Qu.:0.001913   3rd Qu.: 96.00   3rd Qu.: 81.00  
##  Max.   :5.100   Max.   :0.098674   Max.   :144.00   Max.   :115.00  
##    pulse_min       pulse_range       respiratory_max respiratory_median
##  Min.   : 18.00   Min.   :0.005425   Min.   :2.00    Min.   :0.000     
##  1st Qu.: 60.00   1st Qu.:0.036755   1st Qu.:4.00    1st Qu.:3.000     
##  Median : 64.00   Median :0.048821   Median :4.00    Median :4.000     
##  Mean   : 65.37   Mean   :0.053587   Mean   :3.91    Mean   :3.593     
##  3rd Qu.: 70.00   3rd Qu.:0.062365   3rd Qu.:4.00    3rd Qu.:4.000     
##  Max.   :102.00   Max.   :0.500000   Max.   :4.00    Max.   :4.000     
##  respiratory_min respiratory_range    Sodium_max    Sodium_median  
##  Min.   :0.000   Min.   :0.000000   Min.   :134.0   Min.   :128.0  
##  1st Qu.:2.000   1st Qu.:0.000000   1st Qu.:142.0   1st Qu.:139.0  
##  Median :3.000   Median :0.001828   Median :143.0   Median :140.0  
##  Mean   :2.791   Mean   :0.002513   Mean   :143.4   Mean   :140.1  
##  3rd Qu.:4.000   3rd Qu.:0.003653   3rd Qu.:145.0   3rd Qu.:141.0  
##  Max.   :4.000   Max.   :0.025424   Max.   :169.0   Max.   :146.5  
##    Sodium_min     Sodium_range       SubjectID        trunk_max    
##  Min.   :112.0   Min.   :0.00000   Min.   :   533   Min.   :0.000  
##  1st Qu.:135.0   1st Qu.:0.01058   1st Qu.:240826   1st Qu.:5.000  
##  Median :137.0   Median :0.01312   Median :496835   Median :7.000  
##  Mean   :136.8   Mean   :0.01500   Mean   :498880   Mean   :6.204  
##  3rd Qu.:138.0   3rd Qu.:0.01728   3rd Qu.:750301   3rd Qu.:8.000  
##  Max.   :145.0   Max.   :0.14286   Max.   :999482   Max.   :8.000  
##   trunk_median     trunk_min      trunk_range        Urine.Ph_max 
##  Min.   :0.000   Min.   :0.000   Min.   :0.000000   Min.   :5.00  
##  1st Qu.:3.000   1st Qu.:1.000   1st Qu.:0.003643   1st Qu.:6.00  
##  Median :5.000   Median :3.000   Median :0.006920   Median :7.00  
##  Mean   :4.893   Mean   :2.956   Mean   :0.007136   Mean   :6.82  
##  3rd Qu.:6.500   3rd Qu.:5.000   3rd Qu.:0.009639   3rd Qu.:7.00  
##  Max.   :8.000   Max.   :8.000   Max.   :0.042017   Max.   :9.00  
##  Urine.Ph_median  Urine.Ph_min  
##  Min.   :5.000   Min.   :5.000  
##  1st Qu.:5.000   1st Qu.:5.000  
##  Median :6.000   Median :5.000  
##  Mean   :5.711   Mean   :5.183  
##  3rd Qu.:6.000   3rd Qu.:5.000  
##  Max.   :9.000   Max.   :8.000

There are 131 features and some of variables represent statistics like max, min and median values of the same clinical measurements.

2.3 Step 3 - training a model on the data

Now let’s explore the Boruta() function in Boruta package to perform variables selection, based on random forest classification. Boruta() includes the following components:

vs<-Boruta(class~features, data=Mydata, pValue = 0.01, mcAdj = TRUE, maxRuns = 100, doTrace=0, getImp = getImpRfZ, ...)

class: variable for class labels.
features: potential features to select from.
data: dataset containing classes and features.
pValue: confidence level. Default value is 0.01 (Notice we are applying multiple variable selection.
mcAdj: Default TRUE to apply a multiple comparisons adjustment using the Bonferroni method.
maxRuns: maximal number of importance source runs. You may increase it to resolve attributes left Tentative.
doTrace: verbosity level. Default 0 means no tracing, 1 means reporting decision about each attribute as soon as it is justified, 2 means same as 1, plus at each importance source run reporting the number of attributes. The default is 0 where we don’t do the reporting.
getImp: function used to obtain attribute importance. The default is $getImpRfZ$, which runs random forest from the ranger package and gathers $Z$-scores of mean decrease accuracy measure.

The resulting vs object is of class Boruta and contains two important components:

finalDecision: a factor of three values: Confirmed, Rejected or Tentative, containing the final results of the feature selection process.
ImpHistory: a data frame of importance of attributes gathered in each importance source run. Besides the predictors’ importance, it contains maximal, mean and minimal importance of shadow attributes for each run. Rejected attributes get -Inf importance. This output is set to NULL if we specify holdHistory=FALSE in the Boruta call.

Note: Running the code below will take several minutes.

# install.packages("Boruta")
library(Boruta)

## Loading required package: ranger

set.seed(123)
als<-Boruta(ALSFRS_slope~.-ID, data=ALS.train, doTrace=0)
print(als)

## Boruta performed 99 iterations in 4.627568 mins.
##  28 attributes confirmed important: ALSFRS_Total_max,
## ALSFRS_Total_median, ALSFRS_Total_min, ALSFRS_Total_range,
## Creatinine_median and 23 more;
##  59 attributes confirmed unimportant: Albumin_max, Albumin_median,
## Albumin_min, ALT.SGPT._max, ALT.SGPT._median and 54 more;
##  12 tentative attributes left: Age_mean, Albumin_range,
## Creatinine_max, Hematocrit_median, Hematocrit_range and 7 more;

als$ImpHistory[1:6, 1:10]

##        Age_mean Albumin_max Albumin_median Albumin_min Albumin_range
## [1,]  1.2031427   1.4969268      0.6976378   0.9385041      1.979510
## [2,] -0.1998469   0.7204092     -1.5626360   0.5777092      2.573882
## [3,]  1.9272058  -1.0274668      0.2216170  -1.2234402      1.843967
## [4,]  0.5763244   0.9097371      0.2960979   0.6137624      2.184383
## [5,]  3.3655147   1.9412326      0.3849548   1.7309793      1.134676
## [6,]  0.2603118  -0.0287943      1.4164860   2.3251879      2.259974
##      ALSFRS_Total_max ALSFRS_Total_median ALSFRS_Total_min
## [1,]         6.925233            9.551064         15.92924
## [2,]         8.124101            7.867399         14.94650
## [3,]         7.443326            8.735702         17.26469
## [4,]         7.578267            7.868885         16.95563
## [5,]         7.554582            7.248834         15.42697
## [6,]         7.516362            7.145460         14.94824
##      ALSFRS_Total_range ALT.SGPT._max
## [1,]           25.78135     4.1516252
## [2,]           26.11722     1.2187027
## [3,]           25.61523     2.1618804
## [4,]           28.19229     0.4305607
## [5,]           24.90620     1.2043325
## [6,]           26.57093     0.8463782

This is a fairly time-consuming computation. Boruta determines the important attributes from unimportant and tentative features. Here the importance is measured by the Out-of-bag (OOB) error. The OOB estimates the prediction error of machine learning methods (e.g., random forests and boosted decision trees) that utilize bootstrap aggregation to sub-sample training data. OOB represents the mean prediction error on each training sample $x_i$, using only the trees that did not include $x_i$ in their bootstrap samples. Out-of-bag estimates provide internal assessment of the learning accuracy and avoid the need for an independent external validation dataset.

The importance scores for all features at every iteration are stored in the data frame als$ImpHistory. Let’s plot a graph depicting the essential features.

Note: Again, running this code will take several minutes to complete.

plot(als, xlab="", xaxt="n")
lz<-lapply(1:ncol(als$ImpHistory), function(i)
als$ImpHistory[is.finite(als$ImpHistory[, i]), i])
names(lz)<-colnames(als$ImpHistory)
lb<-sort(sapply(lz, median))
axis(side=1, las=2, labels=names(lb), at=1:ncol(als$ImpHistory), cex.axis=0.5, font = 4)

We can see that plotting the graph is easy but extracting matched feature names may require more work. The basic plot is done by this call plot(als, xlab="", xaxt="n"), where xaxt="n" means we suppress plotting of x-axis. The following lines in the script reconstruct the x-axis plot. lz is a list created by the lapply() function. Each element in lz contains all the important scores for a single feature in the original dataset. Also, we excluded all rejected features with infinite importance. Then, we sorted these non-rejected features according to their median importance and print them on the x-axis by using axis().

We have already seen similar groups of boxplots back in Chapter 2 and Chapter 3. In this graph, variables with green boxes are more important than the ones represented with red boxes, and we can see the range of importance scores within a single variable in the graph.

It may be desirable to get rid of tentative features. Notice that this function should be used only when strict decision is highly desired, because this test is much weaker than Boruta and can lower the confidence of the final result.

final.als<-TentativeRoughFix(als)
print(final.als)

## Boruta performed 99 iterations in 4.627568 mins.
## Tentatives roughfixed over the last 99 iterations.
##  32 attributes confirmed important: ALSFRS_Total_max,
## ALSFRS_Total_median, ALSFRS_Total_min, ALSFRS_Total_range,
## Creatinine_median and 27 more;
##  67 attributes confirmed unimportant: Age_mean, Albumin_max,
## Albumin_median, Albumin_min, Albumin_range and 62 more;

final.als$finalDecision

##                         Age_mean                      Albumin_max 
##                         Rejected                         Rejected 
##                   Albumin_median                      Albumin_min 
##                         Rejected                         Rejected 
##                    Albumin_range                 ALSFRS_Total_max 
##                         Rejected                        Confirmed 
##              ALSFRS_Total_median                 ALSFRS_Total_min 
##                        Confirmed                        Confirmed 
##               ALSFRS_Total_range                    ALT.SGPT._max 
##                        Confirmed                         Rejected 
##                 ALT.SGPT._median                    ALT.SGPT._min 
##                         Rejected                         Rejected 
##                  ALT.SGPT._range                    AST.SGOT._max 
##                         Rejected                         Rejected 
##                 AST.SGOT._median                    AST.SGOT._min 
##                         Rejected                         Rejected 
##                  AST.SGOT._range                  Bicarbonate_max 
##                         Rejected                         Rejected 
##               Bicarbonate_median                  Bicarbonate_min 
##                         Rejected                         Rejected 
##                Bicarbonate_range    Blood.Urea.Nitrogen..BUN._max 
##                         Rejected                         Rejected 
## Blood.Urea.Nitrogen..BUN._median    Blood.Urea.Nitrogen..BUN._min 
##                         Rejected                         Rejected 
##  Blood.Urea.Nitrogen..BUN._range                 bp_diastolic_max 
##                         Rejected                         Rejected 
##              bp_diastolic_median                 bp_diastolic_min 
##                         Rejected                         Rejected 
##               bp_diastolic_range                  bp_systolic_max 
##                         Rejected                         Rejected 
##               bp_systolic_median                  bp_systolic_min 
##                         Rejected                         Rejected 
##                bp_systolic_range                      Calcium_max 
##                         Rejected                         Rejected 
##                   Calcium_median                      Calcium_min 
##                         Rejected                         Rejected 
##                    Calcium_range                     Chloride_max 
##                         Rejected                         Rejected 
##                  Chloride_median                     Chloride_min 
##                         Rejected                         Rejected 
##                   Chloride_range                   Creatinine_max 
##                         Rejected                         Rejected 
##                Creatinine_median                   Creatinine_min 
##                        Confirmed                        Confirmed 
##                 Creatinine_range                      Gender_mean 
##                         Rejected                         Rejected 
##                      Glucose_max                   Glucose_median 
##                         Rejected                         Rejected 
##                      Glucose_min                    Glucose_range 
##                         Rejected                         Rejected 
##                        hands_max                     hands_median 
##                        Confirmed                        Confirmed 
##                        hands_min                      hands_range 
##                        Confirmed                        Confirmed 
##                   Hematocrit_max                Hematocrit_median 
##                        Confirmed                         Rejected 
##                   Hematocrit_min                 Hematocrit_range 
##                        Confirmed                        Confirmed 
##                   Hemoglobin_max                Hemoglobin_median 
##                         Rejected                        Confirmed 
##                   Hemoglobin_min                 Hemoglobin_range 
##                         Rejected                        Confirmed 
##                          leg_max                       leg_median 
##                        Confirmed                        Confirmed 
##                          leg_min                        leg_range 
##                        Confirmed                        Confirmed 
##                        mouth_max                     mouth_median 
##                        Confirmed                        Confirmed 
##                        mouth_min                      mouth_range 
##                        Confirmed                        Confirmed 
##                 onset_delta_mean                  onset_site_mean 
##                        Confirmed                         Rejected 
##                    Platelets_max                 Platelets_median 
##                         Rejected                         Rejected 
##                    Platelets_min                    Potassium_max 
##                         Rejected                         Rejected 
##                 Potassium_median                    Potassium_min 
##                         Rejected                         Rejected 
##                  Potassium_range                        pulse_max 
##                         Rejected                        Confirmed 
##                     pulse_median                        pulse_min 
##                         Rejected                         Rejected 
##                      pulse_range                  respiratory_max 
##                         Rejected                         Rejected 
##               respiratory_median                  respiratory_min 
##                        Confirmed                        Confirmed 
##                respiratory_range                       Sodium_max 
##                        Confirmed                         Rejected 
##                    Sodium_median                       Sodium_min 
##                         Rejected                         Rejected 
##                     Sodium_range                        SubjectID 
##                         Rejected                         Rejected 
##                        trunk_max                     trunk_median 
##                        Confirmed                        Confirmed 
##                        trunk_min                      trunk_range 
##                        Confirmed                        Confirmed 
##                     Urine.Ph_max                  Urine.Ph_median 
##                         Rejected                         Rejected 
##                     Urine.Ph_min 
##                         Rejected 
## Levels: Tentative Confirmed Rejected

getConfirmedFormula(final.als)

## ALSFRS_slope ~ ALSFRS_Total_max + ALSFRS_Total_median + ALSFRS_Total_min + 
##     ALSFRS_Total_range + Creatinine_median + Creatinine_min + 
##     hands_max + hands_median + hands_min + hands_range + Hematocrit_max + 
##     Hematocrit_min + Hematocrit_range + Hemoglobin_median + Hemoglobin_range + 
##     leg_max + leg_median + leg_min + leg_range + mouth_max + 
##     mouth_median + mouth_min + mouth_range + onset_delta_mean + 
##     pulse_max + respiratory_median + respiratory_min + respiratory_range + 
##     trunk_max + trunk_median + trunk_min + trunk_range
## <environment: 0x00000000279059b0>

# report the Boruta "Confirmed" & "Tentative" features, removing the "Rejected" ones
print(final.als$finalDecision[final.als$finalDecision %in% c("Confirmed", "Tentative")])

##    ALSFRS_Total_max ALSFRS_Total_median    ALSFRS_Total_min 
##           Confirmed           Confirmed           Confirmed 
##  ALSFRS_Total_range   Creatinine_median      Creatinine_min 
##           Confirmed           Confirmed           Confirmed 
##           hands_max        hands_median           hands_min 
##           Confirmed           Confirmed           Confirmed 
##         hands_range      Hematocrit_max      Hematocrit_min 
##           Confirmed           Confirmed           Confirmed 
##    Hematocrit_range   Hemoglobin_median    Hemoglobin_range 
##           Confirmed           Confirmed           Confirmed 
##             leg_max          leg_median             leg_min 
##           Confirmed           Confirmed           Confirmed 
##           leg_range           mouth_max        mouth_median 
##           Confirmed           Confirmed           Confirmed 
##           mouth_min         mouth_range    onset_delta_mean 
##           Confirmed           Confirmed           Confirmed 
##           pulse_max  respiratory_median     respiratory_min 
##           Confirmed           Confirmed           Confirmed 
##   respiratory_range           trunk_max        trunk_median 
##           Confirmed           Confirmed           Confirmed 
##           trunk_min         trunk_range 
##           Confirmed           Confirmed 
## Levels: Tentative Confirmed Rejected

# how many are actually "confirmed" as important/salient?
impBoruta <- final.als$finalDecision[final.als$finalDecision %in% c("Confirmed")]; length(impBoruta)

## [1] 32

This shows the final features selection result.

2.4 Step 4 - evaluating model performance

2.4.1 Comparing with RFE

Let’s compare the Boruta results against a classical variable selection method - recursive feature elimination (RFE). First, we need to load two packages: caret and randomForest. Then, similar to Chapter 14 we must specify a resampling method. Here we use 10-fold CV to do the resampling.

library(caret)

## Loading required package: lattice

## Loading required package: ggplot2

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## 
## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':
## 
##     margin

## The following object is masked from 'package:ranger':
## 
##     importance

set.seed(123)
control<-rfeControl(functions = rfFuncs, method = "cv", number=10)

Now, all preparations are complete and we are ready to do the RFE variable selection.

rf.train<-rfe(ALS.train[, -c(1, 7)], ALS.train[, 7], sizes=c(10, 20, 30, 40), rfeControl=control)
rf.train

## 
## Recursive feature selection
## 
## Outer resampling method: Cross-Validated (10 fold) 
## 
## Resampling performance over subset size:
## 
##  Variables   RMSE Rsquared  RMSESD RsquaredSD Selected
##         10 0.3500   0.6837 0.03451    0.03837         
##         20 0.3471   0.6894 0.03230    0.03374         
##         30 0.3468   0.6900 0.03135    0.02967        *
##         40 0.3473   0.6895 0.03061    0.02887         
##         99 0.3503   0.6842 0.02995    0.02868         
## 
## The top 5 variables (out of 30):
##    ALSFRS_Total_range, trunk_range, hands_range, mouth_range, ALSFRS_Total_min

This calculation may take a long time to complete. The RFE invocation is different from Boruta. Here we have to specify the feature data frame and the class labels separately. Also, the sizes= option allows us to specify the number of features we want to include in the model. Let’s try sizes=c(10, 20, 30, 40) to compare the model performance for alternative numbers of features.

To visualize the results, we can plot the 5 different feature size combinations listed in the summary. The one with 30 features has the lowest RMSE measure. This result is similar to the Boruta output, which selected around 30 features.

plot(rf.train, type=c("g", "o"), cex=1, col=1:4)

Using the functions predictors() and getSelectedAttributes(), we can compare the final results of the two alternative feature selection methods.

predRFE <- predictors(rf.train)
predBoruta <- getSelectedAttributes(final.als, withTentative = F)

The results are almost identical:

intersect(predBoruta, predRFE)

##  [1] "ALSFRS_Total_max"    "ALSFRS_Total_median" "ALSFRS_Total_min"   
##  [4] "ALSFRS_Total_range"  "Creatinine_min"      "hands_max"          
##  [7] "hands_median"        "hands_min"           "hands_range"        
## [10] "Hematocrit_max"      "Hemoglobin_median"   "leg_max"            
## [13] "leg_median"          "leg_min"             "leg_range"          
## [16] "mouth_median"        "mouth_min"           "mouth_range"        
## [19] "onset_delta_mean"    "respiratory_median"  "respiratory_min"    
## [22] "respiratory_range"   "trunk_max"           "trunk_median"       
## [25] "trunk_min"           "trunk_range"

There are 26 common variables chosen by the two techniques, which suggests that both the Boruta and RFE methods are robust. Also, notice that the Boruta method can give similar results without utilizing on the size option. If we want to consider 10 or more different sizes, the procedure will be quite time consuming. Thus, Boruta method is effective when dealing with complex real world problems.

2.4.2 Comparing with stepwise feature selection

Next, we can contrast the Boruta feature selection results against another classical variable selection method - stepwise model selection. Let’s start with fitting a bidirectional stepwise linear model-based feature selection.

data2 <- ALS.train[, -1]
# Define a base model - intercept only
base.mod <- lm(ALSFRS_slope ~ 1 , data= data2)
# Define the full model - including all predictors
all.mod <- lm(ALSFRS_slope ~ . , data= data2)
# ols_step <- lm(ALSFRS_slope ~ ., data=data2)
ols_step <- step(base.mod, scope = list(lower = base.mod, upper = all.mod), direction = 'both', k=2, trace = F)
summary(ols_step); ols_step

## 
## Call:
## lm(formula = ALSFRS_slope ~ ALSFRS_Total_range + ALSFRS_Total_median + 
##     ALSFRS_Total_min + Calcium_range + Calcium_max + bp_diastolic_min + 
##     onset_delta_mean + Calcium_min + Albumin_range + Glucose_range + 
##     ALT.SGPT._median + AST.SGOT._median + Glucose_max + Glucose_min + 
##     Creatinine_range + Potassium_range + Chloride_range + Chloride_min + 
##     Sodium_median + respiratory_min + respiratory_range + respiratory_max + 
##     trunk_range + pulse_range + Bicarbonate_max + Bicarbonate_range + 
##     Chloride_max + onset_site_mean + trunk_max + Gender_mean + 
##     Creatinine_min, data = data2)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.22558 -0.17875 -0.02024  0.17098  1.95100 
## 
## Coefficients:
##                       Estimate Std. Error t value Pr(>|t|)    
## (Intercept)          4.176e-01  6.064e-01   0.689 0.491091    
## ALSFRS_Total_range  -2.260e+01  1.359e+00 -16.631  < 2e-16 ***
## ALSFRS_Total_median -3.388e-02  2.868e-03 -11.812  < 2e-16 ***
## ALSFRS_Total_min     2.821e-02  3.310e-03   8.524  < 2e-16 ***
## Calcium_range        2.410e+02  4.188e+01   5.754 9.94e-09 ***
## Calcium_max         -4.258e-01  8.846e-02  -4.813 1.59e-06 ***
## bp_diastolic_min    -2.249e-03  8.856e-04  -2.540 0.011161 *  
## onset_delta_mean    -5.461e-05  1.980e-05  -2.758 0.005856 ** 
## Calcium_min          3.579e-01  9.501e-02   3.767 0.000169 ***
## Albumin_range       -2.305e+00  8.197e-01  -2.812 0.004967 ** 
## Glucose_range       -1.510e+01  2.929e+00  -5.156 2.75e-07 ***
## ALT.SGPT._median    -2.300e-03  7.998e-04  -2.876 0.004062 ** 
## AST.SGOT._median     3.369e-03  1.276e-03   2.641 0.008316 ** 
## Glucose_max          3.279e-02  7.082e-03   4.630 3.88e-06 ***
## Glucose_min         -3.507e-02  8.718e-03  -4.023 5.95e-05 ***
## Creatinine_range     5.076e-01  2.214e-01   2.293 0.021925 *  
## Potassium_range     -4.535e+00  2.607e+00  -1.739 0.082128 .  
## Chloride_range       5.318e+00  1.188e+00   4.475 8.04e-06 ***
## Chloride_min         1.672e-02  3.797e-03   4.404 1.12e-05 ***
## Sodium_median       -9.830e-03  4.639e-03  -2.119 0.034227 *  
## respiratory_min     -1.453e-01  2.442e-02  -5.948 3.14e-09 ***
## respiratory_range   -5.834e+01  1.013e+01  -5.757 9.78e-09 ***
## respiratory_max      1.712e-01  3.395e-02   5.042 4.99e-07 ***
## trunk_range         -8.705e+00  3.088e+00  -2.819 0.004860 ** 
## pulse_range         -5.117e-01  3.016e-01  -1.697 0.089874 .  
## Bicarbonate_max      7.526e-03  2.931e-03   2.568 0.010292 *  
## Bicarbonate_range   -2.204e+00  9.567e-01  -2.304 0.021329 *  
## Chloride_max        -6.918e-03  3.952e-03  -1.751 0.080143 .  
## onset_site_mean      3.359e-02  2.019e-02   1.663 0.096359 .  
## trunk_max            2.288e-02  8.453e-03   2.706 0.006854 ** 
## Gender_mean         -3.360e-02  1.751e-02  -1.919 0.055066 .  
## Creatinine_min       7.643e-04  4.977e-04   1.536 0.124771    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.3355 on 2191 degrees of freedom
## Multiple R-squared:  0.7135, Adjusted R-squared:  0.7094 
## F-statistic:   176 on 31 and 2191 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = ALSFRS_slope ~ ALSFRS_Total_range + ALSFRS_Total_median + 
##     ALSFRS_Total_min + Calcium_range + Calcium_max + bp_diastolic_min + 
##     onset_delta_mean + Calcium_min + Albumin_range + Glucose_range + 
##     ALT.SGPT._median + AST.SGOT._median + Glucose_max + Glucose_min + 
##     Creatinine_range + Potassium_range + Chloride_range + Chloride_min + 
##     Sodium_median + respiratory_min + respiratory_range + respiratory_max + 
##     trunk_range + pulse_range + Bicarbonate_max + Bicarbonate_range + 
##     Chloride_max + onset_site_mean + trunk_max + Gender_mean + 
##     Creatinine_min, data = data2)
## 
## Coefficients:
##         (Intercept)   ALSFRS_Total_range  ALSFRS_Total_median  
##           4.176e-01           -2.260e+01           -3.388e-02  
##    ALSFRS_Total_min        Calcium_range          Calcium_max  
##           2.821e-02            2.410e+02           -4.258e-01  
##    bp_diastolic_min     onset_delta_mean          Calcium_min  
##          -2.249e-03           -5.461e-05            3.579e-01  
##       Albumin_range        Glucose_range     ALT.SGPT._median  
##          -2.305e+00           -1.510e+01           -2.300e-03  
##    AST.SGOT._median          Glucose_max          Glucose_min  
##           3.369e-03            3.279e-02           -3.507e-02  
##    Creatinine_range      Potassium_range       Chloride_range  
##           5.076e-01           -4.535e+00            5.318e+00  
##        Chloride_min        Sodium_median      respiratory_min  
##           1.672e-02           -9.830e-03           -1.453e-01  
##   respiratory_range      respiratory_max          trunk_range  
##          -5.834e+01            1.712e-01           -8.705e+00  
##         pulse_range      Bicarbonate_max    Bicarbonate_range  
##          -5.117e-01            7.526e-03           -2.204e+00  
##        Chloride_max      onset_site_mean            trunk_max  
##          -6.918e-03            3.359e-02            2.288e-02  
##         Gender_mean       Creatinine_min  
##          -3.360e-02            7.643e-04

We can report the stepwise “Confirmed” (important) features:

# get the shortlisted variable
stepwiseConfirmedVars <- names(unlist(ols_step[[1]]))
# remove the intercept 
stepwiseConfirmedVars <- stepwiseConfirmedVars[!stepwiseConfirmedVars %in% "(Intercept)"]
print(stepwiseConfirmedVars)

##  [1] "ALSFRS_Total_range"  "ALSFRS_Total_median" "ALSFRS_Total_min"   
##  [4] "Calcium_range"       "Calcium_max"         "bp_diastolic_min"   
##  [7] "onset_delta_mean"    "Calcium_min"         "Albumin_range"      
## [10] "Glucose_range"       "ALT.SGPT._median"    "AST.SGOT._median"   
## [13] "Glucose_max"         "Glucose_min"         "Creatinine_range"   
## [16] "Potassium_range"     "Chloride_range"      "Chloride_min"       
## [19] "Sodium_median"       "respiratory_min"     "respiratory_range"  
## [22] "respiratory_max"     "trunk_range"         "pulse_range"        
## [25] "Bicarbonate_max"     "Bicarbonate_range"   "Chloride_max"       
## [28] "onset_site_mean"     "trunk_max"           "Gender_mean"        
## [31] "Creatinine_min"

The feature selection results of Boruta and step are similar.

library(mlbench)
library(caret)

# estimate variable importance
predStepwise <- varImp(ols_step, scale=FALSE)
# summarize importance
print(predStepwise)

##                       Overall
## ALSFRS_Total_range  16.630592
## ALSFRS_Total_median 11.812263
## ALSFRS_Total_min     8.523606
## Calcium_range        5.754045
## Calcium_max          4.812942
## bp_diastolic_min     2.539766
## onset_delta_mean     2.758465
## Calcium_min          3.767450
## Albumin_range        2.812018
## Glucose_range        5.156259
## ALT.SGPT._median     2.876338
## AST.SGOT._median     2.641369
## Glucose_max          4.629759
## Glucose_min          4.022642
## Creatinine_range     2.293301
## Potassium_range      1.739268
## Chloride_range       4.474709
## Chloride_min         4.403551
## Sodium_median        2.118710
## respiratory_min      5.948488
## respiratory_range    5.756735
## respiratory_max      5.041816
## trunk_range          2.819029
## pulse_range          1.696811
## Bicarbonate_max      2.568068
## Bicarbonate_range    2.303757
## Chloride_max         1.750666
## onset_site_mean      1.663481
## trunk_max            2.706410
## Gender_mean          1.919380
## Creatinine_min       1.535642

# plot predStepwise
# plot(predStepwise)

# Boruta vs. Stepwise feataure selection
intersect(predBoruta, stepwiseConfirmedVars)

## [1] "ALSFRS_Total_median" "ALSFRS_Total_min"    "ALSFRS_Total_range" 
## [4] "Creatinine_min"      "onset_delta_mean"    "respiratory_min"    
## [7] "respiratory_range"   "trunk_max"           "trunk_range"

There are about $10$ common variables chosen by the Boruta and Stepwise feature selection methods.

There is another more elaborate stepwise feature selection technique that is implemented in the function MASS::stepAIC() that is useful for a wider range of object classes.

3 Practice Problem

You can practice variable selection with the SOCR_Data_AD_BiomedBigMetadata on SOCR website. This is a smaller dataset that has 744 observations and 63 variables. Here we utilize DXCURREN or current diagnostics as the class variable.

Let’s import the dataset first.

library(rvest)

## Loading required package: xml2

wiki_url <- read_html("http://wiki.socr.umich.edu/index.php/SOCR_Data_AD_BiomedBigMetadata")
html_nodes(wiki_url, "#content")

## {xml_nodeset (1)}
## [1] <div id="content" class="mw-body-primary" role="main">\n\t<a id="top ...

alzh <- html_table(html_nodes(wiki_url, "table")[[1]])
summary(alzh)

##       SID            MMSCORE        FAQTOTAL            GDTOTAL     
##  Min.   :   2.0   Min.   :18.00   Length:744         Min.   :0.000  
##  1st Qu.: 355.5   1st Qu.:25.00   Class :character   1st Qu.:0.000  
##  Median : 697.5   Median :27.00   Mode  :character   Median :1.000  
##  Mean   : 707.5   Mean   :26.81                      Mean   :1.367  
##  3rd Qu.:1063.0   3rd Qu.:29.00                      3rd Qu.:2.000  
##  Max.   :1435.0   Max.   :30.00                      Max.   :6.000  
##    adascog              sobcdr         DXCURREN     DX_Conversion     
##  Length:744         Min.   :0.000   Min.   :1.000   Length:744        
##  Class :character   1st Qu.:0.000   1st Qu.:1.000   Class :character  
##  Mode  :character   Median :1.500   Median :2.000   Mode  :character  
##                     Mean   :1.785   Mean   :1.958                     
##                     3rd Qu.:2.625   3rd Qu.:2.000                     
##                     Max.   :9.000   Max.   :3.000                     
##     DXCONTYP      DX_Confidence          Gender         Married     
##  Min.   :-4.000   Length:744         Min.   :1.000   Min.   :1.000  
##  1st Qu.:-4.000   Class :character   1st Qu.:1.000   1st Qu.:1.000  
##  Median :-4.000   Mode  :character   Median :1.000   Median :1.000  
##  Mean   :-3.962                      Mean   :1.407   Mean   :1.083  
##  3rd Qu.:-4.000                      3rd Qu.:2.000   3rd Qu.:1.000  
##  Max.   : 3.000                      Max.   :2.000   Max.   :2.000  
##    Education          Age          Weight_Kg         VSBPSYS     
##  Min.   : 6.00   Min.   :55.00   Min.   : -1.00   Min.   : 90.0  
##  1st Qu.:14.00   1st Qu.:71.00   1st Qu.: 64.67   1st Qu.:122.0  
##  Median :16.00   Median :76.00   Median : 74.39   Median :135.0  
##  Mean   :15.64   Mean   :75.49   Mean   : 75.28   Mean   :135.5  
##  3rd Qu.:18.00   3rd Qu.:80.00   3rd Qu.: 84.48   3rd Qu.:146.0  
##  Max.   :20.00   Max.   :91.00   Max.   :137.44   Max.   :206.0  
##     VSBPDIA          VSPULSE           VSRESP          VSTEMP     
##  Min.   : 43.00   Min.   : 40.00   Min.   :-1.00   Min.   :-1.00  
##  1st Qu.: 68.00   1st Qu.: 58.00   1st Qu.:16.00   1st Qu.:36.10  
##  Median : 75.00   Median : 64.00   Median :16.00   Median :36.40  
##  Mean   : 74.56   Mean   : 65.17   Mean   :16.68   Mean   :36.35  
##  3rd Qu.: 82.00   3rd Qu.: 72.00   3rd Qu.:18.00   3rd Qu.:36.70  
##  Max.   :103.00   Max.   :100.00   Max.   :32.00   Max.   :37.70  
##  SymptomeSeverety   SymptomeChronicity    BC.USEA         BCVOMIT     
##  Length:744         Length:744         Min.   :1.000   Min.   :1.000  
##  Class :character   Class :character   1st Qu.:1.000   1st Qu.:1.000  
##  Mode  :character   Mode  :character   Median :1.000   Median :1.000  
##                                        Mean   :1.032   Mean   :1.016  
##                                        3rd Qu.:1.000   3rd Qu.:1.000  
##                                        Max.   :2.000   Max.   :2.000  
##     BCDIARRH        BCCONSTP        BCABDOMN        BCSWEATN    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :1.097   Mean   :1.106   Mean   :1.074   Mean   :1.056  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :2.000   Max.   :2.000   Max.   :2.000   Max.   :2.000  
##     BCDIZZY         BCENERGY      BCDROWSY       BCVISION    
##  Min.   :1.000   Min.   :1.0   Min.   :1.00   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.0   1st Qu.:1.00   1st Qu.:1.000  
##  Median :1.000   Median :1.0   Median :1.00   Median :1.000  
##  Mean   :1.125   Mean   :1.2   Mean   :1.13   Mean   :1.059  
##  3rd Qu.:1.000   3rd Qu.:1.0   3rd Qu.:1.00   3rd Qu.:1.000  
##  Max.   :2.000   Max.   :2.0   Max.   :2.00   Max.   :2.000  
##     BCHDACHE        BCDRYMTH        BCBREATH        BCCOUGH     
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :1.093   Mean   :1.087   Mean   :1.078   Mean   :1.116  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :2.000   Max.   :2.000   Max.   :2.000   Max.   :2.000  
##     BCPALPIT        BCCHEST         BCURNDIS        BCURNFRQ    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :1.031   Mean   :1.017   Mean   :1.023   Mean   :1.218  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :2.000   Max.   :2.000   Max.   :2.000   Max.   :2.000  
##     BCANKLE         BCMUSCLE         BCRASH         BCINSOMN    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :1.078   Mean   :1.364   Mean   :1.073   Mean   :1.112  
##  3rd Qu.:1.000   3rd Qu.:2.000   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :2.000   Max.   :2.000   Max.   :2.000   Max.   :2.000  
##     BCDPMOOD        BCCRYING        BCELMOOD        BCWANDER    
##  Min.   :1.000   Min.   :1.000   Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000   1st Qu.:1.000  
##  Median :1.000   Median :1.000   Median :1.000   Median :1.000  
##  Mean   :1.122   Mean   :1.035   Mean   :1.012   Mean   :1.004  
##  3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000   3rd Qu.:1.000  
##  Max.   :2.000   Max.   :2.000   Max.   :2.000   Max.   :2.000  
##      BCFALL         BCOTHER        CTWHITE             CTRED          
##  Min.   :1.000   Min.   :1.000   Length:744         Length:744        
##  1st Qu.:1.000   1st Qu.:1.000   Class :character   Class :character  
##  Median :1.000   Median :1.000   Mode  :character   Mode  :character  
##  Mean   :1.046   Mean   :1.046                                        
##  3rd Qu.:1.000   3rd Qu.:1.000                                        
##  Max.   :2.000   Max.   :2.000                                        
##    PROTEIN            GLUCOSE          ApoEGeneAllele1 ApoEGeneAllele2
##  Length:744         Length:744         Min.   :2.000   Min.   :2.000  
##  Class :character   Class :character   1st Qu.:3.000   1st Qu.:3.000  
##  Mode  :character   Mode  :character   Median :3.000   Median :3.000  
##                                        Mean   :3.023   Mean   :3.489  
##                                        3rd Qu.:3.000   3rd Qu.:4.000  
##                                        Max.   :4.000   Max.   :4.000  
##     CDMEMORY    CDORIENT         CDJUDGE          CDCOMMUN     
##  Min.   :0   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0   Median :0.5000   Median :0.0000   Median :0.5000  
##  Mean   :0   Mean   :0.5047   Mean   :0.3085   Mean   :0.3683  
##  3rd Qu.:0   3rd Qu.:1.0000   3rd Qu.:0.5000   3rd Qu.:0.5000  
##  Max.   :0   Max.   :2.0000   Max.   :2.0000   Max.   :2.0000  
##      CDHOME           CDCARE          CDGLOBAL     
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :0.2513   Mean   :0.2849   Mean   :0.0672  
##  3rd Qu.:0.5000   3rd Qu.:0.5000   3rd Qu.:0.0000  
##  Max.   :2.0000   Max.   :2.0000   Max.   :2.0000

The data summary shows that we have several factor variables. After converting their type to numeric we find some missing data. We can manage this issue by selecting only the complete observation of the original dataset or by using multivariate imputation, see Chapter 2.

chrtofactor<-c(3, 5, 8, 10, 21:22, 51:54)
alzh[chrtofactor]<-data.frame(apply(alzh[chrtofactor], 2, as.numeric))

## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion

## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion

## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion

## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion

## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion

## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion

## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion

## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion

## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion

## Warning in apply(alzh[chrtofactor], 2, as.numeric): NAs introduced by
## coercion

alzh<-alzh[complete.cases(alzh), ]

For simplicity, here we eliminated the missing data and are left with 408 complete observations. Now, we can apply the Boruta method for feature selection.

## Boruta performed 99 iterations in 8.643105 secs.
##  12 attributes confirmed important: adascog, BCBREATH, CDCARE,
## CDCOMMUN, CDGLOBAL and 7 more;
##  47 attributes confirmed unimportant: Age, BC.USEA, BCABDOMN,
## BCANKLE, BCCHEST and 42 more;
##  2 tentative attributes left: ApoEGeneAllele1, ApoEGeneAllele2;

You might get a result that is a little bit different. We can plot the variable importance graph using some previous knowledge.

The final step is to get rid of the tentative features.

## Boruta performed 99 iterations in 8.643105 secs.
## Tentatives roughfixed over the last 99 iterations.
##  14 attributes confirmed important: adascog, ApoEGeneAllele1,
## ApoEGeneAllele2, BCBREATH, CDCARE and 9 more;
##  47 attributes confirmed unimportant: Age, BC.USEA, BCABDOMN,
## BCANKLE, BCCHEST and 42 more;

##  [1] "MMSCORE"         "FAQTOTAL"        "adascog"        
##  [4] "sobcdr"          "DX_Confidence"   "BCBREATH"       
##  [7] "ApoEGeneAllele1" "ApoEGeneAllele2" "CDORIENT"       
## [10] "CDJUDGE"         "CDCOMMUN"        "CDHOME"         
## [13] "CDCARE"          "CDGLOBAL"

Can you reproduce these results? Also try to apply some of these techniques to other data from the list of our Case-Studies.

SOCR Resource Visitor number

Data Science and Predictive Analytics (UMich HS650)

Variable/Feature Selection