SOCR ≫ DSPA ≫ Topics ≫

As we mentioned in Chapter 15, variable selection is very important when dealing with bioinformatics, healthcare, and biomedical data where we may have more features than observations. Instead of trying to interrogate the complete data in its native high-dimensional state, variable selection, or feature selection, helps us focus on the most salient information contained in the observations. Due to presence of intrinsic and extrinsic noise, the volume and complexity of big health data, as well as different methodological and technological challenges, the process of identifying the salient features may resemble finding a needle in a haystack. Here, we will illustrate alternative strategies for feature selection using filtering (e.g., correlation-based feature selection), wrapping (e.g., recursive feature elimination), and embedding (e.g., variable importance via random forest classification) techniques.

Variable selection relates to dimensionality reduction, which we saw in Chapter 5, however there are differences between them.

Method Process Type Goals Approach
Variable selection Discrete process To select unique representative features from each group of similar features To identify highly correlated variables and choose a representative feature by post processing the data
Dimension reduction Continuous process To denoise the data, enable simpler prediction, or group features so that low impact features have smaller weights Find the essential, \(k\ll n\), components, factors, or clusters representing linear, or nonlinear, functions of the \(n\) variables which maximize an objective function like the proportion of explained variance

Relative to the lower variance estimates in continuous dimensionaltuy reduction, the intrinsic characteristics of the discrete feature selection process yield higher variance in bootstrap estimation and cross validation.

In Chapter 17, we will learn about another powerful technique for variable-selection using decoy features (knockoffs) to control for the false discovery rate of selecting inconsequential features as important.

1 Feature selection methods

There are three major classes of variable or feature selection techniques - filtering-based, wrapper-based, and embedded methods.

1.1 Filtering techniques

  • Univariate: Univariate filtering methods focus on selecting single features with high score based on some statistics like \(\chi^2\) or Information Gain Ratio. Each feature is viewed as independent of the others, effectively ignoring interactions between features.
    • Examples: \(\chi^2\), Euclidean distance, \(i\)-test, and Information gain.
  • Multivariate: Multivariate filtering methods rely on various (multivariate) statistics to select the principal features. They typically account for between-feature interactions by using higher-order statistics like correlation. The basic idea is that we iteratively triage variables that have high correlations with other features.
    • Examples: Correlation-based feature selection, Markov blanket filter, and fast correlation-based feature selection.

1.2 Wrapper

  • Deterministic: Deterministic wrapper feature selection methods either start with no features (forward-selection) or with all features included in the model (backward-selection) and iteratively refine the set of chosen features according to some model quality measures. The iterative process of adding or removing features may rely on statistics like the Jaccard similarity coefficient.
    • Examples: Sequential forward selection, Recursive Feature Elimination, Plus \(q\) take-away \(r\), and Beam search.
  • Randomized: Stochastic wrapper feature selection procedures utilize a binary feature-indexing vector indicating whether or not each variable should be includes in the list of salient features. At each iteration, we randomly perturb to the binary indicators vector and compare the combinations of features before and after the random inclusion-exclusion indexing change. Finally, we pick the indexing vector corresponding with the optimal performance based on some metric like acceptance probability measures. The iterative process continues until no improvement of the objective function is observed.
    • Examples: Simulated annealing, Genetic algorithms, Estimation of distribution algorithms.

1.3 Embedded Techniques

  • Embedded feature selection techniques are based on various classifiers, predictors, or clustering procedures. For instance, we can accomplish feature selection by using decision trees where the separation of the training data relies on features associated with the highest information gain. Further tree branching separating the data deeper may utilize weaker features. This process of choosing the vital features based on their separability characteristics continues until the classifier generates group labels that are mostly homogeneous within clusters/classes and largely heterogeneous across groups, and when the information gain of further tree branching is marginal. The entire process may be iterated multiple times and select the features that appear most frequently.
    • Examples: Decision trees, random forests, weighted naive Bayes, and feature selection using weighted-SVM.

The different types of feature selection methods have their own pros and cons. In this chapter, we are going to introduce the randomized wrapper method using the Boruta package, which utilizes random forest classification method to output variable importance measures (VIMs). Then, we will compare its results with Recursive Feature Elimination, a classical deterministic wrapper method.

2 Random Forest Feature Selection

Let’s start by examining random forest based feature selection, as an embedded technique. The good performance of random forest as a classification, regression, and clustering method is coupled with its ease-of-use, accurate, and robust results. Having a random forest, or more broadly a decision tree, prediction naturally leads to feature selection by using the mean decrease impurity or the mean accuracy decrease criteria.

The many decision trees captured in a random forest include explicit conditions at each branching node, which are based on single features. The intrinsic bifurcation conditions splitting the data may be based on cost function optimization using the impurity, see Chapter 8. We can also use other metrics information gain or entropy for classification problems. These measures capture the importance of variables by computing its impact (how much is the feature-based splitting decision decreasing the weighted impurity in a tree). In random forests, the ranking of feature importance, which based on the average impurity decrease due to each variable leads to effective feature selection.

3 Case Study - ALS

3.1 Step 1: Collecting Data

First things first, let’s explore the dataset we will be using. Case Study 15, Amyotrophic Lateral Sclerosis (ALS), examines the patterns, symmetries, associations and causality in a rare but devastating disease, amyotrophic lateral sclerosis (ALS), also known as Lou Gehrig disease. This ALS case-study reflects a large clinical trial including big, multi-source and heterogeneous datasets. It would be interesting to interrogate the data and attempt to derive potential biomarkers that can be used for detecting, prognosticating, and forecasting the progression of this neurodegenerative disorder. Overcoming many scientific, technical and infrastructure barriers is required to establish complete, efficient, and reproducible protocols for such complex data. These pipeline workflows start with ingesting the raw data, preprocessing, aggregating, harmonizing, analyzing, visualizing and interpreting the findings.

In this case-study, we use the training dataset that contains 2,223 observations and 131 numeric variables. We select ALSFRS slope as our outcome variable, as it captures the patients’ clinical decline over a year. Although we have more observations than features, this is one of the examples where multiple features are highly correlated. Therefore, we need to preprocess the variables before commencing with feature selection.

3.2 Step 2: Exploring and preparing the data

The dataset is located in our case-studies archive. We can use read.csv() to directly import the CSV dataset into R using the URL reference.

ALS.train<-read.csv("https://umich.instructure.com/files/1789624/download?download_frd=1")
summary(ALS.train)
##        ID            Age_mean      Albumin_max    Albumin_median 
##  Min.   :   1.0   Min.   :18.00   Min.   :37.00   Min.   :34.50  
##  1st Qu.: 614.5   1st Qu.:47.00   1st Qu.:45.00   1st Qu.:42.00  
##  Median :1213.0   Median :55.00   Median :47.00   Median :44.00  
##  Mean   :1214.9   Mean   :54.55   Mean   :47.01   Mean   :43.95  
##  3rd Qu.:1815.5   3rd Qu.:63.00   3rd Qu.:49.00   3rd Qu.:46.00  
##  Max.   :2424.0   Max.   :81.00   Max.   :70.30   Max.   :51.10  
##   Albumin_min    Albumin_range       ALSFRS_slope     ALSFRS_Total_max
##  Min.   :24.00   Min.   :0.000000   Min.   :-4.3452   Min.   :11.00   
##  1st Qu.:39.00   1st Qu.:0.009042   1st Qu.:-1.0863   1st Qu.:29.00   
##  Median :41.00   Median :0.012111   Median :-0.6207   Median :33.00   
##  Mean   :40.77   Mean   :0.013779   Mean   :-0.7283   Mean   :31.69   
##  3rd Qu.:43.00   3rd Qu.:0.015873   3rd Qu.:-0.2838   3rd Qu.:36.00   
##  Max.   :49.00   Max.   :0.243902   Max.   : 1.2070   Max.   :40.00   
##  ALSFRS_Total_median ALSFRS_Total_min ALSFRS_Total_range ALT.SGPT._max   
##  Min.   : 2.5        Min.   : 0.00    Min.   :0.00000    Min.   : 10.00  
##  1st Qu.:23.0        1st Qu.:14.00    1st Qu.:0.01404    1st Qu.: 32.00  
##  Median :28.0        Median :20.00    Median :0.02330    Median : 45.00  
##  Mean   :27.1        Mean   :19.88    Mean   :0.02604    Mean   : 54.44  
##  3rd Qu.:32.0        3rd Qu.:27.00    3rd Qu.:0.03480    3rd Qu.: 65.00  
##  Max.   :40.0        Max.   :40.00    Max.   :0.11765    Max.   :944.00  
##  ALT.SGPT._median ALT.SGPT._min    ALT.SGPT._range    AST.SGOT._max   
##  Min.   :  8.00   Min.   :  1.60   Min.   :0.002747   Min.   : 11.00  
##  1st Qu.: 22.00   1st Qu.: 15.00   1st Qu.:0.030303   1st Qu.: 30.00  
##  Median : 30.00   Median : 21.00   Median :0.047619   Median : 38.00  
##  Mean   : 32.99   Mean   : 23.01   Mean   :0.071137   Mean   : 43.13  
##  3rd Qu.: 40.00   3rd Qu.: 28.00   3rd Qu.:0.077539   3rd Qu.: 48.00  
##  Max.   :193.00   Max.   :109.00   Max.   :2.383117   Max.   :911.00  
##  AST.SGOT._median AST.SGOT._min   AST.SGOT._range   Bicarbonate_max
##  Min.   :  9.00   Min.   : 1.00   Min.   :0.00000   Min.   :20.0   
##  1st Qu.: 22.00   1st Qu.:17.00   1st Qu.:0.02352   1st Qu.:29.0   
##  Median : 27.00   Median :20.00   Median :0.03502   Median :31.0   
##  Mean   : 29.08   Mean   :21.54   Mean   :0.04919   Mean   :30.9   
##  3rd Qu.: 34.00   3rd Qu.:25.00   3rd Qu.:0.05243   3rd Qu.:32.0   
##  Max.   :100.00   Max.   :86.00   Max.   :1.91667   Max.   :52.0   
##  Bicarbonate_median Bicarbonate_min Bicarbonate_range
##  Min.   :19.50      Min.   : 2.50   Min.   :0.00000  
##  1st Qu.:26.00      1st Qu.:22.00   1st Qu.:0.01266  
##  Median :27.00      Median :23.00   Median :0.01493  
##  Mean   :26.96      Mean   :23.16   Mean   :0.01687  
##  3rd Qu.:28.00      3rd Qu.:24.45   3rd Qu.:0.01815  
##  Max.   :39.50      Max.   :34.00   Max.   :0.21429  
##  Blood.Urea.Nitrogen..BUN._max Blood.Urea.Nitrogen..BUN._median
##  Min.   : 2.921                Min.   : 2.191                  
##  1st Qu.: 5.842                1st Qu.: 4.640                  
##  Median : 6.937                Median : 5.423                  
##  Mean   : 7.353                Mean   : 5.558                  
##  3rd Qu.: 8.210                3rd Qu.: 6.353                  
##  Max.   :25.192                Max.   :11.866                  
##  Blood.Urea.Nitrogen..BUN._min Blood.Urea.Nitrogen..BUN._range bp_diastolic_max
##  Min.   : 0.5842               Min.   :0.000000                Min.   : 70.00  
##  1st Qu.: 3.2859               1st Qu.:0.004109                1st Qu.: 88.00  
##  Median : 4.0700               Median :0.005817                Median : 90.00  
##  Mean   : 4.1609               Mean   :0.007133                Mean   : 92.03  
##  3rd Qu.: 5.0000               3rd Qu.:0.008353                3rd Qu.: 98.00  
##  Max.   :10.2228               Max.   :0.069543                Max.   :140.00  
##  bp_diastolic_median bp_diastolic_min bp_diastolic_range bp_systolic_max
##  Min.   : 56.00      Min.   : 20.00   Min.   :0.00000    Min.   :100.0  
##  1st Qu.: 78.00      1st Qu.: 65.00   1st Qu.:0.03527    1st Qu.:138.0  
##  Median : 80.00      Median : 70.00   Median :0.04337    Median :145.0  
##  Mean   : 81.11      Mean   : 69.89   Mean   :0.04766    Mean   :147.1  
##  3rd Qu.: 85.00      3rd Qu.: 75.00   3rd Qu.:0.05435    3rd Qu.:157.0  
##  Max.   :110.00      Max.   :100.00   Max.   :0.71429    Max.   :220.0  
##  bp_systolic_median bp_systolic_min bp_systolic_range  Calcium_max   
##  Min.   : 90.0      Min.   : 72.0   Min.   :0.00000   Min.   :2.171  
##  1st Qu.:120.0      1st Qu.:108.0   1st Qu.:0.05272   1st Qu.:2.400  
##  Median :130.0      Median :110.0   Median :0.06494   Median :2.470  
##  Mean   :129.6      Mean   :113.4   Mean   :0.07118   Mean   :2.475  
##  3rd Qu.:136.0      3rd Qu.:120.0   3rd Qu.:0.08190   3rd Qu.:2.530  
##  Max.   :190.0      Max.   :165.0   Max.   :0.40462   Max.   :9.460  
##  Calcium_median   Calcium_min     Calcium_range        Chloride_max  
##  Min.   :2.046   Min.   :0.2438   Min.   :0.0000000   Min.   : 96.0  
##  1st Qu.:2.283   1st Qu.:2.1707   1st Qu.:0.0003741   1st Qu.:106.0  
##  Median :2.345   Median :2.2300   Median :0.0004739   Median :107.0  
##  Mean   :2.346   Mean   :2.2229   Mean   :0.0005407   Mean   :107.2  
##  3rd Qu.:2.400   3rd Qu.:2.2977   3rd Qu.:0.0005893   3rd Qu.:109.0  
##  Max.   :2.800   Max.   :2.6500   Max.   :0.0129009   Max.   :119.0  
##  Chloride_median  Chloride_min    Chloride_range    Creatinine_max  
##  Min.   : 90.0   Min.   : 76.00   Min.   :0.00000   Min.   : 22.00  
##  1st Qu.:102.0   1st Qu.: 98.00   1st Qu.:0.01250   1st Qu.: 65.00  
##  Median :104.0   Median :100.00   Median :0.01587   Median : 79.56  
##  Mean   :103.5   Mean   : 99.26   Mean   :0.01787   Mean   : 78.78  
##  3rd Qu.:105.0   3rd Qu.:101.00   3rd Qu.:0.01990   3rd Qu.: 88.40  
##  Max.   :111.0   Max.   :109.00   Max.   :0.21429   Max.   :248.00  
##  Creatinine_median Creatinine_min   Creatinine_range   Gender_mean   
##  Min.   : 18.00    Min.   :  0.00   Min.   :0.00000   Min.   :1.000  
##  1st Qu.: 53.04    1st Qu.: 39.00   1st Qu.:0.03824   1st Qu.:1.000  
##  Median : 62.00    Median : 53.00   Median :0.04865   Median :2.000  
##  Mean   : 65.19    Mean   : 51.98   Mean   :0.05842   Mean   :1.637  
##  3rd Qu.: 78.85    3rd Qu.: 61.88   3rd Qu.:0.07026   3rd Qu.:2.000  
##  Max.   :176.80    Max.   :167.96   Max.   :0.42095   Max.   :2.000  
##   Glucose_max     Glucose_median    Glucose_min     Glucose_range     
##  Min.   : 4.160   Min.   : 3.497   Min.   : 0.000   Min.   :0.000000  
##  1st Qu.: 5.827   1st Qu.: 4.911   1st Qu.: 4.051   1st Qu.:0.003051  
##  Median : 6.500   Median : 5.300   Median : 4.440   Median :0.004695  
##  Mean   : 7.160   Mean   : 5.487   Mean   : 4.265   Mean   :0.006319  
##  3rd Qu.: 7.600   3rd Qu.: 5.695   3rd Qu.: 4.800   3rd Qu.:0.007373  
##  Max.   :33.688   Max.   :26.196   Max.   :12.200   Max.   :0.097463  
##    hands_max      hands_median     hands_min      hands_range      
##  Min.   :0.000   Min.   :0.000   Min.   :0.000   Min.   :0.000000  
##  1st Qu.:5.000   1st Qu.:3.000   1st Qu.:0.000   1st Qu.:0.003610  
##  Median :7.000   Median :5.500   Median :3.000   Median :0.006652  
##  Mean   :6.181   Mean   :4.905   Mean   :3.047   Mean   :0.006883  
##  3rd Qu.:8.000   3rd Qu.:7.000   3rd Qu.:5.000   3rd Qu.:0.009513  
##  Max.   :8.000   Max.   :8.000   Max.   :8.000   Max.   :0.042857  
##  Hematocrit_max   Hematocrit_median Hematocrit_min   Hematocrit_range  
##  Min.   : 0.373   Min.   : 0.362    Min.   : 0.311   Min.   :0.000000  
##  1st Qu.:42.300   1st Qu.:40.000    1st Qu.:37.000   1st Qu.:0.007164  
##  Median :45.200   Median :42.600    Median :40.000   Median :0.009701  
##  Mean   :41.939   Mean   :39.467    Mean   :36.962   Mean   :0.011431  
##  3rd Qu.:47.700   3rd Qu.:45.000    3rd Qu.:42.700   3rd Qu.:0.013579  
##  Max.   :81.000   Max.   :56.000    Max.   :52.900   Max.   :0.185714  
##  Hemoglobin_max  Hemoglobin_median Hemoglobin_min    Hemoglobin_range 
##  Min.   :116.0   Min.   :106.0     Min.   :  6.204   Min.   :0.00000  
##  1st Qu.:144.0   1st Qu.:136.0     1st Qu.:128.000   1st Qu.:0.02321  
##  Median :152.0   Median :145.0     Median :136.000   Median :0.03106  
##  Mean   :152.1   Mean   :144.3     Mean   :135.461   Mean   :0.03824  
##  3rd Qu.:160.0   3rd Qu.:152.0     3rd Qu.:145.000   3rd Qu.:0.04205  
##  Max.   :280.0   Max.   :182.0     Max.   :180.000   Max.   :0.56180  
##     leg_max       leg_median      leg_min        leg_range       
##  Min.   :0.00   Min.   :0.00   Min.   :0.000   Min.   :0.000000  
##  1st Qu.:3.00   1st Qu.:2.50   1st Qu.:1.000   1st Qu.:0.003378  
##  Median :5.00   Median :3.00   Median :2.000   Median :0.005435  
##  Mean   :5.31   Mean   :4.05   Mean   :2.493   Mean   :0.006163  
##  3rd Qu.:8.00   3rd Qu.:6.00   3rd Qu.:3.000   3rd Qu.:0.008718  
##  Max.   :8.00   Max.   :8.00   Max.   :8.000   Max.   :0.042017  
##    mouth_max      mouth_median      mouth_min       mouth_range      
##  Min.   : 1.00   Min.   : 0.000   Min.   : 0.000   Min.   :0.000000  
##  1st Qu.:10.00   1st Qu.: 8.000   1st Qu.: 5.000   1st Qu.:0.001815  
##  Median :12.00   Median :11.000   Median : 9.000   Median :0.005329  
##  Mean   :10.74   Mean   : 9.703   Mean   : 7.778   Mean   :0.006595  
##  3rd Qu.:12.00   3rd Qu.:12.000   3rd Qu.:11.000   3rd Qu.:0.010251  
##  Max.   :12.00   Max.   :12.000   Max.   :12.000   Max.   :0.036765  
##  onset_delta_mean onset_site_mean Platelets_max   Platelets_median
##  Min.   :-3119    Min.   :1.000   Min.   : 84.0   Min.   : 73.0   
##  1st Qu.: -887    1st Qu.:2.000   1st Qu.:239.0   1st Qu.:204.0   
##  Median : -572    Median :2.000   Median :275.0   Median :233.0   
##  Mean   : -683    Mean   :1.801   Mean   :285.3   Mean   :238.8   
##  3rd Qu.: -374    3rd Qu.:2.000   3rd Qu.:320.0   3rd Qu.:270.0   
##  Max.   :  -16    Max.   :3.000   Max.   :866.0   Max.   :526.0   
##  Platelets_min     Potassium_max    Potassium_median Potassium_min  
##  Min.   :  0.197   Min.   : 3.400   Min.   :3.000    Min.   :2.400  
##  1st Qu.:175.000   1st Qu.: 4.400   1st Qu.:4.000    1st Qu.:3.700  
##  Median :204.000   Median : 4.500   Median :4.200    Median :3.900  
##  Mean   :208.382   Mean   : 4.628   Mean   :4.189    Mean   :3.857  
##  3rd Qu.:236.000   3rd Qu.: 4.800   3rd Qu.:4.300    3rd Qu.:4.000  
##  Max.   :476.000   Max.   :43.000   Max.   :5.100    Max.   :5.100  
##  Potassium_range      pulse_max       pulse_median      pulse_min     
##  Min.   :0.000000   Min.   : 53.00   Min.   : 50.00   Min.   : 18.00  
##  1st Qu.:0.001058   1st Qu.: 84.00   1st Qu.: 72.00   1st Qu.: 60.00  
##  Median :0.001425   Median : 90.00   Median : 77.00   Median : 64.00  
##  Mean   :0.001744   Mean   : 90.64   Mean   : 76.97   Mean   : 65.37  
##  3rd Qu.:0.001913   3rd Qu.: 96.00   3rd Qu.: 81.00   3rd Qu.: 70.00  
##  Max.   :0.098674   Max.   :144.00   Max.   :115.00   Max.   :102.00  
##   pulse_range       respiratory_max respiratory_median respiratory_min
##  Min.   :0.005425   Min.   :2.00    Min.   :0.000      Min.   :0.000  
##  1st Qu.:0.036755   1st Qu.:4.00    1st Qu.:3.000      1st Qu.:2.000  
##  Median :0.048821   Median :4.00    Median :4.000      Median :3.000  
##  Mean   :0.053587   Mean   :3.91    Mean   :3.593      Mean   :2.791  
##  3rd Qu.:0.062365   3rd Qu.:4.00    3rd Qu.:4.000      3rd Qu.:4.000  
##  Max.   :0.500000   Max.   :4.00    Max.   :4.000      Max.   :4.000  
##  respiratory_range    Sodium_max    Sodium_median     Sodium_min   
##  Min.   :0.000000   Min.   :134.0   Min.   :128.0   Min.   :112.0  
##  1st Qu.:0.000000   1st Qu.:142.0   1st Qu.:139.0   1st Qu.:135.0  
##  Median :0.001828   Median :143.0   Median :140.0   Median :137.0  
##  Mean   :0.002513   Mean   :143.4   Mean   :140.1   Mean   :136.8  
##  3rd Qu.:0.003653   3rd Qu.:145.0   3rd Qu.:141.0   3rd Qu.:138.0  
##  Max.   :0.025424   Max.   :169.0   Max.   :146.5   Max.   :145.0  
##   Sodium_range       SubjectID        trunk_max      trunk_median  
##  Min.   :0.00000   Min.   :   533   Min.   :0.000   Min.   :0.000  
##  1st Qu.:0.01058   1st Qu.:240826   1st Qu.:5.000   1st Qu.:3.000  
##  Median :0.01312   Median :496835   Median :7.000   Median :5.000  
##  Mean   :0.01500   Mean   :498880   Mean   :6.204   Mean   :4.893  
##  3rd Qu.:0.01728   3rd Qu.:750301   3rd Qu.:8.000   3rd Qu.:6.500  
##  Max.   :0.14286   Max.   :999482   Max.   :8.000   Max.   :8.000  
##    trunk_min      trunk_range        Urine.Ph_max  Urine.Ph_median
##  Min.   :0.000   Min.   :0.000000   Min.   :5.00   Min.   :5.000  
##  1st Qu.:1.000   1st Qu.:0.003643   1st Qu.:6.00   1st Qu.:5.000  
##  Median :3.000   Median :0.006920   Median :7.00   Median :6.000  
##  Mean   :2.956   Mean   :0.007136   Mean   :6.82   Mean   :5.711  
##  3rd Qu.:5.000   3rd Qu.:0.009639   3rd Qu.:7.00   3rd Qu.:6.000  
##  Max.   :8.000   Max.   :0.042017   Max.   :9.00   Max.   :9.000  
##   Urine.Ph_min  
##  Min.   :5.000  
##  1st Qu.:5.000  
##  Median :5.000  
##  Mean   :5.183  
##  3rd Qu.:5.000  
##  Max.   :8.000

There are 131 features and some of variables represent statistics like max, min and median values of the same clinical measurements.

3.3 Step 3 - training a model on the data

Now let’s explore the Boruta() function in Boruta package to perform variables selection, based on random forest classification. Boruta() includes the following components:

vs<-Boruta(class~features, data=Mydata, pValue = 0.01, mcAdj = TRUE, maxRuns = 100, doTrace=0, getImp = getImpRfZ, ...)

  • class: variable for class labels.
  • features: potential features to select from.
  • data: dataset containing classes and features.
  • pValue: confidence level. Default value is 0.01 (Notice we are applying multiple variable selection.
  • mcAdj: Default TRUE to apply a multiple comparisons adjustment using the Bonferroni method.
  • maxRuns: maximal number of importance source runs. You may increase it to resolve attributes left Tentative.
  • doTrace: verbosity level. Default 0 means no tracing, 1 means reporting decision about each attribute as soon as it is justified, 2 means same as 1, plus at each importance source run reporting the number of attributes. The default is 0 where we don’t do the reporting.
  • getImp: function used to obtain attribute importance. The default is \(getImpRfZ\), which runs random forest from the ranger package and gathers \(Z\)-scores of mean decrease accuracy measure.

The resulting vs object is of class Boruta and contains two important components:

  • finalDecision: a factor of three values: Confirmed, Rejected or Tentative, containing the final results of the feature selection process.
  • ImpHistory: a data frame of importance of attributes gathered in each importance source run. Besides the predictors’ importance, it contains maximal, mean and minimal importance of shadow attributes for each run. Rejected attributes get -Inf importance. This output is set to NULL if we specify holdHistory=FALSE in the Boruta call.

Caution: Running the code below will take several minutes.

# install.packages("Boruta")
library(Boruta)
set.seed(123)
als<-Boruta(ALSFRS_slope~.-ID, data=ALS.train, doTrace=0)
print(als)
## Boruta performed 99 iterations in 3.629015 mins.
##  27 attributes confirmed important: ALSFRS_Total_max,
## ALSFRS_Total_median, ALSFRS_Total_min, ALSFRS_Total_range,
## Creatinine_max and 22 more;
##  60 attributes confirmed unimportant: Age_mean, Albumin_max,
## Albumin_median, Albumin_min, Albumin_range and 55 more;
##  12 tentative attributes left: ALT.SGPT._min, Chloride_range,
## Hematocrit_max, Hematocrit_median, Hematocrit_min and 7 more;
als$ImpHistory[1:6, 1:10]
##       Age_mean Albumin_max Albumin_median Albumin_min Albumin_range
## [1,] 2.2680963  0.37764697      0.2375537  -0.1580937     2.7918574
## [2,] 2.0267252  1.39739377      1.4813602   0.6770461     1.7430500
## [3,] 2.3157588 -0.58408581      1.0305236   2.0934090     0.8981331
## [4,] 2.4953558 -0.94574532      0.1539726   1.4514634     2.2579837
## [5,] 0.6570802  0.07801328     -0.7698394   1.6172399     1.9590540
## [6,] 2.9302386  0.99320619     -0.1421461   1.2192271     1.7620833
##      ALSFRS_Total_max ALSFRS_Total_median ALSFRS_Total_min ALSFRS_Total_range
## [1,]         7.197587            7.769678         17.48219           25.79845
## [2,]         7.887404            8.688664         15.49813           26.35402
## [3,]         7.779168            8.822599         16.64904           25.56681
## [4,]         8.694571            7.061077         17.00731           24.67569
## [5,]         8.352961            8.404101         16.20194           27.51207
## [6,]         8.704381            7.606126         17.05258           27.01024
##      ALT.SGPT._max
## [1,]     0.5698794
## [2,]     0.6220453
## [3,]     1.3444379
## [4,]     1.9128324
## [5,]    -0.3869214
## [6,]     2.1655440

This is a fairly time-consuming computation. Boruta determines the important attributes from unimportant and tentative features. Here the importance is measured by the Out-of-bag (OOB) error. The OOB estimates the prediction error of machine learning methods (e.g., random forests and boosted decision trees) that utilize bootstrap aggregation to sub-sample training data. OOB represents the mean prediction error on each training sample \(x_i\), using only the trees that did not include \(x_i\) in their bootstrap samples. Out-of-bag estimates provide internal assessment of the learning accuracy and avoid the need for an independent external validation dataset.

The importance scores for all features at every iteration are stored in the data frame als$ImpHistory. Let’s plot a graph depicting the essential features.

Note: Again, running this code will take several minutes to complete.

plot(als, xlab="", xaxt="n")
lz<-lapply(1:ncol(als$ImpHistory), function(i)
als$ImpHistory[is.finite(als$ImpHistory[, i]), i])
names(lz)<-colnames(als$ImpHistory)
lb<-sort(sapply(lz, median))
axis(side=1, las=2, labels=names(lb), at=1:ncol(als$ImpHistory), cex.axis=0.5, font = 4)