When developing a Machine Learning model, if there is a significant number of features to inspect, an initial and manual Exploratory Data Analysis may become tedious and nonproductive. One option is to facilitate the process by programmatically testing and identifying important variables based on statistical methods to help trim down features. And that is where Boruta comes in place.
A forest spirit in the Slavic mythology, Boruta (also called Leśny or Lešny) was portrayed as an imposing figure, with horns over the head, surrounded by packs of wolves and bears. Fortunately, in R Boruta is a helpful package for facilitating a feature selection process. Here’s a description from the documentation:
Boruta is an all relevant feature selection wrapper algorithm, capable of working with any classification method that output variable importance measure (VIM); by default, Boruta uses Random Forest. The method performs a top-down search for relevant features by comparing original attributes’ importance with importance achievable at random, estimated using their permuted copies, and progressively eliminating irrelevant features to stabilize that test.
The following is a sample routine in R demonstrating how I used Boruta to find a starting point for features selection. Some noticeable settings include:
- The input data was train.imp.
- doTrace=2 will log the activities and show progress to console.
- maxRuns is how many times Boruta should run. In some circumstances (too short Boruta run, unfortunate mixing of shadow attributes, tricky dataset. . . ), Boruta may leave some attributes Tentative. For my particular case, the first 100 runs (which is a good initial value to start) confirmed most of the features with a few remain tentative. And I set it to 500 to finally resolve 80 features I was interested in and it took about half an hour.
- TentativeRoughFix performs a simplified, weaker test for judging such attributes. This function should be used with discretion, since this weak test can lower the confidence of the final results.
- getSelectedAttributes does what it sounds like.
- attStats keep the statistics and the result of each resolved variable.
train.boruta <- Boruta(SalePrice~., data=train.imp, doTrace=2, maxRuns=500)
plot(train.boruta , las=2, cex.axis=0.7, xlab='')
train.boruta.fix <- TentativeRoughFix(train.boruta)
train.boruta.selected.features <- getSelectedAttributes(train.boruta.fix, withTentative = F)
train.boruta.selected.features.stats <- attStats(train.boruta.fix)
Also included are plots of Boruta output and attStats. Those confirmed important were in green and rejected in red. Unresolved variables were in yellow and classified as tentative which Boruta was not able to conclude their importance. And attStats kept and reported the statistics associated with the decisions.
Boruta uses Random Forest algorithm to provide educated sets of important and not so important features, respectively. Not only save time, but offer a repeatable and automatic way for initial exploratory data analysis.
Feature selection is a critical task in developing a Machine Learning model. Extraneous features introduce multicollinearity, increase variance and lead to overfitting. Data is everything and feature selection is as critical. This is a task that can consume much of model development time. And for me, making the routine a code snippet and getting the mechanics in place help me become productive much quicker. A next logical step is to programmatically consume and integrate Boruta output to build and train a preliminary Machine Learning model to possibly establish a baseline of a target algorithm. Stay tuned for that.