An AUC-based permutation variable importance measure for random forests

Type: Article

Publication Date: 2013-04-05

Citations: 233

DOI: https://doi.org/10.1186/1471-2105-14-119

Abstract

The random forest (RF) method is a commonly used tool for classification with high dimensional data as well as for ranking candidate predictors based on the so-called random forest variable importance measures (VIMs). However the classification performance of RF is known to be suboptimal in case of strongly unbalanced data, i.e. data where response class sizes differ considerably. Suggestions were made to obtain better classification performance based either on sampling procedures or on cost sensitivity analyses. However to our knowledge the performance of the VIMs has not yet been examined in the case of unbalanced response classes. In this paper we explore the performance of the permutation VIM for unbalanced data settings and introduce an alternative permutation VIM based on the area under the curve (AUC) that is expected to be more robust towards class imbalance. We investigated the performance of the standard permutation VIM and of our novel AUC-based permutation VIM for different class imbalance levels using simulated data and real data. The results suggest that the new AUC-based permutation VIM outperforms the standard permutation VIM for unbalanced data settings while both permutation VIMs have equal performance for balanced data settings. The standard permutation VIM loses its ability to discriminate between associated predictors and predictors not associated with the response for increasing class imbalance. It is outperformed by our new AUC-based permutation VIM for unbalanced data settings, while the performance of both VIMs is very similar in the case of balanced classes. The new AUC-based VIM is implemented in the R package party for the unbiased RF variant based on conditional inference trees. The codes implementing our study are available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html

Locations

  • BMC Bioinformatics - View - PDF
  • PubMed Central - View
  • CiteSeer X (The Pennsylvania State University) - View - PDF
  • Europe PMC (PubMed Central) - View - PDF
  • Zurich Open Repository and Archive (University of Zurich) - View - PDF
  • Open access LMU (Ludwid Maxmilian's Universitat Munchen) - View - PDF
  • PubMed - View
  • Open Acess LMU (Ludwig-Maximilians-Universität München) - View - PDF

Similar Works

Action Title Year Authors
+ Efficient permutation testing of variable importance measures by the example of random forests 2023 Alexander Hapfelmeier
Roman Hornung
Bernhard Haller
+ Sequential Permutation Testing of Random Forest Variable Importance Measures 2022 Alexander Hapfelmeier
Roman Hornung
Bernhard Haller
+ PDF Chat Conditional permutation importance revisited 2020 Dries Debeer
Carolin Strobl
+ Asymptotic Unbiasedness of the Permutation Importance Measure in Random Forest Models 2019 Burim Ramosaj
Markus Pauly
+ Random Forests for Ordinal Response Data: Prediction and Variable Selection 2014 Silke Janitza
Gerhard Tutz
Anne‐Laure Boulesteix
+ Asymptotic Unbiasedness of the Permutation Importance Measure in Random Forest Models. 2019 Burim Ramosaj
Markus Pauly
+ Consistent and unbiased variable selection under indepedent features using Random Forest permutation importance 2023 Burim Ramosaj
Markus Pauly
+ Party on! - A new, conditional variable importance measure for random forests available in party 2009 Carolin Strobl
Achim Zeileis
+ PDF Chat Challenges in Variable Importance Ranking Under Correlation 2024 Annie Liang
Thomas Jemielita
Andy Liaw
Vladimir Svetnik
Lingkang Huang
Richard Baumgartner
Jason M. Klusowski
+ PDF Chat Multi forests: Variable importance for multi-class outcomes 2024 Roman Hornung
Alexander Hapfelmeier
+ PDF Chat A Central Limit Theorem for the permutation importance measure 2024 Nico Föge
Lena Schmid
Marc Ditzhaus
Markus Pauly
+ rfvimptest: Sequential Permutation Testing of Random Forest Variable Importance Measures 2022 Alexander Hapfelmeier
Roman Hornung
+ Assessing agreement between permutation and dropout variable importance methods for regression and random forest models 2024 Kelvyn Bladen
D. Richard Cutler
+ PDF Chat Exploitation of surrogate variables in random forests for unbiased analysis of mutual impact and importance of features 2023 Lucas F. Voges
Lukas C. Jarren
Stephan Seifert
+ Exploiting tree-based variable importances to selectively identify relevant variables 2008 Vân Anh Huynh‐Thu
Louis Wehenkel
Pierre Geurts
+ Exploiting tree-based variable importances to selectively identify relevant variables 2008 Vân Anh Huynh‐Thu
Louis Wehenkel
Pierre Geurts
+ PDF Chat Conditional variable importance for random forests 2008 Carolin Strobl
Anne‐Laure Boulesteix
Thomas Kneib
Thomas Augustin
Achim Zeileis
+ Danger: High Power! – Exploring the Statistical Properties of a Test for Random Forest Variable Importance 2008 Carolin Strobl
Achim Zeileis
+ PDF Chat Variable Importance Analysis in Imbalanced Datasets: A New Approach 2020 Ismael Ahrazem Dfuf
Joaquin Forte Perez-Minayo
José Manuel Mira McWilliams
C. González
+ Thresholding Gini variable importance with a single-trained random forest: An empirical Bayes approach 2023 Robert Dunne
Roc Reguant
Priya Ramarao-Milne
Piotr Szul
Letitia M. F. Sng
Mischa Lundberg
Natalie A. Twine
Denis C. Bauer

Works That Cite This (22)

Action Title Year Authors
+ Random forest for ordinal responses: Prediction and variable selection 2015 Silke Janitza
Gerhard Tutz
Anne‐Laure Boulesteix
+ PDF Chat Using machine learning to understand physics graduate school admissions 2020 Nicholas T. Young
Marcos D. Caballero
+ Bias in the intervention in prediction measure in random forests: illustrations and recommendations 2018 Stefano Nembrini
+ PDF Chat Variable Importance Analysis in Imbalanced Datasets: A New Approach 2020 Ismael Ahrazem Dfuf
Joaquin Forte Perez-Minayo
José Manuel Mira McWilliams
C. González
+ Predictive and explanatory models might miss informative features in educational data 2021 Nicholas T. Young
Marcos D. Caballero
+ PDF Chat A computationally fast variable importance test for random forests for high-dimensional data 2016 Silke Janitza
Ender Celik
Anne‐Laure Boulesteix
+ PDF Chat Subsampling Versus Bootstrapping in Resampling-Based Model Selection for Multivariable Regression 2015 Riccardo De Bin
Silke Janitza
Willi Sauerbrei
Anne‐Laure Boulesteix
+ PDF Chat A computationally fast variable importance test for random forests for high-dimensional data 2016 Silke Janitza
Ender Celik
Anne‐Laure Boulesteix
+ PDF Chat Rubric-based holistic review represents a change from traditional graduate admissions approaches in physics 2023 Nicholas T. Young
N. Verboncoeur
Dao Chi Lam
Marcos D. Caballero
+ PDF Chat FeatureLTE: Learning to Estimate Feature Importance 2024 Tianping Zhang
Zhaoyang Wang
Chen Qian
Jian Li
Yin Lou