Statistics Seminar - Ana Maria Kenney

seminar thumbnail

Event Date

Location
Mathematical Sciences Building 1147

Speaker: Ana Maria Kenney, Post-Doctoral Scholar, UC Berkeley

Title: "Leveraging problem structure to improve feature recommendations in biomedical research"

Abstract: Random forests (RF) have been shown to achieve high prediction accuracy and more precisely capture the underlying data-generating mechanisms in many biomedical problems. However, they can lack interpretability, making it challenging to determine which biological features should be prioritized for further study. The cost and resources required for follow-up investigations (e.g., clinical trials) necessitates the development of improved, stable feature recommendations based on these highly predictive models.

In this talk, we first discuss highly collaborative, interdisciplinary work on identifying and testing epistatic drivers of cardiomyocyte hypertrophy through a machine learning approach based on iterative random forests. Through follow-up gene silencing experiments and cell imaging analysis, we were able to validate our recommendations and show that cardiomyocyte hypertrophy is modifiable by two specific pairwise interactions.

The challenges and insights into RFs encountered in this work incentivized the development of generalized mean decrease in impurity (MDI), or GMDI, a unifying framework for random forest feature importances. We show that MDI, the default importance method for RFs, for a feature in each fitted tree in an RF is the unnormalized r-squared value when fitting a linear regression using only the subset of local decision stumps corresponding to nodes that split on this feature. GMDI goes beyond this restrictive ordinary least squares setting and allows the use of more appropriate models and metrics that can be tailored towards different problem structures (e.g., using robust regression when there is potential data contamination). We demonstrate that this flexibility improves feature importance rankings, often by 10% or more, in terms of the AUROC for classifying signal vs non-signal features. Moreover, in a case study on drug response prediction, GMDI extracts well-established predictive genes with greater stability and robustness compared to existing feature importance measures.

 

SEMINAR TIME/DATE: Wednesday, January 25, 11:00am

LOCATION: MSB 1147

Tags