Nonparametric IPSS: fast, flexible feature selection with false discovery control.

dc.contributor.author

Melikechi, Omar

dc.contributor.author

Dunson, David B

dc.contributor.author

Miller, Jeffrey W

dc.contributor.editor

Mathelier, Anthony

dc.date.accessioned

2025-11-22T03:10:13Z

dc.date.available

2025-11-22T03:10:13Z

dc.date.issued

2025-05

dc.description.abstract

Motivation

Feature selection is a critical task in machine learning and statistics. However, existing feature selection methods either (i) rely on parametric methods such as linear or generalized linear models, (ii) lack theoretical false discovery control, or (iii) identify few true positives.

Results

We introduce a general feature selection method with finite-sample false discovery control based on applying integrated path stability selection (IPSS) to arbitrary feature importance scores. The method is nonparametric whenever the importance scores are nonparametric, and it estimates q-values, which are better suited to high-dimensional data than P-values. We focus on two special cases using importance scores from gradient boosting (IPSSGB) and random forests (IPSSRF). Extensive nonlinear simulations with RNA sequencing data show that both methods accurately control the false discovery rate and detect more true positives than existing methods. Both methods are also efficient, running in under 20 s when there are 500 samples and 5000 features. We apply IPSSGB and IPSSRF to detect microRNAs and genes related to cancer, finding that they yield better predictions with fewer features than existing approaches.

Availability and implementation

All code and data used in this work are available on GitHub (https://github.com/omelikechi/ipss_bioinformatics) and permanently archived on Zenodo (https://doi.org/10.5281/zenodo.15335289). A Python package for implementing IPSS is available on GitHub (https://github.com/omelikechi/ipss) and PyPI (https://pypi.org/project/ipss/). An R implementation of IPSS is also available on GitHub (https://github.com/omelikechi/ipssR).
dc.identifier

8129569

dc.identifier.issn

1367-4803

dc.identifier.issn

1367-4811

dc.identifier.uri

https://hdl.handle.net/10161/33534

dc.language

eng

dc.publisher

Oxford University Press (OUP)

dc.relation.ispartof

Bioinformatics (Oxford, England)

dc.relation.isversionof

10.1093/bioinformatics/btaf299

dc.rights.uri

https://creativecommons.org/licenses/by-nc/4.0

dc.subject

Humans

dc.subject

Neoplasms

dc.subject

MicroRNAs

dc.subject

Sequence Analysis, RNA

dc.subject

Computational Biology

dc.subject

Algorithms

dc.subject

Software

dc.subject

Machine Learning

dc.title

Nonparametric IPSS: fast, flexible feature selection with false discovery control.

dc.type

Journal article

pubs.begin-page

btaf299

pubs.issue

5

pubs.organisational-group

Duke

pubs.organisational-group

Trinity College of Arts & Sciences

pubs.organisational-group

Mathematics

pubs.organisational-group

Statistical Science

pubs.organisational-group

University Institutes and Centers

pubs.organisational-group

Duke Institute for Brain Sciences

pubs.publication-status

Published

pubs.volume

41

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Nonparametric IPSS fast, flexible feature selection with false discovery control.pdf
Size:
2.78 MB
Format:
Adobe Portable Document Format