IRIS: Interpretable Retrieval-Augmented Classification for Long Interspersed Document Sequences.

dc.contributor.author

Li, Fengnan

dc.contributor.author

Hill, Elliot D

dc.contributor.author

Jiang, Shu

dc.contributor.author

Gao, Jiaxin

dc.contributor.author

Engelhard, Matthew M

dc.date.accessioned

2025-09-30T16:30:56Z

dc.date.available

2025-09-30T16:30:56Z

dc.date.issued

2025-07

dc.description.abstract

Transformer-based models have achieved state-of-the-art performance in document classification but struggle with long-text processing due to the quadratic computational complexity in the self-attention module. Existing solutions, such as sparse attention, hierarchical models, and key sentence extraction, partially address the issue but still fall short when the input sequence is exceptionally lengthy. To address this challenge, we propose IRIS (Interpretable Retrieval-Augmented Classification for long Interspersed Document Sequences), a novel, lightweight framework that utilizes retrieval to efficiently classify long documents while enhancing interpretability. IRIS segments documents into chunks, stores their embeddings in a vector database, and retrieves those most relevant to a given task using learnable query vectors. A linear attention mechanism then aggregates the retrieved embeddings for classification, allowing the model to process arbitrarily long documents without increasing computational cost and remaining trainable on a single GPU. Our experiments across six datasets show that IRIS achieves comparable performance to baseline models on standard benchmarks, and excels in three clinical note disease risk prediction tasks where documents are extremely long and key information is sparse. Furthermore, IRIS provides global interpretability by revealing a clear summary of key risk factors identified by the model. These findings highlight the potential of IRIS as an efficient and interpretable solution for long-document classification, particularly in healthcare applications where both performance and explainability are crucial.

dc.identifier.issn

0736-587X

dc.identifier.uri

https://hdl.handle.net/10161/33255

dc.language

eng

dc.publisher

Association for Computational Linguistics

dc.relation.ispartof

Proceedings of the conference. Association for Computational Linguistics. Meeting

dc.relation.isversionof

10.18653/v1/2025.acl-long.1461

dc.rights.uri

https://creativecommons.org/licenses/by-nc/4.0

dc.title

IRIS: Interpretable Retrieval-Augmented Classification for Long Interspersed Document Sequences.

dc.type

Journal article

pubs.begin-page

30263

pubs.end-page

30283

pubs.organisational-group

Duke

pubs.organisational-group

Student

pubs.publication-status

Published

pubs.volume

2025

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
IRIS Interpretable Retrieval-Augmented Classification for Long Interspersed Document Sequences.pdf
Size:
548.93 KB
Format:
Adobe Portable Document Format
Description:
Accepted version