IRIS: Interpretable Retrieval-Augmented Classification for Long Interspersed Document Sequences.

Loading...

Date

2025-07

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

3
views
325
downloads

Citation Stats

Attention Stats

Abstract

Transformer-based models have achieved state-of-the-art performance in document classification but struggle with long-text processing due to the quadratic computational complexity in the self-attention module. Existing solutions, such as sparse attention, hierarchical models, and key sentence extraction, partially address the issue but still fall short when the input sequence is exceptionally lengthy. To address this challenge, we propose IRIS (Interpretable Retrieval-Augmented Classification for long Interspersed Document Sequences), a novel, lightweight framework that utilizes retrieval to efficiently classify long documents while enhancing interpretability. IRIS segments documents into chunks, stores their embeddings in a vector database, and retrieves those most relevant to a given task using learnable query vectors. A linear attention mechanism then aggregates the retrieved embeddings for classification, allowing the model to process arbitrarily long documents without increasing computational cost and remaining trainable on a single GPU. Our experiments across six datasets show that IRIS achieves comparable performance to baseline models on standard benchmarks, and excels in three clinical note disease risk prediction tasks where documents are extremely long and key information is sparse. Furthermore, IRIS provides global interpretability by revealing a clear summary of key risk factors identified by the model. These findings highlight the potential of IRIS as an efficient and interpretable solution for long-document classification, particularly in healthcare applications where both performance and explainability are crucial.

Department

Description

Provenance

Subjects

Citation

Published Version (Please cite this version)

10.18653/v1/2025.acl-long.1461

Publication Info

Li, Fengnan, Elliot D Hill, Shu Jiang, Jiaxin Gao and Matthew M Engelhard (2025). IRIS: Interpretable Retrieval-Augmented Classification for Long Interspersed Document Sequences. Proceedings of the conference. Association for Computational Linguistics. Meeting, 2025. pp. 30263–30283. 10.18653/v1/2025.acl-long.1461 Retrieved from https://hdl.handle.net/10161/33255.

This is constructed from limited available data and may be imprecise. To cite this article, please review & use the official citation provided by the journal.


Unless otherwise indicated, scholarly articles published by Duke faculty members are made available here with a CC-BY-NC (Creative Commons Attribution Non-Commercial) license, as enabled by the Duke Open Access Policy. If you wish to use the materials in ways not already permitted under CC-BY-NC, please consult the copyright owner. Other materials are made available here through the author’s grant of a non-exclusive license to make their work openly accessible.