Aggregating residue-level protein language model embeddings with optimal transport.
Date
2025-01
Authors
Editors
Journal Title
Journal ISSN
Volume Title
Repository Usage Stats
views
downloads
Citation Stats
Attention Stats
Abstract
Motivation
Protein language models (PLMs) have emerged as powerful approaches for mapping protein sequences into embeddings suitable for various applications. As protein representation schemes, PLMs generate per-token (i.e. per-residue) representations, resulting in variable-sized outputs based on protein length. This variability poses a challenge for protein-level prediction tasks that require uniform-sized embeddings for consistent analysis across different proteins. Previous work has typically used average pooling to summarize token-level PLM outputs, but it is unclear whether this method effectively prioritizes the relevant information across token-level representations.Results
We introduce a novel method utilizing optimal transport to convert variable-length PLM outputs into fixed-length representations. We conceptualize per-token PLM outputs as samples from a probabilistic distribution and employ sliced-Wasserstein distances to map these samples against a reference set, creating a Euclidean embedding in the output space. The resulting embedding is agnostic to the length of the input and represents the entire protein. We demonstrate the superiority of our method over average pooling for several downstream prediction tasks, particularly with constrained PLM sizes, enabling smaller-scale PLMs to match or exceed the performance of average-pooled larger-scale PLMs. Our aggregation scheme is especially effective for longer protein sequences by capturing essential information that might be lost through average pooling.Availability and implementation
Our implementation code can be found at https://github.com/navid-naderi/PLM_SWE.Type
Department
Description
Provenance
Subjects
Citation
Permalink
Published Version (Please cite this version)
Publication Info
NaderiAlizadeh, Navid, and Rohit Singh (2025). Aggregating residue-level protein language model embeddings with optimal transport. Bioinformatics advances, 5(1). p. vbaf060. 10.1093/bioadv/vbaf060 Retrieved from https://hdl.handle.net/10161/32361.
This is constructed from limited available data and may be imprecise. To cite this article, please review & use the official citation provided by the journal.
Collections
Scholars@Duke

Navid NaderiAlizadeh
Navid NaderiAlizadeh is an Assistant Research Professor in the Department of Biostatistics and Bioinformatics at Duke University. Prior to that, he was a Postdoctoral Researcher %in the Department of Electrical and Systems Engineering at the University of Pennsylvania. Navid’s current research interests span the foundations of machine learning, artificial intelligence, and signal processing and their applications in developing novel methods for analyzing biological data. Navid received the B.S. degree in electrical engineering from Sharif University of Technology, Tehran, Iran, in 2011, the M.S. degree in electrical and computer engineering from Cornell University, Ithaca, NY, USA, in 2014, and the Ph.D. degree in electrical engineering from the University of Southern California, Los Angeles, CA, USA, in 2016. Upon graduating with his Ph.D., he spent four years as a Research Scientist at Intel Labs and HRL Laboratories.

Rohit Singh
Rohit Singh is an Assistant Professor in the Departments of Biostatistics & Bioinformatics and Cell Biology at Duke Univ. His research interests are broadly in computational biology, with a focus on using machine learning to make drug discovery more efficient. Currently, he's exploring how single-cell genomics and large language models can help decode disease mechanisms and aid in identifying new targets and drugs. He is the recipient of the Test of Time Award at RECOMB, MIT's George M. Sprowls Award for his PhD thesis in Computer Science, and Stanford's Christopher Stephenson Memorial Award for Masters Research in the same field. In addition to academia, he has experience in the industry.
Unless otherwise indicated, scholarly articles published by Duke faculty members are made available here with a CC-BY-NC (Creative Commons Attribution Non-Commercial) license, as enabled by the Duke Open Access Policy. If you wish to use the materials in ways not already permitted under CC-BY-NC, please consult the copyright owner. Other materials are made available here through the author’s grant of a non-exclusive license to make their work openly accessible.