Aggregating residue-level protein language model embeddings with optimal transport.

Loading...

Date

2025-01

Journal Title

Journal ISSN

Volume Title

Repository Usage Stats

4
views
12
downloads

Citation Stats

Attention Stats

Abstract

Motivation

Protein language models (PLMs) have emerged as powerful approaches for mapping protein sequences into embeddings suitable for various applications. As protein representation schemes, PLMs generate per-token (i.e. per-residue) representations, resulting in variable-sized outputs based on protein length. This variability poses a challenge for protein-level prediction tasks that require uniform-sized embeddings for consistent analysis across different proteins. Previous work has typically used average pooling to summarize token-level PLM outputs, but it is unclear whether this method effectively prioritizes the relevant information across token-level representations.

Results

We introduce a novel method utilizing optimal transport to convert variable-length PLM outputs into fixed-length representations. We conceptualize per-token PLM outputs as samples from a probabilistic distribution and employ sliced-Wasserstein distances to map these samples against a reference set, creating a Euclidean embedding in the output space. The resulting embedding is agnostic to the length of the input and represents the entire protein. We demonstrate the superiority of our method over average pooling for several downstream prediction tasks, particularly with constrained PLM sizes, enabling smaller-scale PLMs to match or exceed the performance of average-pooled larger-scale PLMs. Our aggregation scheme is especially effective for longer protein sequences by capturing essential information that might be lost through average pooling.

Availability and implementation

Our implementation code can be found at https://github.com/navid-naderi/PLM_SWE.

Department

Description

Provenance

Subjects

Citation

Published Version (Please cite this version)

10.1093/bioadv/vbaf060

Publication Info

NaderiAlizadeh, Navid, and Rohit Singh (2025). Aggregating residue-level protein language model embeddings with optimal transport. Bioinformatics advances, 5(1). p. vbaf060. 10.1093/bioadv/vbaf060 Retrieved from https://hdl.handle.net/10161/32361.

This is constructed from limited available data and may be imprecise. To cite this article, please review & use the official citation provided by the journal.

Scholars@Duke

NaderiAlizadeh

Navid NaderiAlizadeh

Assistant Research Professor of Biostatistics & Bioinformatics

Navid NaderiAlizadeh is an Assistant Research Professor in the Department of Biostatistics and Bioinformatics at Duke University. Prior to that, he was a Postdoctoral Researcher %in the Department of Electrical and Systems Engineering at the University of Pennsylvania. Navid’s current research interests span the foundations of machine learning, artificial intelligence, and signal processing and their applications in developing novel methods for analyzing biological data. Navid received the B.S. degree in electrical engineering from Sharif University of Technology, Tehran, Iran, in 2011, the M.S. degree in electrical and computer engineering from Cornell University, Ithaca, NY, USA, in 2014, and the Ph.D. degree in electrical engineering from the University of Southern California, Los Angeles, CA, USA, in 2016. Upon graduating with his Ph.D., he spent four years as a Research Scientist at Intel Labs and HRL Laboratories.

Singh

Rohit Singh

Assistant Professor of Biostatistics & Bioinformatics

Rohit Singh is an Assistant Professor in the Departments of Biostatistics & Bioinformatics and Cell Biology at Duke Univ. His research interests are broadly in computational biology, with a focus on using machine learning to make drug discovery more efficient. Currently, he's exploring how single-cell genomics and large language models can help decode disease mechanisms and aid in identifying new targets and drugs. He is the recipient of the Test of Time Award at RECOMB, MIT's George M. Sprowls Award for his PhD thesis in Computer Science, and Stanford's Christopher Stephenson Memorial Award for Masters Research in the same field. In addition to academia, he has experience in the industry.


Unless otherwise indicated, scholarly articles published by Duke faculty members are made available here with a CC-BY-NC (Creative Commons Attribution Non-Commercial) license, as enabled by the Duke Open Access Policy. If you wish to use the materials in ways not already permitted under CC-BY-NC, please consult the copyright owner. Other materials are made available here through the author’s grant of a non-exclusive license to make their work openly accessible.