Xpressor

2 minute read

This work initially stemmed from a related but different question we were working on with Gabriel: “How could we rethink the attention mechanism in transformers to make it more efficient or better?”

Unfortunately, we had to reduce the project’s scope for 2 reasons:

Time was lacking, Laura wanted us to focus on something more applied to biology and on the next project: scPRINT-2.
We uncovered not just literature but communities of researchers working on this question 😅

However, we learned a lot and got many ideas. What we had is this notion of an attention matrix that could recursively attend to itself and latent tokens / latent attentions. We initially tried it to reduce attention memory and computational cost. We called it criss-cross attention, it actually worked well and is mentioned in the scPRINT-2 paper!

But this had not enough biological applications and Laura prefered to not publish on this in an ML conference. Gabriel felt that we needed more theoretical work to understand the properties of this attention, something of a disagreement between my PIs.

In the end, we merged it with another question we had: “How to make cell embeddings of single-cell foundation models like scPRINT, better. Indeed to make an embedding from a transformer’s input there was 2 main ways: either to use a [CLS] token (class-pooling) or to average all tokens (avg-pooling). Both were not great, and did not allow reconstruction of the model’s input from the embedding, as an auto-encoder would do.

This is how we came up with the idea of Xpressor, a model that would learn to reconstruct its input from a latent embedding, and that would have an attention mechanism that would allow it to attend to itself and to latent tokens. We realized that bringed 2 new possibilities:

1. Cross scale learning:

extending it outside single cell, it could be a way to learn across biological scales by compressing cells into input tokens for tissues, by compressing genes into input tokens for cells.

2. Parameter efficient fine tuning:

by adding a small transformer (called xpressor blocks), we could be both more expressive but also do a very parameter efficient fine tuning of the xpressor blocks. This was used heavily in scPRINT-2 for batch correction, and cell type prediction, but it can be used to fine tune a model from a scale to another, something we showed in Xpressor by fine-tuning ESM-2 (a protein language model) to generate protein embeddings that were useful for single cell predictions.

Share on

X Facebook LinkedIn Bluesky

Xpressor

1. Cross scale learning:

2. Parameter efficient fine tuning:

Share on

Leave a comment

You may also enjoy

The Gene Regulation Landscape

Finishing the PhD

How I managed thousands of datasets to build the scPRINT family of scRNA-seq foundation models

VCC starter pack