scDataloader

1 minute read

In my new manuscript scPRINT, I present a tool called scDataLoader 🧬. You can find more details in our preprint. The package is available on github and documentation is available here.

Origins

Last year, while starting my Ph.D. and after asking many questions on the cellxgene’s slack channel, I got contacted by Alex who wanted me to try their new tool: lamindb 👀

LaminDB had many compelling features:

Management of notebooks, metadata, and runs
Ontologies of cell types, markers, genes, tissues
Support for various datasets like anndata, spatial omics, imaging data and more

We discussed my need for a dataloader, which aligned with their thinking and previous work with Ko on anndata dataloaders 🔥

Why Dataloaders Matter

Recent papers (“Foundation Models Meet Imbalanced Single-Cell Data When Learning Cell Type Annotations”, “Reusability report: Learning the transcriptional grammar in single-cell RNA-sequencing data using transformers”) have shown that scRNAseq training datasets containing many similar cell types can cause imbalance issues for foundation models training and low performance on rare cell types.

For example, if you train a model on 50% images of cars, it might overfit to cars. We wanted to prevent scPRINT from similar overfitting issues 😱

Implementation

We implemented our dataloader with two key capabilities:

Ability to work with any number of datasets and cells
Support for weighted random sampling

This implementation builds on lamin’s mapped dataset functionality. scDataLoader serves as a wrapper around the mapped dataset tool, providing integration with cell labels, PyTorch and Lightning.

For more information about each component, visit lamin.ai. I highly recommend checking out lamin’s work if you work in computational biology! 💪

Recent Developments

We recently published a review of dataloaders and mapped collections with interesting findings: Arrayloader Benchmarks

Looking ahead, we’re excited about developing an interface to work with remote datasets 🚀

Share on

X Facebook LinkedIn Bluesky

scDataloader

Origins

Why Dataloaders Matter

Implementation

Recent Developments

Share on

Leave a comment

You may also enjoy

The Gene Regulation Landscape

Finishing the PhD

How I managed thousands of datasets to build the scPRINT family of scRNA-seq foundation models

VCC starter pack