scdataloader
  • Home

Example notebooks

  • one-off preparation and download of the database (optional)
    • load some known ontology names
    • Directly download a lamin database
    • Or Preprocessing + Download
  • Create the dataloader

documentation

  • dataset
  • preprocess
  • utils
  • datamodule
  • collator
scdataloader
  • Example notebooks
  • one-off preparation and download of the database (optional)

one-off preparation and download of the database (optional)¶

It is recommended to work on a local lamin database in many instances:

  1. you might want to do some dataset level preprocessing (PCA, neighboors, clustering, re-annotation, harmonization of labels, etc..)
  2. to make the loading time much faster when you do large training runs
  3. to work on compute nodes that might not have access to internet
In [1]:
Copied!
# initialize a local lamin database
# !lamin init --storage ~/scdataloader --schema bionty
! lamin load scdataloader
# initialize a local lamin database # !lamin init --storage ~/scdataloader --schema bionty ! lamin load scdataloader
💡 found cached instance metadata: /home/ml4ig1/.lamin/instance--jkobject--scdataloader.env
💡 loaded instance: jkobject/scdataloader
💡 loaded instance: jkobject/scdataloader
In [1]:
Copied!
import bionty as bt
import lamindb as ln

from scdataloader import utils

%load_ext autoreload
%autoreload 2
import bionty as bt import lamindb as ln from scdataloader import utils %load_ext autoreload %autoreload 2
→ connected lamindb: jkobject/scprint

load some known ontology names¶

first if you use a local instance you will need to populate your ontologies. Meaning loading all the elements from ontological references and build the hierarchical tree.

One can just add everything by keeping the default None value for the ontology argument, but this will take a long time.

Instead, one load only the ontologies we need. By using all the used/existing cellxgene ontology names.

In [ ]:
Copied!
utils.populate_my_ontology()
utils.populate_my_ontology()
❗ now recursing through parents: this only happens once, but is much slower than bulk saving
❗ now recursing through parents: this only happens once, but is much slower than bulk saving

Directly download a lamin database¶

In this context one ca either directly download a lamin database (here the cellxgene database as example).

In [5]:
Copied!
list(ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census").all())
list(ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census").all())
Out[5]:
[Collection(uid='dMyEX3NTfKOEYXyMu591', version='2023-12-15', is_latest=False, name='cellxgene-census', hash='0NB32iVKG5ttaW5XILvG', visibility=1, created_by_id=1, transform_id=19, run_id=24, updated_at='2024-01-30 09:09:49 UTC'),
 Collection(uid='dMyEX3NTfKOEYXyMKDAQ', version='2023-07-25', is_latest=False, name='cellxgene-census', hash='pEJ9uvIeTLvHkZW2TBT5', visibility=1, created_by_id=1, transform_id=18, run_id=23, updated_at='2024-01-30 09:06:05 UTC'),
 Collection(uid='dMyEX3NTfKOEYXyMKDD7', version='2024-07-01', is_latest=True, name='cellxgene-census', hash='nI8Ag-HANeOpZOz-8CSn', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:38 UTC')]
In [7]:
Copied!
cx_dataset = ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census", version='2023-12-15').one()
cx_dataset, len(cx_dataset.artifacts.all())
cx_dataset = ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census", version='2023-12-15').one() cx_dataset, len(cx_dataset.artifacts.all())
! no run & transform get linked, consider calling ln.context.track()
Out[7]:
(Collection(uid='dMyEX3NTfKOEYXyMu591', version='2023-12-15', is_latest=False, name='cellxgene-census', hash='0NB32iVKG5ttaW5XILvG', visibility=1, created_by_id=1, transform_id=19, run_id=24, updated_at='2024-01-30 09:09:49 UTC'),
 1113)
In [ ]:
Copied!
# you can use it as is
# you can use it as is
In [ ]:
Copied!
# or load it locally like so
mydataset = utils.load_dataset_local(lb, cx_dataset, "~/scdataloader/", name="cellxgene-local", description="the full cellxgene database", only=(0,2))
# or load it locally like so mydataset = utils.load_dataset_local(lb, cx_dataset, "~/scdataloader/", name="cellxgene-local", description="the full cellxgene database", only=(0,2))

Or Preprocessing + Download¶

In this case we use a custom made function that applies a preprocessing after downloading each files in the database.

In [4]:
Copied!
from scdataloader.preprocess import (
    LaminPreprocessor,
    additional_postprocess,
    additional_preprocess,
)
from scdataloader.preprocess import ( LaminPreprocessor, additional_postprocess, additional_preprocess, )
In [6]:
Copied!
DESCRIPTION='preprocessed by scDataLoader'
DESCRIPTION='preprocessed by scDataLoader'
In [7]:
Copied!
# Here we also add some additional preprocessing (happens at the beginning of the preprocessing function) and post processing (happens at the end of the preprocessing function)
# this serves as an exemple of the flexibility of the function
do_preprocess = LaminPreprocessor(additional_postprocess=additional_postprocess, additional_preprocess=additional_preprocess, skip_validate=True, subset_hvg=0)
# Here we also add some additional preprocessing (happens at the beginning of the preprocessing function) and post processing (happens at the end of the preprocessing function) # this serves as an exemple of the flexibility of the function do_preprocess = LaminPreprocessor(additional_postprocess=additional_postprocess, additional_preprocess=additional_preprocess, skip_validate=True, subset_hvg=0)
In [16]:
Copied!
preprocessed_dataset = do_preprocess(cx_dataset, name=DESCRIPTION, description=DESCRIPTION, start_at=6, version="2")
preprocessed_dataset = do_preprocess(cx_dataset, name=DESCRIPTION, description=DESCRIPTION, start_at=6, version="2")
0
Artifact(uid='Mgilie8RUip2slElQoDx', key='cell-census/2023-12-15/h5ads/77044335-0ac5-4406-9b3f-8cdd3656d67b.h5ad', suffix='.h5ad', accessor='AnnData', description='Dissection: Pons (Pn) - afferent nuclei of cranial nerves in pons - PnAN', version='2023-12-15', size=131533988, hash='QCaNEZY9apeRmazmwGXAWg-16', hash_type='md5-n', n_observations=23349, visibility=1, key_is_virtual=False, updated_at=2024-01-29 07:45:42 UTC, storage_id=2, transform_id=16, run_id=22, created_by_id=1)
AnnData object with n_obs × n_vars = 23349 × 59357
    obs: 'roi', 'organism_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'sex_ontology_term_id', 'development_stage_ontology_term_id', 'donor_id', 'suspension_type', 'dissection', 'fraction_mitochondrial', 'fraction_unspliced', 'cell_cycle_score', 'total_genes', 'total_UMIs', 'sample_id', 'supercluster_term', 'cluster_id', 'subcluster_id', 'cell_type_ontology_term_id', 'tissue_ontology_term_id', 'is_primary_data', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage'
    var: 'Biotype', 'Chromosome', 'End', 'Gene', 'Start', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype'
    uns: 'batch_condition', 'schema_version', 'title'
    obsm: 'X_UMAP', 'X_tSNE'
Removed 128 genes.
Seeing 6048 outliers (25.90% of total dataset):
/home/ml4ig1/miniconda3/envs/test/lib/python3.10/site-packages/anndata/_core/anndata.py:522: FutureWarning: The dtype argument is deprecated and will be removed in late 2024.
  warnings.warn(
❗ no run & transform get linked, consider passing a `run` or calling ln.track()
... storing 'dpt_group' as categorical
... storing 'symbol' as categorical
... storing 'ncbi_gene_ids' as categorical
... storing 'biotype' as categorical
... storing 'description' as categorical
... storing 'synonyms' as categorical
... storing 'organism' as categorical
1
Artifact(uid='iAZPSOBKLpaK7lqyYp9O', key='cell-census/2023-12-15/h5ads/2a8ca8f3-5599-4cda-b973-3a2dfc3c1fe6.h5ad', suffix='.h5ad', accessor='AnnData', description='Dissection: Amygdaloid complex (AMY) - Corticomedial nuclear group (CMN) - anterior cortical nucleus - CoA', version='2023-12-15', size=169529860, hash='WuShpnxfduKWKn23oYUCTA-21', hash_type='md5-n', n_observations=10778, visibility=1, key_is_virtual=False, updated_at=2024-01-29 07:45:42 UTC, storage_id=2, transform_id=16, run_id=22, created_by_id=1)
... downloading 2a8ca8f3-5599-4cda-b973-3a2dfc3c1fe6.h5ad: 100.0%
AnnData object with n_obs × n_vars = 10778 × 59357
    obs: 'roi', 'organism_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'sex_ontology_term_id', 'development_stage_ontology_term_id', 'donor_id', 'suspension_type', 'dissection', 'fraction_mitochondrial', 'fraction_unspliced', 'cell_cycle_score', 'total_genes', 'total_UMIs', 'sample_id', 'supercluster_term', 'cluster_id', 'subcluster_id', 'cell_type_ontology_term_id', 'tissue_ontology_term_id', 'is_primary_data', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage'
    var: 'Biotype', 'Chromosome', 'End', 'Gene', 'Start', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype'
    uns: 'batch_condition', 'schema_version', 'title'
    obsm: 'X_UMAP', 'X_tSNE'
Removed 128 genes.
Seeing 1652 outliers (15.33% of total dataset):
/home/ml4ig1/miniconda3/envs/test/lib/python3.10/site-packages/anndata/_core/anndata.py:522: FutureWarning: The dtype argument is deprecated and will be removed in late 2024.
  warnings.warn(
---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[16], line 1
----> 1 preprocessed_dataset = do_preprocess(cx_dataset, name=DESCRIPTION, description=DESCRIPTION, start_at=6, version="2")

File ~/Documents code/scDataLoader/scdataloader/preprocess.py:355, in LaminPreprocessor.__call__(self, data, name, description, start_at, version)
    353 print(adata)
    354 try:
--> 355     adata = super().__call__(adata)
    357 except ValueError as v:
    358     if v.args[0].startswith(
    359         "Dataset dropped because contains too many secondary"
    360     ):

File ~/Documents code/scDataLoader/scdataloader/preprocess.py:240, in Preprocessor.__call__(self, adata)
    229     sc.pp.highly_variable_genes(
    230         adata,
    231         layer="clean",
   (...)
    235         subset=False,
    236     )
    237 # based on the topometry paper https://www.biorxiv.org/content/10.1101/2022.03.14.484134v2
    238 # https://rapids-singlecell.readthedocs.io/en/latest/api/generated/rapids_singlecell.pp.pca.html#rapids_singlecell.pp.pca
--> 240 adata.obsm["clean_pca"] = sc.pp.pca(
    241     adata.layers["clean"],
    242     n_comps=300 if adata.shape[0] > 300 else adata.shape[0] - 2,
    243 )
    244 sc.pp.neighbors(adata, use_rep="clean_pca")
    245 sc.tl.leiden(adata, key_added="leiden_3", resolution=3.0)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scanpy/preprocessing/_pca.py:200, in pca(data, n_comps, zero_center, svd_solver, random_state, return_info, use_highly_variable, dtype, copy, chunked, chunk_size)
    194 if svd_solver not in {'lobpcg', 'arpack'}:
    195     raise ValueError(
    196         'svd_solver: {svd_solver} can not be used with sparse input.\n'
    197         'Use "arpack" (the default) or "lobpcg" instead.'
    198     )
--> 200 output = _pca_with_sparse(
    201     X, n_comps, solver=svd_solver, random_state=random_state
    202 )
    203 # this is just a wrapper for the results
    204 X_pca = output['X_pca']

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scanpy/preprocessing/_pca.py:303, in _pca_with_sparse(X, npcs, solver, mu, random_state)
    292     return XHmat(x) - mhmat(ones(x))
    294 XL = LinearOperator(
    295     matvec=matvec,
    296     dtype=X.dtype,
   (...)
    300     rmatmat=rmatmat,
    301 )
--> 303 u, s, v = svds(XL, solver=solver, k=npcs, v0=random_init)
    304 u, v = svd_flip(u, v)
    305 idx = np.argsort(-s)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_eigen/_svds.py:525, in svds(A, k, ncv, tol, which, v0, maxiter, return_singular_vectors, solver, random_state, options)
    523 if v0 is None:
    524     v0 = random_state.standard_normal(size=(min(A.shape),))
--> 525 _, eigvec = eigsh(XH_X, k=k, tol=tol ** 2, maxiter=maxiter,
    526                   ncv=ncv, which=which, v0=v0)
    527 # arpack do not guarantee exactly orthonormal eigenvectors
    528 # for clustered eigenvalues, especially in complex arithmetic
    529 eigvec, _ = np.linalg.qr(eigvec)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:1700, in eigsh(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode)
   1698 with _ARPACK_LOCK:
   1699     while not params.converged:
-> 1700         params.iterate()
   1702     return params.extract(return_eigenvectors)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:549, in _SymmetricArpackParams.iterate(self)
    546 elif self.ido == 1:
    547     # compute y = Op*x
    548     if self.mode == 1:
--> 549         self.workd[yslice] = self.OP(self.workd[xslice])
    550     elif self.mode == 2:
    551         self.workd[xslice] = self.OPb(self.workd[xslice])

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_interface.py:236, in LinearOperator.matvec(self, x)
    233 if x.shape != (N,) and x.shape != (N,1):
    234     raise ValueError('dimension mismatch')
--> 236 y = self._matvec(x)
    238 if isinstance(x, np.matrix):
    239     y = asmatrix(y)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_interface.py:593, in _CustomLinearOperator._matvec(self, x)
    592 def _matvec(self, x):
--> 593     return self.__matvec_impl(x)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_eigen/_svds.py:469, in svds.<locals>.matvec_XH_X(x)
    468 def matvec_XH_X(x):
--> 469     return XH_dot(X_dot(x))

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_interface.py:283, in LinearOperator.rmatvec(self, x)
    280 if x.shape != (M,) and x.shape != (M,1):
    281     raise ValueError('dimension mismatch')
--> 283 y = self._rmatvec(x)
    285 if isinstance(x, np.matrix):
    286     y = asmatrix(y)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_interface.py:599, in _CustomLinearOperator._rmatvec(self, x)
    597 if func is None:
    598     raise NotImplementedError("rmatvec is not defined")
--> 599 return self.__rmatvec_impl(x)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scanpy/preprocessing/_pca.py:289, in _pca_with_sparse.<locals>.rmatvec(x)
    288 def rmatvec(x):
--> 289     return XHdot(x) - mhdot(ones(x))

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/_base.py:465, in _spbase.dot(self, other)
    463     return self * other
    464 else:
--> 465     return self @ other

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/_base.py:678, in _spbase.__matmul__(self, other)
    675 if isscalarlike(other):
    676     raise ValueError("Scalar operands are not allowed, "
    677                      "use '*' instead")
--> 678 return self._mul_dispatch(other)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/_base.py:576, in _spbase._mul_dispatch(self, other)
    573 if other.__class__ is np.ndarray:
    574     # Fast path for the most common case
    575     if other.shape == (N,):
--> 576         return self._mul_vector(other)
    577     elif other.shape == (N, 1):
    578         return self._mul_vector(other.ravel()).reshape(M, 1)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/_compressed.py:494, in _cs_matrix._mul_vector(self, other)
    492 # csr_matvec or csc_matvec
    493 fn = getattr(_sparsetools, self.format + '_matvec')
--> 494 fn(M, N, self.indptr, self.indices, self.data, other, result)
    496 return result

KeyboardInterrupt: 
In [17]:
Copied!
#we have processed that many files
len(ln.Artifact.filter(version='2', description=DESCRIPTION))
#we have processed that many files len(ln.Artifact.filter(version='2', description=DESCRIPTION))
Out[17]:
1
In [18]:
Copied!
# we can load a preprocessed anndata like this
adata = ln.Artifact.filter(version='2', description=DESCRIPTION)[0].backed()
adata.obs
# we can load a preprocessed anndata like this adata = ln.Artifact.filter(version='2', description=DESCRIPTION)[0].backed() adata.obs
Out[18]:
roi organism_ontology_term_id disease_ontology_term_id self_reported_ethnicity_ontology_term_id assay_ontology_term_id sex_ontology_term_id development_stage_ontology_term_id donor_id suspension_type dissection ... total_counts_hb log1p_total_counts_hb pct_counts_hb outlier mt_outlier leiden_3 leiden_2 leiden_1 dpt_group heat_diff
73b14343-9071-4da1-884d-52f04c781a44 Human PnAN NCBITaxon:9606 PATO:0000461 HANCESTRO:0005 EFO:0009922 PATO:0000384 HsapDv:0000136 H19.30.001 nucleus Pons (Pn) - afferent nuclei of cranial nerves ... ... 3.0 1.386294 0.023204 True False 13 11 10 NaN
91e3096c-9a19-4827-8050-9e08fe22f9e0 Human PnAN NCBITaxon:9606 PATO:0000461 HANCESTRO:0005 EFO:0009922 PATO:0000384 HsapDv:0000136 H19.30.001 nucleus Pons (Pn) - afferent nuclei of cranial nerves ... ... 0.0 0.000000 0.000000 False False 5 4 3 NaN
7bd7c1f8-5080-4ea4-87ff-4c02bf641059 Human PnAN NCBITaxon:9606 PATO:0000461 HANCESTRO:0005 EFO:0009922 PATO:0000384 HsapDv:0000136 H19.30.001 nucleus Pons (Pn) - afferent nuclei of cranial nerves ... ... 2.0 1.098612 0.016004 True False 19 17 6 NaN
47d8e4b5-daf3-4b3f-bdaa-c877fce88f08 Human PnAN NCBITaxon:9606 PATO:0000461 HANCESTRO:0005 EFO:0009922 PATO:0000384 HsapDv:0000144 H18.30.002 nucleus Pons (Pn) - afferent nuclei of cranial nerves ... ... 3.0 1.386294 0.008974 True True 24 20 16 NaN
5bc7626d-4e45-4bf1-8818-922502760364 Human PnAN NCBITaxon:9606 PATO:0000461 HANCESTRO:0005 EFO:0009922 PATO:0000384 HsapDv:0000144 H18.30.002 nucleus Pons (Pn) - afferent nuclei of cranial nerves ... ... 0.0 0.000000 0.000000 True False 34 27 22 NaN
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
f266c95f-e602-416b-9f3e-c8f081352e79 Human PnAN NCBITaxon:9606 PATO:0000461 HANCESTRO:0005 EFO:0009922 PATO:0000384 HsapDv:0000144 H18.30.002 nucleus Pons (Pn) - afferent nuclei of cranial nerves ... ... 2.0 1.098612 0.032889 False False 12 9 9 9_PATO:0000461_CL:0002453_UBERON:0000988 0.048092
02530a0a-bd8b-40ff-b3ab-18bbbac42eda Human PnAN NCBITaxon:9606 PATO:0000461 HANCESTRO:0005 EFO:0009922 PATO:0000384 HsapDv:0000144 H18.30.002 nucleus Pons (Pn) - afferent nuclei of cranial nerves ... ... 1.0 0.693147 0.011792 False False 12 9 9 9_PATO:0000461_CL:0002453_UBERON:0000988 0.049665
a1840018-61f1-4345-b665-90ce5ad7936d Human PnAN NCBITaxon:9606 PATO:0000461 HANCESTRO:0005 EFO:0009922 PATO:0000384 HsapDv:0000144 H18.30.002 nucleus Pons (Pn) - afferent nuclei of cranial nerves ... ... 0.0 0.000000 0.000000 False False 12 9 9 9_PATO:0000461_CL:0002453_UBERON:0000988 0.049922
02c6c139-4346-443b-8209-9ab51d80595c Human PnAN NCBITaxon:9606 PATO:0000461 HANCESTRO:0005 EFO:0009922 PATO:0000384 HsapDv:0000144 H18.30.002 nucleus Pons (Pn) - afferent nuclei of cranial nerves ... ... 0.0 0.000000 0.000000 False False 12 9 9 9_PATO:0000461_CL:0002453_UBERON:0000988 0.056045
cce8f57e-ab2d-404c-a546-840de717fc73 Human PnAN NCBITaxon:9606 PATO:0000461 HANCESTRO:0005 EFO:0009922 PATO:0000384 HsapDv:0000136 H19.30.001 nucleus Pons (Pn) - afferent nuclei of cranial nerves ... ... 0.0 0.000000 0.000000 False False 12 9 9 9_PATO:0000461_CL:0002453_UBERON:0000988 0.067691

23349 rows × 53 columns

In [20]:
Copied!
# I need to remake the dataset as it failed for some files and I had to restart at position 11 (As you can see in the preprocess() function)
name="preprocessed dataset"
description="preprocessed dataset using scdataloader"
dataset = ln.Collection(ln.Artifact.filter(version='2', description=DESCRIPTION), name=name, description=description)
dataset.save()
dataset.artifacts.count()
# I need to remake the dataset as it failed for some files and I had to restart at position 11 (As you can see in the preprocess() function) name="preprocessed dataset" description="preprocessed dataset using scdataloader" dataset = ln.Collection(ln.Artifact.filter(version='2', description=DESCRIPTION), name=name, description=description) dataset.save() dataset.artifacts.count()
Out[20]:
1
Previous Next

Built with MkDocs using a theme provided by Read the Docs.
« Previous Next »