one-off preparation and download of the database (optional)¶

It is recommended to work on a local lamin database in many instances:

you might want to do some dataset level preprocessing (PCA, neighboors, clustering, re-annotation, harmonization of labels, etc..)
to make the loading time much faster when you do large training runs
to work on compute nodes that might not have access to internet

In [1]:

Copied!

# initialize a local lamin database
# !lamin init --storage ~/scdataloader --schema bionty
! lamin load scdataloader
# initialize a local lamin database
# !lamin init --storage ~/scdataloader --schema bionty
! lamin load scdataloader

💡 found cached instance metadata: /home/ml4ig1/.lamin/instance--jkobject--scdataloader.env
💡 loaded instance: jkobject/scdataloader
💡 loaded instance: jkobject/scdataloader

In [1]:

Copied!

import bionty as bt
import lamindb as ln

from scdataloader import utils

%load_ext autoreload
%autoreload 2
import bionty as bt
import lamindb as ln

from scdataloader import utils

%load_ext autoreload
%autoreload 2

→ connected lamindb: jkobject/scprint

load some known ontology names¶

first if you use a local instance you will need to populate your ontologies. Meaning loading all the elements from ontological references and build the hierarchical tree.

One can just add everything by keeping the default None value for the ontology argument, but this will take a long time.

Instead, one load only the ontologies we need. By using all the used/existing cellxgene ontology names.

In [ ]:

Copied!

utils.populate_my_ontology()
utils.populate_my_ontology()

❗ now recursing through parents: this only happens once, but is much slower than bulk saving
❗ now recursing through parents: this only happens once, but is much slower than bulk saving

Directly download a lamin database¶

In this context one ca either directly download a lamin database (here the cellxgene database as example).

In [5]:

Copied!

list(ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census").all())
list(ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census").all())

Out[5]:

[Collection(uid='dMyEX3NTfKOEYXyMu591', version='2023-12-15', is_latest=False, name='cellxgene-census', hash='0NB32iVKG5ttaW5XILvG', visibility=1, created_by_id=1, transform_id=19, run_id=24, updated_at='2024-01-30 09:09:49 UTC'),
 Collection(uid='dMyEX3NTfKOEYXyMKDAQ', version='2023-07-25', is_latest=False, name='cellxgene-census', hash='pEJ9uvIeTLvHkZW2TBT5', visibility=1, created_by_id=1, transform_id=18, run_id=23, updated_at='2024-01-30 09:06:05 UTC'),
 Collection(uid='dMyEX3NTfKOEYXyMKDD7', version='2024-07-01', is_latest=True, name='cellxgene-census', hash='nI8Ag-HANeOpZOz-8CSn', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:38 UTC')]

In [7]:

Copied!

cx_dataset = ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census", version='2023-12-15').one()
cx_dataset, len(cx_dataset.artifacts.all())
cx_dataset = ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census", version='2023-12-15').one()
cx_dataset, len(cx_dataset.artifacts.all())

! no run & transform get linked, consider calling ln.context.track()

Out[7]:

(Collection(uid='dMyEX3NTfKOEYXyMu591', version='2023-12-15', is_latest=False, name='cellxgene-census', hash='0NB32iVKG5ttaW5XILvG', visibility=1, created_by_id=1, transform_id=19, run_id=24, updated_at='2024-01-30 09:09:49 UTC'),
 1113)

In [ ]:

Copied!

# you can use it as is
# you can use it as is

In [ ]:

Copied!

# or load it locally like so
mydataset = utils.load_dataset_local(lb, cx_dataset, "~/scdataloader/", name="cellxgene-local", description="the full cellxgene database", only=(0,2))
# or load it locally like so
mydataset = utils.load_dataset_local(lb, cx_dataset, "~/scdataloader/", name="cellxgene-local", description="the full cellxgene database", only=(0,2))

Or Preprocessing + Download¶

In this case we use a custom made function that applies a preprocessing after downloading each files in the database.

In [4]:

Copied!





from scdataloader.preprocess import (
    LaminPreprocessor,
    additional_postprocess,
    additional_preprocess,
)
from scdataloader.preprocess import (
    LaminPreprocessor,
    additional_postprocess,
    additional_preprocess,
)

In [6]:

Copied!

DESCRIPTION='preprocessed by scDataLoader'
DESCRIPTION='preprocessed by scDataLoader'

In [7]:

Copied!

# Here we also add some additional preprocessing (happens at the beginning of the preprocessing function) and post processing (happens at the end of the preprocessing function)
# this serves as an exemple of the flexibility of the function
do_preprocess = LaminPreprocessor(additional_postprocess=additional_postprocess, additional_preprocess=additional_preprocess, skip_validate=True, subset_hvg=0)
# Here we also add some additional preprocessing (happens at the beginning of the preprocessing function) and post processing (happens at the end of the preprocessing function)
# this serves as an exemple of the flexibility of the function
do_preprocess = LaminPreprocessor(additional_postprocess=additional_postprocess, additional_preprocess=additional_preprocess, skip_validate=True, subset_hvg=0)

In [16]:

Copied!

preprocessed_dataset = do_preprocess(cx_dataset, name=DESCRIPTION, description=DESCRIPTION, start_at=6, version="2")
preprocessed_dataset = do_preprocess(cx_dataset, name=DESCRIPTION, description=DESCRIPTION, start_at=6, version="2")

0
Artifact(uid='Mgilie8RUip2slElQoDx', key='cell-census/2023-12-15/h5ads/77044335-0ac5-4406-9b3f-8cdd3656d67b.h5ad', suffix='.h5ad', accessor='AnnData', description='Dissection: Pons (Pn) - afferent nuclei of cranial nerves in pons - PnAN', version='2023-12-15', size=131533988, hash='QCaNEZY9apeRmazmwGXAWg-16', hash_type='md5-n', n_observations=23349, visibility=1, key_is_virtual=False, updated_at=2024-01-29 07:45:42 UTC, storage_id=2, transform_id=16, run_id=22, created_by_id=1)
AnnData object with n_obs × n_vars = 23349 × 59357
    obs: 'roi', 'organism_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'sex_ontology_term_id', 'development_stage_ontology_term_id', 'donor_id', 'suspension_type', 'dissection', 'fraction_mitochondrial', 'fraction_unspliced', 'cell_cycle_score', 'total_genes', 'total_UMIs', 'sample_id', 'supercluster_term', 'cluster_id', 'subcluster_id', 'cell_type_ontology_term_id', 'tissue_ontology_term_id', 'is_primary_data', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage'
    var: 'Biotype', 'Chromosome', 'End', 'Gene', 'Start', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype'
    uns: 'batch_condition', 'schema_version', 'title'
    obsm: 'X_UMAP', 'X_tSNE'
Removed 128 genes.
Seeing 6048 outliers (25.90% of total dataset):

/home/ml4ig1/miniconda3/envs/test/lib/python3.10/site-packages/anndata/_core/anndata.py:522: FutureWarning: The dtype argument is deprecated and will be removed in late 2024.
  warnings.warn(

❗ no run & transform get linked, consider passing a `run` or calling ln.track()

... storing 'dpt_group' as categorical
... storing 'symbol' as categorical
... storing 'ncbi_gene_ids' as categorical
... storing 'biotype' as categorical
... storing 'description' as categorical
... storing 'synonyms' as categorical
... storing 'organism' as categorical

1
Artifact(uid='iAZPSOBKLpaK7lqyYp9O', key='cell-census/2023-12-15/h5ads/2a8ca8f3-5599-4cda-b973-3a2dfc3c1fe6.h5ad', suffix='.h5ad', accessor='AnnData', description='Dissection: Amygdaloid complex (AMY) - Corticomedial nuclear group (CMN) - anterior cortical nucleus - CoA', version='2023-12-15', size=169529860, hash='WuShpnxfduKWKn23oYUCTA-21', hash_type='md5-n', n_observations=10778, visibility=1, key_is_virtual=False, updated_at=2024-01-29 07:45:42 UTC, storage_id=2, transform_id=16, run_id=22, created_by_id=1)
... downloading 2a8ca8f3-5599-4cda-b973-3a2dfc3c1fe6.h5ad: 100.0%
AnnData object with n_obs × n_vars = 10778 × 59357
    obs: 'roi', 'organism_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'sex_ontology_term_id', 'development_stage_ontology_term_id', 'donor_id', 'suspension_type', 'dissection', 'fraction_mitochondrial', 'fraction_unspliced', 'cell_cycle_score', 'total_genes', 'total_UMIs', 'sample_id', 'supercluster_term', 'cluster_id', 'subcluster_id', 'cell_type_ontology_term_id', 'tissue_ontology_term_id', 'is_primary_data', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage'
    var: 'Biotype', 'Chromosome', 'End', 'Gene', 'Start', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype'
    uns: 'batch_condition', 'schema_version', 'title'
    obsm: 'X_UMAP', 'X_tSNE'
Removed 128 genes.
Seeing 1652 outliers (15.33% of total dataset):

/home/ml4ig1/miniconda3/envs/test/lib/python3.10/site-packages/anndata/_core/anndata.py:522: FutureWarning: The dtype argument is deprecated and will be removed in late 2024.
  warnings.warn(

---------------------------------------------------------------------------
KeyboardInterrupt                         Traceback (most recent call last)
Cell In[16], line 1
----> 1 preprocessed_dataset = do_preprocess(cx_dataset, name=DESCRIPTION, description=DESCRIPTION, start_at=6, version="2")

File ~/Documents code/scDataLoader/scdataloader/preprocess.py:355, in LaminPreprocessor.__call__(self, data, name, description, start_at, version)
    353 print(adata)
    354 try:
--> 355     adata = super().__call__(adata)
    357 except ValueError as v:
    358     if v.args[0].startswith(
    359         "Dataset dropped because contains too many secondary"
    360     ):

File ~/Documents code/scDataLoader/scdataloader/preprocess.py:240, in Preprocessor.__call__(self, adata)
    229     sc.pp.highly_variable_genes(
    230         adata,
    231         layer="clean",
   (...)
    235         subset=False,
    236     )
    237 # based on the topometry paper https://www.biorxiv.org/content/10.1101/2022.03.14.484134v2
    238 # https://rapids-singlecell.readthedocs.io/en/latest/api/generated/rapids_singlecell.pp.pca.html#rapids_singlecell.pp.pca
--> 240 adata.obsm["clean_pca"] = sc.pp.pca(
    241     adata.layers["clean"],
    242     n_comps=300 if adata.shape[0] > 300 else adata.shape[0] - 2,
    243 )
    244 sc.pp.neighbors(adata, use_rep="clean_pca")
    245 sc.tl.leiden(adata, key_added="leiden_3", resolution=3.0)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scanpy/preprocessing/_pca.py:200, in pca(data, n_comps, zero_center, svd_solver, random_state, return_info, use_highly_variable, dtype, copy, chunked, chunk_size)
    194 if svd_solver not in {'lobpcg', 'arpack'}:
    195     raise ValueError(
    196         'svd_solver: {svd_solver} can not be used with sparse input.\n'
    197         'Use "arpack" (the default) or "lobpcg" instead.'
    198     )
--> 200 output = _pca_with_sparse(
    201     X, n_comps, solver=svd_solver, random_state=random_state
    202 )
    203 # this is just a wrapper for the results
    204 X_pca = output['X_pca']

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scanpy/preprocessing/_pca.py:303, in _pca_with_sparse(X, npcs, solver, mu, random_state)
    292     return XHmat(x) - mhmat(ones(x))
    294 XL = LinearOperator(
    295     matvec=matvec,
    296     dtype=X.dtype,
   (...)
    300     rmatmat=rmatmat,
    301 )
--> 303 u, s, v = svds(XL, solver=solver, k=npcs, v0=random_init)
    304 u, v = svd_flip(u, v)
    305 idx = np.argsort(-s)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_eigen/_svds.py:525, in svds(A, k, ncv, tol, which, v0, maxiter, return_singular_vectors, solver, random_state, options)
    523 if v0 is None:
    524     v0 = random_state.standard_normal(size=(min(A.shape),))
--> 525 _, eigvec = eigsh(XH_X, k=k, tol=tol ** 2, maxiter=maxiter,
    526                   ncv=ncv, which=which, v0=v0)
    527 # arpack do not guarantee exactly orthonormal eigenvectors
    528 # for clustered eigenvalues, especially in complex arithmetic
    529 eigvec, _ = np.linalg.qr(eigvec)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:1700, in eigsh(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode)
   1698 with _ARPACK_LOCK:
   1699     while not params.converged:
-> 1700         params.iterate()
   1702     return params.extract(return_eigenvectors)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:549, in _SymmetricArpackParams.iterate(self)
    546 elif self.ido == 1:
    547     # compute y = Op*x
    548     if self.mode == 1:
--> 549         self.workd[yslice] = self.OP(self.workd[xslice])
    550     elif self.mode == 2:
    551         self.workd[xslice] = self.OPb(self.workd[xslice])

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_interface.py:236, in LinearOperator.matvec(self, x)
    233 if x.shape != (N,) and x.shape != (N,1):
    234     raise ValueError('dimension mismatch')
--> 236 y = self._matvec(x)
    238 if isinstance(x, np.matrix):
    239     y = asmatrix(y)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_interface.py:593, in _CustomLinearOperator._matvec(self, x)
    592 def _matvec(self, x):
--> 593     return self.__matvec_impl(x)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_eigen/_svds.py:469, in svds.<locals>.matvec_XH_X(x)
    468 def matvec_XH_X(x):
--> 469     return XH_dot(X_dot(x))

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_interface.py:283, in LinearOperator.rmatvec(self, x)
    280 if x.shape != (M,) and x.shape != (M,1):
    281     raise ValueError('dimension mismatch')
--> 283 y = self._rmatvec(x)
    285 if isinstance(x, np.matrix):
    286     y = asmatrix(y)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_interface.py:599, in _CustomLinearOperator._rmatvec(self, x)
    597 if func is None:
    598     raise NotImplementedError("rmatvec is not defined")
--> 599 return self.__rmatvec_impl(x)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scanpy/preprocessing/_pca.py:289, in _pca_with_sparse.<locals>.rmatvec(x)
    288 def rmatvec(x):
--> 289     return XHdot(x) - mhdot(ones(x))

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/_base.py:465, in _spbase.dot(self, other)
    463     return self * other
    464 else:
--> 465     return self @ other

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/_base.py:678, in _spbase.__matmul__(self, other)
    675 if isscalarlike(other):
    676     raise ValueError("Scalar operands are not allowed, "
    677                      "use '*' instead")
--> 678 return self._mul_dispatch(other)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/_base.py:576, in _spbase._mul_dispatch(self, other)
    573 if other.__class__ is np.ndarray:
    574     # Fast path for the most common case
    575     if other.shape == (N,):
--> 576         return self._mul_vector(other)
    577     elif other.shape == (N, 1):
    578         return self._mul_vector(other.ravel()).reshape(M, 1)

File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/_compressed.py:494, in _cs_matrix._mul_vector(self, other)
    492 # csr_matvec or csc_matvec
    493 fn = getattr(_sparsetools, self.format + '_matvec')
--> 494 fn(M, N, self.indptr, self.indices, self.data, other, result)
    496 return result

KeyboardInterrupt:

In [17]:

Copied!

#we have processed that many files
len(ln.Artifact.filter(version='2', description=DESCRIPTION))
#we have processed that many files
len(ln.Artifact.filter(version='2', description=DESCRIPTION))

Out[17]:

In [18]:

Copied!

# we can load a preprocessed anndata like this
adata = ln.Artifact.filter(version='2', description=DESCRIPTION)[0].backed()
adata.obs
# we can load a preprocessed anndata like this
adata = ln.Artifact.filter(version='2', description=DESCRIPTION)[0].backed()
adata.obs

Out[18]:

	roi	organism_ontology_term_id	disease_ontology_term_id	self_reported_ethnicity_ontology_term_id	assay_ontology_term_id	sex_ontology_term_id	development_stage_ontology_term_id	donor_id	suspension_type	dissection	...	total_counts_hb	log1p_total_counts_hb	pct_counts_hb	outlier	mt_outlier	leiden_3	leiden_2	leiden_1	dpt_group	heat_diff
73b14343-9071-4da1-884d-52f04c781a44	Human PnAN	NCBITaxon:9606	PATO:0000461	HANCESTRO:0005	EFO:0009922	PATO:0000384	HsapDv:0000136	H19.30.001	nucleus	Pons (Pn) - afferent nuclei of cranial nerves ...	...	3.0	1.386294	0.023204	True	False	13	11	10		NaN
91e3096c-9a19-4827-8050-9e08fe22f9e0	Human PnAN	NCBITaxon:9606	PATO:0000461	HANCESTRO:0005	EFO:0009922	PATO:0000384	HsapDv:0000136	H19.30.001	nucleus	Pons (Pn) - afferent nuclei of cranial nerves ...	...	0.0	0.000000	0.000000	False	False	5	4	3		NaN
7bd7c1f8-5080-4ea4-87ff-4c02bf641059	Human PnAN	NCBITaxon:9606	PATO:0000461	HANCESTRO:0005	EFO:0009922	PATO:0000384	HsapDv:0000136	H19.30.001	nucleus	Pons (Pn) - afferent nuclei of cranial nerves ...	...	2.0	1.098612	0.016004	True	False	19	17	6		NaN
47d8e4b5-daf3-4b3f-bdaa-c877fce88f08	Human PnAN	NCBITaxon:9606	PATO:0000461	HANCESTRO:0005	EFO:0009922	PATO:0000384	HsapDv:0000144	H18.30.002	nucleus	Pons (Pn) - afferent nuclei of cranial nerves ...	...	3.0	1.386294	0.008974	True	True	24	20	16		NaN
5bc7626d-4e45-4bf1-8818-922502760364	Human PnAN	NCBITaxon:9606	PATO:0000461	HANCESTRO:0005	EFO:0009922	PATO:0000384	HsapDv:0000144	H18.30.002	nucleus	Pons (Pn) - afferent nuclei of cranial nerves ...	...	0.0	0.000000	0.000000	True	False	34	27	22		NaN
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
f266c95f-e602-416b-9f3e-c8f081352e79	Human PnAN	NCBITaxon:9606	PATO:0000461	HANCESTRO:0005	EFO:0009922	PATO:0000384	HsapDv:0000144	H18.30.002	nucleus	Pons (Pn) - afferent nuclei of cranial nerves ...	...	2.0	1.098612	0.032889	False	False	12	9	9	9_PATO:0000461_CL:0002453_UBERON:0000988	0.048092
02530a0a-bd8b-40ff-b3ab-18bbbac42eda	Human PnAN	NCBITaxon:9606	PATO:0000461	HANCESTRO:0005	EFO:0009922	PATO:0000384	HsapDv:0000144	H18.30.002	nucleus	Pons (Pn) - afferent nuclei of cranial nerves ...	...	1.0	0.693147	0.011792	False	False	12	9	9	9_PATO:0000461_CL:0002453_UBERON:0000988	0.049665
a1840018-61f1-4345-b665-90ce5ad7936d	Human PnAN	NCBITaxon:9606	PATO:0000461	HANCESTRO:0005	EFO:0009922	PATO:0000384	HsapDv:0000144	H18.30.002	nucleus	Pons (Pn) - afferent nuclei of cranial nerves ...	...	0.0	0.000000	0.000000	False	False	12	9	9	9_PATO:0000461_CL:0002453_UBERON:0000988	0.049922
02c6c139-4346-443b-8209-9ab51d80595c	Human PnAN	NCBITaxon:9606	PATO:0000461	HANCESTRO:0005	EFO:0009922	PATO:0000384	HsapDv:0000144	H18.30.002	nucleus	Pons (Pn) - afferent nuclei of cranial nerves ...	...	0.0	0.000000	0.000000	False	False	12	9	9	9_PATO:0000461_CL:0002453_UBERON:0000988	0.056045
cce8f57e-ab2d-404c-a546-840de717fc73	Human PnAN	NCBITaxon:9606	PATO:0000461	HANCESTRO:0005	EFO:0009922	PATO:0000384	HsapDv:0000136	H19.30.001	nucleus	Pons (Pn) - afferent nuclei of cranial nerves ...	...	0.0	0.000000	0.000000	False	False	12	9	9	9_PATO:0000461_CL:0002453_UBERON:0000988	0.067691

23349 rows × 53 columns

In [20]:

Copied!





# I need to remake the dataset as it failed for some files and I had to restart at position 11 (As you can see in the preprocess() function)
name="preprocessed dataset"
description="preprocessed dataset using scdataloader"
dataset = ln.Collection(ln.Artifact.filter(version='2', description=DESCRIPTION), name=name, description=description)
dataset.save()
dataset.artifacts.count()
# I need to remake the dataset as it failed for some files and I had to restart at position 11 (As you can see in the preprocess() function)
name="preprocessed dataset"
description="preprocessed dataset using scdataloader"
dataset = ln.Collection(ln.Artifact.filter(version='2', description=DESCRIPTION), name=name, description=description)
dataset.save()
dataset.artifacts.count()

Out[20]: