one-off preparation and download of the database (optional)¶
It is recommended to work on a local lamin database in many instances:
- you might want to do some dataset level preprocessing (PCA, neighboors, clustering, re-annotation, harmonization of labels, etc..)
- to make the loading time much faster when you do large training runs
- to work on compute nodes that might not have access to internet
In [1]:
Copied!
# initialize a local lamin database
# !lamin init --storage ~/scdataloader --schema bionty
! lamin load scdataloader
# initialize a local lamin database
# !lamin init --storage ~/scdataloader --schema bionty
! lamin load scdataloader
💡 found cached instance metadata: /home/ml4ig1/.lamin/instance--jkobject--scdataloader.env 💡 loaded instance: jkobject/scdataloader 💡 loaded instance: jkobject/scdataloader
In [1]:
Copied!
import bionty as bt
import lamindb as ln
from scdataloader import utils
%load_ext autoreload
%autoreload 2
import bionty as bt
import lamindb as ln
from scdataloader import utils
%load_ext autoreload
%autoreload 2
→ connected lamindb: jkobject/scprint
load some known ontology names¶
first if you use a local instance you will need to populate your ontologies. Meaning loading all the elements from ontological references and build the hierarchical tree.
One can just add everything by keeping the default None
value for the ontology
argument, but this will take a long time.
Instead, one load only the ontologies we need. By using all the used/existing cellxgene ontology names.
In [ ]:
Copied!
utils.populate_my_ontology()
utils.populate_my_ontology()
❗ now recursing through parents: this only happens once, but is much slower than bulk saving ❗ now recursing through parents: this only happens once, but is much slower than bulk saving
Directly download a lamin database¶
In this context one ca either directly download a lamin database (here the cellxgene database as example).
In [5]:
Copied!
list(ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census").all())
list(ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census").all())
Out[5]:
[Collection(uid='dMyEX3NTfKOEYXyMu591', version='2023-12-15', is_latest=False, name='cellxgene-census', hash='0NB32iVKG5ttaW5XILvG', visibility=1, created_by_id=1, transform_id=19, run_id=24, updated_at='2024-01-30 09:09:49 UTC'), Collection(uid='dMyEX3NTfKOEYXyMKDAQ', version='2023-07-25', is_latest=False, name='cellxgene-census', hash='pEJ9uvIeTLvHkZW2TBT5', visibility=1, created_by_id=1, transform_id=18, run_id=23, updated_at='2024-01-30 09:06:05 UTC'), Collection(uid='dMyEX3NTfKOEYXyMKDD7', version='2024-07-01', is_latest=True, name='cellxgene-census', hash='nI8Ag-HANeOpZOz-8CSn', visibility=1, created_by_id=1, transform_id=22, run_id=27, updated_at='2024-07-16 12:24:38 UTC')]
In [7]:
Copied!
cx_dataset = ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census", version='2023-12-15').one()
cx_dataset, len(cx_dataset.artifacts.all())
cx_dataset = ln.Collection.using(instance="laminlabs/cellxgene").filter(name="cellxgene-census", version='2023-12-15').one()
cx_dataset, len(cx_dataset.artifacts.all())
! no run & transform get linked, consider calling ln.context.track()
Out[7]:
(Collection(uid='dMyEX3NTfKOEYXyMu591', version='2023-12-15', is_latest=False, name='cellxgene-census', hash='0NB32iVKG5ttaW5XILvG', visibility=1, created_by_id=1, transform_id=19, run_id=24, updated_at='2024-01-30 09:09:49 UTC'), 1113)
In [ ]:
Copied!
# you can use it as is
# you can use it as is
In [ ]:
Copied!
# or load it locally like so
mydataset = utils.load_dataset_local(lb, cx_dataset, "~/scdataloader/", name="cellxgene-local", description="the full cellxgene database", only=(0,2))
# or load it locally like so
mydataset = utils.load_dataset_local(lb, cx_dataset, "~/scdataloader/", name="cellxgene-local", description="the full cellxgene database", only=(0,2))
Or Preprocessing + Download¶
In this case we use a custom made function that applies a preprocessing after downloading each files in the database.
In [4]:
Copied!
from scdataloader.preprocess import (
LaminPreprocessor,
additional_postprocess,
additional_preprocess,
)
from scdataloader.preprocess import (
LaminPreprocessor,
additional_postprocess,
additional_preprocess,
)
In [6]:
Copied!
DESCRIPTION='preprocessed by scDataLoader'
DESCRIPTION='preprocessed by scDataLoader'
In [7]:
Copied!
# Here we also add some additional preprocessing (happens at the beginning of the preprocessing function) and post processing (happens at the end of the preprocessing function)
# this serves as an exemple of the flexibility of the function
do_preprocess = LaminPreprocessor(additional_postprocess=additional_postprocess, additional_preprocess=additional_preprocess, skip_validate=True, subset_hvg=0)
# Here we also add some additional preprocessing (happens at the beginning of the preprocessing function) and post processing (happens at the end of the preprocessing function)
# this serves as an exemple of the flexibility of the function
do_preprocess = LaminPreprocessor(additional_postprocess=additional_postprocess, additional_preprocess=additional_preprocess, skip_validate=True, subset_hvg=0)
In [16]:
Copied!
preprocessed_dataset = do_preprocess(cx_dataset, name=DESCRIPTION, description=DESCRIPTION, start_at=6, version="2")
preprocessed_dataset = do_preprocess(cx_dataset, name=DESCRIPTION, description=DESCRIPTION, start_at=6, version="2")
0 Artifact(uid='Mgilie8RUip2slElQoDx', key='cell-census/2023-12-15/h5ads/77044335-0ac5-4406-9b3f-8cdd3656d67b.h5ad', suffix='.h5ad', accessor='AnnData', description='Dissection: Pons (Pn) - afferent nuclei of cranial nerves in pons - PnAN', version='2023-12-15', size=131533988, hash='QCaNEZY9apeRmazmwGXAWg-16', hash_type='md5-n', n_observations=23349, visibility=1, key_is_virtual=False, updated_at=2024-01-29 07:45:42 UTC, storage_id=2, transform_id=16, run_id=22, created_by_id=1) AnnData object with n_obs × n_vars = 23349 × 59357 obs: 'roi', 'organism_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'sex_ontology_term_id', 'development_stage_ontology_term_id', 'donor_id', 'suspension_type', 'dissection', 'fraction_mitochondrial', 'fraction_unspliced', 'cell_cycle_score', 'total_genes', 'total_UMIs', 'sample_id', 'supercluster_term', 'cluster_id', 'subcluster_id', 'cell_type_ontology_term_id', 'tissue_ontology_term_id', 'is_primary_data', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage' var: 'Biotype', 'Chromosome', 'End', 'Gene', 'Start', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype' uns: 'batch_condition', 'schema_version', 'title' obsm: 'X_UMAP', 'X_tSNE' Removed 128 genes. Seeing 6048 outliers (25.90% of total dataset):
/home/ml4ig1/miniconda3/envs/test/lib/python3.10/site-packages/anndata/_core/anndata.py:522: FutureWarning: The dtype argument is deprecated and will be removed in late 2024. warnings.warn(
❗ no run & transform get linked, consider passing a `run` or calling ln.track()
... storing 'dpt_group' as categorical ... storing 'symbol' as categorical ... storing 'ncbi_gene_ids' as categorical ... storing 'biotype' as categorical ... storing 'description' as categorical ... storing 'synonyms' as categorical ... storing 'organism' as categorical
1 Artifact(uid='iAZPSOBKLpaK7lqyYp9O', key='cell-census/2023-12-15/h5ads/2a8ca8f3-5599-4cda-b973-3a2dfc3c1fe6.h5ad', suffix='.h5ad', accessor='AnnData', description='Dissection: Amygdaloid complex (AMY) - Corticomedial nuclear group (CMN) - anterior cortical nucleus - CoA', version='2023-12-15', size=169529860, hash='WuShpnxfduKWKn23oYUCTA-21', hash_type='md5-n', n_observations=10778, visibility=1, key_is_virtual=False, updated_at=2024-01-29 07:45:42 UTC, storage_id=2, transform_id=16, run_id=22, created_by_id=1) ... downloading 2a8ca8f3-5599-4cda-b973-3a2dfc3c1fe6.h5ad: 100.0% AnnData object with n_obs × n_vars = 10778 × 59357 obs: 'roi', 'organism_ontology_term_id', 'disease_ontology_term_id', 'self_reported_ethnicity_ontology_term_id', 'assay_ontology_term_id', 'sex_ontology_term_id', 'development_stage_ontology_term_id', 'donor_id', 'suspension_type', 'dissection', 'fraction_mitochondrial', 'fraction_unspliced', 'cell_cycle_score', 'total_genes', 'total_UMIs', 'sample_id', 'supercluster_term', 'cluster_id', 'subcluster_id', 'cell_type_ontology_term_id', 'tissue_ontology_term_id', 'is_primary_data', 'cell_type', 'assay', 'disease', 'organism', 'sex', 'tissue', 'self_reported_ethnicity', 'development_stage' var: 'Biotype', 'Chromosome', 'End', 'Gene', 'Start', 'feature_is_filtered', 'feature_name', 'feature_reference', 'feature_biotype' uns: 'batch_condition', 'schema_version', 'title' obsm: 'X_UMAP', 'X_tSNE' Removed 128 genes. Seeing 1652 outliers (15.33% of total dataset):
/home/ml4ig1/miniconda3/envs/test/lib/python3.10/site-packages/anndata/_core/anndata.py:522: FutureWarning: The dtype argument is deprecated and will be removed in late 2024. warnings.warn(
--------------------------------------------------------------------------- KeyboardInterrupt Traceback (most recent call last) Cell In[16], line 1 ----> 1 preprocessed_dataset = do_preprocess(cx_dataset, name=DESCRIPTION, description=DESCRIPTION, start_at=6, version="2") File ~/Documents code/scDataLoader/scdataloader/preprocess.py:355, in LaminPreprocessor.__call__(self, data, name, description, start_at, version) 353 print(adata) 354 try: --> 355 adata = super().__call__(adata) 357 except ValueError as v: 358 if v.args[0].startswith( 359 "Dataset dropped because contains too many secondary" 360 ): File ~/Documents code/scDataLoader/scdataloader/preprocess.py:240, in Preprocessor.__call__(self, adata) 229 sc.pp.highly_variable_genes( 230 adata, 231 layer="clean", (...) 235 subset=False, 236 ) 237 # based on the topometry paper https://www.biorxiv.org/content/10.1101/2022.03.14.484134v2 238 # https://rapids-singlecell.readthedocs.io/en/latest/api/generated/rapids_singlecell.pp.pca.html#rapids_singlecell.pp.pca --> 240 adata.obsm["clean_pca"] = sc.pp.pca( 241 adata.layers["clean"], 242 n_comps=300 if adata.shape[0] > 300 else adata.shape[0] - 2, 243 ) 244 sc.pp.neighbors(adata, use_rep="clean_pca") 245 sc.tl.leiden(adata, key_added="leiden_3", resolution=3.0) File ~/miniconda3/envs/test/lib/python3.10/site-packages/scanpy/preprocessing/_pca.py:200, in pca(data, n_comps, zero_center, svd_solver, random_state, return_info, use_highly_variable, dtype, copy, chunked, chunk_size) 194 if svd_solver not in {'lobpcg', 'arpack'}: 195 raise ValueError( 196 'svd_solver: {svd_solver} can not be used with sparse input.\n' 197 'Use "arpack" (the default) or "lobpcg" instead.' 198 ) --> 200 output = _pca_with_sparse( 201 X, n_comps, solver=svd_solver, random_state=random_state 202 ) 203 # this is just a wrapper for the results 204 X_pca = output['X_pca'] File ~/miniconda3/envs/test/lib/python3.10/site-packages/scanpy/preprocessing/_pca.py:303, in _pca_with_sparse(X, npcs, solver, mu, random_state) 292 return XHmat(x) - mhmat(ones(x)) 294 XL = LinearOperator( 295 matvec=matvec, 296 dtype=X.dtype, (...) 300 rmatmat=rmatmat, 301 ) --> 303 u, s, v = svds(XL, solver=solver, k=npcs, v0=random_init) 304 u, v = svd_flip(u, v) 305 idx = np.argsort(-s) File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_eigen/_svds.py:525, in svds(A, k, ncv, tol, which, v0, maxiter, return_singular_vectors, solver, random_state, options) 523 if v0 is None: 524 v0 = random_state.standard_normal(size=(min(A.shape),)) --> 525 _, eigvec = eigsh(XH_X, k=k, tol=tol ** 2, maxiter=maxiter, 526 ncv=ncv, which=which, v0=v0) 527 # arpack do not guarantee exactly orthonormal eigenvectors 528 # for clustered eigenvalues, especially in complex arithmetic 529 eigvec, _ = np.linalg.qr(eigvec) File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:1700, in eigsh(A, k, M, sigma, which, v0, ncv, maxiter, tol, return_eigenvectors, Minv, OPinv, mode) 1698 with _ARPACK_LOCK: 1699 while not params.converged: -> 1700 params.iterate() 1702 return params.extract(return_eigenvectors) File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_eigen/arpack/arpack.py:549, in _SymmetricArpackParams.iterate(self) 546 elif self.ido == 1: 547 # compute y = Op*x 548 if self.mode == 1: --> 549 self.workd[yslice] = self.OP(self.workd[xslice]) 550 elif self.mode == 2: 551 self.workd[xslice] = self.OPb(self.workd[xslice]) File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_interface.py:236, in LinearOperator.matvec(self, x) 233 if x.shape != (N,) and x.shape != (N,1): 234 raise ValueError('dimension mismatch') --> 236 y = self._matvec(x) 238 if isinstance(x, np.matrix): 239 y = asmatrix(y) File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_interface.py:593, in _CustomLinearOperator._matvec(self, x) 592 def _matvec(self, x): --> 593 return self.__matvec_impl(x) File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_eigen/_svds.py:469, in svds.<locals>.matvec_XH_X(x) 468 def matvec_XH_X(x): --> 469 return XH_dot(X_dot(x)) File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_interface.py:283, in LinearOperator.rmatvec(self, x) 280 if x.shape != (M,) and x.shape != (M,1): 281 raise ValueError('dimension mismatch') --> 283 y = self._rmatvec(x) 285 if isinstance(x, np.matrix): 286 y = asmatrix(y) File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/linalg/_interface.py:599, in _CustomLinearOperator._rmatvec(self, x) 597 if func is None: 598 raise NotImplementedError("rmatvec is not defined") --> 599 return self.__rmatvec_impl(x) File ~/miniconda3/envs/test/lib/python3.10/site-packages/scanpy/preprocessing/_pca.py:289, in _pca_with_sparse.<locals>.rmatvec(x) 288 def rmatvec(x): --> 289 return XHdot(x) - mhdot(ones(x)) File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/_base.py:465, in _spbase.dot(self, other) 463 return self * other 464 else: --> 465 return self @ other File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/_base.py:678, in _spbase.__matmul__(self, other) 675 if isscalarlike(other): 676 raise ValueError("Scalar operands are not allowed, " 677 "use '*' instead") --> 678 return self._mul_dispatch(other) File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/_base.py:576, in _spbase._mul_dispatch(self, other) 573 if other.__class__ is np.ndarray: 574 # Fast path for the most common case 575 if other.shape == (N,): --> 576 return self._mul_vector(other) 577 elif other.shape == (N, 1): 578 return self._mul_vector(other.ravel()).reshape(M, 1) File ~/miniconda3/envs/test/lib/python3.10/site-packages/scipy/sparse/_compressed.py:494, in _cs_matrix._mul_vector(self, other) 492 # csr_matvec or csc_matvec 493 fn = getattr(_sparsetools, self.format + '_matvec') --> 494 fn(M, N, self.indptr, self.indices, self.data, other, result) 496 return result KeyboardInterrupt:
In [17]:
Copied!
#we have processed that many files
len(ln.Artifact.filter(version='2', description=DESCRIPTION))
#we have processed that many files
len(ln.Artifact.filter(version='2', description=DESCRIPTION))
Out[17]:
1
In [18]:
Copied!
# we can load a preprocessed anndata like this
adata = ln.Artifact.filter(version='2', description=DESCRIPTION)[0].backed()
adata.obs
# we can load a preprocessed anndata like this
adata = ln.Artifact.filter(version='2', description=DESCRIPTION)[0].backed()
adata.obs
Out[18]:
roi | organism_ontology_term_id | disease_ontology_term_id | self_reported_ethnicity_ontology_term_id | assay_ontology_term_id | sex_ontology_term_id | development_stage_ontology_term_id | donor_id | suspension_type | dissection | ... | total_counts_hb | log1p_total_counts_hb | pct_counts_hb | outlier | mt_outlier | leiden_3 | leiden_2 | leiden_1 | dpt_group | heat_diff | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
73b14343-9071-4da1-884d-52f04c781a44 | Human PnAN | NCBITaxon:9606 | PATO:0000461 | HANCESTRO:0005 | EFO:0009922 | PATO:0000384 | HsapDv:0000136 | H19.30.001 | nucleus | Pons (Pn) - afferent nuclei of cranial nerves ... | ... | 3.0 | 1.386294 | 0.023204 | True | False | 13 | 11 | 10 | NaN | |
91e3096c-9a19-4827-8050-9e08fe22f9e0 | Human PnAN | NCBITaxon:9606 | PATO:0000461 | HANCESTRO:0005 | EFO:0009922 | PATO:0000384 | HsapDv:0000136 | H19.30.001 | nucleus | Pons (Pn) - afferent nuclei of cranial nerves ... | ... | 0.0 | 0.000000 | 0.000000 | False | False | 5 | 4 | 3 | NaN | |
7bd7c1f8-5080-4ea4-87ff-4c02bf641059 | Human PnAN | NCBITaxon:9606 | PATO:0000461 | HANCESTRO:0005 | EFO:0009922 | PATO:0000384 | HsapDv:0000136 | H19.30.001 | nucleus | Pons (Pn) - afferent nuclei of cranial nerves ... | ... | 2.0 | 1.098612 | 0.016004 | True | False | 19 | 17 | 6 | NaN | |
47d8e4b5-daf3-4b3f-bdaa-c877fce88f08 | Human PnAN | NCBITaxon:9606 | PATO:0000461 | HANCESTRO:0005 | EFO:0009922 | PATO:0000384 | HsapDv:0000144 | H18.30.002 | nucleus | Pons (Pn) - afferent nuclei of cranial nerves ... | ... | 3.0 | 1.386294 | 0.008974 | True | True | 24 | 20 | 16 | NaN | |
5bc7626d-4e45-4bf1-8818-922502760364 | Human PnAN | NCBITaxon:9606 | PATO:0000461 | HANCESTRO:0005 | EFO:0009922 | PATO:0000384 | HsapDv:0000144 | H18.30.002 | nucleus | Pons (Pn) - afferent nuclei of cranial nerves ... | ... | 0.0 | 0.000000 | 0.000000 | True | False | 34 | 27 | 22 | NaN | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
f266c95f-e602-416b-9f3e-c8f081352e79 | Human PnAN | NCBITaxon:9606 | PATO:0000461 | HANCESTRO:0005 | EFO:0009922 | PATO:0000384 | HsapDv:0000144 | H18.30.002 | nucleus | Pons (Pn) - afferent nuclei of cranial nerves ... | ... | 2.0 | 1.098612 | 0.032889 | False | False | 12 | 9 | 9 | 9_PATO:0000461_CL:0002453_UBERON:0000988 | 0.048092 |
02530a0a-bd8b-40ff-b3ab-18bbbac42eda | Human PnAN | NCBITaxon:9606 | PATO:0000461 | HANCESTRO:0005 | EFO:0009922 | PATO:0000384 | HsapDv:0000144 | H18.30.002 | nucleus | Pons (Pn) - afferent nuclei of cranial nerves ... | ... | 1.0 | 0.693147 | 0.011792 | False | False | 12 | 9 | 9 | 9_PATO:0000461_CL:0002453_UBERON:0000988 | 0.049665 |
a1840018-61f1-4345-b665-90ce5ad7936d | Human PnAN | NCBITaxon:9606 | PATO:0000461 | HANCESTRO:0005 | EFO:0009922 | PATO:0000384 | HsapDv:0000144 | H18.30.002 | nucleus | Pons (Pn) - afferent nuclei of cranial nerves ... | ... | 0.0 | 0.000000 | 0.000000 | False | False | 12 | 9 | 9 | 9_PATO:0000461_CL:0002453_UBERON:0000988 | 0.049922 |
02c6c139-4346-443b-8209-9ab51d80595c | Human PnAN | NCBITaxon:9606 | PATO:0000461 | HANCESTRO:0005 | EFO:0009922 | PATO:0000384 | HsapDv:0000144 | H18.30.002 | nucleus | Pons (Pn) - afferent nuclei of cranial nerves ... | ... | 0.0 | 0.000000 | 0.000000 | False | False | 12 | 9 | 9 | 9_PATO:0000461_CL:0002453_UBERON:0000988 | 0.056045 |
cce8f57e-ab2d-404c-a546-840de717fc73 | Human PnAN | NCBITaxon:9606 | PATO:0000461 | HANCESTRO:0005 | EFO:0009922 | PATO:0000384 | HsapDv:0000136 | H19.30.001 | nucleus | Pons (Pn) - afferent nuclei of cranial nerves ... | ... | 0.0 | 0.000000 | 0.000000 | False | False | 12 | 9 | 9 | 9_PATO:0000461_CL:0002453_UBERON:0000988 | 0.067691 |
23349 rows × 53 columns
In [20]:
Copied!
# I need to remake the dataset as it failed for some files and I had to restart at position 11 (As you can see in the preprocess() function)
name="preprocessed dataset"
description="preprocessed dataset using scdataloader"
dataset = ln.Collection(ln.Artifact.filter(version='2', description=DESCRIPTION), name=name, description=description)
dataset.save()
dataset.artifacts.count()
# I need to remake the dataset as it failed for some files and I had to restart at position 11 (As you can see in the preprocess() function)
name="preprocessed dataset"
description="preprocessed dataset using scdataloader"
dataset = ln.Collection(ln.Artifact.filter(version='2', description=DESCRIPTION), name=name, description=description)
dataset.save()
dataset.artifacts.count()
Out[20]:
1