Documentation for `Dataset`

`scdataloader.data.Dataset` `dataclass`

Bases: Dataset

Dataset class to load a bunch of anndata from a lamin dataset (Collection) in a memory efficient way.

This serves as a wrapper around lamin's mappedCollection to provide more features, mostly, the management of hierarchical labels, the encoding of labels, the management of multiple species

For an example of mappedDataset, see :meth:~lamindb.Dataset.mapped.

.. note::

A related data loader exists `here
<https://github.com/Genentech/scimilarity>`__.

lamin_dataset (lamindb.Dataset): lamin dataset to load
genedf (pd.Dataframe): dataframe containing the genes to load
organisms (list[str]): list of organisms to load
    (for now only validates the the genes map to this organism)
obs (list[str]): list of observations to load from the Collection
clss_to_predict (list[str]): list of observations to encode
join_vars (flag): join variables @see :meth:`~lamindb.Dataset.mapped`.
hierarchical_clss: list of observations to map to a hierarchy using lamin's bionty

Methods:

Name	Description
`define_hierarchies`	define_hierarchies is a method to define the hierarchies for the classes to predict
`get_label_weights`	Get all weights for the given label keys.
`get_unseen_mapped_dataset_elements`	get_unseen_mapped_dataset_elements is a wrapper around mappedDataset.get_unseen_mapped_dataset_elements

`define_hierarchies`

define_hierarchies is a method to define the hierarchies for the classes to predict

Parameters:	`clsses` (`list[str]`) – list of classes to predict

Raises:	`ValueError` – if the class is not in the accepted classes

Source code in scdataloader/data.py

def define_hierarchies(self, clsses: list[str]):
    """
    define_hierarchies is a method to define the hierarchies for the classes to predict

    Args:
        clsses (list[str]): list of classes to predict

    Raises:
        ValueError: if the class is not in the accepted classes
    """
    # TODO: use all possible hierarchies instead of just the ones for which we have a sample annotated with
    self.labels_groupings = {}
    self.class_topred = {}
    for clss in clsses:
        if clss not in [
            "cell_type_ontology_term_id",
            "tissue_ontology_term_id",
            "disease_ontology_term_id",
            "development_stage_ontology_term_id",
            "simplified_dev_stage",
            "age_group",
            "assay_ontology_term_id",
            "self_reported_ethnicity_ontology_term_id",
        ]:
            raise ValueError(
                "class {} not in accepted classes, for now only supported from bionty sources".format(
                    clss
                )
            )
        elif clss == "cell_type_ontology_term_id":
            parentdf = (
                bt.CellType.filter()
                .df(include=["parents__ontology_id"])
                .set_index("ontology_id")
            )
        elif clss == "tissue_ontology_term_id":
            parentdf = (
                bt.Tissue.filter()
                .df(include=["parents__ontology_id"])
                .set_index("ontology_id")
            )
        elif clss == "disease_ontology_term_id":
            parentdf = (
                bt.Disease.filter()
                .df(include=["parents__ontology_id"])
                .set_index("ontology_id")
            )
        elif clss in [
            "development_stage_ontology_term_id",
            "simplified_dev_stage",
            "age_group",
        ]:
            parentdf = (
                bt.DevelopmentalStage.filter()
                .df(include=["parents__ontology_id"])
                .set_index("ontology_id")
            )
        elif clss == "assay_ontology_term_id":
            parentdf = (
                bt.ExperimentalFactor.filter()
                .df(include=["parents__ontology_id"])
                .set_index("ontology_id")
            )
        elif clss == "self_reported_ethnicity_ontology_term_id":
            parentdf = (
                bt.Ethnicity.filter()
                .df(include=["parents__ontology_id"])
                .set_index("ontology_id")
            )

        else:
            raise ValueError(
                "class {} not in accepted classes, for now only supported from bionty sources".format(
                    clss
                )
            )
        cats = set(self.mapped_dataset.get_merged_categories(clss))
        addition = set(LABELS_TOADD.get(clss, {}).values())
        cats |= addition
        groupings, _, leaf_labels = get_ancestry_mapping(cats, parentdf)
        for i, j in groupings.items():
            if len(j) == 0:
                groupings.pop(i)
        self.labels_groupings[clss] = groupings
        if clss in self.clss_to_predict:
            # if we have added new clss, we need to update the encoder with them too.
            mlength = len(self.mapped_dataset.encoders[clss])

            mlength -= (
                1
                if self.mapped_dataset.unknown_label
                in self.mapped_dataset.encoders[clss].keys()
                else 0
            )

            for i, v in enumerate(
                addition - set(self.mapped_dataset.encoders[clss].keys())
            ):
                self.mapped_dataset.encoders[clss].update({v: mlength + i})
            # we need to change the ordering so that the things that can't be predicted appear afterward

            self.class_topred[clss] = leaf_labels
            c = 0
            update = {}
            mlength = len(leaf_labels)
            mlength -= (
                1
                if self.mapped_dataset.unknown_label
                in self.mapped_dataset.encoders[clss].keys()
                else 0
            )
            for k, v in self.mapped_dataset.encoders[clss].items():
                if k in self.labels_groupings[clss].keys():
                    update.update({k: mlength + c})
                    c += 1
                elif k == self.mapped_dataset.unknown_label:
                    update.update({k: v})
                    self.class_topred[clss] -= set([k])
                else:
                    update.update({k: v - c})
            self.mapped_dataset.encoders[clss] = update

`get_label_weights`

Get all weights for the given label keys.

Source code in scdataloader/data.py

def get_label_weights(
    self,
    obs_keys: str | list[str],
    scaler: int = 10,
    return_categories=False,
    bypass_label=["neuron"],
):
    """Get all weights for the given label keys."""
    if isinstance(obs_keys, str):
        obs_keys = [obs_keys]
    labels_list = []
    for label_key in obs_keys:
        labels_to_str = (
            self.mapped_dataset.get_merged_labels(label_key).astype(str).astype("O")
        )
        labels_list.append(labels_to_str)
    if len(labels_list) > 1:
        labels = ["___".join(labels_obs) for labels_obs in zip(*labels_list)]
    else:
        labels = labels_list[0]

    counter = Counter(labels)  # type: ignore
    if return_categories:
        rn = {n: i for i, n in enumerate(counter.keys())}
        labels = np.array([rn[label] for label in labels])
        counter = np.array(list(counter.values()))
        weights = scaler / (counter + scaler)
        return weights, labels
    else:
        counts = np.array([counter[label] for label in labels])
        if scaler is None:
            weights = 1.0 / counts
        else:
            weights = scaler / (counts + scaler)
        return weights

`get_unseen_mapped_dataset_elements`

get_unseen_mapped_dataset_elements is a wrapper around mappedDataset.get_unseen_mapped_dataset_elements

Parameters:	`idx` (`int`) – index of the element to get

Returns:	– list[str]: list of unseen genes

Source code in scdataloader/data.py

def get_unseen_mapped_dataset_elements(self, idx: int):
    """
    get_unseen_mapped_dataset_elements is a wrapper around mappedDataset.get_unseen_mapped_dataset_elements

    Args:
        idx (int): index of the element to get

    Returns:
        list[str]: list of unseen genes
    """
    return [str(i)[2:-1] for i in self.mapped_dataset.uns(idx, "unseen_genes")]

`scdataloader.data.SimpleAnnDataset`

Bases: Dataset

SimpleAnnDataset is a simple dataloader for an AnnData dataset. this is to interface nicely with the rest of scDataloader and with your model during inference.

adata (anndata.AnnData): anndata object to use
obs_to_output (list[str]): list of observations to output from anndata.obs
layer (str): layer of the anndata to use

Source code in scdataloader/data.py

def __init__(
    self,
    adata: AnnData,
    obs_to_output: Optional[list[str]] = [],
    layer: Optional[str] = None,
):
    """
    SimpleAnnDataset is a simple dataloader for an AnnData dataset. this is to interface nicely with the rest of
    scDataloader and with your model during inference.

    Args:
    ----
        adata (anndata.AnnData): anndata object to use
        obs_to_output (list[str]): list of observations to output from anndata.obs
        layer (str): layer of the anndata to use
    """
    self.adataX = adata.layers[layer] if layer is not None else adata.X
    self.adataX = self.adataX.toarray() if issparse(self.adataX) else self.adataX
    self.obs_to_output = adata.obs[obs_to_output]

Documentation for Dataset

scdataloader.data.Dataset dataclass

define_hierarchies

get_label_weights

get_unseen_mapped_dataset_elements

scdataloader.data.SimpleAnnDataset

Documentation for `Dataset`

`scdataloader.data.Dataset` `dataclass`

`define_hierarchies`

`get_label_weights`

`get_unseen_mapped_dataset_elements`

`scdataloader.data.SimpleAnnDataset`