Documentation for `Preprocessor`

`scdataloader.preprocess.Preprocessor`

Prepare data into training, valid and test split. Normalize raw expression values, binning or using other transform into the preset model input format.

Initializes the preprocessor and configures the workflow steps.

Your dataset should contain at least the following obs: - organism_ontology_term_id with the ontology id of the organism of your anndata - gene names in the var.index field of your anndata that map to the ensembl_gene nomenclature or the hugo gene symbols nomenclature (if the later, set is_symbol to True)

Parameters:

filter_gene_by_counts (int or bool, default: False ) –

Determines whether to filter genes by counts. If int, filters genes with counts. Defaults to False.
filter_cell_by_counts (int or bool, default: False ) –

Determines whether to filter cells by counts. If int, filters cells with counts. Defaults to False.
normalize_sum (float or bool, default: 10000.0 ) –

Determines whether to normalize the total counts of each cell to a specific value. Defaults to 1e4.
log1p (bool) –

Determines whether to apply log1p transform to the normalized data. Defaults to True.
n_hvg_for_postp (int or bool, default: 0 ) –

Determines whether to subset to highly variable genes for the PCA. Defaults to False.
hvg_flavor (str, default: 'seurat_v3' ) –

Specifies the flavor of highly variable genes selection. See :func:scanpy.pp.highly_variable_genes for more details. Defaults to "seurat_v3".
binning (int, default: None ) –

Determines whether to bin the data into discrete values of number of bins provided.
result_binned_key (str, default: 'X_binned' ) –

Specifies the key of :class:~anndata.AnnData to store the binned data. Defaults to "X_binned".
length_normalize (bool, default: False ) –

Determines whether to length normalize the data. Defaults to False.
force_preprocess (bool, default: False ) –

Determines whether to bypass the check of raw counts. Defaults to False.
min_dataset_size (int, default: 100 ) –

The minimum size required for a dataset to be kept. Defaults to 100.
min_valid_genes_id (int, default: 10000 ) –

The minimum number of valid genes to keep a dataset. Defaults to 10_000.
min_nnz_genes (int, default: 200 ) –

The minimum number of non-zero genes to keep a cell. Defaults to 200.
maxdropamount (int, default: 50 ) –

The maximum amount of dropped cells per dataset. (2 for 50% drop, 3 for 33% drop, etc.) Defaults to 2.
madoutlier (int, default: 5 ) –

The maximum absolute deviation of the outlier samples. Defaults to 5.
pct_mt_outlier (int, default: 8 ) –

The maximum percentage of mitochondrial genes outlier. Defaults to 8.
batch_key (str) –

The key of :class:~anndata.AnnData.obs to use for batch information. This arg is used in the highly variable gene selection step.
skip_validate (bool, default: False ) –

Determines whether to skip the validation step. Defaults to False.
keepdata (bool, default: False ) –

Determines whether to keep the data in the AnnData object. Defaults to False.