Curate DataFrames and AnnDatas¶

Curating datasets typically means three things:

Validate: ensure a dataset meets predefined validation criteria
Standardize: transform a dataset so that it meets validation criteria, e.g., by fixing typos or using standardized identifiers
Annotate: link a dataset against metadata records

In LaminDB, valid metadata is metadata that’s stored in a metadata registry and validation criteria merely defines a mapping onto a field of a registry.

Example

"Experiment 1" is a valid value for ULabel.name if a record with this name exists in the ULabel registry.

# !pip install 'lamindb[bionty]'
!lamin init --storage ./test-curate --schema bionty

Validate a DataFrame¶

Let’s start with a DataFrame that we’d like to validate.

import lamindb as ln
import bionty as bt
import pandas as pd


df = pd.DataFrame(
    {
        "temperature": [37.2, 36.3, 38.2],
        "cell_type": ["cerebral pyramidal neuron", "astrocyte", "oligodendrocyte"],
        "assay_ontology_id": ["EFO:0008913", "EFO:0008913", "EFO:0008913"],
        "donor": ["D0001", "D0002", "DOOO3"]
    },
    index = ["obs1", "obs2", "obs3"]
)
df

Show code cell output Hide code cell output

→ connected lamindb: testuser1/test-curate

	temperature	cell_type	assay_ontology_id	donor
obs1	37.2	cerebral pyramidal neuron	EFO:0008913	D0001
obs2	36.3	astrocyte	EFO:0008913	D0002
obs3	38.2	oligodendrocyte	EFO:0008913	DOOO3

Define validation criteria and create a Curator object.

# in the dictionary, each key is a column name of the dataframe, and each value
# is a registry field onto which values are mapped
categoricals = {
    "cell_type": bt.CellType.name,
    "assay_ontology_id": bt.ExperimentalFactor.ontology_id,
    "donor": ln.ULabel.name,
}

# pass validation criteria
curate = ln.Curator.from_df(df, categoricals=categoricals)

The validate() method checks our data against the defined criteria. It identifies which values are already validated (exist in our registries) and which are potentially problematic (do not yet exist in our registries).

curate.validate()

Register new metadata values¶

If you see “non-validated” values, you’ll need to decide whether to add them to your registries or “fix” them in your dataset.

Because our current registries are still empty, we’ll start by populating our CellType registry with values from a public ontology.

# this adds cell types that were validated based on a public ontology
curate.add_validated_from("cell_type")

If we call validate() again, we see that one cell type still doesn’t pass validation.

curate.validate()

• mapping cell_type on CellType.name

!    1 terms is not validated: 'cerebral pyramidal neuron'
      → fix typos, remove non-existent values, or save terms via .add_new_from('cell_type')

• mapping assay_ontology_id on ExperimentalFactor.ontology_id

!    found 1 validated terms: ['EFO:0008913']
      → save terms via .add_validated_from('assay_ontology_id')

!    3 terms are not validated: 'D0001', 'D0002', 'DOOO3'
      → fix typos, remove non-existent values, or save terms via .add_new_from('donor')

False

Hence, let’s understand which cell type in the public ontology might be the actual match.

# use a lookup object to get the correct spelling of categories from a public ontology
lookup = curate.lookup(public=True)
lookup

# here is an example for the cell_type column
cell_types = lookup["cell_type"]
cell_types.cerebral_cortex_pyramidal_neuron

# fix the cell type
df.cell_type = df.cell_type.replace({"cerebral pyramidal neuron": cell_types.cerebral_cortex_pyramidal_neuron.name})
# now register curated and validated cell types
curate.add_validated_from(df.cell_type.name)

Now, do the same for "assay_ontology_id” and "donor".

# this adds assays that were validated based on a public ontology
curate.add_validated_from("assay_ontology_id")

# this adds donors that were _not_ validated
curate.add_new_from("donor")

# validate again
validated = curate.validate()
validated

Validate an AnnData¶

Here we addtionally specify which var_index to validate against.

import anndata as ad

X = pd.DataFrame(
    {
        "ENSG00000081059": [1, 2, 3], 
        "ENSG00000276977": [4, 5, 6], 
        "ENSG00000198851": [7, 8, 9], 
        "ENSG00000010610": [10, 11, 12], 
        "ENSG00000153563": [13, 14, 15],
        "corrupted": [16, 17, 18]
    }, 
    index=df.index
)

adata = ad.AnnData(X=X, obs=df)
adata

curate = ln.Curator.from_anndata(
    adata, 
    var_index=bt.Gene.ensembl_gene_id,  # validate var.index against Gene.ensembl_gene_id
    categoricals=categoricals, 
    organism="human",
)

curate.validate()

Save the validated genes following the instruction:

curate.add_validated_from_var_index()

✓ added 5 records from public with Gene.ensembl_gene_id for var_index: 'ENSG00000081059', 'ENSG00000276977', 'ENSG00000198851', 'ENSG00000010610', 'ENSG00000153563'

! 1 non-validated values are not saved in Gene.ensembl_gene_id: ['corrupted']!
      → to lookup values, use lookup().var_index
      → to save, run add_new_from_var_index

Non-validated terms can be accessed via:

curate.non_validated

Subset the AnnData to validated genes only:

adata_validated = adata[:, ~adata.var.index.isin(curate.non_validated["var_index"])].copy()

Now let’s validate the subsetted object:

curate = ln.Curator.from_anndata(
    adata_validated, 
    var_index=bt.Gene.ensembl_gene_id,  # validate var.index against Gene.ensembl_gene_id
    categoricals=categoricals, 
    organism="human",
)

curate.validate()

Save a curated artifact¶

The validated object can be subsequently saved as an Artifact:

artifact = curate.save_artifact(description="test AnnData")

Validated features and labels are linked to the artifact:

artifact.describe()

We’ve walked through the process of validating, standardizing, and annotating datasets going through these key steps:

Defining validation criteria
Validating data against existing registries
Adding new validated entries to registries
Annotating artifacts with validated metadata

By following these steps, you can ensure your data is standardized and well-curated.

If you have datasets that aren’t DataFrame-like or AnnData-like, read: Curate datasets of any format.