Test `AnnDataAccessor`¶

import lamindb as ln

ln.setup.init(storage="s3://lamindb-ci/test-anndata")

! To use lamindb, you need to connect to an instance.

Connect to an instance: `ln.connect()`. Init an instance: `ln.setup.init()`.

If you used the CLI to set up lamindb in a notebook, restart the Python session.

→ go to: https://lamin.ai/testuser1/test-anndata

! updating cloud SQLite 's3://lamindb-ci/test-anndata/183bc48fd12a5d5b8ff8153b79de292c.lndb' of instance 'testuser1/test-anndata'

→ connected lamindb: testuser1/test-anndata

! locked instance (to unlock and push changes to the cloud SQLite file, call: lamin disconnect)

We’ll need some test data:

ln.Artifact("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save()

! no run & transform got linked, call `ln.track()` & re-run

Artifact(uid='J5QHE4FxFYM0RSua0000', is_latest=True, key='lndb-storage/pbmc68k.h5ad', suffix='.h5ad', size=638484, hash='-QNUPBbAug3jFmmk3fsOQA', _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=False, storage_id=2, created_by_id=1, created_at=2024-10-18 15:56:36 UTC)

An h5ad artifact stored on s3:

artifact = ln.Artifact.filter(key="lndb-storage/pbmc68k.h5ad").one()

artifact.path

S3Path('s3://lamindb-ci/lndb-storage/pbmc68k.h5ad')

adata = artifact.open()

! run input wasn't tracked, call `ln.track()` and re-run

adata

AnnDataAccessor object with n_obs × n_vars = 70 × 765
  constructed for the AnnData object pbmc68k.h5ad
    obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
    obsm: ['X_pca', 'X_umap']
    obsp: ['connectivities', 'distances']
    uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
    var: ['highly_variable', 'index', 'n_counts']
    varm: ['PCs']

It is possible to access AnnData attributes without loading them into memory

However, .obs, .var and .uns are always loaded fully into memory on AnnDataAccessor initialization

adata.obs.head()

	cell_type	n_genes	percent_mito	louvain
index
GCAGGGCTGGATTC-1	Dendritic cells	1168	0.014345	2
CTTTAGTGGTTACG-6	CD19+ B	1121	0.019679	8
TGACTGGAACCATG-7	Dendritic cells	1277	0.012961	1
TCAATCACCCTTCG-8	CD19+ B	1139	0.018467	4
CGTTATACAGTACC-8	CD4+/CD45RO+ Memory	1034	0.010163	0

adata.var.head()

	n_counts	highly_variable
index
HES4	1153.387451	True
TNFRSF4	304.358154	True
SSU72	2530.272705	False
PARK7	7451.664062	False
RBP7	272.811035	True

adata.uns.keys()

dict_keys(['louvain', 'louvain_colors', 'neighbors', 'pca'])

Without subsetting, the AnnDataAccessor object gives references to underlying lazy h5 or zarr arrays:

adata.X

<HDF5 dataset "X": shape (70, 765), type "<f4">

adata.obsm["X_pca"]

<HDF5 dataset "X_pca": shape (70, 50), type "<f4">

And to a lazy SparseDataset from the anndata package:

adata.obsp["distances"]

CSRDataset: backend hdf5, shape (70, 70), data_dtype float64

Get a subset of the object, attributes are loaded only on explicit access:

obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    adata.obs.percent_mito <= 0.05
)
adata_subset = adata[obs_idx]

adata_subset

AnnDataAccessorSubset object with n_obs × n_vars = 35 × 765
  obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
  obsm: ['X_pca', 'X_umap']
  obsp: ['connectivities', 'distances']
  uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
  var: ['highly_variable', 'index', 'n_counts']
  varm: ['PCs']

Check shapes of the subset

adata_subset.obs.cell_type.value_counts()

cell_type
Dendritic cells                 28
CD14+ Monocytes                  7
CD4+/CD25 T Reg                  0
CD4+/CD45RO+ Memory              0
CD8+ Cytotoxic T                 0
CD8+/CD45RA+ Naive Cytotoxic     0
CD19+ B                          0
CD34+                            0
CD56+ NK                         0
Name: count, dtype: int64

Subsets load the arrays into memory only on direct access

print(adata_subset.X)

[[-0.326 -0.191  0.499 ... -0.21  -0.636 -0.49 ]
 [ 0.811 -0.191 -0.728 ... -0.21   0.604 -0.49 ]
 [-0.326 -0.191  0.643 ... -0.21   2.303 -0.49 ]
 ...
 [-0.326 -0.191 -0.728 ... -0.21   0.626 -0.49 ]
 [-0.326 -0.191 -0.728 ... -0.21  -0.636 -0.49 ]
 [-0.326 -0.191 -0.728 ... -0.21  -0.636 -0.49 ]]

print(adata_subset.obsm["X_pca"])

[[-5.750601   -4.096395   -2.9178936  ... -0.3169805  -0.20286919
  -0.4912242 ]
 [-6.516435    4.5414424   1.629511   ... -2.0872126   2.4427452
   0.67004365]
 [-2.0939696   4.8808017  -2.0491498  ... -3.3238401  -1.6365678
   1.0325491 ]
 ...
 [-2.284083   -4.8995905  -2.5168793  ... -0.22459485 -0.28241014
  -0.45557737]
 [-7.1581526   5.147818    2.4819682  ...  2.1289759  -0.27535897
   0.5335301 ]
 [-4.0010567  -6.0705996  -3.1599348  ...  1.1530831   0.48674038
  -0.24262637]]

To load the entire subset into memory as an actual AnnData object, use to_memory():

adata_subset.to_memory()

AnnData object with n_obs × n_vars = 35 × 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'

Test AnnDataAccessor¶

Test `AnnDataAccessor`¶