Test AnnDataAccessor

import lamindb as ln

ln.setup.init(storage="s3://lamindb-ci/test-anndata")
! To use lamindb, you need to connect to an instance.

Connect to an instance: `ln.connect()`. Init an instance: `ln.setup.init()`.

If you used the CLI to set up lamindb in a notebook, restart the Python session.
→ go to: https://lamin.ai/testuser1/test-anndata
! updating cloud SQLite 's3://lamindb-ci/test-anndata/183bc48fd12a5d5b8ff8153b79de292c.lndb' of instance 'testuser1/test-anndata'
→ connected lamindb: testuser1/test-anndata
! locked instance (to unlock and push changes to the cloud SQLite file, call: lamin disconnect)

We’ll need some test data:

ln.Artifact("s3://lamindb-ci/lndb-storage/pbmc68k.h5ad").save()
! no run & transform got linked, call `ln.track()` & re-run
Artifact(uid='J5QHE4FxFYM0RSua0000', is_latest=True, key='lndb-storage/pbmc68k.h5ad', suffix='.h5ad', size=638484, hash='-QNUPBbAug3jFmmk3fsOQA', _hash_type='md5', _accessor='AnnData', visibility=1, _key_is_virtual=False, storage_id=2, created_by_id=1, created_at=2024-10-18 15:56:36 UTC)

An h5ad artifact stored on s3:

artifact = ln.Artifact.filter(key="lndb-storage/pbmc68k.h5ad").one()
artifact.path
S3Path('s3://lamindb-ci/lndb-storage/pbmc68k.h5ad')
adata = artifact.open()
! run input wasn't tracked, call `ln.track()` and re-run
adata
AnnDataAccessor object with n_obs × n_vars = 70 × 765
  constructed for the AnnData object pbmc68k.h5ad
    obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
    obsm: ['X_pca', 'X_umap']
    obsp: ['connectivities', 'distances']
    uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
    var: ['highly_variable', 'index', 'n_counts']
    varm: ['PCs']

It is possible to access AnnData attributes without loading them into memory

Hide code cell content
print(adata.obsm)
print(adata.varm)
print(adata.obsp)
print(adata.varm)
Accessor for the AnnData attribute obsm
  with keys: ['X_pca', 'X_umap']
Accessor for the AnnData attribute varm
  with keys: ['PCs']
Accessor for the AnnData attribute obsp
  with keys: ['connectivities', 'distances']
Accessor for the AnnData attribute varm
  with keys: ['PCs']

However, .obs, .var and .uns are always loaded fully into memory on AnnDataAccessor initialization

adata.obs.head()
cell_type n_genes percent_mito louvain
index
GCAGGGCTGGATTC-1 Dendritic cells 1168 0.014345 2
CTTTAGTGGTTACG-6 CD19+ B 1121 0.019679 8
TGACTGGAACCATG-7 Dendritic cells 1277 0.012961 1
TCAATCACCCTTCG-8 CD19+ B 1139 0.018467 4
CGTTATACAGTACC-8 CD4+/CD45RO+ Memory 1034 0.010163 0
adata.var.head()
n_counts highly_variable
index
HES4 1153.387451 True
TNFRSF4 304.358154 True
SSU72 2530.272705 False
PARK7 7451.664062 False
RBP7 272.811035 True
adata.uns.keys()
dict_keys(['louvain', 'louvain_colors', 'neighbors', 'pca'])

Without subsetting, the AnnDataAccessor object gives references to underlying lazy h5 or zarr arrays:

adata.X
<HDF5 dataset "X": shape (70, 765), type "<f4">
adata.obsm["X_pca"]
<HDF5 dataset "X_pca": shape (70, 50), type "<f4">

And to a lazy SparseDataset from the anndata package:

adata.obsp["distances"]
CSRDataset: backend hdf5, shape (70, 70), data_dtype float64

Get a subset of the object, attributes are loaded only on explicit access:

obs_idx = adata.obs.cell_type.isin(["Dendritic cells", "CD14+ Monocytes"]) & (
    adata.obs.percent_mito <= 0.05
)
adata_subset = adata[obs_idx]
adata_subset
AnnDataAccessorSubset object with n_obs × n_vars = 35 × 765
  obs: ['cell_type', 'index', 'louvain', 'n_genes', 'percent_mito']
  obsm: ['X_pca', 'X_umap']
  obsp: ['connectivities', 'distances']
  uns: ['louvain', 'louvain_colors', 'neighbors', 'pca']
  var: ['highly_variable', 'index', 'n_counts']
  varm: ['PCs']

Check shapes of the subset

Hide code cell content
num_idx = sum(obs_idx)
assert adata_subset.shape == (num_idx, adata.shape[1])
assert (adata_subset.obs.cell_type == "CD34+").sum() == 0
adata_subset.obs.cell_type.value_counts()
cell_type
Dendritic cells                 28
CD14+ Monocytes                  7
CD4+/CD25 T Reg                  0
CD4+/CD45RO+ Memory              0
CD8+ Cytotoxic T                 0
CD8+/CD45RA+ Naive Cytotoxic     0
CD19+ B                          0
CD34+                            0
CD56+ NK                         0
Name: count, dtype: int64

Subsets load the arrays into memory only on direct access

print(adata_subset.X)
[[-0.326 -0.191  0.499 ... -0.21  -0.636 -0.49 ]
 [ 0.811 -0.191 -0.728 ... -0.21   0.604 -0.49 ]
 [-0.326 -0.191  0.643 ... -0.21   2.303 -0.49 ]
 ...
 [-0.326 -0.191 -0.728 ... -0.21   0.626 -0.49 ]
 [-0.326 -0.191 -0.728 ... -0.21  -0.636 -0.49 ]
 [-0.326 -0.191 -0.728 ... -0.21  -0.636 -0.49 ]]
print(adata_subset.obsm["X_pca"])
[[-5.750601   -4.096395   -2.9178936  ... -0.3169805  -0.20286919
  -0.4912242 ]
 [-6.516435    4.5414424   1.629511   ... -2.0872126   2.4427452
   0.67004365]
 [-2.0939696   4.8808017  -2.0491498  ... -3.3238401  -1.6365678
   1.0325491 ]
 ...
 [-2.284083   -4.8995905  -2.5168793  ... -0.22459485 -0.28241014
  -0.45557737]
 [-7.1581526   5.147818    2.4819682  ...  2.1289759  -0.27535897
   0.5335301 ]
 [-4.0010567  -6.0705996  -3.1599348  ...  1.1530831   0.48674038
  -0.24262637]]
Hide code cell content
assert adata_subset.obsp["distances"].shape[0] == num_idx

To load the entire subset into memory as an actual AnnData object, use to_memory():

adata_subset.to_memory()
AnnData object with n_obs × n_vars = 35 × 765
    obs: 'cell_type', 'n_genes', 'percent_mito', 'louvain'
    var: 'n_counts', 'highly_variable'
    uns: 'louvain', 'louvain_colors', 'neighbors', 'pca'
    obsm: 'X_pca', 'X_umap'
    varm: 'PCs'
    obsp: 'connectivities', 'distances'
Hide code cell content
!lamin delete --force test-anndata
• deleting instance testuser1/test-anndata
→ deleted storage record on hub 0b060fdbd72e55ae864c531f35d458ee
→ deleted instance record on hub 183bc48fd12a5d5b8ff8153b79de292c