Multimodal configuration

After running the main conversion pipeline you can populate the required YAML parameters file to run the multimodal integration pipeline.

Parameters file

The parameters file looks like this:

outdir: "/path/to/output/"

url: http://localhost:3000/
project: my_project
title: "My Project"

data:
  -
    dataset: scrnaseq
    obs_type: cell
    anndata: /path/to/main/output/scrnaseq-anndata.zarr
    offset: 0
    is_spatial: false
    vitessce_options:
      spatial:
        xy: obsm/spatial
      mappings:
        obsm/X_umap: [0,1]
      matrix: X
  -
    dataset: visium
    obs_type: spot
    anndata: /path/to/main/output/visium-anndata.zarr
    offset: 1000000
    is_spatial: true
    raw_image: /path/to/main/output/visium-raw.zarr
    label_image: /path/to/main/output/visium-label.zarr
    vitessce_options:
      spatial:
        xy: obsm/spatial
      matrix: X

In contrast to the main conversion pipeline’s parameters file, this file includes a single project to which multiple datasets belong.

Each dataset block defines the name of the dataset and paths to the converted data and image files (if any).

Each dataset also requires a set of vitessce_options that specify the location of certain data (spatial coordinates, embeddings, expression matrix, etc.) within the AnnData object that is processed/generated. This follows the same structure as in the main pipeline’s vitessce_options.

Additionally, each dataset requires:

  • obs_type, a string indicating the type of observation of the dataset. For example, “cell” or “spot”.

  • offset, an integer offset to add to the dataset’s ID’s so they don’t clash with the other datasets.

  • is_spatial, a boolean indicating whether the dataset contains spatial information and has associated image files (raw and/or label images)

Given that raw images are only read but not modified the pipeline does not generate new output from them. In order for the output directory (defined by outdir) to contain all necessary files that need to be served for the web application to consume, by default, the pipeline copies the raw images to the output directory (unless a file with the same name already exists in the output directory). This process can take a long time depending on the size of the image. You may want to manually copy or move the image or serve it from its own directory separate from the rest of the output. The default copying can be disabled by setting copy_raw: false as a project-wide parameter (at the same level as outdir, project, etc). For example,

outdir: "/path/to/output/"
url: http://localhost:3000/
project: my_project
title: "My Project"
copy_raw: false

With additional features

Using the above example parameters file to run the multimodal integration pipeline will run the reindexing and intersection steps. To perform the concatenation of additional features (like celltypes) to visualise them as continuous values, some extra parameters need to be added.

As a project-wide parameter (at the same level as outdir, project, etc.):

  • extend_feature_name, the name of the additional feature. For example, “celltype”

And at a dataset level:

  • extend_feature, the location of the additional feature information. This can be either the path to a cell2location output file, or the location within the AnnData object where the feature is stored as a categorical within obs. For example, /path/to/c2l.h5ad containing predicted continuous values, or obs/celltype containing categoricals.

The full parameters file will then look like this

outdir: "/path/to/output/"

url: http://localhost:3000/
project: my_project
title: "My Project"

extend_feature_name: celltype

data:
  -
    dataset: scrnaseq
    obs_type: cell
    anndata: /path/to/main/output/scrnaseq-anndata.zarr
    extend_feature: obs/celltype
    offset: 0
    is_spatial: false
    vitessce_options:
      spatial:
        xy: obsm/spatial
      mappings:
        obsm/X_umap: [0,1]
      matrix: X
  -
    dataset: visium
    obs_type: spot
    anndata: /path/to/main/output/visium-anndata.zarr
    extend_feature: /path/to/c2l.h5ad
    offset: 1000000
    is_spatial: true
    raw_image: /path/to/main/output/visium-raw.zarr
    label_image: /path/to/main/output/visium-label.zarr
    vitessce_options:
      spatial:
        xy: obsm/spatial
      matrix: X

With these parameters the multimodal integration pipeline will concatenate the expression matrix with the additional feature values so both can be queried and visualised across datasets within the same portal.

In the case of providing a cell2location output file, you can further configure extend_feature with arguments for how the file should be processed. Instead of only setting the path to the file you would need to define extend_feature as a map containing path and optional args.

extend_feature_name: celltype
data:
  -
    dataset: visium
    obs_type: spot
    anndata: /path/to/main/output/visium-anndata.zarr
    extend_feature:
      path: /path/to/c2l.h5ad
      args:
        sample: ["library_id", "sample_1"] # tuple containing the obs column name and value to filter the object. By default the object is not filtered.
        q: "q05_cell_abundance_w_sf" # matrix in obsm to use. Defaults to "q05_cell_abundance_w_sf".
        sort_index: "index_column" # column in the AnnData object that contains an index that matches the index in cell2location.
        sort: True # can be set to False to skip ordering the cell2location matrix but observations might not match in order between files. Defaults to True.

For example, sample can be used when a cell2location output file contains predictions for multiple samples. Setting sample to filter the output file enables the pipeline to obtain the appropriate prediction matrix for the data being processed, without having to split the cell2location output file for each sample. Otherwise, if a file with multiple sample prediction is input it will not match the number of observations of the AnnData object and the process will throw an error.

q can be set to use a different prediction matrix from the cell2location output file. It defaults to "q05_cell_abundance_w_sf"

sort and sort_index can be used to define how a cell2location output file matches the AnnData object. By default the pipeline will try to ensure the order of observations between the prediction matrix and AnnData object match so values are correctly concatenated. The pipeline will attempt to order the prediction matrix given the index of the AnnData object (or the original index if the main pipeline re-indexed it). However you can override the observations column of the AnnData object that contains the index that the prediction matrix should match using sort_index. sort can be set to False to disable any re-ordering. If disabled, the prediction matrix would be concatenated as-is into the AnnData object without checking if observations’ IDs match.