Practical Guide to Cell Segmentation, Domain Identification and Cell Type Annotation in Single-Cell Spatial Transcriptomics (Visium HD)

A hands-on tutorial covering cell segmentation with SOPA (StarDist + ProSeg), spatial domain identification with Novae (with batch correction on the fly), and cell type annotation with Enrichmap (also with built-in batch correction and spatial smoothing) on a Visium HD colorectal cancer dataset.

Purpose of this tutorial

flowchart LR
    subgraph INPUT["<b>Input</b>"]
        A["Visium HD\n+ H&E Image"]
    end

    subgraph SEG["<b>1. Cell Segmentation (SOPA)</b>"]
        direction TB
        B1["StarDist\n<i>Nuclear detection via\nstar-convex polygons</i>"]
        B2["ProSeg\n<i>Transcript-informed\nboundary refinement</i>"]
        B3["Aggregate\n<i>Assign transcripts\nto cells</i>"]
        B1 --> B2 --> B3
    end

    subgraph PRE["<b>2. Pre-processing</b>"]
        direction TB
        C1["Filter genes & cells"]
        C2["QC: mito %, counts"]
        C3["Normalize & HVGs"]
        C1 --> C2 --> C3
    end

    subgraph DOM["<b>3. Domain ID (Novae)</b>"]
        direction TB
        D1["Spatial neighbors"]
        D2["Fine-tune\nfoundation model"]
        D3["Assign spatial\ndomains"]
        D1 --> D2 --> D3
    end

    subgraph ANN["<b>4. Annotation (Enrichmap)</b>"]
        direction TB
        E1["Gene set\nenrichment scoring"]
        E2["Rank by group\n(Wilcoxon 1-vs-rest)"]
        E3["Elbow-gap\ncell type assignment"]
        E1 --> E2 --> E3
    end

    subgraph OUT["<b>Output</b>"]
        F["Annotated\nSpatial Dataset"]
    end

    A --> SEG --> PRE --> DOM --> ANN --> F

    style INPUT fill:#f0f0f0,stroke:#333
    style SEG fill:#e8f4e8,stroke:#2d7d2d
    style PRE fill:#e8ecf4,stroke:#2d4d7d
    style DOM fill:#f4e8f4,stroke:#7d2d7d
    style ANN fill:#f4ece8,stroke:#7d4d2d
    style OUT fill:#f0f0f0,stroke:#333

Working with high-resolution spatial transcriptomics platforms like Visium HD involves quite a few moving parts: segmenting cells, finding spatial domains, and annotating cell types. This tutorial walks through the full pipeline in Python, covering:

  1. Cell segmentation with SOPA , using StarDist for nuclear detection and ProSeg for transcript-based boundary refinement.
  2. Spatial domain identification with Novae , a graph-based foundation model with built-in batch correction.
  3. Cell type annotation with Enrichmap , which computes spatially-aware enrichment scores (also with batch correction on the fly) using gene set signatures.

We use the publicly available Human Colorectal Cancer (CRC) FFPE dataset from 10x Genomics and reproduce the main findings from their Loupe Browser tutorial, but entirely in Python and at single-cell resolution via cell segmentation.


Spatial Transcriptomics: A primer

To really understand what’s going on in a tissue, we need to know not just which genes a cell expresses, but also where that cell is relative to its neighbors. Spatial transcriptomics (ST) is a family of technologies that give you exactly that: spatially resolved gene expression .

The field has moved fast, going from a handful of measurable genes to transcriptome-wide profiling at near-single-cell resolution , with applications across cancer biology , neuroscience, developmental biology, and drug discovery . Below is a brief primer on the two main ST technology families and what they bring to the table compared to earlier approaches.

Imaging-based Spatial Transcriptomics (iST) vs Sequencing-based Spatial Transcriptomics (sST)

ST technologies generally fall into two camps: imaging-based and sequencing-based . Imaging-based methods can be further split into in situ hybridization (ISH) and in situ sequencing (ISS), while sequencing-based methods mostly rely on in situ capturing (ISC).

Imaging-based ST (iST) uses labeled probes that target specific genes, letting you visualize mRNA directly in tissue . This targeted approach gives high sensitivity and can reach subcellular or single-molecule resolution. Commercial platforms include MERFISH (Vizgen’s MERSCOPE), Xenium (10x Genomics), and CosMx (NanoString). The downside is that you need to know which genes to target upfront, and the number of genes you can profile at once is limited compared to transcriptome-wide methods .

Sequencing-based ST (sST) uses spatially barcoded arrays to capture RNA from tissue sections, which then gets sequenced via NGS . Because it’s unbiased and transcriptome-wide, no prior knowledge of target genes is needed, making it great for discovery-driven research.

The original Visium platform had 55 µm spots that typically captured a mixture of multiple cells. Visium HD brought this down to 2 µm bins with no gaps, enabling single-cell scale profiling. The tradeoff is that sequencing-based methods generally have lower transcript capture efficiency than imaging-based ones, since they rely on transcripts diffusing onto a capture surface rather than being detected directly in tissue .

What does Spatial Transcriptomics unlock compared to other transcriptomics techniques?

1. Going beyond bulk RNA-seq. Bulk RNA-seq averages gene expression across thousands of cells , hiding cellular heterogeneity and discarding any spatial information. ST preserves both, letting you spot rare cell populations that would otherwise be invisible in the bulk signal.

2. Adding context to scRNA-seq. scRNA-seq gives you single-cell resolution, a step up from bulk-RNA, but still requires dissociating the tissue first, which destroys the spatial layout of cells . It can also lose fragile cell types or introduce dissociation-induced stress artifacts . Cells with complex morphologies (e.g., neurons, glial cells) are particularly hard to recover intact . ST profiles gene expression in situ, keeping cells in their original context without the need for dissociation.

3. Enabling spatial analyses. Because ST preserves coordinates, it opens up analyses that are simply not feasible with dissociated data:

4. Building spatial atlases. ST enables 3D tissue reconstruction by aligning consecutive 2D sections , and spatiotemporal tracking of development at cellular resolution.

Shift from Deconvolution (spot based) to Cell Segmentation (bin based)

With the original Visium (55 µm spots, each covering multiple cells), you needed deconvolution methods like RCTD to infer cell type mixtures within each spot using a scRNA-seq reference.

Visium HD changed this. With 2 µm bins, you can now do cell segmentation directly on the histology image and assign transcripts to individual cells using computational methods. This gives you a per-cell expression matrix (like scRNA-seq) but with spatial coordinates preserved. Each observation is a single segmented cell rather than a statistical mixture.

This shift from deconvolution to segmentation is a key motivation for this tutorial: we get single-cell spatial transcriptomics without the assumptions that come with deconvolution.


Exploring a Visium HD Dataset

For this tutorial we will use the Visium HD Human Colorectal Cancer (FFPE) dataset from 10x Genomics:

Once you’ve downloaded the data, your project structure should look like this:

Data Structure

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
spatial/
├── Microscopy/
   ├── Visium_HD_Human_Colon_Cancer_image.tif
   └── Visium_HD_Human_Colon_Cancer_tissue_image.btf

└── SpaceRanger/
    └── Visium_HD_Human_Colon_Cancer/
        └── outs/
            ├── binned_outputs/
               ├── square_002um/
               ├── square_008um/
               └── square_016um/
            
            └── spatial/
                ├── aligned_fiducials.jpg
                ├── aligned_tissue_image.jpg
                ├── cytassist_image.tiff
                ├── detected_tissue_image.jpg
                ├── tissue_hires_image.png
                ├── tissue_lowres_image.png
                ├── Visium_HD_Human_Colon_Cancer_cloupe_008um.cloupe
                ├── Visium_HD_Human_Colon_Cancer_feature_slice.h5
                ├── Visium_HD_Human_Colon_Cancer_metrics_summary.csv
                ├── Visium_HD_Human_Colon_Cancer_probe_set.csv
                └── Visium_HD_Human_Colon_Cancer_slide_file.vtf

Check this repository link to find all the full script and structure necessary to follow this tutorial .

The imports look like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import warnings
from datetime import datetime
from pathlib import Path

import decoupler as dc
import enrichmap as em
import matplotlib as mpl
import matplotlib.pyplot as plt
import novae
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import sopa
import spatialdata as sd
import spatialdata_plot
import yaml
import zarr
from anndata import AnnData
from loguru import logger
import spatialdata_plot

warnings.filterwarnings(
    "ignore",
    message="Use `squidpy.pl.spatial_scatter`",
)

# Matplotlib defaults
mpl.rcParams["figure.dpi"] = 300
plt.style.use("bmh")
plt.rcParams.update(
    {
        "figure.figsize": (12, 8),
        "axes.facecolor": "white",
        "axes.edgecolor": "black",
    }
)

The following code is used to set up the directories and logging. Note that we will also be using the config_crc_tutorial.yaml to store and load the configuration parameters. The markers you see in this file are the ones downloaded from the 10X website (link) with some added genes for neutrophils as we need to have more than 1 gene per cell type signature to run the Enrichmap package later on this tutorial.

The yaml file looks like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Visium HD CRC Tutorial Configuration
# 10x Genomics Human Colorectal Cancer (FFPE) dataset

# Base paths
paths:
  spaceranger_outs: "../data/spatial/SpaceRanger"
  processed: "../data/spatial/processed"
  figures: "../figures"

# Single sample for this tutorial
samples:
  - id: "Visium_HD_Human_Colon_Cancer"
    name: "CRC_tutorial"

# Analysis parameters
params:
  filter_genes_counts: 1
  filter_cells_counts: 1
  radius: 50
  percentile_pct_mito: null
  pct_mito: 10
  novae_max_epochs: 50
  novae_model: "MICS-Lab/novae-human-0"
  domain_range: [5, 15]
  clustering_col: "novae_domains_15"

# Marker genes for CRC cell type annotation
# Source: 10x Genomics Loupe Browser CRC tutorial CSVs (data/markers/*.csv)
marker_genes:
  Tumor: ["REG1A", "REG1B", "CEACAM6", "TGFBI"]
  Fibroblasts: ["COL1A1", "MMP2"]
  Macrophages: ["LYZ", "SPP1"]
  Neutrophils: ["SAT1", "CSF3R", "FCGR3B", "S100A8"]
  Goblet_cells: ["FCGBP", "MUC2", "CLCA1"]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
PROJECT_DIR = Path(
    r"C:\Users\Projects\segmentation-and-annotation"
).resolve()
CONFIG_PATH = (
    PROJECT_DIR
    / "scripts"
    / "config_crc_tutorial.yaml"
)

with open(CONFIG_PATH, "r") as f:
    cfg = yaml.safe_load(f)

sample_id = cfg["samples"][0]["id"]
sample_name = cfg["samples"][0]["name"]
params = cfg["params"]
marker_genes = cfg["marker_genes"]

PROCESSED_DIR = (
    PROJECT_DIR
    / "data"
    / "spatial"
    / "processed"
).resolve()
FIGURES_DIR = (
    PROJECT_DIR / "figures"
).resolve()

run_name = f"crc_tutorial_{datetime.now().strftime('%d%m%Y_%H%M')}"
run_dir = (
    PROCESSED_DIR / run_name
).resolve()
run_dir.mkdir(
    parents=True, exist_ok=True
)
(run_dir / "figures").mkdir(
    exist_ok=True
)

logger.info(f"Run directory: {run_dir}")
logger.info(
    f"Sample: {sample_id} ({sample_name})"
)

Reading the data

Once we have the downloaded files and in the project structure above, we can read the data using the sopa package as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
logger.info(
    "=== STEP 1: Reading Visium HD data and running segmentation ==="
)

sdata = sopa.io.visium_hd(
    str(
        PROJECT_DIR
        / "data"
        / "spatial"
        / "SpaceRanger"
        / sample_id
        / "outs"
    ),
    dataset_id=sample_id,
    fullres_image_file=str(
        PROJECT_DIR
        / "data"
        / "spatial"
        / "Microscopy"
        / "Visium_HD_Human_Colon_Cancer_tissue_image.btf"
    ),
)

In order to speed things up, we will use a subset of the data. We can crop the dataset like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
sdata_sub = sd.bounding_box_query(
    sdata,
    min_coordinate=[51000, 9000],
    max_coordinate=[56000, 14000],
    axes=("x", "y"),
    target_coordinate_system=sample_id,
    filter_table=True,
)

# Make the var_names unique, but not required for this tutorial
<!-- for (
    key,
    table,
) in sdata_sub.tables.items():
    table.var_names_make_unique() -->

<!-- sdata_sub.write("sdata_subset.zarr")  # save it -->

Which gives the following SpatialData object (to find out more about SpatialData check this repository link):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
SpatialData object
├── Images
     ├── 'Visium_HD_Human_Colon_Cancer_full_image': DataTree[cyx] (3, 5000, 5000), (3, 2500, 2500), (3, 1250, 1250), (3, 625, 625), (3, 313, 313)
     ├── 'Visium_HD_Human_Colon_Cancer_hires_image': DataArray[cyx] (3, 398, 399)
     └── 'Visium_HD_Human_Colon_Cancer_lowres_image': DataArray[cyx] (3, 40, 40)
├── Shapes
     ├── 'Visium_HD_Human_Colon_Cancer_square_002um': GeoDataFrame shape: (470416, 1) (2D shapes)
     ├── 'Visium_HD_Human_Colon_Cancer_square_008um': GeoDataFrame shape: (29659, 1) (2D shapes)
     ├── 'Visium_HD_Human_Colon_Cancer_square_016um': GeoDataFrame shape: (7500, 1) (2D shapes)
     ├── 'image_patches': GeoDataFrame shape: (9, 3) (2D shapes)
     ├── 'region_of_interest': GeoDataFrame shape: (1, 1) (2D shapes)
     └── 'stardist_boundaries': GeoDataFrame shape: (12419, 1) (2D shapes)
└── Tables
      ├── 'square_002um': AnnData (470416, 18085)
      ├── 'square_008um': AnnData (29659, 18085)
      └── 'square_016um': AnnData (7500, 18085)
with coordinate systems:
     'Visium_HD_Human_Colon_Cancer', with elements:
        Visium_HD_Human_Colon_Cancer_full_image (Images), Visium_HD_Human_Colon_Cancer_hires_image (Images), Visium_HD_Human_Colon_Cancer_lowres_image (Images), Visium_HD_Human_Colon_Cancer_square_002um (Shapes), Visium_HD_Human_Colon_Cancer_square_008um (Shapes), Visium_HD_Human_Colon_Cancer_square_016um (Shapes), image_patches (Shapes), region_of_interest (Shapes), stardist_boundaries (Shapes)
     'Visium_HD_Human_Colon_Cancer_downscaled_hires', with elements:
        Visium_HD_Human_Colon_Cancer_hires_image (Images), Visium_HD_Human_Colon_Cancer_square_002um (Shapes), Visium_HD_Human_Colon_Cancer_square_008um (Shapes), Visium_HD_Human_Colon_Cancer_square_016um (Shapes), region_of_interest (Shapes)
     'Visium_HD_Human_Colon_Cancer_downscaled_lowres', with elements:
        Visium_HD_Human_Colon_Cancer_lowres_image (Images), Visium_HD_Human_Colon_Cancer_square_002um (Shapes), Visium_HD_Human_Colon_Cancer_square_008um (Shapes), Visium_HD_Human_Colon_Cancer_square_016um (Shapes), region_of_interest (Shapes)
     'microns', with elements:
        Visium_HD_Human_Colon_Cancer_square_002um (Shapes), Visium_HD_Human_Colon_Cancer_square_008um (Shapes), Visium_HD_Human_Colon_Cancer_square_016um (Shapes), region_of_interest (Shapes)

We can then plot the tissue image and the cropped image like this:

1
2
3
4
5
6
sdata_sub.pl.render_images(
    f"{sample_id}_full_image"
).pl.show(
    coordinate_systems=sample_id,
    figsize=(10, 10),
)

Which would look something like this:


Cell Segmentation in Single‑Cell Spatial Transcriptomics using SOPA

Now that the data is loaded and cropped, we can segment the cells. We’ll use sopa (Spatial Omics Pipeline and Analysis), a technology-invariant pipeline built on SpatialData that works across Xenium, MERSCOPE, CosMx, Visium HD, and others.

For segmentation, sopa uses a patch-based approach: it splits the tissue image into patches, segments each one, and stitches the results back together. This keeps memory usage manageable even for large whole-slide images.

1
2
3
# Patch-based cell segmentation
logger.info("Creating image patches...")
sopa.make_image_patches(sdata_sub)

The landscape of cell segmentation methods

Cell segmentation matters because everything downstream (domain identification, cell type annotation, cell-cell interactions) depends on getting cell boundaries right. The transcript-informed segmentation landscape breaks down into three broad families:

CNNs and Transformers: BIDCell, Bin2Cell, SCS, UCS, Bering, GeneSegNet, and StarDist (the default in SOPA).

Graph-based methods: Segger and ComSeg.

Probabilistic/Bayesian methods: ProSeg and Baysor.

Which one you pick depends on what inputs you have (H&E/DAPI images, nuclear staining, transcript coordinates) and your computational budget. In this tutorial, we use a two-step approach: StarDist for initial HE segmentation, then ProSeg to refine those boundaries using transcript locations.

StarDist

StarDist detects nuclei by representing them as star-convex polygons. For each pixel, it predicts the probability of belonging to a nucleus and the distances to the object boundary along a set of radial directions. This works much better than bounding boxes for rounded nuclei and handles crowded cells well, where other methods tend to merge neighboring cells or miss them entirely.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
logger.info(
    "Running StarDist segmentation..."
)
sopa.segmentation.stardist(
    sdata_sub, min_area=20
)
sdata_sub.pl.render_images(
    f"{sample_id}_full_image"
).pl.render_shapes(
    "stardist_boundaries",
    outline=True,
    fill_alpha=0,
    outline_alpha=0.6,
    outline_color="yellow",
).pl.show(
    coordinate_systems=sample_id,
    figsize=(12, 12),
    title="StarDist Cell Segmentation",
)

Which gives the following segmentation result:

Proseg

ProSeg takes a different approach: instead of relying on images alone, it infers cell boundaries from the spatial distribution of transcripts using ab initio cell simulation. It models cell shapes as probabilistic distributions and iteratively adjusts boundaries to best explain where the transcripts are, while preserving cell type heterogeneity.

The nice thing about ProSeg is that it can take prior segmentation results (like our StarDist boundaries) as a starting point and refine them using transcript information. So StarDist gives us the initial nuclear outlines from the H&E, and ProSeg adjusts those to better match the actual extent of each cell. In benchmarks across three commercial platforms, ProSeg shows strong performance and helps with difficult-to-segment cells like tumor-infiltrating immune cells (neutrophils, T cells).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
logger.info(
    "Running ProSeg refinement..."
)
sopa.segmentation.proseg(
    sdata_sub,
    prior_shapes_key="stardist_boundaries",
)

sdata_sub.pl.render_images(
    f"{sample_id}_full_image"
).pl.render_shapes(
    "proseg_boundaries",
    outline=True,
    fill_alpha=0,
    outline_alpha=0.6,
    outline_color="yellow",
).pl.show(
    coordinate_systems=sample_id,
    figsize=(12, 12),
    title="ProSeg Cell Segmentation Refinement",
)

Which gives the following segmentation result:

Once we have the segmentation, we can aggregate the gene expression to the cells (As seen in the sopa tutorial, this is mandatory if you used stardist only, but optional if you ran proseg)

1
2
3
4
5
6
7
8
logger.info(
    "Aggregating gene expression to cells..."
)
sopa.aggregate(
    sdata_sub,
    aggregate_channels=False,
    expand_radius_ratio=1,
)

This will give a new table in the sdata_sub.tables["table"] with the aggregated/segmented gene expression. It looks something like this:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
SpatialData object
├── Images
     ├── 'Visium_HD_Human_Colon_Cancer_full_image': DataTree[cyx] (3, 5000, 5000), (3, 2500, 2500), (3, 1250, 1250), (3, 625, 625), (3, 313, 313)
     ├── 'Visium_HD_Human_Colon_Cancer_hires_image': DataArray[cyx] (3, 398, 399)
     └── 'Visium_HD_Human_Colon_Cancer_lowres_image': DataArray[cyx] (3, 40, 40)
├── Shapes
     ├── 'Visium_HD_Human_Colon_Cancer_square_002um': GeoDataFrame shape: (470416, 1) (2D shapes)
     ├── 'Visium_HD_Human_Colon_Cancer_square_008um': GeoDataFrame shape: (29659, 1) (2D shapes)
     ├── 'Visium_HD_Human_Colon_Cancer_square_016um': GeoDataFrame shape: (7500, 1) (2D shapes)
     ├── 'image_patches': GeoDataFrame shape: (9, 3) (2D shapes)
     ├── 'proseg_boundaries': GeoDataFrame shape: (12394, 2) (2D shapes)
     ├── 'region_of_interest': GeoDataFrame shape: (1, 1) (2D shapes)
     └── 'stardist_boundaries': GeoDataFrame shape: (12419, 1) (2D shapes)
└── Tables
      ├── 'square_002um': AnnData (470416, 18085)
      ├── 'square_008um': AnnData (29659, 18085)
      ├── 'square_016um': AnnData (7500, 18085)
      └── 'table': AnnData (12394, 18085)
with coordinate systems:
     'Visium_HD_Human_Colon_Cancer', with elements:
        Visium_HD_Human_Colon_Cancer_full_image (Images), Visium_HD_Human_Colon_Cancer_hires_image (Images), Visium_HD_Human_Colon_Cancer_lowres_image (Images), Visium_HD_Human_Colon_Cancer_square_002um (Shapes), Visium_HD_Human_Colon_Cancer_square_008um (Shapes), Visium_HD_Human_Colon_Cancer_square_016um (Shapes), image_patches (Shapes), proseg_boundaries (Shapes), region_of_interest (Shapes), stardist_boundaries (Shapes)
     'Visium_HD_Human_Colon_Cancer_downscaled_hires', with elements:
        Visium_HD_Human_Colon_Cancer_hires_image (Images), Visium_HD_Human_Colon_Cancer_square_002um (Shapes), Visium_HD_Human_Colon_Cancer_square_008um (Shapes), Visium_HD_Human_Colon_Cancer_square_016um (Shapes), proseg_boundaries (Shapes), region_of_interest (Shapes)
     'Visium_HD_Human_Colon_Cancer_downscaled_lowres', with elements:
        Visium_HD_Human_Colon_Cancer_lowres_image (Images), Visium_HD_Human_Colon_Cancer_square_002um (Shapes), Visium_HD_Human_Colon_Cancer_square_008um (Shapes), Visium_HD_Human_Colon_Cancer_square_016um (Shapes), proseg_boundaries (Shapes), region_of_interest (Shapes)
     'microns', with elements:
        Visium_HD_Human_Colon_Cancer_square_002um (Shapes), Visium_HD_Human_Colon_Cancer_square_008um (Shapes), Visium_HD_Human_Colon_Cancer_square_016um (Shapes), proseg_boundaries (Shapes), region_of_interest (Shapes)

This is the AnnData object that we will be using (segmented cells) to run domain identification and cell type annotation. Let’s save this segmented data:

1
2
3
4
5
6
7
8
9
10
11
12
segmented_h5ad = (
    PROJECT_DIR
    / "data"
    / "spatial"
    / "segmented.h5ad"
)
sdata_sub.tables["table"].write_h5ad(
    str(segmented_h5ad)
)
logger.info(
    f"Saved segmented table to {segmented_h5ad}"
)

Pre-processing

Before running domain identification, let’s do some pre-processing (e.g., filtering cell and gene counts, removing cells with mitochondrial percentage above a certain threshold, etc.).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
logger.info(
    "=== STEP 2: Filtering, preprocessing, and Novae clustering ==="
)

adata = sc.read_h5ad(
    str(segmented_h5ad)
)

# Filter genes and cells
logger.info(
    f"AnnData shape before filtering: {adata.shape}"
)
sc.pp.filter_genes(
    adata,
    min_counts=params[
        "filter_genes_counts"
    ],
)
sc.pp.filter_cells(
    adata,
    min_counts=params[
        "filter_cells_counts"
    ],
)
logger.info(
    f"AnnData shape after filtering: {adata.shape}"
)

# QC metrics
vmax_pct = 99
adata.var["mt"] = (
    adata.var_names.str.startswith(
        ("MT-", "mt-")
    )
)
sc.pp.calculate_qc_metrics(
    adata,
    inplace=True,
    percent_top=None,
)
sc.pp.calculate_qc_metrics(
    adata,
    qc_vars=["mt"],
    inplace=True,
    percent_top=None,
)

Let’s create some plots resulting from the QC filtering:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Plot n_counts
fig = sc.pl.spatial(
    adata,
    color="n_counts",
    vmin=np.percentile(
        adata.obs["n_counts"], 1
    ),
    vmax=np.percentile(
        adata.obs["n_counts"], vmax_pct
    ),
    spot_size=10,
    show=False,
    return_fig=True,
)
fig.savefig(
    f"{run_dir}/n_counts_{sample_id}.png",
    bbox_inches="tight",
    dpi=150,
)
plt.close(fig)

Spatial distribution of total counts per cell:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Plot mito percentage
fig = sc.pl.spatial(
    adata,
    color="pct_counts_mt",
    vmin=np.percentile(
        adata.obs["pct_counts_mt"], 1
    ),
    vmax=np.percentile(
        adata.obs["pct_counts_mt"],
        vmax_pct,
    ),
    spot_size=10,
    show=False,
    return_fig=True,
)
fig.savefig(
    f"{run_dir}/pct_counts_mt_{sample_id}.png",
    bbox_inches="tight",
    dpi=150,
)
plt.close(fig)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Filter high mito cells
pct_to_remove = params["pct_mito"]
if (
    params["percentile_pct_mito"]
    is not None
):
    pct_to_remove = adata.obs[
        "pct_counts_mt"
    ].quantile(
        params["percentile_pct_mito"]
    )

adata = adata[
    adata.obs["pct_counts_mt"]
    < pct_to_remove
].copy()
logger.info(
    f"After mito filtering ({pct_to_remove}%): {adata.shape}"
)

# Plot mito after filtering
fig = sc.pl.spatial(
    adata,
    color="pct_counts_mt",
    spot_size=10,
    show=False,
    return_fig=True,
)
fig.savefig(
    f"{run_dir}/mt_after_filtering_{sample_id}.png",
    bbox_inches="tight",
    dpi=150,
)
plt.close(fig)
Why is it important to filter regions with high % of MT? Biology or Artifact?

Although this example uses a different dataset from the tutorial, it highlights an important thing: regions with a high percentage of mitochondrial (MT) genes can be identified by tools like Novae as distinct spatial domains, when in reality they may simply reflect elevated MT content. In some cases, this is biologically meaningful, for exampel, high MT percentages can indicate cell death or disease progression, making these domains genuinely relevant. However, in other cases, such as FFPE samples, the elevated MT signal may be an artifact: the extremities of the tissue are exposed to harsher conditions during sample prep/embedding, leading to greater cellular stress and, consequently, higher MT percentages compared to the inner tissue.

Have a look at this figure:

You can see that the regions of this sample that have highest % of MT are the extremities of the sample.

Now look at the domains that were obtained using novae (which you will see how to use in the next section):

You can see that novae has captured this region of high % of MT as being a niche/domain. And while it certainly can be due to biology, it can also be due to artifacts as I mentioned above (especially in the extremities of FFPE samples or issues during sample preparation).

Click here to see an optional/aditional step to remove “isolated cell islands” using connectivities in sparser samples

Another thing we can do is to remove sparse spatial regions using computed connectivities. This is a good idea if we have a sample that is not as compact as the dataset we are working with, where we may have “isolated islands” of cells. This should be evaluated on a project bases, as these islands could be real and not artifacts from sample preparation.

We can do this using the connected_components function from scipy.sparse.csgraph. This function below requires connectivities to be calculated first, which we will do in the next section with Novae.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
def filter_isolated_regions(
    adatas,
    min_cells=2000,
    run_dir=None,
    remove_specific_regions=None,
):
    from scipy.sparse.csgraph import (
        connected_components,
    )

    for i, adata in enumerate(adatas):
        coords = adata.obsm["spatial"]

        sample_id = adata.obs[
            "sample_id"
        ].unique()[0]

        logger.info(
            f"Running region filtering based on number of cells (removing isolated islands). {sample_id}"
        )

        print(
            f"X range: {coords[:, 0].min():.1f} - {coords[:, 0].max():.1f}"
        )
        print(
            f"Y range: {coords[:, 1].min():.1f} - {coords[:, 1].max():.1f}"
        )

        n_components, labels = (
            connected_components(
                adata.obsp[
                    "spatial_connectivities"
                ],
                directed=False,
            )
        )
        adata.obs["spatial_region"] = (
            labels.astype(str)
        )
        print(
            f"Found {n_components} spatial regions"
        )

        # Count cells per region
        region_counts = adata.obs[
            "spatial_region"
        ].value_counts()
        print(region_counts)

        large_regions = region_counts[
            region_counts >= min_cells
        ].index.tolist()
        print(
            f"Large regions (>={min_cells} cells): {large_regions}"
        )

        # Tag regions as "main" or "island"
        adata.obs["region_type"] = (
            adata.obs[
                "spatial_region"
            ].apply(
                lambda x: "main"
                if x in large_regions
                else "island"
            )
        )

        # Visualize the regions
        fig, axes = plt.subplots(
            1, 3, figsize=(14, 5)
        )

        # Plot all regions colored
        sc.pl.embedding(
            adata,
            basis="spatial",
            color="spatial_region",
            ax=axes[0],
            show=False,
            title="All spatial regions (based on connectivities)",
        )

        # Plot main vs island
        sc.pl.embedding(
            adata,
            basis="spatial",
            color="region_type",
            ax=axes[1],
            show=False,
            title="Main vs Islands",
        )

        if (
            remove_specific_regions
            is not None
        ):
            if (
                sample_id
                in remove_specific_regions
            ):
                regions_to_remove = remove_specific_regions[
                    sample_id
                ]
                adata = adata[
                    ~adata.obs[
                        "spatial_region"
                    ].isin(
                        regions_to_remove
                    )
                ]

                logger.info(
                    f"Removed user defined regions from {sample_id}: {regions_to_remove}"
                )
        cells_before_filtering = (
            adata.n_obs
        )
        # Filter to keep only main regions
        adata = adata[
            adata.obs["region_type"]
            == "main"
        ]
        cells_after_filtering = (
            adata.n_obs
        )
        print(
            f"Filtered: {cells_before_filtering} -> {cells_after_filtering} cells"
        )
        sc.pl.embedding(
            adata,
            basis="spatial",
            color="region_type",
            ax=axes[2],
            show=False,
            title="After filtering isolated islands",
        )
        plt.tight_layout()

        if run_dir is not None:
            plt.savefig(
                f"{run_dir}/spatial_regions_{sample_id}.png",
                dpi=300,
                bbox_inches="tight",
            )
        else:
            plt.savefig(
                f"spatial_regions_{sample_id}.png",
                dpi=300,
                bbox_inches="tight",
            )
        plt.show()

        adatas[i] = adata.copy()

    return adatas

This is an example of the output from the code above, where one identify and remove the “islands” of cells that are not connected to the main regions of cells. This image is not related to the Visium HD dataset for this tutorial, and is simply an example of how this can be used to remove isolated islands of cells that don’t meed certain criteria.

Spatial Clustering and Domain Identification with Novae

With segmented cells and pre-processed data, the next step is to find spatial domains: tissue regions that share similar expression patterns and spatial organization (think tumor, stroma, immune infiltrates, normal mucosa).

We’ll use Novae , a graph-based foundation model pre-trained on ~30 million cells across 18 tissues. It can do zero-shot domain inference across different gene panels, tissues, and technologies. Some cool things that it can do out of the box:

An alternative worth mentioning is SpatialLeiden, a spatially-aware version of the Leiden algorithm. It uses a multiplex graph combining gene expression with spatial topology, and a layer_ratio parameter lets you balance the two. It integrates with the scverse ecosystem and works across Visium, Stereo-seq, and imaging-based platforms. However, it does not (yet) perform batch correction out of the box. Thus, a batch correction method such as Harmony or scVI can be used.

Check out this contribution I posted on the SpatialLeiden GitHub repository for more details: link. Edit March 19th, 2026: This contribution was turned into a tutorial, using Harmony with SpatialLeiden here.

Another interesting option is SpatialFusion, a deep-learning multimodal model for niche discovery using pathway-informed spatial niche mapping. Check the priprint here: SpatialFusion: A lightweight multimodal foundation model for pathway-informed spatial niche mapping.

For this tutorial we go with novae for its foundation model capabilities and hierarchical domains. In addition, it is easy to get lost with all the different methods being released on a weekly basis (especially for recent methods such as spatial transcriptomics), so it is important to focus on those tools/packages that 1) work/generalize well enough, 2) are easy to install, and 3) have a good documentation.

The first step with Novae is computing spatial neighbors via spatial_neighbors, which takes a radius parameter. For this cropped region, we use a radius of 50.

Note that neighbors/connectivities need to be computed per sample. If you have multiple samples in one AnnData object, use the slide_key parameter to compute them separately.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
logger.info(
    "Computing spatial neighbors."
)
novae.spatial_neighbors(
    [adata],
    radius=params["radius"],
    coord_type="generic",
)

novae.plot.connectivities([adata])
plt.savefig(
    f"{run_dir}/connectivities.png",
    dpi=600,
    bbox_inches="tight",
)
plt.close()

This radius parameter is important as too low of a radius will result in very sparse connectivities, missing important spatial relationships, while too high of a radius will start connecting cells that are not actually connected. See example of images below:

Radius of 20

Radius of 150

Novae also provides some useful utilities for normalization of the data. We can do it using this:

1
2
3
4
5
6
logger.info("Preprocessing with Novae")
novae.utils.prepare_adatas([adata]) # performs lognorm, highly variable genes, among others. 
# Copy the lognorms computed from novae to adata layers.
adata.layers["lognorm_counts"] = (
    adata.X.copy()
)

Let’s also compute the PCA using the highly variable genes computed from Novae in the previous prepare_adatas step:

1
2
3
4
logger.info("Running PCA")
sc.pp.pca(
    adata, use_highly_variable=True
)

Now, one can use the Novae model to fine-tune it on our own dataset. We load a pre-trained model and fine-tune it as follows:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Novae clustering
logger.info(
    "Fine-tuning Novae model..."
)
model = novae.Novae.from_pretrained(
    params["novae_model"]
)
# See more on the Novae documentation: https://mics-lab.github.io/novae/
model.fine_tune(
    [adata],
    max_epochs=params[
        "novae_max_epochs"
    ],
)
model.save_pretrained(
    str(run_dir / "model")
)

You can find more detailed information on fine-tuning and how to use H&E embeddings to improve model performance in the Novae documentation. For this tutorial, we will use the zero_shot parameter to compute representations. This leverages the pre-trained model’s learned representations to cluster cells without requiring task-specific training.

1
2
3
4
5
6
7
logger.info(
    "Computing Novae representations..."
)
# Inference based on zero_shot
model.compute_representations(
    [adata], zero_shot=True
)

We can then use the model to assign domains from a range of domains we wish to test. We will also save these domain assignments to the adata.obs table and save both spatial plots and domain proportions plots.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Assign domains across a range of resolutions
domain_min, domain_max = params[
    "domain_range"
]
for n_domains in range(
    domain_min, domain_max + 1
):
    col = f"novae_domains_{n_domains}"
    model.assign_domains(
        [adata], level=n_domains
    )

    adata.obsm[
        f"novae_latent_{n_domains}"
    ] = adata.obsm[
        "novae_latent"
    ].copy()

    novae.plot.domains(
        [adata],
        cell_size=8,
        show=False,
        obs_key=col,
    )
    plt.savefig(
        str(
            run_dir
            / f"domains_{n_domains}.png"
        ),
        dpi=600,
        bbox_inches="tight",
    )
    plt.close()

    novae.plot.domains_proportions(
        [adata], obs_key=col, show=False
    )
    plt.savefig(
        str(
            run_dir
            / f"domains_proportions_{n_domains}.png"
        ),
        dpi=600,
        bbox_inches="tight",
    )
    plt.close()

We will use the novae_domains_8 domain assignments from adata.obs for this tutorial. We get the following plots:

Domain proportions for the 8 novae domains:

Since the previous steps of segmentation and domain identification can take a while to run, let’s save the adata object to a file.

1
2
3
4
5
6
7
8
9
10
# Save clustered AnnData
adata.write_h5ad(
    str(
        run_dir
        / f"{sample_id}_clustered.h5ad"
    )
)
logger.info(
    f"Saved clustered AnnData: {adata.shape}"
)

Cell Type Annotation using Enrichmap and Cell Type Gene Set Signatures

For cell type annotation, you can either use reference-based methods like RCTD (see the Python implementation here), or gene set signatures from databases like PanglaoDB, CellMarker 2.0, and CellTypist. For common cancer types, finding a scRNA-seq reference is usually straightforward, but for rarer cancers (e.g., SCLC), good references may not exist. In those cases, gene signature-based scoring is the way to go.

We’ll use Enrichmap for this. Compared to standard enrichment methods like scanpy.tl.score_genes, decoupler.mt.ulm or gseapy, Enrichmap adds three things that matter for spatial data:

It’s been benchmarked against other scoring methods using Moran’s I and shows better spatial coherence. It also works across Visium, Xenium, MERFISH, and imaging mass cytometry.

We start by filtering our marker genes to those present in the dataset, then run em.tl.score to generate per-cell enrichment scores for each cell type signature.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
logger.info(
    "=== STEP 3: Cell type annotation with Enrichmap ==="
)

clustering_col = "novae_domains_8"
logger.info(
    f"Annotating using clustering column: {clustering_col}"
)
logger.info(
    f"Marker genes: {marker_genes}"
)

# Ensure we use lognorm counts for enrichmap
if "lognorm_counts" in adata.layers:
    adata.X = adata.layers[
        "lognorm_counts"
    ].copy()

# Filter marker genes to those present in the data
marker_genes_filtered = {
    ct: [
        g
        for g in genes
        if g in adata.var_names
    ]
    for ct, genes in marker_genes.items()
}
marker_genes_filtered = {
    ct: genes
    for ct, genes in marker_genes_filtered.items()
    if len(genes) > 0
}
logger.info(
    f"Marker genes after filtering: { {k: len(v) for k, v in marker_genes_filtered.items()} }"
)

# Run Enrichmap scoring
logger.info(
    "Running Enrichmap scoring..."
)
em.tl.score(
    adata=adata,
    gene_set=marker_genes_filtered,
    smoothing=True,
    correct_spatial_covariates=True,
)

Let’s plot some of the scores to see how they look:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
coords = adata.obsm["spatial"]
scores = adata.obs["Neutrophils_score"] # replace this with the appropriate colum name from our computed scores above

fig, ax = plt.subplots(figsize=(10, 10))
scatter = ax.scatter(
    coords[:, 0],
    coords[:, 1],
    c=scores,
    cmap="seismic",
    s=3,
    vmin=-scores.abs().max(),
    vmax=scores.abs().max(),
)
ax.set_aspect("equal")
ax.invert_yaxis()
ax.axis("off")
ax.set_title("Tumor")
plt.colorbar(scatter, ax=ax, shrink=0.5)
plt.show()

Neutrophil enrichment scores:

Fibroblasts enrichment scores:

Goblet cells enrichment scores:

Macrophages enrichment scores:

Tumor enrichment scores:

If you compare these scores distributions to the markers from the original 10X tutorial, we can see that the scores follow similar distribution. We also see that there is an overlap between Tumor and Goblet cell scores.

To assign a cell type, one can then do a simple idxmax to assign the max scores for each cell ID to a given cell type. We can also use decoupler (repository link) to run statistical tests and perform one-vs-rest comparisons between the scores of a given cell type in a given domain compared to all other domains in that sample.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
score_cols = [
    col
    for col in adata.obs.columns
    if col.endswith("_score")
]
logger.info(
    f"Score columns: {score_cols}"
)

score_matrix = adata.obs[
    score_cols
].values
score_adata = AnnData(
    X=score_matrix,
    obs=adata.obs[
        [clustering_col]
    ].copy(),
)
score_adata.var_names = [
    col.replace("_score", "")
    for col in score_cols
]

ranking_df = dc.tl.rankby_group(
    score_adata,
    groupby=clustering_col,
    reference="rest",
    method="wilcoxon",
)
ranking_df = ranking_df.sort_values(
    by=["group", "meanchange"],
    ascending=[True, False],
)
logger.info(
    f"Ranking shape: {ranking_df.shape}"
)
ranking_df.to_csv(
    str(
        run_dir
        / f"enrichment_ranking_{clustering_col}.csv"
    ),
    index=False,
)

Here is the example of the ranking dataframe (available in the github repo here):

group reference name stat meanchange pval padj
D1003 rest Goblet_cells 17.71772568858253 0.2058900842521403 3.06037424683307e-70 1.530187123416535e-69
D1003 rest Tumor 6.524446096548585 0.08057131782724745 6.825329006609873e-11 1.1375548344349789e-10
D1003 rest Fibroblasts -2.8177362507011354 -0.005792962570818874 0.004836351981595639 0.004836351981595639
D1003 rest Macrophages 5.867352722361021 -0.007929933567018977 4.42808028964004e-09 5.5351003620500505e-09
D1003 rest Neutrophils -15.045558367012578 -0.014965320227218267 3.692052243146449e-51 9.230130607866121e-51
D1009 rest Fibroblasts 11.581138126903658 0.3457794348546475 5.1359605420638064e-31 1.2839901355159517e-30
D1009 rest Macrophages 10.169842717776737 0.04183143564320674 2.7034484990128016e-24 4.5057474983546696e-24
D1009 rest Neutrophils 11.670558188973292 0.02683085310666919 1.8023675673478208e-31 9.011837836739104e-31
D1009 rest Goblet_cells -7.798139307666739 -0.016959587070388787 6.282663280243531e-15 7.853329100304415e-15
D1009 rest Tumor -7.669749266711418 -0.05728980467007975 1.723328895820832e-14 1.723328895820832e-14
D1010 rest Fibroblasts 19.745919842884664 0.2403534565845365 8.696678111413742e-87 4.348339055706871e-86
D1010 rest Macrophages -12.73580528164666 0.023629517191564146 3.739675027432825e-37 9.349187568582063e-37
D1010 rest Goblet_cells 2.72647137857913 0.0023891458429599896 0.006401550421739988 0.006401550421739988

We can then use these scores together with an elbow-gap method to assign cell types to each cell ID based on the computed scores. The rationale behind this approach is that a given cell ID might still be capturing multiple cells (cell segmentation works well but is not perfect). If the scores are close to each other (e.g., the meanchanges from the ranking dataframe above) for a given domain, then we can infer that the domain may contain a mixture of cell types and assign it to the most enriched cell type(s).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
enrichmap_gap_annotations = {}
elbow_data = {}

for domain in ranking_df[
    "group"
].unique():
    domain_data = ranking_df.loc[
        ranking_df["group"] == domain
    ].copy()

    # Filter to significant, positively enriched
    domain_data = domain_data.loc[
        (domain_data["meanchange"] > 0)
        & (domain_data["padj"] < 0.05)
    ]

    if len(domain_data) == 0:
        enrichmap_gap_annotations[
            domain
        ] = "Unknown"
        continue

    if len(domain_data) == 1:
        enrichmap_gap_annotations[
            domain
        ] = domain_data["name"].iloc[0]
        elbow_data[domain] = domain_data
        continue

    elbow_data[domain] = domain_data

    sorted_mc = domain_data[
        "meanchange"
    ].values
    gaps = (
        sorted_mc[:-1] - sorted_mc[1:]
    )
    biggest_gap_idx = int(
        np.argmax(gaps)
    )

    selected = (
        domain_data["name"]
        .iloc[: biggest_gap_idx + 1]
        .tolist()
    )
    enrichmap_gap_annotations[
        domain
    ] = "/".join(selected)

logger.info(
    f"Domain -> cell type mapping:\n{enrichmap_gap_annotations}"
)

# Assign domain-level cell type
adata.obs["cell_type_domain"] = (
    adata.obs[clustering_col].map(
        enrichmap_gap_annotations
    )
)

# Per-cell high granularity: argmax over scores
score_df = adata.obs[score_cols].copy()
score_df.columns = [
    col.replace("_score", "")
    for col in score_cols
]

Let’s also save the “high granularity” cell type assignments to the adata.obs table by simply taking the argmax over the scores for each cell ID.

1
2
3
adata.obs["cell_type_enrichmap"] = (
    score_df.idxmax(axis=1)
)
1
2
3
4
5
6
7
8
9
10
logger.info(
    f"Cell type distribution:\n{adata.obs['cell_type_domain'].value_counts()}"
)

Fibroblasts                6577
Goblet_cells               2442
Macrophages                1898
Tumor                      1069
Macrophages/Neutrophils     230
Unknown                       2

Now let’s plot the elbow plot using the meanchange from the ranking dataframe above:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
plottable_domains = [
    d
    for d in elbow_data
    if len(elbow_data[d]) > 0
]
n_domains_plot = len(plottable_domains)
n_cols_plot = min(4, n_domains_plot)
n_rows_plot = int(
    np.ceil(
        n_domains_plot / n_cols_plot
    )
)

fig_elbow, axes_elbow = plt.subplots(
    n_rows_plot,
    n_cols_plot,
    figsize=(
        5 * n_cols_plot,
        4 * n_rows_plot,
    ),
)
axes_flat = np.array(
    axes_elbow
).flatten()

for ax_idx, domain in enumerate(
    plottable_domains
):
    ax = axes_flat[ax_idx]
    domain_data = elbow_data[domain]
    mc_vals = domain_data[
        "meanchange"
    ].values
    ct_names = domain_data[
        "name"
    ].values
    x_pos = np.arange(len(mc_vals))

    selected_names = set(
        enrichmap_gap_annotations[
            domain
        ].split("/")
    )

    ax.plot(
        x_pos,
        mc_vals,
        color="gray",
        lw=1.5,
        zorder=2,
    )

    for i, (xp, mc, name) in enumerate(
        zip(x_pos, mc_vals, ct_names)
    ):
        if name in selected_names:
            color = "steelblue"
            ec = "black"
        else:
            color = "silver"
            ec = "gray"
        ax.scatter(
            xp,
            mc,
            color=color,
            s=60,
            zorder=4,
            edgecolors=ec,
            linewidths=0.5,
        )

    if len(mc_vals) > 1:
        gaps = (
            mc_vals[:-1] - mc_vals[1:]
        )
        gap_idx = int(np.argmax(gaps))
        gap_x = gap_idx + 0.5
        gap_val = gaps[gap_idx]
        ax.axvline(
            gap_x,
            color="red",
            ls="--",
            lw=1,
            alpha=0.8,
            label=f"gap = {gap_val:.3f}",
        )
        ax.axvspan(
            -0.5,
            gap_x,
            alpha=0.08,
            color="steelblue",
        )
        y_top = mc_vals[gap_idx]
        y_bot = mc_vals[gap_idx + 1]
        bracket_x = gap_x + 0.3
        ax.annotate(
            "",
            xy=(bracket_x, y_bot),
            xytext=(bracket_x, y_top),
            arrowprops=dict(
                arrowstyle="<->",
                color="red",
                lw=1.2,
            ),
        )
        ax.text(
            bracket_x + 0.15,
            (y_top + y_bot) / 2,
            f"{gap_val:.3f}",
            fontsize=6,
            color="red",
            va="center",
        )

    ax.set_xticks(x_pos)
    ax.set_xticklabels(
        ct_names,
        rotation=60,
        ha="right",
        fontsize=5,
    )
    ax.set_ylabel(
        "meanchange", fontsize=7
    )
    ax.set_title(
        f"Domain {domain}", fontsize=8
    )
    ax.legend(
        fontsize=5, loc="upper right"
    )
    ax.tick_params(labelsize=6)

for ax_idx in range(
    n_domains_plot, len(axes_flat)
):
    axes_flat[ax_idx].axis("off")

fig_elbow.suptitle(
    f"Gap-based elbow — {clustering_col}",
    fontsize=11,
)
plt.tight_layout()
plt.savefig(
    str(
        run_dir
        / f"elbow_plot_{clustering_col}.png"
    ),
    dpi=400,
    bbox_inches="tight",
)
plt.close()
logger.info("Saved elbow plot")

We see, for example, that domain D999 is enriched for both Macrophages and Neutrophils, although the meanchanges are quite small. With a larger set of cell type signatures (10+), we would expect to see more robust differences and more confidently assign each domain to one or more cell types based on this gap method.

Let’s now plot the cell type annotations:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
logger.info("=== Generating plots ===")

cell_type_palette = {
    "Tumor": "#E64B35",
    "Fibroblasts": "#4DBBD5",
    "Macrophages": "#3C5488",
    "Neutrophils": "#F39B7F",
    "Goblet_cells": "#00599F",
    "Macrophages/Neutrophils": "#FFD700",
    "Unknown": "#B09C85",
}

sc.pl.spatial(
    adata,
    color="cell_type_domain",
    spot_size=9,
    show=False,
    title=f"{sample_name} - Cell Types",
    palette=[
        cell_type_palette.get(
            ct, "#999999"
        )
        for ct in adata.obs[
            "cell_type_domain"
        ].cat.categories
    ],
)
plt.savefig(
    str(
        run_dir
        / "cell_types_spatial.png"
    ),
    dpi=600,
    bbox_inches="tight",
)
plt.close()

Domains overlapped with the HE image:

Cell Type assignments overlapped with the HE image:


Conclusion and Next Steps

We went from raw outputs to a (cropped) annotated spatial dataset:

  1. Segmented cells with StarDist + ProSeg via SOPA .
  2. Identified spatial domains with Novae .
  3. Annotated cell types with Enrichmap scores and an argmax and elbow-gap assignment method.

We reproduced the main findings from the 10x Loupe Browser CRC tutorial, but in Python and at single-cell resolution. This plugs directly into the scverse ecosystem (Scanpy, SpatialData, AnnData) for whatever comes next. Other visualizations can include matrixplots and dotplots as well to visualize domain scores per cell type to further explore domain/niche assignemnts. Example of how it could look like (using sc.pl.matrixplot with domain assignments from Novae and cell type scores from Enrichmap):

Next Steps

With a segmented and annotated dataset, there are several interesting directions:

sopa, Novae, and Enrichmap are all under active development, so check their docs for the latest features.

Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Cancer subtyping using cNMF and literature markers