A hands-on tutorial covering cell segmentation with SOPA (StarDist + ProSeg), spatial domain identification with Novae (with batch correction on the fly), and cell type annotation with Enrichmap (also with built-in batch correction and spatial smoothing) on a Visium HD colorectal cancer dataset.
flowchart LR
subgraph INPUT["<b>Input</b>"]
A["Visium HD\n+ H&E Image"]
end
subgraph SEG["<b>1. Cell Segmentation (SOPA)</b>"]
direction TB
B1["StarDist\n<i>Nuclear detection via\nstar-convex polygons</i>"]
B2["ProSeg\n<i>Transcript-informed\nboundary refinement</i>"]
B3["Aggregate\n<i>Assign transcripts\nto cells</i>"]
B1 --> B2 --> B3
end
subgraph PRE["<b>2. Pre-processing</b>"]
direction TB
C1["Filter genes & cells"]
C2["QC: mito %, counts"]
C3["Normalize & HVGs"]
C1 --> C2 --> C3
end
subgraph DOM["<b>3. Domain ID (Novae)</b>"]
direction TB
D1["Spatial neighbors"]
D2["Fine-tune\nfoundation model"]
D3["Assign spatial\ndomains"]
D1 --> D2 --> D3
end
subgraph ANN["<b>4. Annotation (Enrichmap)</b>"]
direction TB
E1["Gene set\nenrichment scoring"]
E2["Rank by group\n(Wilcoxon 1-vs-rest)"]
E3["Elbow-gap\ncell type assignment"]
E1 --> E2 --> E3
end
subgraph OUT["<b>Output</b>"]
F["Annotated\nSpatial Dataset"]
end
A --> SEG --> PRE --> DOM --> ANN --> F
style INPUT fill:#f0f0f0,stroke:#333
style SEG fill:#e8f4e8,stroke:#2d7d2d
style PRE fill:#e8ecf4,stroke:#2d4d7d
style DOM fill:#f4e8f4,stroke:#7d2d7d
style ANN fill:#f4ece8,stroke:#7d4d2d
style OUT fill:#f0f0f0,stroke:#333
Working with high-resolution spatial transcriptomics platforms like Visium HD involves quite a few moving parts: segmenting cells, finding spatial domains, and annotating cell types. This tutorial walks through the full pipeline in Python, covering:
We use the publicly available Human Colorectal Cancer (CRC) FFPE dataset from 10x Genomics and reproduce the main findings from their Loupe Browser tutorial, but entirely in Python and at single-cell resolution via cell segmentation.
To really understand what’s going on in a tissue, we need to know not just which genes a cell expresses, but also where that cell is relative to its neighbors. Spatial transcriptomics (ST) is a family of technologies that give you exactly that: spatially resolved gene expression
The field has moved fast, going from a handful of measurable genes to transcriptome-wide profiling at near-single-cell resolution
ST technologies generally fall into two camps: imaging-based and sequencing-based
Imaging-based ST (iST) uses labeled probes that target specific genes, letting you visualize mRNA directly in tissue
Sequencing-based ST (sST) uses spatially barcoded arrays to capture RNA from tissue sections, which then gets sequenced via NGS
The original Visium platform had 55 µm spots that typically captured a mixture of multiple cells. Visium HD brought this down to 2 µm bins with no gaps, enabling single-cell scale profiling. The tradeoff is that sequencing-based methods generally have lower transcript capture efficiency than imaging-based ones, since they rely on transcripts diffusing onto a capture surface rather than being detected directly in tissue
1. Going beyond bulk RNA-seq. Bulk RNA-seq averages gene expression across thousands of cells
2. Adding context to scRNA-seq. scRNA-seq gives you single-cell resolution, a step up from bulk-RNA, but still requires dissociating the tissue first, which destroys the spatial layout of cells
3. Enabling spatial analyses. Because ST preserves coordinates, it opens up analyses that are simply not feasible with dissociated data:
4. Building spatial atlases. ST enables 3D tissue reconstruction by aligning consecutive 2D sections
With the original Visium (55 µm spots, each covering multiple cells), you needed deconvolution methods like RCTD
Visium HD changed this. With 2 µm bins, you can now do cell segmentation directly on the histology image and assign transcripts to individual cells using computational methods. This gives you a per-cell expression matrix (like scRNA-seq) but with spatial coordinates preserved. Each observation is a single segmented cell rather than a statistical mixture.
This shift from deconvolution to segmentation is a key motivation for this tutorial: we get single-cell spatial transcriptomics without the assumptions that come with deconvolution.
For this tutorial we will use the Visium HD Human Colorectal Cancer (FFPE) dataset from 10x Genomics:
Once you’ve downloaded the data, your project structure should look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
spatial/
├── Microscopy/
│ ├── Visium_HD_Human_Colon_Cancer_image.tif
│ └── Visium_HD_Human_Colon_Cancer_tissue_image.btf
│
└── SpaceRanger/
└── Visium_HD_Human_Colon_Cancer/
└── outs/
├── binned_outputs/
│ ├── square_002um/
│ ├── square_008um/
│ └── square_016um/
│
└── spatial/
├── aligned_fiducials.jpg
├── aligned_tissue_image.jpg
├── cytassist_image.tiff
├── detected_tissue_image.jpg
├── tissue_hires_image.png
├── tissue_lowres_image.png
├── Visium_HD_Human_Colon_Cancer_cloupe_008um.cloupe
├── Visium_HD_Human_Colon_Cancer_feature_slice.h5
├── Visium_HD_Human_Colon_Cancer_metrics_summary.csv
├── Visium_HD_Human_Colon_Cancer_probe_set.csv
└── Visium_HD_Human_Colon_Cancer_slide_file.vtf
Check this repository link to find all the full script and structure necessary to follow this tutorial .
The imports look like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
import warnings
from datetime import datetime
from pathlib import Path
import decoupler as dc
import enrichmap as em
import matplotlib as mpl
import matplotlib.pyplot as plt
import novae
import numpy as np
import pandas as pd
import scanpy as sc
import seaborn as sns
import sopa
import spatialdata as sd
import spatialdata_plot
import yaml
import zarr
from anndata import AnnData
from loguru import logger
import spatialdata_plot
warnings.filterwarnings(
"ignore",
message="Use `squidpy.pl.spatial_scatter`",
)
# Matplotlib defaults
mpl.rcParams["figure.dpi"] = 300
plt.style.use("bmh")
plt.rcParams.update(
{
"figure.figsize": (12, 8),
"axes.facecolor": "white",
"axes.edgecolor": "black",
}
)
The following code is used to set up the directories and logging. Note that we will also be using the config_crc_tutorial.yaml to store and load the configuration parameters. The markers you see in this file are the ones downloaded from the 10X website (link) with some added genes for neutrophils as we need to have more than 1 gene per cell type signature to run the Enrichmap package later on this tutorial.
The yaml file looks like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Visium HD CRC Tutorial Configuration
# 10x Genomics Human Colorectal Cancer (FFPE) dataset
# Base paths
paths:
spaceranger_outs: "../data/spatial/SpaceRanger"
processed: "../data/spatial/processed"
figures: "../figures"
# Single sample for this tutorial
samples:
- id: "Visium_HD_Human_Colon_Cancer"
name: "CRC_tutorial"
# Analysis parameters
params:
filter_genes_counts: 1
filter_cells_counts: 1
radius: 50
percentile_pct_mito: null
pct_mito: 10
novae_max_epochs: 50
novae_model: "MICS-Lab/novae-human-0"
domain_range: [5, 15]
clustering_col: "novae_domains_15"
# Marker genes for CRC cell type annotation
# Source: 10x Genomics Loupe Browser CRC tutorial CSVs (data/markers/*.csv)
marker_genes:
Tumor: ["REG1A", "REG1B", "CEACAM6", "TGFBI"]
Fibroblasts: ["COL1A1", "MMP2"]
Macrophages: ["LYZ", "SPP1"]
Neutrophils: ["SAT1", "CSF3R", "FCGR3B", "S100A8"]
Goblet_cells: ["FCGBP", "MUC2", "CLCA1"]
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
PROJECT_DIR = Path(
r"C:\Users\Projects\segmentation-and-annotation"
).resolve()
CONFIG_PATH = (
PROJECT_DIR
/ "scripts"
/ "config_crc_tutorial.yaml"
)
with open(CONFIG_PATH, "r") as f:
cfg = yaml.safe_load(f)
sample_id = cfg["samples"][0]["id"]
sample_name = cfg["samples"][0]["name"]
params = cfg["params"]
marker_genes = cfg["marker_genes"]
PROCESSED_DIR = (
PROJECT_DIR
/ "data"
/ "spatial"
/ "processed"
).resolve()
FIGURES_DIR = (
PROJECT_DIR / "figures"
).resolve()
run_name = f"crc_tutorial_{datetime.now().strftime('%d%m%Y_%H%M')}"
run_dir = (
PROCESSED_DIR / run_name
).resolve()
run_dir.mkdir(
parents=True, exist_ok=True
)
(run_dir / "figures").mkdir(
exist_ok=True
)
logger.info(f"Run directory: {run_dir}")
logger.info(
f"Sample: {sample_id} ({sample_name})"
)
Once we have the downloaded files and in the project structure above, we can read the data using the sopa package
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
logger.info(
"=== STEP 1: Reading Visium HD data and running segmentation ==="
)
sdata = sopa.io.visium_hd(
str(
PROJECT_DIR
/ "data"
/ "spatial"
/ "SpaceRanger"
/ sample_id
/ "outs"
),
dataset_id=sample_id,
fullres_image_file=str(
PROJECT_DIR
/ "data"
/ "spatial"
/ "Microscopy"
/ "Visium_HD_Human_Colon_Cancer_tissue_image.btf"
),
)
In order to speed things up, we will use a subset of the data. We can crop the dataset like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
sdata_sub = sd.bounding_box_query(
sdata,
min_coordinate=[51000, 9000],
max_coordinate=[56000, 14000],
axes=("x", "y"),
target_coordinate_system=sample_id,
filter_table=True,
)
# Make the var_names unique, but not required for this tutorial
<!-- for (
key,
table,
) in sdata_sub.tables.items():
table.var_names_make_unique() -->
<!-- sdata_sub.write("sdata_subset.zarr") # save it -->
Which gives the following SpatialData object (to find out more about SpatialData check this repository link):
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
SpatialData object
├── Images
│ ├── 'Visium_HD_Human_Colon_Cancer_full_image': DataTree[cyx] (3, 5000, 5000), (3, 2500, 2500), (3, 1250, 1250), (3, 625, 625), (3, 313, 313)
│ ├── 'Visium_HD_Human_Colon_Cancer_hires_image': DataArray[cyx] (3, 398, 399)
│ └── 'Visium_HD_Human_Colon_Cancer_lowres_image': DataArray[cyx] (3, 40, 40)
├── Shapes
│ ├── 'Visium_HD_Human_Colon_Cancer_square_002um': GeoDataFrame shape: (470416, 1) (2D shapes)
│ ├── 'Visium_HD_Human_Colon_Cancer_square_008um': GeoDataFrame shape: (29659, 1) (2D shapes)
│ ├── 'Visium_HD_Human_Colon_Cancer_square_016um': GeoDataFrame shape: (7500, 1) (2D shapes)
│ ├── 'image_patches': GeoDataFrame shape: (9, 3) (2D shapes)
│ ├── 'region_of_interest': GeoDataFrame shape: (1, 1) (2D shapes)
│ └── 'stardist_boundaries': GeoDataFrame shape: (12419, 1) (2D shapes)
└── Tables
├── 'square_002um': AnnData (470416, 18085)
├── 'square_008um': AnnData (29659, 18085)
└── 'square_016um': AnnData (7500, 18085)
with coordinate systems:
▸ 'Visium_HD_Human_Colon_Cancer', with elements:
Visium_HD_Human_Colon_Cancer_full_image (Images), Visium_HD_Human_Colon_Cancer_hires_image (Images), Visium_HD_Human_Colon_Cancer_lowres_image (Images), Visium_HD_Human_Colon_Cancer_square_002um (Shapes), Visium_HD_Human_Colon_Cancer_square_008um (Shapes), Visium_HD_Human_Colon_Cancer_square_016um (Shapes), image_patches (Shapes), region_of_interest (Shapes), stardist_boundaries (Shapes)
▸ 'Visium_HD_Human_Colon_Cancer_downscaled_hires', with elements:
Visium_HD_Human_Colon_Cancer_hires_image (Images), Visium_HD_Human_Colon_Cancer_square_002um (Shapes), Visium_HD_Human_Colon_Cancer_square_008um (Shapes), Visium_HD_Human_Colon_Cancer_square_016um (Shapes), region_of_interest (Shapes)
▸ 'Visium_HD_Human_Colon_Cancer_downscaled_lowres', with elements:
Visium_HD_Human_Colon_Cancer_lowres_image (Images), Visium_HD_Human_Colon_Cancer_square_002um (Shapes), Visium_HD_Human_Colon_Cancer_square_008um (Shapes), Visium_HD_Human_Colon_Cancer_square_016um (Shapes), region_of_interest (Shapes)
▸ 'microns', with elements:
Visium_HD_Human_Colon_Cancer_square_002um (Shapes), Visium_HD_Human_Colon_Cancer_square_008um (Shapes), Visium_HD_Human_Colon_Cancer_square_016um (Shapes), region_of_interest (Shapes)
We can then plot the tissue image and the cropped image like this:
1
2
3
4
5
6
sdata_sub.pl.render_images(
f"{sample_id}_full_image"
).pl.show(
coordinate_systems=sample_id,
figsize=(10, 10),
)
Which would look something like this:
Now that the data is loaded and cropped, we can segment the cells. We’ll use sopa
For segmentation, sopa uses a patch-based approach: it splits the tissue image into patches, segments each one, and stitches the results back together. This keeps memory usage manageable even for large whole-slide images.
1
2
3
# Patch-based cell segmentation
logger.info("Creating image patches...")
sopa.make_image_patches(sdata_sub)
Cell segmentation matters because everything downstream (domain identification, cell type annotation, cell-cell interactions) depends on getting cell boundaries right. The transcript-informed segmentation landscape breaks down into three broad families:
CNNs and Transformers: BIDCell, Bin2Cell, SCS, UCS, Bering, GeneSegNet, and StarDist
Graph-based methods: Segger and ComSeg.
Probabilistic/Bayesian methods: ProSeg
Which one you pick depends on what inputs you have (H&E/DAPI images, nuclear staining, transcript coordinates) and your computational budget. In this tutorial, we use a two-step approach: StarDist for initial HE segmentation, then ProSeg to refine those boundaries using transcript locations.
StarDist
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
logger.info(
"Running StarDist segmentation..."
)
sopa.segmentation.stardist(
sdata_sub, min_area=20
)
sdata_sub.pl.render_images(
f"{sample_id}_full_image"
).pl.render_shapes(
"stardist_boundaries",
outline=True,
fill_alpha=0,
outline_alpha=0.6,
outline_color="yellow",
).pl.show(
coordinate_systems=sample_id,
figsize=(12, 12),
title="StarDist Cell Segmentation",
)
Which gives the following segmentation result:
ProSeg
The nice thing about ProSeg is that it can take prior segmentation results (like our StarDist boundaries) as a starting point and refine them using transcript information. So StarDist gives us the initial nuclear outlines from the H&E, and ProSeg adjusts those to better match the actual extent of each cell. In benchmarks across three commercial platforms, ProSeg shows strong performance and helps with difficult-to-segment cells like tumor-infiltrating immune cells (neutrophils, T cells).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
logger.info(
"Running ProSeg refinement..."
)
sopa.segmentation.proseg(
sdata_sub,
prior_shapes_key="stardist_boundaries",
)
sdata_sub.pl.render_images(
f"{sample_id}_full_image"
).pl.render_shapes(
"proseg_boundaries",
outline=True,
fill_alpha=0,
outline_alpha=0.6,
outline_color="yellow",
).pl.show(
coordinate_systems=sample_id,
figsize=(12, 12),
title="ProSeg Cell Segmentation Refinement",
)
Which gives the following segmentation result:
Once we have the segmentation, we can aggregate the gene expression to the cells (As seen in the sopa tutorial, this is mandatory if you used stardist only, but optional if you ran proseg)
1
2
3
4
5
6
7
8
logger.info(
"Aggregating gene expression to cells..."
)
sopa.aggregate(
sdata_sub,
aggregate_channels=False,
expand_radius_ratio=1,
)
This will give a new table in the sdata_sub.tables["table"] with the aggregated/segmented gene expression. It looks something like this:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
SpatialData object
├── Images
│ ├── 'Visium_HD_Human_Colon_Cancer_full_image': DataTree[cyx] (3, 5000, 5000), (3, 2500, 2500), (3, 1250, 1250), (3, 625, 625), (3, 313, 313)
│ ├── 'Visium_HD_Human_Colon_Cancer_hires_image': DataArray[cyx] (3, 398, 399)
│ └── 'Visium_HD_Human_Colon_Cancer_lowres_image': DataArray[cyx] (3, 40, 40)
├── Shapes
│ ├── 'Visium_HD_Human_Colon_Cancer_square_002um': GeoDataFrame shape: (470416, 1) (2D shapes)
│ ├── 'Visium_HD_Human_Colon_Cancer_square_008um': GeoDataFrame shape: (29659, 1) (2D shapes)
│ ├── 'Visium_HD_Human_Colon_Cancer_square_016um': GeoDataFrame shape: (7500, 1) (2D shapes)
│ ├── 'image_patches': GeoDataFrame shape: (9, 3) (2D shapes)
│ ├── 'proseg_boundaries': GeoDataFrame shape: (12394, 2) (2D shapes)
│ ├── 'region_of_interest': GeoDataFrame shape: (1, 1) (2D shapes)
│ └── 'stardist_boundaries': GeoDataFrame shape: (12419, 1) (2D shapes)
└── Tables
├── 'square_002um': AnnData (470416, 18085)
├── 'square_008um': AnnData (29659, 18085)
├── 'square_016um': AnnData (7500, 18085)
└── 'table': AnnData (12394, 18085)
with coordinate systems:
▸ 'Visium_HD_Human_Colon_Cancer', with elements:
Visium_HD_Human_Colon_Cancer_full_image (Images), Visium_HD_Human_Colon_Cancer_hires_image (Images), Visium_HD_Human_Colon_Cancer_lowres_image (Images), Visium_HD_Human_Colon_Cancer_square_002um (Shapes), Visium_HD_Human_Colon_Cancer_square_008um (Shapes), Visium_HD_Human_Colon_Cancer_square_016um (Shapes), image_patches (Shapes), proseg_boundaries (Shapes), region_of_interest (Shapes), stardist_boundaries (Shapes)
▸ 'Visium_HD_Human_Colon_Cancer_downscaled_hires', with elements:
Visium_HD_Human_Colon_Cancer_hires_image (Images), Visium_HD_Human_Colon_Cancer_square_002um (Shapes), Visium_HD_Human_Colon_Cancer_square_008um (Shapes), Visium_HD_Human_Colon_Cancer_square_016um (Shapes), proseg_boundaries (Shapes), region_of_interest (Shapes)
▸ 'Visium_HD_Human_Colon_Cancer_downscaled_lowres', with elements:
Visium_HD_Human_Colon_Cancer_lowres_image (Images), Visium_HD_Human_Colon_Cancer_square_002um (Shapes), Visium_HD_Human_Colon_Cancer_square_008um (Shapes), Visium_HD_Human_Colon_Cancer_square_016um (Shapes), proseg_boundaries (Shapes), region_of_interest (Shapes)
▸ 'microns', with elements:
Visium_HD_Human_Colon_Cancer_square_002um (Shapes), Visium_HD_Human_Colon_Cancer_square_008um (Shapes), Visium_HD_Human_Colon_Cancer_square_016um (Shapes), proseg_boundaries (Shapes), region_of_interest (Shapes)
This is the AnnData object that we will be using (segmented cells) to run domain identification and cell type annotation. Let’s save this segmented data:
1
2
3
4
5
6
7
8
9
10
11
12
segmented_h5ad = (
PROJECT_DIR
/ "data"
/ "spatial"
/ "segmented.h5ad"
)
sdata_sub.tables["table"].write_h5ad(
str(segmented_h5ad)
)
logger.info(
f"Saved segmented table to {segmented_h5ad}"
)
Before running domain identification, let’s do some pre-processing (e.g., filtering cell and gene counts, removing cells with mitochondrial percentage above a certain threshold, etc.).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
logger.info(
"=== STEP 2: Filtering, preprocessing, and Novae clustering ==="
)
adata = sc.read_h5ad(
str(segmented_h5ad)
)
# Filter genes and cells
logger.info(
f"AnnData shape before filtering: {adata.shape}"
)
sc.pp.filter_genes(
adata,
min_counts=params[
"filter_genes_counts"
],
)
sc.pp.filter_cells(
adata,
min_counts=params[
"filter_cells_counts"
],
)
logger.info(
f"AnnData shape after filtering: {adata.shape}"
)
# QC metrics
vmax_pct = 99
adata.var["mt"] = (
adata.var_names.str.startswith(
("MT-", "mt-")
)
)
sc.pp.calculate_qc_metrics(
adata,
inplace=True,
percent_top=None,
)
sc.pp.calculate_qc_metrics(
adata,
qc_vars=["mt"],
inplace=True,
percent_top=None,
)
Let’s create some plots resulting from the QC filtering:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
# Plot n_counts
fig = sc.pl.spatial(
adata,
color="n_counts",
vmin=np.percentile(
adata.obs["n_counts"], 1
),
vmax=np.percentile(
adata.obs["n_counts"], vmax_pct
),
spot_size=10,
show=False,
return_fig=True,
)
fig.savefig(
f"{run_dir}/n_counts_{sample_id}.png",
bbox_inches="tight",
dpi=150,
)
plt.close(fig)
Spatial distribution of total counts per cell:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
# Plot mito percentage
fig = sc.pl.spatial(
adata,
color="pct_counts_mt",
vmin=np.percentile(
adata.obs["pct_counts_mt"], 1
),
vmax=np.percentile(
adata.obs["pct_counts_mt"],
vmax_pct,
),
spot_size=10,
show=False,
return_fig=True,
)
fig.savefig(
f"{run_dir}/pct_counts_mt_{sample_id}.png",
bbox_inches="tight",
dpi=150,
)
plt.close(fig)
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
# Filter high mito cells
pct_to_remove = params["pct_mito"]
if (
params["percentile_pct_mito"]
is not None
):
pct_to_remove = adata.obs[
"pct_counts_mt"
].quantile(
params["percentile_pct_mito"]
)
adata = adata[
adata.obs["pct_counts_mt"]
< pct_to_remove
].copy()
logger.info(
f"After mito filtering ({pct_to_remove}%): {adata.shape}"
)
# Plot mito after filtering
fig = sc.pl.spatial(
adata,
color="pct_counts_mt",
spot_size=10,
show=False,
return_fig=True,
)
fig.savefig(
f"{run_dir}/mt_after_filtering_{sample_id}.png",
bbox_inches="tight",
dpi=150,
)
plt.close(fig)
Although this example uses a different dataset from the tutorial, it highlights an important thing: regions with a high percentage of mitochondrial (MT) genes can be identified by tools like Novae as distinct spatial domains, when in reality they may simply reflect elevated MT content. In some cases, this is biologically meaningful, for exampel, high MT percentages can indicate cell death or disease progression, making these domains genuinely relevant. However, in other cases, such as FFPE samples, the elevated MT signal may be an artifact: the extremities of the tissue are exposed to harsher conditions during sample prep/embedding, leading to greater cellular stress and, consequently, higher MT percentages compared to the inner tissue.
Have a look at this figure:
You can see that the regions of this sample that have highest % of MT are the extremities of the sample.
Now look at the domains that were obtained using novae (which you will see how to use in the next section):
You can see that novae has captured this region of high % of MT as being a niche/domain. And while it certainly can be due to biology, it can also be due to artifacts as I mentioned above (especially in the extremities of FFPE samples or issues during sample preparation).
Another thing we can do is to remove sparse spatial regions using computed connectivities. This is a good idea if we have a sample that is not as compact as the dataset we are working with, where we may have “isolated islands” of cells. This should be evaluated on a project bases, as these islands could be real and not artifacts from sample preparation.
We can do this using the connected_components function from scipy.sparse.csgraph. This function below requires connectivities to be calculated first, which we will do in the next section with Novae.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
def filter_isolated_regions(
adatas,
min_cells=2000,
run_dir=None,
remove_specific_regions=None,
):
from scipy.sparse.csgraph import (
connected_components,
)
for i, adata in enumerate(adatas):
coords = adata.obsm["spatial"]
sample_id = adata.obs[
"sample_id"
].unique()[0]
logger.info(
f"Running region filtering based on number of cells (removing isolated islands). {sample_id}"
)
print(
f"X range: {coords[:, 0].min():.1f} - {coords[:, 0].max():.1f}"
)
print(
f"Y range: {coords[:, 1].min():.1f} - {coords[:, 1].max():.1f}"
)
n_components, labels = (
connected_components(
adata.obsp[
"spatial_connectivities"
],
directed=False,
)
)
adata.obs["spatial_region"] = (
labels.astype(str)
)
print(
f"Found {n_components} spatial regions"
)
# Count cells per region
region_counts = adata.obs[
"spatial_region"
].value_counts()
print(region_counts)
large_regions = region_counts[
region_counts >= min_cells
].index.tolist()
print(
f"Large regions (>={min_cells} cells): {large_regions}"
)
# Tag regions as "main" or "island"
adata.obs["region_type"] = (
adata.obs[
"spatial_region"
].apply(
lambda x: "main"
if x in large_regions
else "island"
)
)
# Visualize the regions
fig, axes = plt.subplots(
1, 3, figsize=(14, 5)
)
# Plot all regions colored
sc.pl.embedding(
adata,
basis="spatial",
color="spatial_region",
ax=axes[0],
show=False,
title="All spatial regions (based on connectivities)",
)
# Plot main vs island
sc.pl.embedding(
adata,
basis="spatial",
color="region_type",
ax=axes[1],
show=False,
title="Main vs Islands",
)
if (
remove_specific_regions
is not None
):
if (
sample_id
in remove_specific_regions
):
regions_to_remove = remove_specific_regions[
sample_id
]
adata = adata[
~adata.obs[
"spatial_region"
].isin(
regions_to_remove
)
]
logger.info(
f"Removed user defined regions from {sample_id}: {regions_to_remove}"
)
cells_before_filtering = (
adata.n_obs
)
# Filter to keep only main regions
adata = adata[
adata.obs["region_type"]
== "main"
]
cells_after_filtering = (
adata.n_obs
)
print(
f"Filtered: {cells_before_filtering} -> {cells_after_filtering} cells"
)
sc.pl.embedding(
adata,
basis="spatial",
color="region_type",
ax=axes[2],
show=False,
title="After filtering isolated islands",
)
plt.tight_layout()
if run_dir is not None:
plt.savefig(
f"{run_dir}/spatial_regions_{sample_id}.png",
dpi=300,
bbox_inches="tight",
)
else:
plt.savefig(
f"spatial_regions_{sample_id}.png",
dpi=300,
bbox_inches="tight",
)
plt.show()
adatas[i] = adata.copy()
return adatas
This is an example of the output from the code above, where one identify and remove the “islands” of cells that are not connected to the main regions of cells. This image is not related to the Visium HD dataset for this tutorial, and is simply an example of how this can be used to remove isolated islands of cells that don’t meed certain criteria.
With segmented cells and pre-processed data, the next step is to find spatial domains: tissue regions that share similar expression patterns and spatial organization (think tumor, stroma, immune infiltrates, normal mucosa).
We’ll use Novae
An alternative worth mentioning is SpatialLeiden, a spatially-aware version of the Leiden algorithm. It uses a multiplex graph combining gene expression with spatial topology, and a layer_ratio parameter lets you balance the two. It integrates with the scverse ecosystem and works across Visium, Stereo-seq, and imaging-based platforms. However, it does not (yet) perform batch correction out of the box. Thus, a batch correction method such as Harmony or scVI can be used.
Check out this contribution I posted on the SpatialLeiden GitHub repository for more details: link. Edit March 19th, 2026: This contribution was turned into a tutorial, using Harmony with SpatialLeiden here.
Another interesting option is SpatialFusion, a deep-learning multimodal model for niche discovery using pathway-informed spatial niche mapping. Check the priprint here: SpatialFusion: A lightweight multimodal foundation model for pathway-informed spatial niche mapping.
For this tutorial we go with novae for its foundation model capabilities and hierarchical domains. In addition, it is easy to get lost with all the different methods being released on a weekly basis (especially for recent methods such as spatial transcriptomics), so it is important to focus on those tools/packages that 1) work/generalize well enough, 2) are easy to install, and 3) have a good documentation.
The first step with Novae is computing spatial neighbors via spatial_neighbors, which takes a radius parameter. For this cropped region, we use a radius of 50.
Note that neighbors/connectivities need to be computed per sample. If you have multiple samples in one AnnData object, use the slide_key parameter to compute them separately.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
logger.info(
"Computing spatial neighbors."
)
novae.spatial_neighbors(
[adata],
radius=params["radius"],
coord_type="generic",
)
novae.plot.connectivities([adata])
plt.savefig(
f"{run_dir}/connectivities.png",
dpi=600,
bbox_inches="tight",
)
plt.close()
This radius parameter is important as too low of a radius will result in very sparse connectivities, missing important spatial relationships, while too high of a radius will start connecting cells that are not actually connected. See example of images below:
Radius of 20
Radius of 150
Novae also provides some useful utilities for normalization of the data. We can do it using this:
1
2
3
4
5
6
logger.info("Preprocessing with Novae")
novae.utils.prepare_adatas([adata]) # performs lognorm, highly variable genes, among others.
# Copy the lognorms computed from novae to adata layers.
adata.layers["lognorm_counts"] = (
adata.X.copy()
)
Let’s also compute the PCA using the highly variable genes computed from Novae in the previous prepare_adatas step:
1
2
3
4
logger.info("Running PCA")
sc.pp.pca(
adata, use_highly_variable=True
)
Now, one can use the Novae model to fine-tune it on our own dataset. We load a pre-trained model and fine-tune it as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
# Novae clustering
logger.info(
"Fine-tuning Novae model..."
)
model = novae.Novae.from_pretrained(
params["novae_model"]
)
# See more on the Novae documentation: https://mics-lab.github.io/novae/
model.fine_tune(
[adata],
max_epochs=params[
"novae_max_epochs"
],
)
model.save_pretrained(
str(run_dir / "model")
)
You can find more detailed information on fine-tuning and how to use H&E embeddings to improve model performance in the Novae documentation. For this tutorial, we will use the zero_shot parameter to compute representations. This leverages the pre-trained model’s learned representations to cluster cells without requiring task-specific training.
1
2
3
4
5
6
7
logger.info(
"Computing Novae representations..."
)
# Inference based on zero_shot
model.compute_representations(
[adata], zero_shot=True
)
We can then use the model to assign domains from a range of domains we wish to test. We will also save these domain assignments to the adata.obs table and save both spatial plots and domain proportions plots.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
# Assign domains across a range of resolutions
domain_min, domain_max = params[
"domain_range"
]
for n_domains in range(
domain_min, domain_max + 1
):
col = f"novae_domains_{n_domains}"
model.assign_domains(
[adata], level=n_domains
)
adata.obsm[
f"novae_latent_{n_domains}"
] = adata.obsm[
"novae_latent"
].copy()
novae.plot.domains(
[adata],
cell_size=8,
show=False,
obs_key=col,
)
plt.savefig(
str(
run_dir
/ f"domains_{n_domains}.png"
),
dpi=600,
bbox_inches="tight",
)
plt.close()
novae.plot.domains_proportions(
[adata], obs_key=col, show=False
)
plt.savefig(
str(
run_dir
/ f"domains_proportions_{n_domains}.png"
),
dpi=600,
bbox_inches="tight",
)
plt.close()
We will use the novae_domains_8 domain assignments from adata.obs for this tutorial. We get the following plots:
Domain proportions for the 8 novae domains:
Since the previous steps of segmentation and domain identification can take a while to run, let’s save the adata object to a file.
1
2
3
4
5
6
7
8
9
10
# Save clustered AnnData
adata.write_h5ad(
str(
run_dir
/ f"{sample_id}_clustered.h5ad"
)
)
logger.info(
f"Saved clustered AnnData: {adata.shape}"
)
For cell type annotation, you can either use reference-based methods like RCTD
We’ll use Enrichmap scanpy.tl.score_genes, decoupler.mt.ulm or gseapy, Enrichmap adds three things that matter for spatial data:
It’s been benchmarked against other scoring methods using Moran’s I and shows better spatial coherence. It also works across Visium, Xenium, MERFISH, and imaging mass cytometry.
We start by filtering our marker genes to those present in the dataset, then run em.tl.score to generate per-cell enrichment scores for each cell type signature.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
logger.info(
"=== STEP 3: Cell type annotation with Enrichmap ==="
)
clustering_col = "novae_domains_8"
logger.info(
f"Annotating using clustering column: {clustering_col}"
)
logger.info(
f"Marker genes: {marker_genes}"
)
# Ensure we use lognorm counts for enrichmap
if "lognorm_counts" in adata.layers:
adata.X = adata.layers[
"lognorm_counts"
].copy()
# Filter marker genes to those present in the data
marker_genes_filtered = {
ct: [
g
for g in genes
if g in adata.var_names
]
for ct, genes in marker_genes.items()
}
marker_genes_filtered = {
ct: genes
for ct, genes in marker_genes_filtered.items()
if len(genes) > 0
}
logger.info(
f"Marker genes after filtering: { {k: len(v) for k, v in marker_genes_filtered.items()} }"
)
# Run Enrichmap scoring
logger.info(
"Running Enrichmap scoring..."
)
em.tl.score(
adata=adata,
gene_set=marker_genes_filtered,
smoothing=True,
correct_spatial_covariates=True,
)
Let’s plot some of the scores to see how they look:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
coords = adata.obsm["spatial"]
scores = adata.obs["Neutrophils_score"] # replace this with the appropriate colum name from our computed scores above
fig, ax = plt.subplots(figsize=(10, 10))
scatter = ax.scatter(
coords[:, 0],
coords[:, 1],
c=scores,
cmap="seismic",
s=3,
vmin=-scores.abs().max(),
vmax=scores.abs().max(),
)
ax.set_aspect("equal")
ax.invert_yaxis()
ax.axis("off")
ax.set_title("Tumor")
plt.colorbar(scatter, ax=ax, shrink=0.5)
plt.show()
Neutrophil enrichment scores:
Fibroblasts enrichment scores:
Goblet cells enrichment scores:
Macrophages enrichment scores:
Tumor enrichment scores:
If you compare these scores distributions to the markers from the original 10X tutorial, we can see that the scores follow similar distribution. We also see that there is an overlap between Tumor and Goblet cell scores.
To assign a cell type, one can then do a simple idxmax to assign the max scores for each cell ID to a given cell type. We can also use decoupler (repository link) to run statistical tests and perform one-vs-rest comparisons between the scores of a given cell type in a given domain compared to all other domains in that sample.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
score_cols = [
col
for col in adata.obs.columns
if col.endswith("_score")
]
logger.info(
f"Score columns: {score_cols}"
)
score_matrix = adata.obs[
score_cols
].values
score_adata = AnnData(
X=score_matrix,
obs=adata.obs[
[clustering_col]
].copy(),
)
score_adata.var_names = [
col.replace("_score", "")
for col in score_cols
]
ranking_df = dc.tl.rankby_group(
score_adata,
groupby=clustering_col,
reference="rest",
method="wilcoxon",
)
ranking_df = ranking_df.sort_values(
by=["group", "meanchange"],
ascending=[True, False],
)
logger.info(
f"Ranking shape: {ranking_df.shape}"
)
ranking_df.to_csv(
str(
run_dir
/ f"enrichment_ranking_{clustering_col}.csv"
),
index=False,
)
Here is the example of the ranking dataframe (available in the github repo here):
| group | reference | name | stat | meanchange | pval | padj |
|---|---|---|---|---|---|---|
| D1003 | rest | Goblet_cells | 17.71772568858253 | 0.2058900842521403 | 3.06037424683307e-70 | 1.530187123416535e-69 |
| D1003 | rest | Tumor | 6.524446096548585 | 0.08057131782724745 | 6.825329006609873e-11 | 1.1375548344349789e-10 |
| D1003 | rest | Fibroblasts | -2.8177362507011354 | -0.005792962570818874 | 0.004836351981595639 | 0.004836351981595639 |
| D1003 | rest | Macrophages | 5.867352722361021 | -0.007929933567018977 | 4.42808028964004e-09 | 5.5351003620500505e-09 |
| D1003 | rest | Neutrophils | -15.045558367012578 | -0.014965320227218267 | 3.692052243146449e-51 | 9.230130607866121e-51 |
| D1009 | rest | Fibroblasts | 11.581138126903658 | 0.3457794348546475 | 5.1359605420638064e-31 | 1.2839901355159517e-30 |
| D1009 | rest | Macrophages | 10.169842717776737 | 0.04183143564320674 | 2.7034484990128016e-24 | 4.5057474983546696e-24 |
| D1009 | rest | Neutrophils | 11.670558188973292 | 0.02683085310666919 | 1.8023675673478208e-31 | 9.011837836739104e-31 |
| D1009 | rest | Goblet_cells | -7.798139307666739 | -0.016959587070388787 | 6.282663280243531e-15 | 7.853329100304415e-15 |
| D1009 | rest | Tumor | -7.669749266711418 | -0.05728980467007975 | 1.723328895820832e-14 | 1.723328895820832e-14 |
| D1010 | rest | Fibroblasts | 19.745919842884664 | 0.2403534565845365 | 8.696678111413742e-87 | 4.348339055706871e-86 |
| D1010 | rest | Macrophages | -12.73580528164666 | 0.023629517191564146 | 3.739675027432825e-37 | 9.349187568582063e-37 |
| D1010 | rest | Goblet_cells | 2.72647137857913 | 0.0023891458429599896 | 0.006401550421739988 | 0.006401550421739988 |
We can then use these scores together with an elbow-gap method to assign cell types to each cell ID based on the computed scores. The rationale behind this approach is that a given cell ID might still be capturing multiple cells (cell segmentation works well but is not perfect). If the scores are close to each other (e.g., the meanchanges from the ranking dataframe above) for a given domain, then we can infer that the domain may contain a mixture of cell types and assign it to the most enriched cell type(s).
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
enrichmap_gap_annotations = {}
elbow_data = {}
for domain in ranking_df[
"group"
].unique():
domain_data = ranking_df.loc[
ranking_df["group"] == domain
].copy()
# Filter to significant, positively enriched
domain_data = domain_data.loc[
(domain_data["meanchange"] > 0)
& (domain_data["padj"] < 0.05)
]
if len(domain_data) == 0:
enrichmap_gap_annotations[
domain
] = "Unknown"
continue
if len(domain_data) == 1:
enrichmap_gap_annotations[
domain
] = domain_data["name"].iloc[0]
elbow_data[domain] = domain_data
continue
elbow_data[domain] = domain_data
sorted_mc = domain_data[
"meanchange"
].values
gaps = (
sorted_mc[:-1] - sorted_mc[1:]
)
biggest_gap_idx = int(
np.argmax(gaps)
)
selected = (
domain_data["name"]
.iloc[: biggest_gap_idx + 1]
.tolist()
)
enrichmap_gap_annotations[
domain
] = "/".join(selected)
logger.info(
f"Domain -> cell type mapping:\n{enrichmap_gap_annotations}"
)
# Assign domain-level cell type
adata.obs["cell_type_domain"] = (
adata.obs[clustering_col].map(
enrichmap_gap_annotations
)
)
# Per-cell high granularity: argmax over scores
score_df = adata.obs[score_cols].copy()
score_df.columns = [
col.replace("_score", "")
for col in score_cols
]
Let’s also save the “high granularity” cell type assignments to the adata.obs table by simply taking the argmax over the scores for each cell ID.
1
2
3
adata.obs["cell_type_enrichmap"] = (
score_df.idxmax(axis=1)
)
1
2
3
4
5
6
7
8
9
10
logger.info(
f"Cell type distribution:\n{adata.obs['cell_type_domain'].value_counts()}"
)
Fibroblasts 6577
Goblet_cells 2442
Macrophages 1898
Tumor 1069
Macrophages/Neutrophils 230
Unknown 2
Now let’s plot the elbow plot using the meanchange from the ranking dataframe above:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
plottable_domains = [
d
for d in elbow_data
if len(elbow_data[d]) > 0
]
n_domains_plot = len(plottable_domains)
n_cols_plot = min(4, n_domains_plot)
n_rows_plot = int(
np.ceil(
n_domains_plot / n_cols_plot
)
)
fig_elbow, axes_elbow = plt.subplots(
n_rows_plot,
n_cols_plot,
figsize=(
5 * n_cols_plot,
4 * n_rows_plot,
),
)
axes_flat = np.array(
axes_elbow
).flatten()
for ax_idx, domain in enumerate(
plottable_domains
):
ax = axes_flat[ax_idx]
domain_data = elbow_data[domain]
mc_vals = domain_data[
"meanchange"
].values
ct_names = domain_data[
"name"
].values
x_pos = np.arange(len(mc_vals))
selected_names = set(
enrichmap_gap_annotations[
domain
].split("/")
)
ax.plot(
x_pos,
mc_vals,
color="gray",
lw=1.5,
zorder=2,
)
for i, (xp, mc, name) in enumerate(
zip(x_pos, mc_vals, ct_names)
):
if name in selected_names:
color = "steelblue"
ec = "black"
else:
color = "silver"
ec = "gray"
ax.scatter(
xp,
mc,
color=color,
s=60,
zorder=4,
edgecolors=ec,
linewidths=0.5,
)
if len(mc_vals) > 1:
gaps = (
mc_vals[:-1] - mc_vals[1:]
)
gap_idx = int(np.argmax(gaps))
gap_x = gap_idx + 0.5
gap_val = gaps[gap_idx]
ax.axvline(
gap_x,
color="red",
ls="--",
lw=1,
alpha=0.8,
label=f"gap = {gap_val:.3f}",
)
ax.axvspan(
-0.5,
gap_x,
alpha=0.08,
color="steelblue",
)
y_top = mc_vals[gap_idx]
y_bot = mc_vals[gap_idx + 1]
bracket_x = gap_x + 0.3
ax.annotate(
"",
xy=(bracket_x, y_bot),
xytext=(bracket_x, y_top),
arrowprops=dict(
arrowstyle="<->",
color="red",
lw=1.2,
),
)
ax.text(
bracket_x + 0.15,
(y_top + y_bot) / 2,
f"{gap_val:.3f}",
fontsize=6,
color="red",
va="center",
)
ax.set_xticks(x_pos)
ax.set_xticklabels(
ct_names,
rotation=60,
ha="right",
fontsize=5,
)
ax.set_ylabel(
"meanchange", fontsize=7
)
ax.set_title(
f"Domain {domain}", fontsize=8
)
ax.legend(
fontsize=5, loc="upper right"
)
ax.tick_params(labelsize=6)
for ax_idx in range(
n_domains_plot, len(axes_flat)
):
axes_flat[ax_idx].axis("off")
fig_elbow.suptitle(
f"Gap-based elbow — {clustering_col}",
fontsize=11,
)
plt.tight_layout()
plt.savefig(
str(
run_dir
/ f"elbow_plot_{clustering_col}.png"
),
dpi=400,
bbox_inches="tight",
)
plt.close()
logger.info("Saved elbow plot")
We see, for example, that domain D999 is enriched for both Macrophages and Neutrophils, although the meanchanges are quite small. With a larger set of cell type signatures (10+), we would expect to see more robust differences and more confidently assign each domain to one or more cell types based on this gap method.
Let’s now plot the cell type annotations:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
logger.info("=== Generating plots ===")
cell_type_palette = {
"Tumor": "#E64B35",
"Fibroblasts": "#4DBBD5",
"Macrophages": "#3C5488",
"Neutrophils": "#F39B7F",
"Goblet_cells": "#00599F",
"Macrophages/Neutrophils": "#FFD700",
"Unknown": "#B09C85",
}
sc.pl.spatial(
adata,
color="cell_type_domain",
spot_size=9,
show=False,
title=f"{sample_name} - Cell Types",
palette=[
cell_type_palette.get(
ct, "#999999"
)
for ct in adata.obs[
"cell_type_domain"
].cat.categories
],
)
plt.savefig(
str(
run_dir
/ "cell_types_spatial.png"
),
dpi=600,
bbox_inches="tight",
)
plt.close()
Domains overlapped with the HE image:
Cell Type assignments overlapped with the HE image:
We went from raw outputs to a (cropped) annotated spatial dataset:
We reproduced the main findings from the 10x Loupe Browser CRC tutorial, but in Python and at single-cell resolution. This plugs directly into the scverse ecosystem (Scanpy, SpatialData, AnnData) for whatever comes next. Other visualizations can include matrixplots and dotplots as well to visualize domain scores per cell type to further explore domain/niche assignemnts. Example of how it could look like (using sc.pl.matrixplot with domain assignments from Novae and cell type scores from Enrichmap):
With a segmented and annotated dataset, there are several interesting directions:
Cell-cell communication. LIANA+ can infer ligand-receptor interactions using the spatial proximity information we now have.
Pathway and TF activity. decoupler estimates pathway activities and transcription factor regulon scores, going beyond individual markers.
Spatially variable genes. Tools like SpatialDE, Squidpy, or Novae itself can identify genes with significant spatial expression patterns.
Multi-sample integration and 3D reconstruction. For studies with multiple sections, spatial alignment methods
Niche characterization. You can dig into the cellular composition and functional state of specific niches (tumor-stroma interface, immune regions) to understand how the microenvironment affects disease progression.
sopa, Novae, and Enrichmap are all under active development, so check their docs for the latest features.
Here are some more articles you might like to read next: