Cancer subtyping using cNMF and literature markers

An example of usage of cNMF to identify cancer subtypes

Purpose of this tutorial

Cancer is a complex disease that arises from the accumulation of genetic mutations. These mutations are vast, and differ from one patient to another. This makes it so that cancer is generally seen as a heterogeneous disease. This heterogeneity is reflected in the different subtypes of cancer, which are defined by the mutations that are present in the tumor. These subtypes generally have different clinical features, such as the presence of different mutations, the size of the tumor, and the presence of certain genes (key transcription factors). Additionally, these subtypes can be associated with different responses to treatment. Thus, understanding the molecular biology of different cancer types is key to developing effective treatments and, ultimately, improving patient outcomes. Here, I want to show a brief example of how we can use methods such as cNMF to identify cancer subtypes, particularly in Small-cell lung cancer (SCLC).


What do we know about SCLC?

While this isn’t intended to be a comprehensive overview of SCLC, it is important to understand what we already know about the disease, in order to subtype it.

SCLC has long been recognized as one of the most aggressive and heterogeneous cancers. For decades, researchers tried to make sense of this complexity by dividing tumors into “classic” and “variant” forms. But as molecular profiling advanced, a more refined classification emerged: one that highlights the role of dominant transcription factors in shaping tumor identity and behavior.

Thus, in this tutorial, we can define molecular subtypes as being biologically consistent groups of tumors characterized by three key features:

  1. a dominant transcriptional regulator or signaling pathway,
  2. a relatively stable molecular signature across different disease stages and experimental models,
  3. clear association with distinct biological behaviors or therapeutic responses observed in clinical settings

Today, four major molecular subtypes are recognized:

What about SCLC-Y?

Before SCLC-I was proposed, some researchers suggested a fourth subtype called SCLC-Y, supposedly driven by YAP1. YAP1 is a transcriptional co-activator involved in proliferation, survival, and drug resistance. Early studies hinted that it might define a distinct group of tumors. However, subsequent work revealed that YAP1 expression was inconsistent and often low, failing to meet the criteria for a stable, subtype-defining regulator. In other words, YAP1 activity was biologically relevant, but it didn’t reliably carve out a separate molecular identity for SCLC tumors.

The inflamed alternative: SCLC-I

In 2021, Gay et al. proposed the new SCLC-I (Inflamed) subtype. Unlike the neuroendocrine-driven groups, SCLC-I is distinguished by its immune-active microenvironment. These tumors show signatures of interferon signaling, antigen presentation, and T-cell infiltration, features that set them apart biologically and therapeutically. This inflamed phenotype is particularly exciting because it opens the door to immunotherapy strategies. While classic neuroendocrine SCLC tends to be “immune cold,” SCLC-I tumors may be more responsive to checkpoint inhibitors and other immune-targeted treatments.

For this tutorial, we will attempt to replicate the subtyping done in Gay et al. using a dataset from George et al. (2015) which can be found here. This same dataset was used by Gay et al to perform their subtyping.

Here is a summary of what we know about SCLC, from a slide I have created and use to present at conferences:


What is cNMF?

cNMF is a non-negative matrix factorization (NMF) method that is particularly useful for identifying cancer subtypes. The “c” in cNMF stands for “consensus” which means that the algorithm will run the analysis multiple times and find the most stable, reliable patterns.

cNMF can be used in a variety of settings, including single-cell RNA (scRNA) in order to look for groups of genes that work together, called gene expression programs. This can be extended to bulk RNA, this time looking for cancer subtypes (here we are talking about samples and not single cells).

The strength of cNMF is in its ability to solve compositional problems. Imagine sitting in a concert hall, listening to an orchestra. If you focus carefully, you start to notice recurring harmonies from the different instruments. That’s exactly what cNMF does with gene expression data. It picks a part the recurring harmonies of genes, the programs that consistently play together. These GEPs (gene expression programs) reveal the functional states of cells or the molecular identity of tumors, turning noise into meaningful biological insight.

In this tutorial, we will be using the python cNMF package which although has been used for scRNA, can also be used for bulk RNA data.


Reading the data

First, let’s import the necessary libraries.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
from cnmf import cNMF
from scipy.stats import zscore
from matplotlib.colors import rgb2hex
import scanpy as sc
import seaborn as sns
import os
from pathlib import Path
import pandas as pd
import anndata as ad
import numpy as np

plt.style.use("bmh")
plt.rcParams.update({"figure.figsize" : (12, 8),
                     "axes.facecolor" : "white",
                     "axes.edgecolor":  "black"})

Now, lets read the count data. In this case, I am already providing the TPM counts, which were converted from FPKM (original data can be found here ). Check this repository link to find all the data necessary to follow this tutorial .

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
try:
    _base_dir = Path(__file__).resolve().parent
except NameError:
    _base_dir = Path.cwd()

data_file = _base_dir / "data" / "files" / "tpm_counts.csv"

count_data = pd.read_csv(data_file,sep="\t",
    index_col=0,)


clinical_metadata_path = _base_dir / \
    / "data" / "files" / "data_clinical_patient.txt"

clinical_metadata = pd.read_csv(clinical_metadata_path, sep="\t")

clinical_metadata = clinical_metadata.iloc[4:, :]

# Subset clinical metadata with the samples from count_data
clinical_metadata_subset = clinical_metadata[
    clinical_metadata["#Patient Identifier"].isin(count_data.index)
].copy()

clinical_metadata_subset.rename(
    columns={
        "#Patient Identifier": "ID_Sample",
        "UICC Tumor Stage": "tnm_staging",
        "Sex": "sex",
        "Diagnosis Age": "age_at_sample", 
        "Overall Survival (Months)": "time_since_t0_death",
        "Overall Survival Status": "status_os",
    },
    inplace=True,
)

clinical_metadata_subset["status_os"].replace(
    {"1:DECEASED": 1, "0:LIVING": 0}, inplace=True
)

# Add the staging (< III, LS, IV then ES)
clinical_metadata_subset["staging"] = clinical_metadata_subset["tnm_staging"].apply(
    lambda x: "ES" if x == "IV" else "LS" if x != "nan" else "LS"
)

try:
    clinical_metadata_subset.set_index("ID_Sample", inplace=True)
except Exception as e:
    print(f"Error setting index: {e}")

print(
    f"Number of samples per stage: {clinical_metadata_subset.staging.value_counts()}")

We see that we have 72 patients that are in LS (limited stage, <= TNM stage III) and 9 ES (extensive stage, TNM stage IV).

And the count_data has this shape (zoom out if unable to see the entire table):

1
count_data
B2M FTL HLA-B ANXA1 HLA-DRA PFN1 CTSD CD74 IGLL5 LYZ ... CYP4F22 NDST4 CLEC12B IL17C AXDND1 SCN2A CNTNAP5 RNF133 DNAH8 YAP1
sclc_ucologne_2015_S00022 2162.154403 2298.411925 1001.751408 23.633396 86.991391 678.728366 472.136116 27.912742 42.978858 28.775967 ... 2.750689 8.490031 2.725959 2.992372 2.715768 3.514570 8.136591 2.696780 2.728837 3.277595
sclc_ucologne_2015_S00035 3160.088999 3745.273955 648.107862 102.795278 276.450852 615.955668 524.715629 50.000057 46.280251 84.170639 ... 2.192853 8.469487 2.246005 3.080709 2.148493 7.893646 2.475983 2.318928 2.155952 3.605534
sclc_ucologne_2015_S00050 2052.014302 2118.938263 879.408577 36.147493 423.594166 532.336017 489.297038 109.557077 244.091360 212.364860 ... 3.146193 6.516990 2.984626 3.215373 3.042088 4.725947 18.991687 3.011319 3.134531 3.915876
sclc_ucologne_2015_S00213 9640.167384 14605.657288 2289.823486 121.938662 2079.767782 884.827811 1074.222337 176.478753 5425.059394 2610.552043 ... 2.681442 15.147577 2.421056 3.124963 2.386332 4.895384 2.558535 2.451218 2.389190 4.352976
sclc_ucologne_2015_S00356 5004.217605 4981.908336 835.222854 119.778900 983.900900 502.215145 514.830956 122.104755 1418.010653 499.754643 ... 2.829266 3.128775 2.719074 3.650863 2.776094 10.940923 4.448620 3.396203 2.740640 5.575360
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
sclc_ucologne_2015_S02375 6812.104434 11667.759405 1509.701805 1282.925898 2003.241765 1048.792635 1064.024598 183.230711 318.635008 572.531677 ... 2.093379 2.280902 2.106701 2.041678 2.030944 2.036819 3.085045 1.997111 2.072963 2.719448
sclc_ucologne_2015_S02376 8251.028221 3238.868715 3740.925445 83.274742 501.582874 1147.828117 389.587217 58.089660 541.181302 293.667019 ... 2.179461 6.646879 2.105514 2.843782 2.154875 5.445269 6.453248 2.294543 2.117667 3.164042
sclc_ucologne_2015_S02378 9520.367291 6724.043881 2958.637473 209.366759 1717.925348 677.814857 1175.556034 183.766750 267.717238 523.855194 ... 2.772599 2.877168 2.858701 3.147568 2.824992 2.806729 3.585936 2.965653 2.768039 6.153518
sclc_ucologne_2015_S02382 3779.392094 2525.732381 577.241843 161.040659 226.311589 307.053940 236.202220 49.486211 557.569124 478.677608 ... 3.289202 3.953085 3.289202 6.149953 3.346079 4.930345 8.773256 3.703050 3.311947 7.840243
sclc_ucologne_2015_S02397 4621.667938 5300.872063 1401.574063 82.634577 444.025069 779.949035 413.593408 72.070549 338.586292 227.066808 ... 2.991421 14.824241 3.086601 4.139198 2.982478 13.696145 4.118115 3.192017 2.981595 4.707087

81 rows × 1294 columns

Let’s now create an AnnData object and add the log_tpm_counts to it. This will be useful later on for plotting of the main TFs (transcription factors) for each group or subtype.

1
2
3
4
5
6
7
8
9
10
11
adata = ad.AnnData(count_data)

adata.obs = clinical_metadata_subset.copy()

adata.var = pd.DataFrame(count_data.columns)

adata.layers["tpm_counts"] = adata.X.copy()

adata.layers["log_tpm_counts"] = np.log1p(count_data.values).copy()

adata.var.set_index(0, inplace=True)

While one could run cNMF using all the genes from the bulk dataset, only a few thousand genes are needed to identify the subtypes. There are several methods to select the most relevant genes, but for this tutorial, we will use the discovery set from the original publication (Gay et al. 2021). These can be found in the supplementary material of the paper (see DOI above).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
# Load genes 1300 discovery set
import openpyxl
final_genes = pd.read_excel(
    _base_dir / "data" / "files" / "1-s2.0-S1535610820306620-mmc2.xlsx",
    header=1,
)
final_genes = final_genes["Unnamed: 0"].tolist()
final_genes.append("YAP1") # adding YAP1 to this list just to test the expression  of YAP1 across the samples to see that it shows a somewhat even expression across identified subtypes. 

missing_genes = [gene for gene in final_genes if gene not in adata.var_names]
print("Missing genes:", missing_genes)

final_genes_filtered = [
    gene for gene in final_genes if gene in adata.var_names]
final_adata = adata[:, final_genes_filtered].copy()
final_adata.X = final_adata.layers["tpm_counts"].copy()

Lets now crate the TPM counts file that will be used by cNMF.

1
2
3
4
5
6
counts_file_new = "./temp_counts_subtype_identification.txt"
pd.DataFrame(
	adata.X,
	index=adata.obs_names,
	columns=adata.var_names,
).to_csv(counts_file_new, sep="\t")

Let’s now run cNMF. Here, we will test a range of components (from 2 to 8). This is generally done if you don’t yet know the optimal number of components/groups/subtypes to use. However, from the literature, we already know for SCLC that the field recognizes this number to be 4, so we will already use this number here.

Running cNMF

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
subtype_col_name="nmf-group-de-novo"
k: int = 4
density_threshold: float = 0.1
components: range = range(2, 8)
mandatory_genes: list = ["ASCL1", "NEUROD1", "POU2F3", "YAP1"] 
n_iter: int = 150
total_workers: int = 1
num_highvar_genes: int = 2000 # number 
ntop_genes: int = 5000
data_save_path: str = "./"
layer = "tpm_counts"

cnmf_obj_new = cNMF(
	output_dir="./cnmf_results_selected_genes", name="nmf_selected_genes"
)

cnmf_obj_new.prepare(
	counts_fn=counts_file_new,
	tpm_fn=f"{_base_dir}/temp_counts_subtype_identification.txt", # note that we can load the TPM counts file directly. Generally, for this package, one can run the raw counts and then it will convert to TPM, but since we don't have access to the raw counts here, we will load the TPM counts directly, which have been calculated from the original FPKM values. 
	components=components,
	n_iter=n_iter,
	num_highvar_genes=num_highvar_genes, # The package may already use this number to use just the top 2000 genes that are highly variable across the samples. 2000 is the default number, but you can change it to something else. In this case, the total number of genes is around 1300, so this number being 2000 won't do much harm.
)

cnmf_obj_new.factorize(worker_i=0, total_workers=total_workers)

cnmf_obj_new.combine()

This plot will show the stability and error for each of the K tested (in our case, from 2 to 7 components). We see that the optimal number of components is 4, which offers the best stability and lowest error, which once again confirms the literature.

1
2
3
4
5
6
7
8
cnmf_obj_new.consensus(
	k=k,
	density_threshold=density_threshold,
	show_clustering=True,
	close_clustergram_fig=False,
)

cnmf_obj_new.k_selection_plot()
1
2
3
usage_matrix_nmf_var, gep_scores_nmf_var, gep_tpm_nmf_var, topgenes_nmf_var = (
	cnmf_obj_new.load_results(K=k, density_threshold=density_threshold)
) # capture the usage matrix, GEP scores, etc.

Now that we have calculated the usage matrix, we can merge it with the original AnnData object. This is a normalized value, that sums to 1 across our selected K. In this case, we will have 4 new columns, from 1 to 4, representing each of the 4 components.

We can also then take the maximum value of the usage matrix, and give a categorical assignment to each sample based on the maximum value. This is useful for other downstream analysis, such as differential expression analysis, boxplot visualizations, etc.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
adata_nmf = adata.copy()
adata_nmf.obs = pd.merge(
	left=adata_nmf.obs,
	right=usage_matrix_nmf_var,
	how="left",
	left_index=True,
	right_index=True,
)
top_genes_nmf = []

print(f"Selecting top {ntop_genes} genes for each subtype")
for gep in gep_scores_nmf_var.columns:
	top_genes_nmf.append(
		list(
			gep_scores_nmf_var.sort_values(by=gep, ascending=False).index[
				:ntop_genes
			]
		)
	)

top_genes_nmf = pd.DataFrame(top_genes_nmf, index=gep_scores_nmf_var.columns).T

for col in top_genes_nmf.columns:
	print([g for g in top_genes_nmf[col] if g in mandatory_genes])

adata_nmf.obs[subtype_col_name] = usage_matrix_nmf_var.idxmax(
	axis=1
).astype(str)

print(
	f"Count of groups: {adata_nmf.obs[subtype_col_name].value_counts()}"
)

gene_df = pd.DataFrame({"gene": adata_nmf.var_names.tolist()})

print(
	f"Saving top {ntop_genes} NMF genes to {data_save_path + 'nmf_genes.csv'}"
)
gene_df.to_csv(data_save_path + "nmf_genes.csv")


adata_nmf.obs[["nmf-group-de-novo",1,2,3,4]]

Our adata.obs now has the subtype scores and the nmf-group-de-novo which is essentially the maximum of the subtype scores, so we can associate a subtype for each sample (zoom out if unable to see the entire table). Doing a simple .value_counts() on this column will give us the number of samples per subtype which goes as follows:

Index Value
1 29
2 21
3 18
4 13

Additionally, the subtype scores, as these are normalized, they can be interpreted as proportions of a given subtype across each sample. This is particularly useful, as we can do correlations with other biomarkers we may have, or to simply indicate the proportion of subtypes for each sample since, realistically, one patient is likely to have multiple subtypes at once, within a single sample.

One could also combine these scores and define thresholds to define “mixed” subtypes (e.g. A/N, A/I, etc). This can be particularly useful if we want to be a bit more specific about the subtypes we define, especially in SCLC, as research suggests that the subtypes are not always clear-cut and an otherwise “non-inflamed” SCLC subtype may still have some imune components that give it an improved treatment response in combination with immune-based therapies . Considering these nuances is important if we want to use this information to guide treatment decisions.

nmf-group-de-novo 1 2 3 4
ID_Sample
sclc_ucologne_2015_S00022 2 0.472633 0.506275 0.006064 0.015028
sclc_ucologne_2015_S00035 1 0.836964 0.000000 0.106799 0.056237
sclc_ucologne_2015_S00050 2 0.174711 0.798140 0.027149 0.000000
sclc_ucologne_2015_S00213 3 0.417614 0.000000 0.539206 0.043180
sclc_ucologne_2015_S00356 1 0.427478 0.282972 0.198701 0.090849
... ... ... ... ... ...
sclc_ucologne_2015_S02375 4 0.034423 0.000000 0.384087 0.581490
sclc_ucologne_2015_S02376 1 0.656548 0.029480 0.260617 0.053356
sclc_ucologne_2015_S02378 3 0.175559 0.215421 0.337799 0.271221
sclc_ucologne_2015_S02382 2 0.228773 0.581242 0.023989 0.165996
sclc_ucologne_2015_S02397 1 0.622625 0.079786 0.144879 0.152710

81 rows × 5 columns

Plotting the results

Now that we have our subtypes, it is time to plot some of the known markers from SCLC and see which group has highest expression for each of the key markets. To explore this, let’s do some boxplots, showing the groups (i.e. subtypes) against key genes using the log_tpm_counts that we calculated earlier.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
# Start by creating a dataframe with the log_tpm_counts and the ID_Samples
key_markers_boxplot = pd.DataFrame(
	adata_nmf.layers["log_tpm_counts"],
	columns=adata_nmf.var_names,
	index=adata_nmf.obs_names,
)

# Subset just the key genes
key_markers_boxplot = key_markers_boxplot[["ASCL1", "NEUROD1", "POU2F3", "YAP1"]]

# Add the NMF grouping we calculated earlier using cNMF
merged_data = key_markers_boxplot.merge(adata_nmf.obs[["nmf-group-de-novo"]], on="ID_Sample", how="inner")

# Now, plot each of the key genes against the NMF groups
for key_genes in ["ASCL1", "NEUROD1", "YAP1", "POU2F3"]:
	ax = sns.boxplot(x="nmf-group-de-novo", y=key_genes, data=merged_data, palette="Set2")

	ax.set(xlabel="NMF groups", ylabel="log TPM counts")
	ax.set_title(f"Boxplot of {key_genes} expression per NMF group", fontsize=20)
	ax.set_ylabel("log TPM counts")
	ax.set_xlabel("NMF groups")	
	plt.show()
	plt.close()

This will give the following boxplots:

This gives us a good idea of which subtypes are associated with each of the key genes. Now, ideally, if we have the raw counts, the next step is to do some differential expression analysis to see which of the key markers from SCLC (or another cancer being analyzed) are differentially expressed between the subtypes. To do this, one can use the pydeseq2 package, which is a popular tool for differential expression analysis in python. While it is not the focus of this tutorial, I will bring another tutorial later on showing an example of how to use pydeseq2 and a modification to do both one-vs-one and one-vs-all comparisons.

The next step is to do some more visualizations that will help us understand the subtypes better, this time using gene sets and a radial plot. The gene sets we will use (see code below) are NE (neuroendocrine), nonNE (non-neuroendocrine), T-eff (tumor-effector), B/PC (B-cell and plasma cell), and APM (antigen processing and presentation) and Checkpt (immune checkpoint).

Additionally, we will also use a mapping between the numerical NMF groups, and the subtypes based on the boxplots made earlier. We see that cluster 1 has higher expression for ASCL1, so we associate this cluster to A subtype. Cluster 2 has higher expression for NEUROD1, so we associate this cluster to N subtype. Cluster 4 has higher expression for POU2F3, so we associate this cluster to P subtype. The last cluster, cluster 3 can, can be associated with the Inflammed subtype. We see that the expression of YAP1 is not sufficient to define a subtype on its own.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
adata_analysis_radial = adata_nmf.copy()
adata_analysis_radial.obs = adata_analysis_radial.obs[
    [
        1,
        2,
        3,
        4,
        "nmf-group-de-novo",
    ]
]

# Since we already have the log_tpm_counts, we can use them directly, and then only scale them
adata_analysis_radial.X = adata_analysis_radial.layers["log_tpm_counts"].copy()
sc.pp.scale(adata_analysis_radial)

# These will be the gene sets we will use
gene_categories = {
    "NE": ["CHGA", "DLL3", "NEUROD1", "INSM1", "ASCL1"],
    "nonNE": ["YAP1", "POU2F3", "MYC", "REST"],
    "T-eff": [
        "CD8A",
        "GZMA",
        "GZMB",
        "PRF1",
        "IFNG",
        "CXCL9",
        "CXCL10",
        "TBX21",
    ],
    "B/PC": ["CD79A", "MS4A1", "MZB1", "JCHAIN"],
    "APM": ["TAP1", "TAP2", "B2M", "HLA-A", "HLA-C"],
    "Checkpt": ["PDCD1", "CD274", "LAG3", "CTLA4", "BTLA", "TIGIT"],
}

# Compute scores for each category
for category, genes in gene_categories.items():
    sc.tl.score_genes(adata_analysis_radial, genes,
                      score_name=f"{category}_score")

# Calculate mean scores for each NMF group
nmf_scores = adata_analysis_radial.obs.groupby("nmf-group-de-novo").mean()

# Normalize scores (Z-score across subsets)
score_columns = [f"{cat}_score" for cat in gene_categories.keys()]
nmf_scores_norm = (
    nmf_scores[score_columns] - nmf_scores[score_columns].mean()
) / nmf_scores[score_columns].std()

# Define NMF labels
nmf_labels = {
    "1": "A",
    "2": "N",
    "3": "I",
    "4": "P",
}

# Colors for each NMF group
colors = ["#8A7EB5", "#67C28E", "#E75480", "#FFA500"]
color_map = {
    "1": "#4169E1",  # A - RoyalBlue (cold)
    "2": "#008080",  # N - Teal (cold)
    "3": "#DC143C",  # I-nNE - Crimson (warm)
    "4": "#FF8C00",  # I-NE - DarkOrange (warm)
}

fig = plt.figure(figsize=(8, 8))
ax = plt.subplot(111, projection="polar")

# Prepare angles
categories = list(gene_categories.keys())
N = len(categories)
angles = [n / float(N) * 2 * np.pi for n in range(N)]
angles += angles[:1]  # Complete the loop

# Plot each group
for nmf_group in sorted(nmf_scores_norm.index):
    data = nmf_scores_norm.loc[nmf_group, score_columns].to_frame().T
    values = data.values.flatten().tolist()
    values += values[:1]  # Complete the loop

    label = nmf_labels.get(nmf_group, nmf_group)
    color = color_map[nmf_group]

    ax.plot(
        angles,
        values,
        linewidth=2.5,
        linestyle="solid",
        label=f"NMF{nmf_group}: {label}",
        color=color,
    )
    ax.fill(angles, values, color=color, alpha=0.2)

# Format the plot
ax.set_theta_offset(np.pi / 2)
ax.set_theta_direction(-1)

ax.set_xticks(angles[:-1])
ax.set_xticklabels(categories, size=15)
ax.set_yticklabels([])  # Hide radial labels
ax.grid(True, color="grey", alpha=0.3)
ax.spines["polar"].set_visible(False)

ax.legend(loc="upper right", bbox_to_anchor=(1.3, 1.1), fontsize=12)
ax.set_title("Radial plot of NMF groups", size=20, y=1.1)
plt.tight_layout()


plt.show()

Conclusion

And there you have it! We have now identified the molecular subtypes of SCLC using cNMF, with matching distributions and TFs expression profiles, compared to the original publication.

There is a lot we can do with this information. Cell type deconvolution on the bulk RNA data to further evaluate cell type composition, survival analysis, differential expression analysis, functional enrichment analysis, etc. I plan on bringing more tutorials across all things related to transcriptomics and machine learning, so stay tuned!

Enjoy Reading This Article?

Here are some more articles you might like to read next:

  • Practical Guide to Cell Segmentation, Domain Identification and Cell Type Annotation in Single-Cell Spatial Transcriptomics (Visium HD)