Abstract

Unravelling the origin and evolution of precancerous lesions is crucial for preventing malignant transformation, yet our current knowledge remains limited. Here we used a base editor-enabled DNA barcoding system to comprehensively map single-cell phylogenies in mouse models of intestinal tumorigenesis induced by inflammation or loss of the Apc gene. Through quantitative analysis of high-resolution phylogenies, we identified tens of independent cell lineages undergoing parallel clonal expansions within each lesion. We also found polyclonal origins of human sporadic colorectal polyps through bulk whole-exome sequencing and single-gland whole-genome sequencing. Genomic and clinical data support a model of polyclonal- to-monoclonal transition, with monoclonal lesions representing a more advanced stage.

1. Bibliographic Information

1.1. Title

Polyclonal-to-monoclonal transition in colorectal precancerous evolution

1.2. Authors

Zhaolian Lu, Shanlan Mo, Duo Xie, Xiangwei Zhai, Shanjun Deng, Kantian Zhou, Kun Wang, Xueling Kang, Hao Zhang, Juanzhen Tong, Liangzhen Hou, Huijuan Hu, Xuefei Li, Da Zhou, Leo Tsz On Lee, Li Liu, Yaxi Zhu, Jing Yu, Ping Lan, Jiguang Wang, Zhen He, Xionglei He & Zheng Hu

1.3. Journal/Conference

Published online in Nature. Nature is one of the most prestigious and influential scientific journals globally. Its reputation stems from publishing groundbreaking, high-impact research across all fields of science and technology. Publication in Nature signifies exceptional scientific rigor, novelty, and broad significance, making it a top-tier venue in academic research.

1.4. Publication Year

2024

1.5. Abstract

The paper investigates the origin and evolution of precancerous lesions in colorectal cancer, a critical area for prevention but with limited current understanding. The authors utilized a base editor-enabled DNA barcoding system, SMALT, to map single-cell phylogenies in mouse models of intestinal tumorigenesis (induced by inflammation or Apc gene loss). Through quantitative analysis of these high-resolution phylogenies, encompassing 260,922 single cells, they identified numerous independent cell lineages undergoing parallel clonal expansions within each lesion. Complementing this, they found polyclonal origins in human sporadic colorectal polyps using bulk whole-exome sequencing (WES) and single-gland whole-genome sequencing (WGS). Genomic and clinical data support a model of polyclonal-to-monoclonal transition, where monoclonal lesions represent a more advanced stage of disease. Single-cell RNA sequencing (scRNA-seq) revealed extensive intercellular interactions in early polyclonal lesions, which significantly diminished during monoclonal transition. These findings suggest that colorectal precancer often originates from multiple lineages, with cooperative intercellular interactions being crucial in the earliest stages of cancer formation, offering insights for earlier intervention strategies.

1.6. Original Source Link

/files/papers/691c4e2b25edee2b759f32e3/paper.pdf Publication Status: Published online at Nature on 30 October 2024, with $DOI: 10.1038/s41586-024-08133-1$ .

2. Executive Summary

2.1. Background & Motivation

The core problem the paper addresses is the limited understanding of how precancerous lesions originate and evolve into malignant tumors, particularly in colorectal cancer (CRC). This knowledge gap is critical because unraveling these initial phases is essential for effective early screening, intervention, and ultimately, preventing malignant transformation.

CRC is a significant global health burden, being the third most prevalent cancer and the second leading cause of cancer-related deaths. The well-established adenoma-carcinoma sequence in CRC development provides an ideal model to study early tumorigenesis. While genomic sequencing of malignant tumors typically reveals a single founding clone, previous studies using somatic markers have suggested that premalignant adenomas or even some CRCs can have polyclonal origins, meaning multiple distinct cell lineages contribute to the lesion. However, a comprehensive understanding of the prevalence, temporal dynamics, and underlying mechanisms of these polyclonal origins and their subsequent evolution in the earliest stages of lesion formation remains largely unknown. The challenges include the long span of precancerous stages and the difficulty in accurately identifying the earliest events of tumor initiation.

The paper's entry point and innovative idea lie in leveraging a novel, high-resolution lineage tracing technique called SMALT (substitution mutation-aided lineage-tracing), which is a base editor-enabled DNA barcoding system. This system allows for systematically mapping the origin and evolution of intestinal precancers at single-cell resolution in mouse models. By integrating this advanced lineage tracing with other multi-omics datasets (like WES, WGS, and scRNA-seq) in both mouse and human samples, the authors aim to comprehensively delineate the early evolutionary trajectory of colorectal tumorigenesis.

2.2. Main Contributions / Findings

The paper makes several significant contributions and reaches key conclusions regarding the early evolution of colorectal cancer:

Development and Application of High-Resolution Lineage Tracing: The study successfully utilized SMALT, a base editor-enabled DNA barcoding system, to generate unprecedentedly high-resolution single-cell phylogenies. This system boasts a significantly higher number of mutable sites (average of 836) compared to previous methods, enabling more detailed tracking of cell lineages in mouse models of intestinal tumorigenesis.
Identification of Polyclonal Origins in Precancerous Lesions: Through quantitative analysis of over 260,000 single-cell phylogenies from mouse models (AOM/DSS and $ApcMin/+$ ), the study found that a majority of precancerous lesions (e.g., 66.7% of AOM/DSS neoplasms) originate polyclonally, meaning they are founded by multiple independent cell lineages undergoing parallel clonal expansions. The estimated number of founding progenitors (Np) varied from 2 to 33 in AOM/DSS lesions and 4 to 100 in $ApcMin/+$ polyps.
Validation of Polyclonal Origins in Human Sporadic Colorectal Polyps: The research extended these findings to human biology by demonstrating polyclonal origins in sporadic colorectal polyps through bulk WES and single-gland WGS from patient cohorts. This confirms that the polyclonal initiation observed in mouse models is biologically relevant to human disease.
Elucidation of a Polyclonal-to-Monoclonal Transition Model: The study provides compelling genomic and clinical evidence supporting a "polyclonal-to-monoclonal transition" model in colorectal tumorigenesis. Monoclonal lesions were found to be more advanced, characterized by significantly higher barcode mutations (indicating more cell divisions), higher genome-wide mutation burdens, more putative driver mutations, larger size, and higher-grade dysplasia compared to polyclonal lesions. This transition is driven by stringent subclonal selection.
Discovery of Critical Intercellular Interactions in Early Stages: Single-cell RNA sequencing (scRNA-seq) revealed extensive intercellular interactions, particularly via extracellular matrix (ECM) organization and cell adhesion pathways, in early polyclonal lesions. These interactions were observed to significantly decrease as lesions transitioned towards monoclonality. This suggests that intercellular cooperation plays a crucial role in promoting neoplastic growth during the earliest phases of tumorigenesis.

These findings solve the problem of limited understanding regarding the cellular origin and early evolutionary dynamics of colorectal precancer. They provide a conceptual framework for the initial phases of cancer formation, highlighting that CRC often begins with multiple cooperating lineages before a dominant, more aggressive clone takes over. This offers significant opportunities for developing earlier intervention strategies by targeting these initial cooperative interactions.

3.1. Foundational Concepts

To fully grasp the contents of this paper, a foundational understanding of several key biological and technical concepts is essential:

Precancerous Lesions: These are abnormal cells or tissues that have undergone some genetic or cellular changes, making them more likely to develop into cancer. They are not yet malignant but represent an increased risk. In the context of the colon, adenomas or polyps are common precancerous lesions.
Malignant Transformation: This is the process by which normal cells or precancerous cells acquire cancerous properties, becoming invasive and capable of metastasis. It involves accumulation of genetic mutations that drive uncontrolled cell growth and survival.
DNA Barcoding (Lineage Tracing): This technique involves tagging individual cells with unique, heritable genetic sequences (barcodes). As these cells divide, their progeny inherit the barcode, allowing researchers to track cell lineages, understand developmental relationships, and reconstruct phylogenetic trees. It's akin to giving each founding cell a unique ID that is passed down.
Base Editor: A sophisticated gene-editing tool derived from CRISPR-Cas technology. Unlike CRISPR-Cas9 which typically creates double-strand breaks in DNA, a base editor chemically modifies a single DNA base into another (e.g., C-to-T) without causing DNA breakage. This allows for precise, targeted, and programmable point mutations, which are crucial for generating diverse barcodes in lineage tracing.
Single-Cell Phylogenies: These are evolutionary "family trees" constructed for individual cells within a tissue or population. By analyzing the accumulation of heritable somatic mutations (like the barcode mutations induced by SMALT), researchers can infer the ancestral relationships between cells and their clonal descendants. This helps visualize how different cell lineages expanded and diversified.
Clonal Expansion: This refers to the proliferation of a group of cells that all originate from a single common ancestral cell. When a cell acquires a mutation that confers a growth advantage, it can undergo clonal expansion, leading to a population of genetically identical (or nearly identical) cells.
Polyclonal Origin: A lesion or tumor is considered to have a polyclonal origin if it develops from multiple distinct progenitor cells that independently initiate growth and expand. These multiple lineages coexist and contribute to the overall lesion.
Monoclonal Origin: A lesion or tumor is considered to have a monoclonal origin if it arises from a single progenitor cell that acquires mutations and then expands to form the entire lesion. All cells in the lesion are descendants of that one founding cell.
Colorectal Cancer (CRC): Cancer that develops in the colon or rectum. It typically begins as small, noncancerous (benign) clumps of cells called polyps that form on the inside of the colon.
Adenoma-Carcinoma Sequence: A widely accepted model describing the progression of CRC. It posits that most CRCs develop from benign adenomatous polyps that gradually accumulate genetic mutations, leading to malignant transformation.
Whole-Exome Sequencing (WES): A genomic technique for sequencing all the protein-coding regions of genes in a genome (the exome). While it only covers about 1-2% of the human genome, it is cost-effective and covers the most functionally relevant parts where disease-causing mutations are concentrated.
Whole-Genome Sequencing (WGS): A comprehensive genomic technique that determines the entire DNA sequence of an organism's genome at a single time. It provides a complete picture of all genetic variations, including those in non-coding regions.
Single-Cell RNA Sequencing (scRNA-seq): A powerful technique that measures the gene expression profiles of individual cells. Unlike bulk RNA-seq which averages expression across a population of cells, scRNA-seq reveals cell-to-cell heterogeneity, identifies distinct cell types, and uncovers cell states and functions within a complex tissue.
Apc Gene: The Adenomatous Polyposis Coli gene. It is a tumor suppressor gene, and mutations in Apc are a common initiating event in colorectal cancer development, particularly in familial adenomatous polyposis (FAP). Apc is a key component of the Wnt signaling pathway, which regulates cell proliferation and differentiation.
AOM/DSS Model: A commonly used mouse model for inducing colorectal cancer. AOM (azoxymethane) is a genotoxic carcinogen that induces DNA damage, and DSS (dextran sodium sulfate) induces colonic inflammation. The combination mimics inflammation-driven CRC, similar to that seen in inflammatory bowel disease (IBD) patients.
$ApcMin/+$ Mice: A genetically engineered mouse model that carries a germline mutation in the Apc gene. These mice spontaneously develop multiple intestinal polyps (adenomas) and are widely used to study APC-driven intestinal tumorigenesis, particularly modeling FAP.

3.2. Previous Works

The paper builds upon a foundation of previous research, integrating various techniques and concepts:

Clonality of Tumors: The prevailing view from early cancer genomic sequencing was that malignant tumors typically arise from a single founding clone (monoclonal origin) (Ref. 8). However, this paper acknowledges that earlier studies using somatic markers had already suggested that premalignant adenomas or even some CRCs could have polyclonal origins, where multiple distinct lineages expand concurrently (Ref. 9, 10). This paper aims to provide a more comprehensive and high-resolution picture of this debate, especially in the earliest stages.
Lineage Tracing Techniques: The study significantly advances lineage tracing, which has revolutionized how cell lineages are recorded. Earlier methods included:
- CRISPR-Cas9-based lineage tracing: Techniques like those by Chan et al. (Ref. 12) and Yang et al. (Ref. 14) use CRISPR-Cas9 to introduce mutations at specific target sites in DNA barcodes. These mutations accumulate over cell divisions, allowing phylogenetic reconstruction. The current paper differentiates itself by using SMALT, a base editor-based system, which offers higher mutation diversity and resolution.
Inflammation-Induced Clonal Expansions: The paper notes that previous work by Olafsson et al. (Ref. 23) had already observed significant clonal expansions in the inflamed colon of patients with IBD. The current study's AOM/DSS mouse model further substantiates and quantifies these inflammation-driven clonal expansions with high-resolution lineage data.
Selection in Cancer Evolution: The dN/dS ratio (ratio of non-synonymous to synonymous mutations) as a metric for quantifying selective pressure in cancer was established by Martincorena et al. (Ref. 24). This study applies this metric to assess selection in polyclonal versus monoclonal lesions.
Computational Modeling of Tumor Evolution: Agent-based tumor simulations and approximate Bayesian computation are mentioned as methods used in prior studies (Ref. 25, e.g., by Hu et al. in colorectal cancer) to infer spatial dynamics and selection coefficients, which this paper also utilizes.
Single-Gland Analysis in Colon: The concept that individual glands (crypts) in the colorectal epithelium often represent clonal cell populations was established by studies such as Lee-Six et al. (Ref. 31) and Saini et al. (Ref. 32). This forms the basis for the paper's single-gland WGS to assess clonality.
Tumor-Associated Macrophage (TAM) Classification: The unified nomenclature for TAMs based on single-cell transcriptomics (Ref. 34) is referenced for classifying macrophage subtypes within the tumor microenvironment.
Cell-Cell Communication Inference: Tools like CellChat (Ref. 36) are used to infer ligand-receptor interactions between cell types, building on methods designed to understand intercellular communication from scRNA-seq data.
Models of Polyclonal Initiation: The paper references the "recruitment model" involving short-range interactions (Ref. 38, by Thliveris et al.) as a potential explanation for polyclonal initiation. It also cites Reeves et al. (Ref. 41) who showed that Hras-mutant clones can recruit neighboring wild-type epithelial cells, and the Allee effect (Ref. 42), which describes growth barriers for small populations, suggesting that recruitment could be selectively favored.

3.3. Technological Evolution

The field of cancer evolution research has seen a rapid technological advancement, particularly in high-throughput sequencing and single-cell analysis. This paper's work fits within this timeline by integrating several cutting-edge technologies:

From Bulk to Single-Cell Genomics: Early cancer genomics relied on bulk sequencing, which provided an averaged view of tumor mutations. The evolution to single-cell genomics (scRNA-seq, single-cell lineage tracing) allowed for unprecedented resolution in dissecting tumor heterogeneity and evolutionary trajectories.
From Somatic Markers to Programmable DNA Barcodes: Initial clonality studies often relied on endogenous somatic mutations or genetic markers. The development of synthetic, programmable DNA barcodes (e.g., using CRISPR-Cas9) enabled more precise and scalable lineage tracing. The SMALT system represents an evolution of this, moving from CRISPR-Cas9 (which primarily introduces indels) to base editors (which introduce point mutations), offering higher diversity and potentially finer resolution.
Integration of Multi-Omics Data: The trend has moved towards integrating multiple types of data (genomics, transcriptomics, lineage information) from the same or matched samples. This paper exemplifies this by combining SMALT lineage tracing, WGS/WES, and scRNA-seq to provide a holistic view of both genetic evolution and cellular microenvironment changes.

3.4. Differentiation Analysis

Compared to the main methods in related work, the core differences and innovations of this paper's approach are:

High-Resolution Lineage Tracing with SMALT: While CRISPR-Cas9-based lineage tracing methods (e.g., Ref. 12, 14) are powerful, they typically offer a limited number of mutable sites (10-60). The SMALT system, utilizing a base editor, significantly increases this capacity to an average of 836 mutable sites. This higher diversity provides superior resolution for phylogenetic mapping, yielding 3.3 times more internal branching events in phylogenetic trees compared to CRISPR-Cas9 systems, allowing for a much finer reconstruction of clonal relationships and expansion dynamics.
Comprehensive Integration of Multi-Omics Data: The study doesn't rely solely on lineage tracing. It rigorously integrates high-resolution SMALT data with bulk WES, single-gland WGS, and scRNA-seq data from both mouse models and human patient samples. This multi-omics approach provides a robust and multifaceted view of tumor evolution, allowing for correlation between genetic changes, cellular states, intercellular communication, and clinical parameters.
Focus on the Polyclonal-to-Monoclonal Transition: While previous studies acknowledged polyclonal origins, this paper specifically proposes and provides extensive evidence for a dynamic polyclonal-to-monoclonal transition as a common trajectory in early colorectal tumorigenesis. It quantifies Np (number of founding progenitors) and characterizes the genomic and microenvironmental features that distinguish these stages.
Insights into Intercellular Interactions in Early Stages: The scRNA-seq component uniquely reveals that early polyclonal lesions are characterized by extensive intercellular interactions, particularly through ECM organization and cell adhesion. This highlights a cooperative aspect of early tumorigenesis that is lost as lesions become monoclonal, suggesting a mechanism beyond just independent clonal expansion. This provides a novel mechanistic understanding of the early tumor microenvironment.

4. Methodology

This section provides a detailed, step-by-step deconstruction of the paper's methodology, integrating the principles with the specific technical implementation and mathematical formulas where applicable.

4.1. Principles

The core idea of the method used in this paper is to leverage DNA barcoding to track cell lineages at single-cell resolution, coupled with genomic and transcriptomic profiling, to understand the evolutionary dynamics of precancerous lesions. The theoretical basis rests on the principle that mutations accumulate over cell divisions and are heritable. By engineering a DNA barcode that can be continuously mutated, each cell's lineage can be uniquely marked. Reconstructing phylogenetic trees from these mutations allows for discerning clonal relationships and expansion histories.

The SMALT (substitution mutation-aided lineage-tracing) system is central to this. It employs HsAID (an optimized Homo sapiens activation-induced cytidine deaminase) which induces C-to-T mutations in a targeted 3kb DNA barcode. Crucially, these C-to-T mutations only become fixed after DNA replication, meaning the number of barcode mutations directly correlates with the number of cell divisions. iScel (an inactive variant of the homing nuclease I-Scel) is fused to HsAID and guides it to specific $18-bp DNA$ motifs within the barcode, ensuring targeted mutagenesis. The expression of this HsAIDiScel fusion protein is inducible by doxycycline, allowing temporal control over barcode mutagenesis.

Once lineage trees are constructed, quantitative methods are applied to assess clonality. TarCA (targeting coalescent analysis) is used to estimate Np (the number of founding progenitor cells) by analyzing the probability of two random cells sharing a common ancestor within a monophyletic clade. WES and WGS are used to identify somatic mutations (SNVs, indels, SCNAs) and calculate CCFs (Cancer Cell Fractions), which help determine whether a lesion is monoclonal (one dominant clone) or polyclonal (multiple clones). Finally, scRNA-seq is employed to dissect the cellular states and intercellular communication networks within these lesions, providing insights into the microenvironmental factors that accompany clonal evolution.

4.2. Core Methodology In-depth (Layer by Layer)

The methodology involves a comprehensive multi-omics approach, spanning DNA barcoding, mouse modeling, sequencing, computational analysis of genomic and transcriptomic data, and quantitative phylogenetic inference.

4.2.1. SMALT System and Mouse Model Establishment

The study first establishes the SMALT lineage tracing system in mice.

SMALT System Components (Fig. 1a, Extended Data Fig. 1a,b):
- HsAID: An optimized human activation-induced cytidine deaminase with 10 amino acid substitutions relative to the wild type. It induces cytosine (C)-to-thymine (T) mutations at a rate approximately 30-fold faster than the wild type. The key property is that the deaminated $C$ (uracil, $U$ ) is converted to $T$ only after DNA replication, meaning the accumulated C-to-T barcode mutations directly measure the relative number of cell divisions a cell has undergone.
- iScel: An inactive variant of the homing nuclease I-Scel. It binds specifically to an $18-bp DNA$ motif, thereby guiding HsAID to target regions within the DNA barcode.
- 3-kb DNA Barcode: This engineered barcode consists of 16 different tandem targets. Each target contains an 18 bp iScel binding site and a 156 bp editing region where HsAID can induce mutations.
- Inducible Expression: The expression of the HsAIDiScel fusion protein is under the control of a doxycycline-inducible promoter. This allows for precise temporal control over when barcode mutagenesis begins.
- Integration: The SMALT system (hsAID, iScel, and the 3 kb DNA barcode) was knocked into the H11 locus of $C57BL/6J$ mouse embryonic stem cells to generate SMALT mice.
Mouse Models of Intestinal Tumorigenesis (Fig. 1b, Extended Data Fig. 1c-e):
- AOM/DSS Model (Inflammation-driven): Transgenic male mice ( $Rosa26^{rtTA};HII^{SMALT}$ ) were used. At 6-8 weeks, mice were fed doxycycline for 3 days. Then, a single intraperitoneal injection of AOM ( $12.5 mg kg^-1$ body weight) was administered. Three days later, mice underwent three cycles of 2% DSS treatment (7 days DSS-dissolved water, 14 days tap water per cycle). This model is used to study CRC induced by inflammation.
- $ApcMin/+$ Model (APC-driven): $ApcMin/+$ male mice were mated with $Rosa26^{rtTA};HII^{SMALT}$ female mice to generate $Rosa26^{rtTA};HII^{SMALT};ApcMin/+$ males. Female mice were fed doxycycline for 3 days before mating. This model is used for APC-driven intestinal tumorigenesis.
- Sample Collection: At the experimental endpoint (e.g., around 30 weeks postnatal), normal colon, inflamed colon, AOM/DSS neoplasms ( $IBD_T$ ), $ApcMin/+$ polyps ( $Apc_P$ ), and other unaffected organs (blood, liver, lung) were collected.
- Cell Sorting: Cells were sorted using MojoSort nanobeads. $CD45+$ cells (immune cells) and CD45- EpCAM+ cells (epithelial cells) were isolated, confirmed by FACS.

4.2.2. Barcode Sequencing and Data Processing

Library Preparation for PacBio Sequencing:
- Genomic DNA was isolated from sorted cells.
- A three-step PCR strategy was employed to amplify the 3-kb target barcode:
  1. Lineage Amplification: One cycle of PCR using a P1 primer to incorporate 14-nucleotide Unique Molecular Identifiers (UMIs) onto each original DNA molecule.
  2. Nested PCR Enrichment: Ten cycles of nested PCR amplification using P2 and P3 primers to enrich the indexed target molecules.
  3. Sample Multiplexing: Final PCR amplification using P4 and P5 primers, which contain 6-nucleotide symmetric sample barcodes to enable multiplexing of samples.
- PacBio SMARTbell library preparation was performed, followed by PacBio Sequel IIe platform sequencing, generating HiFi reads.
Long-read Sequencing Data Processing:
- Read Processing: Adapters were removed, and Circular Consensus Sequencing (CCS) reads were generated using pbccs v6.2.0 (with parameters $--min-length=1000 --num-threads=10 --by-strand --min-passes=3$ ).
- Alignment: CCS reads were mapped to the reference 3-kb target barcode sequence using minimap2 v2.17 (with parameters -t 10 -A4 -B12 -O10,15 -E2,1 --score-N 0 --end-bonus 10 -a --MD -x map-pb). samtools was used for sam file manipulation.
- UMI Dereplication and Variant Calling: UMIs, 3-kb barcode sequence, and sample barcodes were annotated. UMIs were grouped using usearch v11.0.667 (with parameters -id 0.95 -gapopen 3.0l/2.0E -gapext 1.0l/0.5E -match +2.0 -mismatch 20.0 -sizeout), removing groups with fewer than 3 CCS reads. For each CCS group, reads were realigned and collapsed into a consensus CCS read, calling nucleotide substitutions with an overall mapping quality > 50 and a mutation frequency > 0.6.
- Filtering for Downstream Analysis:
  - Only high-quality CCS reads with at least two mutations on the barcode were retained.
  - To remove normal cells from neoplastic samples:
    1. If a neoplastic sample showed a bimodal distribution of mutation counts, cells in the lower mutation cluster were removed.
    2. If not bimodal, cells with mutation counts lower than the 75th percentile of mutation counts in adjacent normal cells were removed.
  - After filtering, 260,922 cells were used for analysis.

4.2.3. Phylogenetic Reconstruction and Clonality Assessment

Phylogenetic Tree Reconstruction:
- Phylogenetic trees were reconstructed using a maximum-likelihood method implemented in IQ-TREE v2.2.2.7 (with parameters -T5 -oref -m GTR2+FO+R10).
- The original reference 3-kb barcode sequence was used as the phylogenetic root.
- $GTR2+FO+R10$ was selected as the optimal substitution model.
- Robustness Evaluation: 1,000 rounds of ultrafast bootstrap approximation ( $-B 1000$ ) and 1,000 rounds of SH-like approximate likelihood ratio test (-alrt 1000) were performed.
- Hotspot Sites: Sites with a mutation frequency > 0.04 and present in at least two normal samples were identified as hotspots (14 sites).
Clonality Determination:
- Monoclonal: Characterized by the presence of a single dominant monophyletic clade of neoplastic cells with shared clonal mutations.
- Polyclonal: Identified when neoplastic cells were dispersed into multiple phylogenetic clades, intermixed with normal cells, and lacked clonal mutations.
Quantifying Founding Progenitors (TarCA - Targeting Coalescent Analysis):
- The TarCA v0.1.0 method (Ref. 22) was used to estimate the number of founding progenitor cells (Np) for each neoplastic lesion.
- A "progenitor" is defined as an ancestral cell capable of founding a clonally expanding population in the lesion.
- The effective number of progenitors (Np) is calculated as the inverse of Pr, the probability of two random neoplastic cells sharing a common progenitor in a monophyletic clade.
- The formula for Np is: $ \mathrm{Np} = 1 / \mathrm{Pr} $ where $\mathrm{Np}$ is the effective number of progenitors.
- The formula for Pr (probability of two random neoplastic cells sharing a common progenitor in a monophyletic clade of the phylogenetic tree) is: $ \mathrm{Pr} = \frac{\sum C_{m_i}^2}{C_{N_s}^2} = \frac{\sum (m_i \times (m_i - 1))}{N_s \times (N_s - 1)} $ where:
  - $m_i$ is the number of sampled cells of the $i$ -th monophyletic clade of neoplastic cells.
  - $N_s$ is the total number of neoplastic cells in the sample.
  - $C_{X}^2$ represents "X choose 2", or the number of ways to choose 2 items from X without replacement, which is $\frac{X(X-1)}{2}$ . The formula can be simplified by canceling the $1/2$ terms in both numerator and denominator.
- Robustness Evaluation: Np estimates were evaluated by downsampling: if a sample had >1,000 cells, 1,000 neoplastic cells were downsampled 20 times (including all normal cells); if <1,000 cells, $m$ normal or neoplastic cells were downsampled 20 times ( $m$ being the smaller cell count).
Estimating Timing of Progenitors (Extended Data Fig. 5a):
- This method infers the start time of clonal expansions for polyp initiation.
- The average barcode mutation burden in normal cells ( $m_0$ ) is expressed as: $ m_0 = \mu_0 T $ where $\mu_0$ is the normal cell mutation rate per cell division, and $T$ is the total time from fertilized egg to polyp sampling.
- The average mutation burden for neoplastic cells within each monophyletic clade in the phylogenetic tree is $m_I$ .
- The ratio of the neoplastic cell mutation rate to the normal cell mutation rate is $r = \mu_I / \mu_0$ . This ratio $r$ is estimated from mutation accumulation data obtained from in vitro organoid cultures (see below).
- The timing of progenitor initiation ( $\tau$ ) is then inferred using these values.
- Organoid Experiments: Organoids derived from normal small intestine and neoplastic polyps of $ApcMin/+$ SMALT mice were cultured for 30 days with doxycycline. SMALT barcodes were sequenced at Day 0, 15, and 30. Linear regression of barcode mutations over time allowed estimation of mutation rates ( $\mu_0$ and $\mu_I$ ) and thus the ratio $r$ .

4.2.4. Genomic Analysis of Human and Mouse Samples

Whole-Genome Sequencing (WGS) of Mouse Samples:
- High-quality genomic DNA was extracted from 34 mouse samples (normal, polyps, tumors).
- Sequencing was performed on the Illumina Novaseq PE150 platform, generating an average of 90 Gb of data (approximately 30x coverage) per sample.
Whole-Exome Sequencing (WES) of Human Sporadic Polyp/CRC Cohort:
- WES data (mean depth $>200x$ ) was collected from 107 treatment-naive patients with synchronous sporadic premalignant polyps and CRCs.
- Synchronous tumor ( $T$ ), polyp ( $P$ ), and adjacent normal ( $N$ ) samples were collected.
- Genomic DNA was extracted, NEBNext Ultra DNA Library Prep Kit was used for library construction, and SureSelect XT Human All Exon V6 kit for exome capture.
- Sequencing was performed on the Illumina NovaSeq 6000 platform (150 bp paired-end).
- Histological Grading: Polyps were graded as low-grade or high-grade dysplasia based on H&E-stained images, considering architectural and cytological features.
Detection of Somatic SNVs and SCNAs:
- Preprocessing: Raw fastq files were preprocessed with fastp v0.19.7.
- Alignment: Cleaned reads were aligned to the reference genome (mm10 for mouse, GRCh38 for human) using BWA-MEM algorithm ( $BWA v0.7.17-r1188$ ).
- GATK Best Practices: Aligned reads were processed with MarkDuplicates, BaseRecalibrator, and ApplyBQSR (GATK v4.2.0.0).
- SNV/Indel Calling: Mutect2 (Ref. 58) was used to identify SsNVs and indels for each tumor/normal or polyp/normal pair. A panel-of-normals (PoN) filter removed artefactual/germline variants. FilterMutectCall extracted high-confidence variants.
- Annotation: Somatic mutations were annotated using ANNOVAR v20200608.
- SCNA Detection and Purity/Ploidy Estimation: TitanCNA v1.28.0 (Ref. 60) was used to detect SCNAs and estimate tumor purity and ploidy. Samples with purity $>=0.25$ were retained.
- CCF Calculation: The Cancer Cell Fraction (CCF) for each somatic mutation was calculated by adjusting the observed VAF (variant allele frequency) based on tumor purity, local copy number, and multiplicity (Ref. 28). The formula for CCF is often context-dependent, but generally follows principles similar to: $ \mathrm{CCF} = \frac{\mathrm{VAF} \times (1 - \mathrm{Purity} + \mathrm{Purity} \times \mathrm{CopyNumber})}{\mathrm{Purity}} $ where:
  - $\mathrm{CCF}$ is the Cancer Cell Fraction.
  - $\mathrm{VAF}$ is the Variant Allele Frequency.
  - $\mathrm{Purity}$ is the estimated tumor cell purity.
  - $\mathrm{CopyNumber}$ is the total copy number of the genomic locus in the tumor.
- Clonal vs. Subclonal: Mutations were classified as clonal if the upper bound of their 95% confidence interval for CCF was $>=1$ ; otherwise, subclonal.
- Filtering SSNVs: Retained if $>=5$ variant reads in WES (or $>=4$ in WGS) in polyp/tumor, total reads $>=10$ , and $VAF < 0.01$ in the normal sample.
Single-Gland WGS (Human):
- Isolated 29 neoplastic glands from 5 polyps and 3 adjacent normal crypts from one sporadic patient (B139).
- WGS (mean depth approximately 21x) was performed using Vazyme TruePrep DNA Library Prep Kit V2 and Illumina NovaSeq.
- Variant Calling: Mutect2 and Strelka v2.9.2 were used, and VariantFilter obtained a consensus call set.
- High-Confidence Mutations: Retained if $VAF >= 0.15$ , variant reads $>=4$ , total reads $>=7$ in gland, and no reads in the matched bulk normal sample.
- Phylogenetic Trees: Reconstructed using a neighbour-joining method (Ref. 66) in Biopython (Ref. 67), with the reference sequence as the root.
- SCNA Estimation: Sequenza v3.0.0 R package (Ref. 68) was used to estimate SCNAs relative to matched normal tissue. Cutoffs for amplification (AMP) and deletion (DEL) were $\log_2(2.5/2)$ and $\log_2(1.5/2)$ , respectively.
Clonal Relatedness for Polyp/CRC Pairs:
- Breakclone v0.3.3 (Ref. 69) was used to assess clonal relatedness, incorporating population frequency and allele frequency.
- It calculates a P-value (from a permutation test) and a clonal relatedness score.
- Classification: related ( $P < 0.01$ and clonality score $> 0.1$ ), unrelated ( $P > 0.05$ and clonality score $< 0.05$ ), or ambiguous (remaining pairs).
dN/dS Analysis:
- The dN/dS ratio (ratio of non-synonymous to synonymous mutation rates) was estimated using the dndscv v0.0.1.0 R package (Ref. 24).
- This analysis was applied separately to mutation sets within polyclonal polyps, monoclonal polyps, and CRCs.
- dN/dS for putative CRC driver genes was extracted using the genesetdnds function.
- Conceptual Definition: The dN/dS ratio, often denoted as $\omega$ , is a measure of selective pressure acting on protein-coding genes. It quantifies the ratio of the rate of non-synonymous (amino acid changing) substitutions (dN) to the rate of synonymous (silent, non-amino acid changing) substitutions (dS). A dN/dS ratio $>1$ suggests positive selection (mutations provide a fitness advantage), $<1$ suggests purifying selection (deleterious mutations are removed), and $=1$ suggests neutral evolution. In cancer, a high dN/dS can indicate driver mutations are being selected for.
- Mathematical Formula: $ \omega = \frac{dN}{dS} $ where dN is the number of non-synonymous substitutions per non-synonymous site, and dS is the number of synonymous substitutions per synonymous site. These rates are calculated by comparing sequences and accounting for the genetic code and codon usage.
- Symbol Explanation:
  - $\omega$ : The dN/dS ratio.
  - dN: The rate of non-synonymous substitutions, calculated as the number of non-synonymous mutations divided by the number of non-synonymous sites.
  - dS: The rate of synonymous substitutions, calculated as the number of synonymous mutations divided by the number of synonymous sites.

4.2.5. Single-Cell RNA Sequencing and Interaction Analysis

scRNA-seq Sample Preparation:
- Colons from nine AOM/DSS mice were dissected, and tumors were cut into pieces.
- Single-cell suspension was prepared using MACS Tissue Dissociation Kits, followed by digestion, filtration, and resuspension.
- Library generation was performed using 10x Genomics v2 chemistry and Chromium Single Cell 5' Reagent Kits.
scRNA-seq Data Processing and Clustering:
- Alignment and Quantification: Raw scRNA-seq data were aligned to the mm10 reference genome, and UMIs were quantified using Cell Ranger v7.1.
- Integration: 2,619 additional normal colon cells from a previous study (GSE134255, Ref. 33) were integrated as normal controls.
- Quality Control (QC): Cells were retained if they had $>=500$ genes, $<15%$ mitochondrial gene expression, and were identified as singlets by DoubletFinder v2.0.3 (Ref. 74). Genes expressed in $<3$ cells were filtered.
- Normalization: sctransform v2 (Ref. 75, 76) was used for normalization.
- Dimensionality Reduction and Integration: PCA was performed, and the top 50 significant principal components were selected. Harmony v1.1.0 (Ref. 77) was used to remove batch effects and integrate data.
- Clustering: FindNeighbors and FindClusters functions (Louvain algorithm) were used (resolution 0.1 for first round, 0.4 for second round). This resulted in 8 main cell types and 26 subclusters (10 for epithelial, 7 for macrophages).
Differential Gene Expression and Annotation:
- Seurat FindALLMarkers and FindMarkers functions (Wilcoxon rank-sum test) identified DEGs.
- Canonical markers and DEGs were used for cell type annotation.
- Gene Set Enrichment Analysis (GSEA) was performed using clusterProfiler package (Ref. 78) with Gene Ontology and MSigDB hallmark gene sets.
- Gene Set Variation Analysis (GSVA) package (Ref. 79) was used to score gene sets for individual cells, particularly for macrophage classification.
Differential Cell Abundance Analysis:
- miloR v1.8.1 (Ref. 80) was used to identify changes in cell abundance across cell types.
- A KNN graph was constructed on the Harmony space. A sampling refinement algorithm selected 10% of cells, and neighborhoods were formed. Spatial FDR (corrected P-value) was used.
Cell-Cell Communication Analysis:
- CellChat v1.6.1 (Ref. 36) inferred interactions between epithelial cell types within each sample. It considers expression levels, structural components, soluble agonists, etc.
- Quantification: The intensity of interactions was measured by the total number of inferred ligand-receptor pairs.
- Downsampling: To compare across samples, 50 downsamplings were performed, each with 689 epithelial cells.
- Correlation: The average number of ligand-receptor pairs per sample was correlated with lesion clonality ( $1/Np$ ).
- Differential Interaction Analysis: 1,543 candidate ligand-receptor pairs from CellChatDB were compared between early low-clonality lesions ( $Np > 3$ , 6 samples) and late high-clonality lesions ( $Np <= 3$ , 3 samples) using Wilcoxon rank-sum test ( $FDR < 0.05$ ).
- Validation: MultiNicheNetR v1.0.3 (Ref. 81) was used as an orthogonal method to identify significantly altered ligand-receptor pairs, considering downstream target gene regulation.

5. Experimental Setup

5.1. Datasets

The study utilized a combination of mouse and human tissue samples, processed with various high-throughput techniques to generate multi-omics datasets.

Mouse Models:
- SMALT Lineage Tracing:
  - AOM/DSS neoplasms ( $IBD_T$ ): 30 samples.
  - $ApcMin/+$ polyps ( $Apc_P$ ): 17 samples.
  - Normal and inflamed intestinal tissues: 26 samples (including normal colon, inflamed colon, $ApcMin/+$ small intestine).
  - Unaffected organs: 18 samples from 6 organs (blood, liver, lung) in 3 SMALT mice.
  - Total single cells for phylogenetic analysis: 260,922 cells across 112 normal and neoplastic samples.
- Whole Genome Sequencing (WGS):
  - 11 $IBD_T$ samples and 4 $Apc_P$ samples, each with matched healthy samples.
  - Total of 34 mouse samples for WGS.
- Single-Cell RNA Sequencing (scRNA-seq):
  - 9 $IBD_T$ samples.
  - Integrated with 2,619 normal colon cells from a public dataset (GSE134255, Ref. 33).
  - Total high-quality cells for scRNA-seq analysis: 45,620.
- Organoid Cultures:
  - Organoids derived from normal small intestine and neoplastic polyps of $ApcMin/+$ SMALT mice for in vitro mutation rate estimation.
Human Samples:
- Whole-Exome Sequencing (WES):
  - A cohort of 107 treatment-naive patients with sporadic premalignant polyps and synchronous CRCs.
  - For each patient, synchronous tumor ( $T$ ), polyp ( $P$ ), and adjacent normal ( $N$ ) samples were collected.
  - Sample processing led to 102 polyps and 86 CRCs retained for analysis after purity filtering.
- Single-Gland Whole-Genome Sequencing (WGS):
  - One additional sporadic patient (B139, male, 73 years old).
  - 29 neoplastic glands isolated from 5 separate polyps.
  - 3 adjacent normal crypts.
Data Characteristics and Domain:
- The datasets cover various stages of colorectal tumorigenesis, from normal tissue to precancerous lesions (polyps/adenomas) and malignant tumors, across both genetically engineered mouse models and human sporadic cases.
- This allows for comparative analysis between species and different etiologies (inflammation-driven vs. APC-driven).
- The inclusion of multi-omics data (DNA barcodes, bulk DNA mutations, single-cell RNA expression) provides complementary information on genetic evolution and cellular phenotypes.
  
  The choice of these datasets is effective for validating the method's performance and supporting the proposed model. Mouse models provide a controlled environment for SMALT lineage tracing, allowing precise tracking from initiation. Human WES and WGS data from polyps and CRCs offer direct clinical relevance and validation of findings in a human context. The single-gland WGS provides a unique way to confirm clonality at a micro-anatomical level in human tissue.

5.2. Evaluation Metrics

For every evaluation metric mentioned in the paper, here is a complete explanation:

Barcode Mutation Count:
- Conceptual Definition: This metric quantifies the absolute number of C-to-T point mutations identified within the engineered 3kb SMALT DNA barcode sequence of a single cell. Since these mutations are induced by HsAID and fixed upon DNA replication, a higher barcode mutation count directly correlates with a greater number of cell divisions or generations a cell has undergone. It serves as a proxy for proliferative activity or cellular age.
- Mathematical Formula: Not a single formula, but derived by counting unique C-to-T substitutions in the 3kb barcode sequence of each CCS read (representing a single cell) after bioinformatics processing.
- Symbol Explanation: Not applicable as it's a direct count.
Number of Unique Barcode Alleles:
- Conceptual Definition: This metric assesses the genetic diversity within a sample by counting how many distinct 3kb barcode sequences (alleles) are present. A high number of unique alleles indicates extensive genetic heterogeneity and a diverse population of cells, suggesting a lack of strong clonal selection or recent common ancestry.
- Mathematical Formula: Not a single formula, but a count of distinct barcode sequences identified in the filtered CCS reads from a given sample.
- Symbol Explanation: Not applicable.
Number of Founding Progenitors (Np):
- Conceptual Definition: Np, estimated using TarCA (targeting coalescent analysis), quantifies the effective number of ancestral cells that successfully initiated and contributed to the clonally expanding population within a lesion. A Np value of 1 indicates a monoclonal origin, while values greater than 1 suggest a polyclonal origin. It reflects the breadth of initial cellular involvement in tumorigenesis.
- Mathematical Formula: $ \mathrm{Np} = 1 / \mathrm{Pr} $ where $\mathrm{Np}$ is the effective number of progenitors. The probability $\mathrm{Pr}$ is calculated as: $ \mathrm{Pr} = \frac{\sum C_{m_i}^2}{C_{N_s}^2} = \frac{\sum (m_i \times (m_i - 1))}{N_s \times (N_s - 1)} $
- Symbol Explanation:
  - $\mathrm{Np}$ : The effective number of founding progenitor cells.
  - $\mathrm{Pr}$ : The probability that two randomly selected neoplastic cells from the sample share a common progenitor within a monophyletic clade in the phylogenetic tree.
  - $m_i$ : The number of sampled cells belonging to the $i$ -th monophyletic clade of neoplastic cells within the phylogenetic tree.
  - $N_s$ : The total number of neoplastic cells sampled from the lesion.
  - $C_X^2$ : Denotes the number of combinations of choosing 2 items from a set of $X$ items, which equals $\frac{X(X-1)}{2}$ . The term $\frac{1}{2}$ cancels out in the numerator and denominator of the full formula.
Genome-wide Mutation Burden:
- Conceptual Definition: This metric represents the total number of somatic mutations (single nucleotide variants (SNVs), short insertions/deletions (indels)) identified across the entire sequenced genome (for WGS) or exome (for WES) of a tumor or polyp sample, relative to its matched normal tissue. A higher burden indicates more genetic instability and accumulated damage.
- Mathematical Formula: Not a single formula; it's a direct count of detected SsNVs and indels after filtering false positives.
- Symbol Explanation: Not applicable.
Putative Driver Mutation Burden:
- Conceptual Definition: This metric specifically counts the number of somatic mutations that occur in genes known or strongly suspected to play a causal role in cancer development (driver genes). It indicates the accumulation of functionally significant genetic alterations that drive tumor growth and progression.
- Mathematical Formula: Not a single formula; it's a count of mutations found in a curated list of cancer driver genes.
- Symbol Explanation: Not applicable.
Clonal Expansion Score:
- Conceptual Definition: This score quantifies the degree of clonal expansion within a tissue by measuring the phylogenetic similarity between randomly chosen pairs of cells. A higher score indicates that cells within the sample are more closely related phylogenetically, suggesting significant proliferation from common ancestors.
- Mathematical Formula: Not explicitly provided in the main text, but typically involves calculating a measure of genetic distance or shared ancestry between cell pairs within the reconstructed phylogenetic tree. For example, it could be based on the proportion of shared mutations or the depth of their most recent common ancestor.
- Symbol Explanation: Not applicable.
Proliferative Fitness:
- Conceptual Definition: This metric reflects the relative growth advantage of a cell lineage or clone. In the context of SMALT data, it can be estimated from the rate of barcode mutation accumulation or the size and growth dynamics of a specific clade within the phylogenetic tree. Higher fitness implies a greater capacity for rapid proliferation and expansion.
- Mathematical Formula: Not explicitly provided, but often inferred from the growth dynamics of clades or relative increases in cell numbers associated with certain mutations. In the paper, it is likely derived from the barcode mutation burden relative to time or the proportion of cells belonging to a given clone.
- Symbol Explanation: Not applicable.
dN/dS Ratio ( $\omega$ ):
- Conceptual Definition: The dN/dS ratio, or $\omega$ $ω$ , is a measure of selective pressure on protein-coding genes. It compares the rate of non-synonymous (amino acid-altering) substitutions (dN) to the rate of synonymous (silent, non-amino acid-altering) substitutions (dS).
  - $\omega > 1$ : Positive selection (mutations provide a fitness advantage and are favored).
  - $\omega < 1$ : Purifying selection (deleterious mutations are removed).
  - $\omega = 1$ : Neutral evolution (mutations are neither favored nor disfavored). In cancer, $dN/dS > 1$ for a gene indicates it's a driver, undergoing positive selection.
- Mathematical Formula: $ \omega = \frac{dN}{dS} $ where:
  - dN is the number of non-synonymous substitutions per non-synonymous site.
  - dS is the number of synonymous substitutions per synonymous site.
- Symbol Explanation:
  - $\omega$ : The dN/dS ratio.
  - dN: The rate of non-synonymous substitutions (number of non-synonymous mutations normalized by the number of possible non-synonymous sites).
  - dS: The rate of synonymous substitutions (number of synonymous mutations normalized by the number of possible synonymous sites).
Cancer Cell Fraction (CCF):
- Conceptual Definition: The CCF represents the proportion of tumor cells in a given sample that harbor a specific somatic mutation. It's a crucial metric for distinguishing between clonal mutations (present in virtually all cancer cells, CCF close to 1) and subclonal mutations (present in a subset of cancer cells, $CCF < 1$ ). CCF accounts for tumor purity, ploidy, and local copy number.
- Mathematical Formula: The calculation of CCF is often performed using computational tools (TitanCNA in this paper) that involve complex probabilistic modeling, integrating VAF (Variant Allele Frequency), tumor purity, and local DNA copy number. A simplified representation for a diploid locus in a pure tumor is: $ \mathrm{CCF} = \frac{\mathrm{VAF} \times (1 - \mathrm{Purity} + \mathrm{Purity} \times \mathrm{CopyNumber})}{\mathrm{Purity}} $
- Symbol Explanation:
  - $\mathrm{CCF}$ : Cancer Cell Fraction.
  - $\mathrm{VAF}$ : The observed Variant Allele Frequency from sequencing reads for a specific mutation.
  - $\mathrm{Purity}$ : The estimated proportion of tumor cells within the total cells sampled (ranging from 0 to 1).
  - $\mathrm{CopyNumber}$ : The estimated total DNA copy number of the genomic region where the mutation is located in the tumor cells.
Ligand-Receptor Interactions (Inferred by CellChat):
- Conceptual Definition: This metric quantifies the potential for communication between different cell types (or subtypes) based on the expression levels of known ligand-receptor pairs. CellChat infers these interactions by considering the expression of ligand genes in "sender" cells and receptor genes in "receiver" cells, along with complex factors like multimeric complexes and co-receptors. A higher number or strength of inferred interactions indicates more active intercellular communication.
- Mathematical Formula: CellChat uses a statistical model to infer interactions. While no single formula is given for the "number" of interactions, the process involves computing a communication probability for each ligand-receptor pair based on cell group expression.
- Symbol Explanation: Not applicable as it's an output of a software tool based on complex models.

5.3. Baselines

The paper compared its findings and methods against several baselines and control groups to establish significance and novelty:

Mouse Models:
- Normal Tissue Controls: $Wild-type normal colon (WT_N)$ , $ApcMin/+$ normal small intestine ( $Apc_N$ ), and IBD normal tissue ( $IBD_N$ ) were consistently used as controls to establish baseline barcode mutation rates, cell type proportions, and clonal expansion levels in non-diseased states.
- Adjacent Normal Cells: For both AOM/DSS and $ApcMin/+$ lesions, adjacent normal cells from the same mice were used to differentiate neoplastic from normal cellular properties and to filter out plausible normal cells within neoplastic samples during barcode processing.
- Unaffected Organs: Samples from blood, liver, and lung in SMALT mice served as further controls for background mutation rates and $CD45+$ cell characterization.
- CRISPR-Cas9 Lineage Tracing: The branching index of SMALT phylogenetic trees was quantitatively compared to those generated by previous CRISPR-Cas9 lineage tracing studies (Ref. 12, 14) to highlight SMALT's superior resolution.
Human Sporadic Polyp/CRC Cohort:
- Monoclonal Polyps and CRCs: Polyclonal polyps were compared against monoclonal polyps and malignant CRCs within the human cohort to characterize differences in mutation burden, driver mutations, size, and dysplasia grade, supporting the polyclonal-to-monoclonal transition model.
- Matched Normal Tissue: For WES and single-gland WGS, matched normal tissue (adjacent normal colon or normal crypts) was used as a baseline for somatic mutation calling and SCNA detection.
scRNA-seq Analysis:
- Public scRNA-seq Data: scRNA-seq data from wild-type mouse normal colon (GSE134255, Ref. 33) was integrated into the study's dataset as a normal control, allowing for comparison of cell type composition and intercellular interactions between normal and neoplastic tissues.
- Late Monoclonal Lesions: Early polyclonal lesions were compared to late monoclonal lesions ( $Np <= 3$ ) in AOM/DSS neoplasms to identify differences in cell state dynamics and intercellular communication during the transition.
Computational Methods:
- CellChat was used to infer ligand-receptor interactions, and its results were validated using an orthogonal method, MultiNicheNetR, which serves as a cross-validation baseline for the communication analysis.
  
  These baselines are representative as they include healthy controls, established disease models, human clinical samples, and comparisons with existing techniques, ensuring that the observed findings are robust, specific to the disease, and represent an advancement over previous methodologies.

6. Results & Analysis

6.1. Core Results Analysis

The paper presents a comprehensive set of results across mouse models and human samples, integrating diverse experimental techniques to support the central hypothesis of a polyclonal-to-monoclonal transition in colorectal precancer.

6.1.1. High-Resolution Lineage Tracing with SMALT in Mouse Models

SMALT System Validation: The SMALT system demonstrated high in vivo mutagenesis activity and lineage barcoding capacity. More than 90% of mutations were C/G-to-T/A, as expected from HsAID. Mutations were widely distributed across the 3kb barcode, with an average of 836 mutable sites per sample, far surpassing CRISPR-Cas9 methods (typically 10-60 sites). This high diversity was reflected in approximately 90% of cells exhibiting a unique combination of mutations (Fig. 1f,g,h).
Neoplastic vs. Normal Cell Proliferation: Neoplastic cells consistently showed significantly higher barcode mutation counts than normal cells from adjacent tissues or other organs (Fig. 1f, Extended Data Fig. 2b). This allowed for effective separation of neoplastic cells. AOM/DSS lesions (4.3-fold increase) had a higher mutation burden increase than $ApcMin/+$ lesions (2.8-fold increase) compared to normal cells, likely due to AOM's mutagenic effect.

The following figure (Figure 1 from the original paper) illustrates the SMALT lineage tracing system and key initial findings:

$Fig. 1 | SMALT lineage tracing of mouse intestinal tumorigenesis. a, Schematic of the SMALT lineage tracing system. b, Intestinal tumorigenesis with AOM/DSS or ApcMin/+ mice carrying the engineered SMALT system in the germline. Normal and neoplastic samples were collected for long-read sequencing of lineage barcodes (barcode-seq), WGS and scRNA-seq. WT, wild type. c, The relative proportions of distinct substitution types in barcodes across all samples.Apc_, normal small intestine inApcMinmice; Apc_P, polyps in ApcMin/+mice; IBD_N, normal tissue in AOM/DSS mice; IBD_T, neoplasms in AOM/DSS mice; WT_N, wild-type normal colon. d, Per site mutation frequency on barcodes across all samples. e, Correlation of per site mutation frequency between mouse and fruit fly. Pearson's r and Pvalue are shown. f, Violin plot showing the number of barcode mutations per cellin different tissues. The mean number of mutations and the number of cells are shown. $P$ values are by two-sided Wilcoxon rank-sum test. g, The proportion of unique barcode alleles and the number of cells with the unique barcode in WT_N $( n = 4 )$ , Apc_N $( n = 4 )$ ,iBD_N $_ { n = 1 9 } )$ A_ $( n = 2 1 )$ and IBD_T $( n = 3 0 )$ , respectively. Data are mean $\\pm$ s.e.m. h, The proportion of unique barcode alleles and the number of samples with the unique barcode. In box plots, the horizontal line is the median, the box delineates the 25th to 75th centiles, and whiskers extend to 1.5 times the interquartile range.$ 该图像是示意图，展示了SMALT谱系追踪系统在小鼠肠道肿瘤发生中的应用。图中包括样本类型的比例（c），不同样本中的突变频率（d），以及小鼠实验所用的模型（b）。其中， $P$ 值和其他相关数据展示了不同样本对比的结果，强调了多克隆至单克隆的转变在肿瘤进展中的重要性。 Fig. 1 | SMALT lineage tracing of mouse intestinal tumorigenesis. a, Schematic of the SMALT lineage tracing system. b, Intestinal tumorigenesis with AOM/DSS or ApcMin/+ mice carrying the engineered SMALT system in the germline. Normal and neoplastic samples were collected for long-read sequencing of lineage barcodes (barcode-seq), WGS and scRNA-seq. WT, wild type. c, The relative proportions of distinct substitution types in barcodes across all samples.Apc_, normal small intestine inApcMinmice; Apc_P, polyps in ApcMin/+mice; IBD_N, normal tissue in AOM/DSS mice; IBD_T, neoplasms in AOM/DSS mice; WT_N, wild-type normal colon. d, Per site mutation frequency on barcodes across all samples. e, Correlation of per site mutation frequency between mouse and fruit fly. Pearson's r and Pvalue are shown. f, Violin plot showing the number of barcode mutations per cellin different tissues. The mean number of mutations and the number of cells are shown. $P$ values are by two-sided Wilcoxon rank-sum test. g, The proportion of unique barcode alleles and the number of cells with the unique barcode in WT_N $( n = 4 )$ , Apc_N $( n = 4 )$ ,iBD_N $_ { n = 1 9 } )$ A_ $( n = 2 1 )$ and IBD_T $( n = 3 0 )$ , respectively. Data are mean $\\pm$ s.e.m. h, The proportion of unique barcode alleles and the number of samples with the unique barcode. In box plots, the horizontal line is the median, the box delineates the 25th to 75th centiles, and whiskers extend to 1.5 times the interquartile range.

6.1.2. Polyclonal Origins and Transition in Mouse Models

Inflammation-Driven Lesions (AOM/DSS):
- SMALT lineage tracing revealed that the majority (66.7%, 20 out of 30) of AOM/DSS neoplasms had polyclonal origins (Fig. 2a,b, Extended Data Fig. 3). This indicates parallel expansions of multiple distinct cell lineages within the same lesion.
- The Np (number of founding progenitors) varied from 2 to 33 for polyclonal lesions, robustly estimated by TarCA (Fig. 2e).
- Monoclonal lesions (10 out of 30) exhibited significantly more barcode mutations, higher genome-wide mutation burdens, and more putative driver mutations compared to polyclonal lesions (Fig. 2f,g,h). This suggests that monoclonal lesions represent a more advanced stage of tumorigenesis.
- Inflamed colons showed greater clonal expansion than normal colons (Fig. 2i,j). Monoclonal lesions had greater expansion and higher proliferative fitness than polyclonal lesions (Fig. 2i,k). A higher dN/dS ratio in monoclonal lesions indicated stringent selection during somatic evolution (Supplementary Fig. 14c).
- Spatial computational inference further suggested strong subclonal selection ( $s = 0.57–0.87$ ) even after monoclonal transition.
  
  The following figure (Figure 2 from the original paper) presents single-cell phylogenies of inflammation-driven neoplasms:
  
  该图像是图表，展示了不同肠道肿瘤样本的单细胞谱系分析结果。左侧标记为“Polyclonal”的部分显示了多克隆样本，右侧标记为“Monoclonal”的部分显示了单克隆样本。每个样本的树状图显示了细胞的演化关系，并附有样本的总细胞数（n）及进一步克隆的数量（Np）。这些数据支持了多克隆向单克隆转变的模型，提示样本的肿瘤发展阶段。 Fig. 2 | Single-cell phylogenies reveal the origin of inflammation-driven neoplasms. a,b, Single-cell phylogeny (left) and corresponding barcode mutations (right) for a representative monoclonal lesion (a; lesion IBD4_T) and a representative polyclonal lesion (b; lesion IBD50_T). c, Bootstrapping values of the phylogenetic tree for IBD4_T (top) and IBD50_T (bottom). d, The branching index of SMALT trees in this study $\\left( n = 7 7 \\right)$ compared with CRISPR Cas9 lineage trees from two previous studies (ref. 12, $n = 4$ ; and ref.14, $n = 8 5$ ). e, The number of founding progenitors (Np) estimated from single-cell phylogeny. For each lesion, Np was estimated 20 times using downsampled cells. f, The barcode mutation count per cell in monoclonal $\\scriptstyle ( n = 2 2 , 7 6 6$ cells) versus polyclonal $( n = 2 0 , 8 8 2$ cells) lesions. g,h, Total somatic mutation burden $\\mathbf { \\delta } ( \\mathbf { g } )$ or putative driver mutation burden $\\mathbf { (h) }$ in WGS data of monoclonal $( n = 7 )$ versus polyclonal $( n = 9 )$ lesions. i, Clonal expansion scores calculated using 1,000 downsampled cell pairs ranked by the median clonal expansion scores within each sample type.j, A representative single-cell phylogeny for inflamed normal colon. Lineages exhibiting clonal expansions are highlighted in colour. k, Single-cell fitness scores in monoclonal versus polyclonal lesions. d,f-h, Pvalues by two-sided Wilcoxon rank-sum test.
$ApcMin/+$ Lesions:
- All 17 individual polyps examined from $ApcMin/+$ mice exhibited polyclonal origins (Extended Data Fig. 4a). Np ranged from 4 to approximately 100, indicating even higher polyclonality than in AOM/DSS lesions (Extended Data Fig. 5).
- Some regions within a polyp showed stronger, independent clonal expansions (e.g., P5-1 and P5-5 in Apc68_P5), characterized by monophyletic subtrees and higher proliferative fitness (Extended Data Fig. 4b,c,e).
- Timing analysis suggested that neoplastic initiation in $ApcMin/+$ mice occurred early (59-130 postnatal days), corresponding to infancy in human FAP patients (Extended Data Fig. 5g).

6.1.3. Polyclonal-to-Monoclonal Transition in Human Sporadic Polyps

WES Analysis of Human Cohort:
- Analysis of WES data from 107 patients with sporadic polyps and synchronous CRCs revealed that polyclonality was more common in polyps (29.4%, 30/102) than in CRCs (8.1%, 7/86) (Fig. 3b).
- Polyclonal polyps had fewer SsNVs and SCNAs than monoclonal polyps, both having lower mutational burdens than CRCs (Fig. 3c, Extended Data Fig. 6).
- Polyclonal polyps were typically smaller, more commonly exhibited low-grade dysplasia, and were found in younger patients (Fig. 3d,e,g). These clinical features further supported that monoclonality represents a more advanced stage.
- KRAS mutations were significantly more common in monoclonal polyps (34.7%) and CRCs (40.5%) than in polyclonal polyps (6.7%), suggesting KRAS mutations confer a selective advantage for clonal outgrowth and monoclonal transition (Fig. 3h,i).
- The overall selective strength (dN/dS) for driver mutations was higher in monoclonal polyps than in polyclonal ones (Supplementary Fig. 23).
- These human genomic and clinical data strongly validate the polyclonal-to-monoclonal transition trajectory observed in mouse models (Fig. 3j).
  
  The following figure (Figure 3 from the original paper) summarizes the polyclonal-to-monoclonal transition in human sporadic polyps:
  
  该图像是示意图，展示了不同基因型（Apc68 和 Apc72）下的小鼠肠道肿瘤的单细胞谱系图，显示了多克隆到单克隆转变的动态过程。图中包含多条分支，代表独立细胞谱系的并行扩展。整体布局提供了高分辨率的细胞谱系构建，支持多克隆肿瘤发展为单克隆肿瘤的模型。 *Fig3|Polyclonal-to-onoclonaltransitioninumansporadicolyps. a, A human cohort with sporadic premalignant polyps, including 107 patients with synchronous polyps and CRC. b, The distribution of CCFs reveals the clonality of each lesion. One representative polyclonal polyp (P_poly, B046P) and one monoclonal polyp (P_mono, B002P) are shown. c, The total somatic mutation burden in P_poly $( n = 3 0 )$ , P_mono $\\left( n = 7 2 \\right)$ or CRCs $( n = 8 6 )$ after removing samples with low purity $_ { ( < 0 . 2 5 ) }$ . d, Distribution of small (<1 cm) and large (≥1 cm) polyps. e, Distribution oflow-grade and high-grade dysplasias. f, Representative images of haematoxylin and eosin (H&E) staining. Scale bar, $1 0 0 \\mu \\mathrm { m }$ Age distribution of participants. h, Percentage of patients carrying indicated putative driver mutations. $^ { * } P < 0 . 0 5$ - $^ { * * } P { < } 0 . 0 1$ - $^ { * * * } P < 0 . 0 0 1$ ,one-sided
Fisher's exact test. Highlighted genes have Benjamini-Hochberg FDR $< 0 . 1$ . i, The CCFs of putative driver genes.j, Schematic of polyclonal-to-monoclonal transition in the somatic evolution of premalignant polyp and its subsequent malignant transformation. k, Schematic of wGS of single glands from five premalignant polyps (P1P4 and P6) in a patient with sporadic polyps (B139). N, normal tissue; R, regions within polyps. I, Images of individually isolated glands. m, Integrated phylogenetic tree including 3 normal glands and 29 neoplastic glands from 5 polyps. Putative driver mutations shared by multiple glands are labelled. Pvalues by two-sided Wilcoxon rank-sum test (c,g) or Fisher's exact test (d,e). Graphics in a,k adapted from Servier Medical Art (CC BY 4.0).*
Single-Gland WGS Validation:
- WGS of 29 neoplastic glands from 5 polyps in one patient (B139) further confirmed polyclonal origins. For polyps P1, P2, and P4, the most recent common ancestor of their glands was near the phylogenetic root, with no putative driver mutations in the trunk, indicating polyclonal initiation (Fig. 3m).
- Partially shared driver mutations (e.g., FAT3 in P1, JAK1 in P4) suggested clonal expansion within polyclonal lesions.
- Many neoplastic glands (18/29) exhibited non-APC driver mutations (CTNNB1, JAK1, FAT3), implying diverse early drivers beyond APC in sporadic polyps.

6.1.4. Evolution of Cell States and Intercellular Interactions

scRNA-seq Landscape: scRNA-seq of 9 AOM/DSS neoplasms identified 8 main cell types and 26 subclusters (Extended Data Fig. 8a-c).
Cell Type Abundance Changes: As lesion clonality ( $1/Np$ $1/ Np$ ) increased (i.e., becoming more monoclonal), there were marked increases in macrophage, neutrophil, and endothelial cell proportions, and decreases in neoplastic epithelial cells (Extended Data Fig. 8d,e).
- $Trem2+$ and $Chil3+$ macrophages (overlapping with immunosuppressive LA_TAMs and Reg_TAMs) were enriched with increasing clonality (Fig. 4b,c), suggesting an immune-suppressive microenvironment contributes to monoclonal transition.
- High-clonality lesions showed upregulation of cancer-associated hallmarks (e.g., MYC targets, KRAS signaling, EMT) (Extended Data Fig. 8g).
Intercellular Communication:
- Cell-cell communication analysis revealed a significant elevation of ligand-receptor interactions between epithelial subtypes in early polyclonal lesions compared to normal colons and late monoclonal lesions (Fig. 4d,e, Extended Data Fig. 9).
- 14 ligand-receptor interactions were significantly enriched in early polyclonal lesions, primarily involved in extracellular matrix (ECM) organization and cell adhesion (e.g., laminins, CDH1, SEMA4) (Fig. 4f).
- $Krt20+$ neoplastic cells contributed about 40% of these enriched interactions and exhibited pro-inflammatory characteristics (Extended Data Fig. 10a-c).
- These findings strongly suggest that extensive intercellular cooperation, potentially via ECM and cell adhesion, is a hallmark of early inflammation-driven intestinal tumorigenesis.
  
  The following figure (Figure 4 from the original paper) shows the model of intercellular interactions and polyclonal-to-monoclonal evolution:
  
  $该图像是一个图表，展示了Apc突变小鼠模型中，正常细胞和肿瘤前体细胞的进化图谱与较高分辨率的细胞系谱。包括不同时间点的肠道类器官的显微图像，表现出多克隆到单克隆转变的过程，并通过公式 $m_t = rac{m_ au}{r-1}$ 描述突变负担变化。$ 该图像是一个图表，展示了Apc突变小鼠模型中，正常细胞和肿瘤前体细胞的进化图谱与较高分辨率的细胞系谱。包括不同时间点的肠道类器官的显微图像，表现出多克隆到单克隆转变的过程，并通过公式 $m_t = rac{m_ au}{r-1}$ 描述突变负担变化。 Fig. 4 | Intercellular interactions and polyclonal-to-monoclonal evolution model. a, scRNA-seq identifies 26 cell subclusters. b, Beeswarm plot of differential abundance for the cell subclusters along the increase of lesion clonality (measured by 1/Np). Each point represents a neighbourhood that contains a group of cells with similar transcriptomes. Cell neighbourhoods with spatial FDR $< 0 . 1$ are highlighted in red for decreased abundance and blue for increased abundance. c, Subclusters of macrophages and the expression of Trem2, signatures of LA_TAMs and Reg_TAMs. d, Cell-cell communication between neoplastic epithelial subclusters inferred by CellChat. The nodes represent epithelial subclusters. The thickness of edges represents the average number of ligand-receptor interactions between every two subclusters from the 50 downsamplings (Methods). e, Correlation between lesion clonality $\\bf ( 1 / N p )$ and the average number of ligand-receptor interactions. Spearman's $\\rho$ and Pvalue are shown. f, A total of 14 ligandreceptor interactions significantly enriched $\\left( \\mathsf { FDR } < 0 . 0 5 \\right)$ in the early polyclonal lesions $( \\mathsf { Np } > 3 )$ relative to late lesions $( \\mathsf { Np } \\leq 3 )$ belong to 4 pathways: laminin, desmocolin (DSC), CDH1 and SEMA4. Data are mean $\\pm$ s.e.m. g, Schematic of polyclonal origins and monoclonal transition in early intestinal tumorigenesis. Each precancerous lesion is founded by many lineages undergoing parallel clonal expansions and strong inter-clonal interactions. The gradual loss of inter-clonal interactions and microenvironmental changes might facilitate subsequent polyclonal-tomonoclonal transition. Subclonal selection remains stringent after monoclonal transition, where malignant transformation requires a further clonal sweep.

6.2. Data Presentation (Tables)

The provided full content of the research paper does not include any tables within the main text or figures. All tabular data are referenced as Supplementary Tables (e.g., Supplementary Table 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16) which are not part of the provided text. Therefore, no tables can be transcribed here.

6.3. Ablation Studies / Parameter Analysis

While the paper does not present traditional ablation studies where specific components of a proposed model are removed to assess their contribution, it performs several analyses to validate the robustness of its methods and explore the impact of specific features:

Robustness of Np Estimation: The number of founding progenitors (Np) estimated by TarCA was tested for robustness by performing 20 downsamplings of cells for each lesion. The consistency of Np estimates across these downsamplings (Fig. 2e) indicates the reliability of the quantification method, rather than being sensitive to cell sampling depth.
Impact of Hotspot Mutations: The robustness of Np estimates was also confirmed against hotspot mutation events (Supplementary Fig. 13b), demonstrating that regions with unusually high mutation rates did not unduly influence the overall Np calculation.
Comparison with Previous Lineage Tracing Methods: The branching index of SMALT trees was quantitatively compared to CRISPR-Cas9 lineage trees from previous studies (Ref. 12, 14). This comparison (Fig. 2d) highlighted SMALT's superior resolution (3.3 times more internal branching events), justifying the choice of SMALT as an advanced tool for this research.
Cross-Validation of Cell-Cell Communication: The findings from CellChat analysis regarding ligand-receptor interactions were validated using an orthogonal method, MultiNicheNetR (Supplementary Fig. 28). This serves as a strong validation that the observed intercellular communication patterns are not an artifact of a single computational tool, enhancing confidence in the findings.
Correlation Analyses: The paper extensively uses correlation analyses (e.g., Pearson's $r$ and Spearman's $ρ$ ) to explore relationships between various parameters:
- Correlation of per site mutation frequency between mouse and Drosophila (Fig. 1e).
- Correlation between proportions of different cell types (macrophages, neutrophils, endothelial, epithelial) and lesion clonality ( $1/Np$ ) (Extended Data Fig. 8e).
- Correlation between lesion clonality ( $1/Np$ ) and the average number of ligand-receptor interactions (Fig. 4e). These analyses help to understand how different cellular and genomic features co-vary with the progression of clonality.
Spatial Computational Inference: The use of agent-based tumor simulations and approximate Bayesian computation (Supplementary Fig. 15) to estimate selection coefficients ( $s$ ) in monoclonal AOM/DSS tumors (Supplementary Fig. 16) provides a computational validation of the strength of subclonal selection, which is a key driver of the polyclonal-to-monoclonal transition.

These analyses collectively serve to demonstrate the robustness of the methods, validate key findings against alternative approaches or conditions, and explore the quantitative relationships between different biological parameters, much like traditional ablation studies would for specific model components.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper presents a groundbreaking study that comprehensively maps the single-cell phylogenies of colorectal precancerous lesions, revealing a consistent polyclonal-to-monoclonal transition in both mouse models and human sporadic polyps. Through the innovative use of the high-resolution SMALT DNA barcoding system, integrated with WES, WGS, and scRNA-seq, the authors demonstrate that early precancerous lesions are often founded by multiple independent cell lineages undergoing parallel clonal expansions (polyclonal origin). Monoclonal lesions, in contrast, represent a more advanced stage, characterized by higher mutation burdens, more driver mutations, and increased selective pressure. Crucially, the study uncovers extensive intercellular interactions, particularly involving ECM organization and cell adhesion, in these early polyclonal lesions, which significantly diminish during the transition to a monoclonal state. This highlights the vital role of intercellular cooperation in the initial phases of tumorigenesis. The findings provide a novel conceptual framework for understanding early cancer evolution, suggesting opportunities for intervention by targeting these initial cooperative interactions.

7.2. Limitations & Future Work

The authors acknowledge several limitations and suggest future research directions:

Underestimation of Progenitors: The Np estimates might be lower than the actual number of progenitors, as many founding lineages could be lost during growth due to random drift and competition.
Unclear Mechanisms of Driver Mutations: The paper notes that some driver mutations found frequently in polyps (e.g., BCL9L, SOX9) are less common in CRCs. The underlying mechanisms by which these early drivers might potentially impede malignant transformation remain unclear and require further investigation.
Further Elucidation of Intercellular Interactions: While extensive intercellular interactions were identified in early polyclonal lesions, the precise mechanisms by which these interactions promote neoplastic growth require further study.
Linking Cell States and Lineage Information: Future research should focus on unraveling the molecular crosstalk within the microenvironment by linking cellular states (e.g., using single-cell multiomics) with lineage information in the same cells. This would provide a more direct causal link between cellular phenotype and evolutionary trajectory.
Predictive Modeling for Cancer Risk: Efforts are needed to formulate a predictive model that can forecast cancer risk based on the molecular and evolutionary features of premalignant lesions.

7.3. Personal Insights & Critique

This paper is a tour de force in cancer evolutionary biology, representing a significant leap forward in understanding early tumorigenesis.

Innovations:
- The SMALT system: This is a major technical innovation. The high number of mutable sites compared to CRISPR-Cas9 based methods (~800 vs. 10-60) provides an unprecedented resolution for reconstructing cell phylogenies, which is absolutely critical for deciphering the nuances of polyclonal origins and early clonal dynamics. This technological advancement underpins much of the paper's success.
- Multi-omics Integration: The seamless integration of high-resolution lineage tracing with WES, WGS, and scRNA-seq in both mouse and human samples is exceptionally rigorous. This comprehensive approach allows for robust validation across species and provides both genetic and microenvironmental insights, offering a holistic view of the evolutionary process.
- The "Polyclonal-to-Monoclonal Transition" Framework: This model provides a compelling and empirically supported conceptual framework that refines our understanding of early cancer evolution. It moves beyond a simple monoclonal/polyclonal dichotomy to illustrate a dynamic process driven by selection, offering a more realistic representation of disease progression.
Implications & Transferability:
- The finding that early lesions are often polyclonal, relying on extensive intercellular cooperation, is particularly insightful. It suggests that cancer initiation is not always a solitary event driven by a single advantageous clone, but can involve a "community effect" where multiple lineages collaborate to overcome initial growth barriers. This implies that early intervention strategies could focus on disrupting these cooperative interactions rather than solely targeting individual driver mutations.
- The methods, particularly the high-resolution SMALT lineage tracing and the integrated multi-omics approach, are highly transferable. They could be adapted to study clonal dynamics in various other contexts, such as normal tissue homeostasis, regeneration, aging, and other diseases involving complex cellular interactions and evolution (e.g., inflammatory diseases, neurological disorders).
Potential Issues & Areas for Improvement:
- Causality of Intercellular Interactions: While scRNA-seq reveals strong correlations between extensive intercellular interactions and early polyclonal lesions, the precise causal mechanisms remain to be fully elucidated. Future experiments (e.g., in vitro co-culture systems or in vivo manipulation of specific ligand-receptor pathways) would be crucial to functionally demonstrate how these interactions promote polyclonal growth and how their loss facilitates monoclonal transition. The paper hints at a recruitment model, but direct evidence of how non-driver-mutated cells are supported by neighboring clones is an exciting next step.
- Initial Cellular State: The paper defines progenitors as cells capable of founding clonal populations, but the specific cellular identity (e.g., stem cells, transit-amplifying cells, differentiated cells undergoing dedifferentiation) that gives rise to these initial polyclonal lineages is not deeply explored. Understanding this could further refine the initiation model.
- Generalizability of the AOM/DSS Model: While a good model for inflammation-driven cancer, the specific inflammatory context might influence the observed intercellular interactions. It would be valuable to explore if similar interaction patterns are seen in other non-inflammatory CRC models or other cancer types.
- Predictive Model Development: The suggested future work of formulating a predictive model for cancer risk based on precancerous lesion features is ambitious but critical. Translating these complex single-cell evolutionary insights into clinically actionable biomarkers or risk assessment tools will be a significant challenge, requiring robust validation in larger human cohorts.
  
  Overall, this paper provides a robust and deeply insightful view into the nascent stages of colorectal cancer, fundamentally reshaping our understanding of how these complex diseases begin and evolve. Its rigorous methodology and compelling findings lay a strong foundation for future research aimed at truly early cancer prevention.