Polyclonal-to-monoclonal transition in colorectal precancerous evolution
TL;DR Summary
This study uses a base editor-enabled DNA barcoding system to reveal the origin and evolution of colorectal precancerous lesions, showcasing the transition from polyclonal to monoclonal states. High-resolution single-cell phylogenies indicate numerous independent lineages undergo
Abstract
Unravelling the origin and evolution of precancerous lesions is crucial for preventing malignant transformation, yet our current knowledge remains limited. Here we used a base editor-enabled DNA barcoding system to comprehensively map single-cell phylogenies in mouse models of intestinal tumorigenesis induced by inflammation or loss of the Apc gene. Through quantitative analysis of high-resolution phylogenies, we identified tens of independent cell lineages undergoing parallel clonal expansions within each lesion. We also found polyclonal origins of human sporadic colorectal polyps through bulk whole-exome sequencing and single-gland whole-genome sequencing. Genomic and clinical data support a model of polyclonal- to-monoclonal transition, with monoclonal lesions representing a more advanced stage.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Polyclonal-to-monoclonal transition in colorectal precancerous evolution
1.2. Authors
Zhaolian Lu, Shanlan Mo, Duo Xie, Xiangwei Zhai, Shanjun Deng, Kantian Zhou, Kun Wang, Xueling Kang, Hao Zhang, Juanzhen Tong, Liangzhen Hou, Huijuan Hu, Xuefei Li, Da Zhou, Leo Tsz On Lee, Li Liu, Yaxi Zhu, Jing Yu, Ping Lan, Jiguang Wang, Zhen He, Xionglei He & Zheng Hu
1.3. Journal/Conference
Published online in Nature.
Nature is one of the most prestigious and influential scientific journals globally. Its reputation stems from publishing groundbreaking, high-impact research across all fields of science and technology. Publication in Nature signifies exceptional scientific rigor, novelty, and broad significance, making it a top-tier venue in academic research.
1.4. Publication Year
2024
1.5. Abstract
The paper investigates the origin and evolution of precancerous lesions in colorectal cancer, a critical area for prevention but with limited current understanding. The authors utilized a base editor-enabled DNA barcoding system, SMALT, to map single-cell phylogenies in mouse models of intestinal tumorigenesis (induced by inflammation or Apc gene loss). Through quantitative analysis of these high-resolution phylogenies, encompassing 260,922 single cells, they identified numerous independent cell lineages undergoing parallel clonal expansions within each lesion. Complementing this, they found polyclonal origins in human sporadic colorectal polyps using bulk whole-exome sequencing (WES) and single-gland whole-genome sequencing (WGS). Genomic and clinical data support a model of polyclonal-to-monoclonal transition, where monoclonal lesions represent a more advanced stage of disease. Single-cell RNA sequencing (scRNA-seq) revealed extensive intercellular interactions in early polyclonal lesions, which significantly diminished during monoclonal transition. These findings suggest that colorectal precancer often originates from multiple lineages, with cooperative intercellular interactions being crucial in the earliest stages of cancer formation, offering insights for earlier intervention strategies.
1.6. Original Source Link
/files/papers/691c4e2b25edee2b759f32e3/paper.pdf
Publication Status: Published online at Nature on 30 October 2024, with .
2. Executive Summary
2.1. Background & Motivation
The core problem the paper addresses is the limited understanding of how precancerous lesions originate and evolve into malignant tumors, particularly in colorectal cancer (CRC). This knowledge gap is critical because unraveling these initial phases is essential for effective early screening, intervention, and ultimately, preventing malignant transformation.
CRC is a significant global health burden, being the third most prevalent cancer and the second leading cause of cancer-related deaths. The well-established adenoma-carcinoma sequence in CRC development provides an ideal model to study early tumorigenesis. While genomic sequencing of malignant tumors typically reveals a single founding clone, previous studies using somatic markers have suggested that premalignant adenomas or even some CRCs can have polyclonal origins, meaning multiple distinct cell lineages contribute to the lesion. However, a comprehensive understanding of the prevalence, temporal dynamics, and underlying mechanisms of these polyclonal origins and their subsequent evolution in the earliest stages of lesion formation remains largely unknown. The challenges include the long span of precancerous stages and the difficulty in accurately identifying the earliest events of tumor initiation.
The paper's entry point and innovative idea lie in leveraging a novel, high-resolution lineage tracing technique called SMALT (substitution mutation-aided lineage-tracing), which is a base editor-enabled DNA barcoding system. This system allows for systematically mapping the origin and evolution of intestinal precancers at single-cell resolution in mouse models. By integrating this advanced lineage tracing with other multi-omics datasets (like WES, WGS, and scRNA-seq) in both mouse and human samples, the authors aim to comprehensively delineate the early evolutionary trajectory of colorectal tumorigenesis.
2.2. Main Contributions / Findings
The paper makes several significant contributions and reaches key conclusions regarding the early evolution of colorectal cancer:
-
Development and Application of High-Resolution Lineage Tracing: The study successfully utilized
SMALT, a base editor-enabled DNA barcoding system, to generate unprecedentedly high-resolution single-cell phylogenies. This system boasts a significantly higher number of mutable sites (average of 836) compared to previous methods, enabling more detailed tracking of cell lineages in mouse models of intestinal tumorigenesis. -
Identification of Polyclonal Origins in Precancerous Lesions: Through quantitative analysis of over 260,000 single-cell phylogenies from mouse models (AOM/DSS and ), the study found that a majority of precancerous lesions (e.g., 66.7% of AOM/DSS neoplasms) originate polyclonally, meaning they are founded by multiple independent cell lineages undergoing parallel clonal expansions. The estimated number of founding progenitors (
Np) varied from 2 to 33 in AOM/DSS lesions and 4 to 100 in polyps. -
Validation of Polyclonal Origins in Human Sporadic Colorectal Polyps: The research extended these findings to human biology by demonstrating polyclonal origins in sporadic colorectal polyps through bulk
WESand single-glandWGSfrom patient cohorts. This confirms that the polyclonal initiation observed in mouse models is biologically relevant to human disease. -
Elucidation of a Polyclonal-to-Monoclonal Transition Model: The study provides compelling genomic and clinical evidence supporting a "polyclonal-to-monoclonal transition" model in colorectal tumorigenesis. Monoclonal lesions were found to be more advanced, characterized by significantly higher barcode mutations (indicating more cell divisions), higher genome-wide mutation burdens, more putative driver mutations, larger size, and higher-grade dysplasia compared to polyclonal lesions. This transition is driven by stringent subclonal selection.
-
Discovery of Critical Intercellular Interactions in Early Stages: Single-cell RNA sequencing (
scRNA-seq) revealed extensive intercellular interactions, particularly via extracellular matrix (ECM) organization and cell adhesion pathways, in early polyclonal lesions. These interactions were observed to significantly decrease as lesions transitioned towards monoclonality. This suggests that intercellular cooperation plays a crucial role in promoting neoplastic growth during the earliest phases of tumorigenesis.These findings solve the problem of limited understanding regarding the cellular origin and early evolutionary dynamics of colorectal precancer. They provide a conceptual framework for the initial phases of cancer formation, highlighting that
CRCoften begins with multiple cooperating lineages before a dominant, more aggressive clone takes over. This offers significant opportunities for developing earlier intervention strategies by targeting these initial cooperative interactions.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
To fully grasp the contents of this paper, a foundational understanding of several key biological and technical concepts is essential:
- Precancerous Lesions: These are abnormal cells or tissues that have undergone some genetic or cellular changes, making them more likely to develop into cancer. They are not yet malignant but represent an increased risk. In the context of the colon,
adenomasorpolypsare common precancerous lesions. - Malignant Transformation: This is the process by which normal cells or precancerous cells acquire cancerous properties, becoming invasive and capable of metastasis. It involves accumulation of genetic mutations that drive uncontrolled cell growth and survival.
- DNA Barcoding (Lineage Tracing): This technique involves tagging individual cells with unique, heritable genetic sequences (barcodes). As these cells divide, their progeny inherit the barcode, allowing researchers to track cell lineages, understand developmental relationships, and reconstruct phylogenetic trees. It's akin to giving each founding cell a unique
IDthat is passed down. - Base Editor: A sophisticated gene-editing tool derived from
CRISPR-Castechnology. UnlikeCRISPR-Cas9which typically creates double-strand breaks inDNA, a base editor chemically modifies a singleDNAbase into another (e.g., C-to-T) without causingDNAbreakage. This allows for precise, targeted, and programmable point mutations, which are crucial for generating diverse barcodes in lineage tracing. - Single-Cell Phylogenies: These are evolutionary "family trees" constructed for individual cells within a tissue or population. By analyzing the accumulation of heritable somatic mutations (like the barcode mutations induced by
SMALT), researchers can infer the ancestral relationships between cells and their clonal descendants. This helps visualize how different cell lineages expanded and diversified. - Clonal Expansion: This refers to the proliferation of a group of cells that all originate from a single common ancestral cell. When a cell acquires a mutation that confers a growth advantage, it can undergo clonal expansion, leading to a population of genetically identical (or nearly identical) cells.
- Polyclonal Origin: A lesion or tumor is considered to have a polyclonal origin if it develops from multiple distinct progenitor cells that independently initiate growth and expand. These multiple lineages coexist and contribute to the overall lesion.
- Monoclonal Origin: A lesion or tumor is considered to have a monoclonal origin if it arises from a single progenitor cell that acquires mutations and then expands to form the entire lesion. All cells in the lesion are descendants of that one founding cell.
- Colorectal Cancer (CRC): Cancer that develops in the colon or rectum. It typically begins as small, noncancerous (benign) clumps of cells called polyps that form on the inside of the colon.
- Adenoma-Carcinoma Sequence: A widely accepted model describing the progression of
CRC. It posits that mostCRCsdevelop from benign adenomatous polyps that gradually accumulate genetic mutations, leading to malignant transformation. - Whole-Exome Sequencing (WES): A genomic technique for sequencing all the protein-coding regions of genes in a genome (the
exome). While it only covers about 1-2% of the human genome, it is cost-effective and covers the most functionally relevant parts where disease-causing mutations are concentrated. - Whole-Genome Sequencing (WGS): A comprehensive genomic technique that determines the entire
DNAsequence of an organism's genome at a single time. It provides a complete picture of all genetic variations, including those in non-coding regions. - Single-Cell RNA Sequencing (scRNA-seq): A powerful technique that measures the gene expression profiles of individual cells. Unlike bulk
RNA-seqwhich averages expression across a population of cells,scRNA-seqreveals cell-to-cell heterogeneity, identifies distinct cell types, and uncovers cell states and functions within a complex tissue. ApcGene: TheAdenomatous Polyposis Coligene. It is a tumor suppressor gene, and mutations inApcare a common initiating event in colorectal cancer development, particularly in familial adenomatous polyposis (FAP).Apcis a key component of theWntsignaling pathway, which regulates cell proliferation and differentiation.- AOM/DSS Model: A commonly used mouse model for inducing colorectal cancer.
AOM(azoxymethane) is a genotoxic carcinogen that inducesDNAdamage, andDSS(dextran sodium sulfate) induces colonic inflammation. The combination mimics inflammation-drivenCRC, similar to that seen in inflammatory bowel disease (IBD) patients. - Mice: A genetically engineered mouse model that carries a germline mutation in the
Apcgene. These mice spontaneously develop multiple intestinal polyps (adenomas) and are widely used to studyAPC-driven intestinal tumorigenesis, particularly modelingFAP.
3.2. Previous Works
The paper builds upon a foundation of previous research, integrating various techniques and concepts:
- Clonality of Tumors: The prevailing view from early cancer genomic sequencing was that malignant tumors typically arise from a single founding clone (monoclonal origin) (Ref. 8). However, this paper acknowledges that earlier studies using somatic markers had already suggested that premalignant adenomas or even some
CRCscould have polyclonal origins, where multiple distinct lineages expand concurrently (Ref. 9, 10). This paper aims to provide a more comprehensive and high-resolution picture of this debate, especially in the earliest stages. - Lineage Tracing Techniques: The study significantly advances lineage tracing, which has revolutionized how cell lineages are recorded. Earlier methods included:
- CRISPR-Cas9-based lineage tracing: Techniques like those by Chan et al. (Ref. 12) and Yang et al. (Ref. 14) use
CRISPR-Cas9to introduce mutations at specific target sites inDNAbarcodes. These mutations accumulate over cell divisions, allowing phylogenetic reconstruction. The current paper differentiates itself by usingSMALT, a base editor-based system, which offers higher mutation diversity and resolution.
- CRISPR-Cas9-based lineage tracing: Techniques like those by Chan et al. (Ref. 12) and Yang et al. (Ref. 14) use
- Inflammation-Induced Clonal Expansions: The paper notes that previous work by Olafsson et al. (Ref. 23) had already observed significant clonal expansions in the inflamed colon of patients with
IBD. The current study'sAOM/DSSmouse model further substantiates and quantifies these inflammation-driven clonal expansions with high-resolution lineage data. - Selection in Cancer Evolution: The
dN/dSratio (ratio of non-synonymous to synonymous mutations) as a metric for quantifying selective pressure in cancer was established by Martincorena et al. (Ref. 24). This study applies this metric to assess selection in polyclonal versus monoclonal lesions. - Computational Modeling of Tumor Evolution:
Agent-based tumor simulationsandapproximate Bayesian computationare mentioned as methods used in prior studies (Ref. 25, e.g., by Hu et al. in colorectal cancer) to infer spatial dynamics and selection coefficients, which this paper also utilizes. - Single-Gland Analysis in Colon: The concept that individual glands (crypts) in the colorectal epithelium often represent clonal cell populations was established by studies such as Lee-Six et al. (Ref. 31) and Saini et al. (Ref. 32). This forms the basis for the paper's single-gland
WGSto assess clonality. - Tumor-Associated Macrophage (TAM) Classification: The unified nomenclature for
TAMsbased on single-cell transcriptomics (Ref. 34) is referenced for classifying macrophage subtypes within the tumor microenvironment. - Cell-Cell Communication Inference: Tools like
CellChat(Ref. 36) are used to infer ligand-receptor interactions between cell types, building on methods designed to understand intercellular communication fromscRNA-seqdata. - Models of Polyclonal Initiation: The paper references the "recruitment model" involving short-range interactions (Ref. 38, by Thliveris et al.) as a potential explanation for polyclonal initiation. It also cites Reeves et al. (Ref. 41) who showed that
Hras-mutant clones can recruit neighboring wild-type epithelial cells, and theAllee effect(Ref. 42), which describes growth barriers for small populations, suggesting that recruitment could be selectively favored.
3.3. Technological Evolution
The field of cancer evolution research has seen a rapid technological advancement, particularly in high-throughput sequencing and single-cell analysis. This paper's work fits within this timeline by integrating several cutting-edge technologies:
- From Bulk to Single-Cell Genomics: Early cancer genomics relied on bulk sequencing, which provided an averaged view of tumor mutations. The evolution to single-cell genomics (
scRNA-seq, single-cell lineage tracing) allowed for unprecedented resolution in dissecting tumor heterogeneity and evolutionary trajectories. - From Somatic Markers to Programmable DNA Barcodes: Initial clonality studies often relied on endogenous somatic mutations or genetic markers. The development of synthetic, programmable
DNAbarcodes (e.g., usingCRISPR-Cas9) enabled more precise and scalable lineage tracing. TheSMALTsystem represents an evolution of this, moving fromCRISPR-Cas9(which primarily introduces indels) to base editors (which introduce point mutations), offering higher diversity and potentially finer resolution. - Integration of Multi-Omics Data: The trend has moved towards integrating multiple types of data (genomics, transcriptomics, lineage information) from the same or matched samples. This paper exemplifies this by combining
SMALTlineage tracing,WGS/WES, andscRNA-seqto provide a holistic view of both genetic evolution and cellular microenvironment changes.
3.4. Differentiation Analysis
Compared to the main methods in related work, the core differences and innovations of this paper's approach are:
- High-Resolution Lineage Tracing with
SMALT: WhileCRISPR-Cas9-based lineage tracing methods (e.g., Ref. 12, 14) are powerful, they typically offer a limited number of mutable sites (10-60). TheSMALTsystem, utilizing a base editor, significantly increases this capacity to an average of 836 mutable sites. This higher diversity provides superior resolution for phylogenetic mapping, yielding 3.3 times more internal branching events in phylogenetic trees compared toCRISPR-Cas9systems, allowing for a much finer reconstruction of clonal relationships and expansion dynamics. - Comprehensive Integration of Multi-Omics Data: The study doesn't rely solely on lineage tracing. It rigorously integrates high-resolution
SMALTdata with bulkWES, single-glandWGS, andscRNA-seqdata from both mouse models and human patient samples. This multi-omics approach provides a robust and multifaceted view of tumor evolution, allowing for correlation between genetic changes, cellular states, intercellular communication, and clinical parameters. - Focus on the Polyclonal-to-Monoclonal Transition: While previous studies acknowledged polyclonal origins, this paper specifically proposes and provides extensive evidence for a dynamic
polyclonal-to-monoclonal transitionas a common trajectory in early colorectal tumorigenesis. It quantifiesNp(number of founding progenitors) and characterizes the genomic and microenvironmental features that distinguish these stages. - Insights into Intercellular Interactions in Early Stages: The
scRNA-seqcomponent uniquely reveals that early polyclonal lesions are characterized by extensive intercellular interactions, particularly throughECMorganization and cell adhesion. This highlights a cooperative aspect of early tumorigenesis that is lost as lesions become monoclonal, suggesting a mechanism beyond just independent clonal expansion. This provides a novel mechanistic understanding of the early tumor microenvironment.
4. Methodology
This section provides a detailed, step-by-step deconstruction of the paper's methodology, integrating the principles with the specific technical implementation and mathematical formulas where applicable.
4.1. Principles
The core idea of the method used in this paper is to leverage DNA barcoding to track cell lineages at single-cell resolution, coupled with genomic and transcriptomic profiling, to understand the evolutionary dynamics of precancerous lesions. The theoretical basis rests on the principle that mutations accumulate over cell divisions and are heritable. By engineering a DNA barcode that can be continuously mutated, each cell's lineage can be uniquely marked. Reconstructing phylogenetic trees from these mutations allows for discerning clonal relationships and expansion histories.
The SMALT (substitution mutation-aided lineage-tracing) system is central to this. It employs HsAID (an optimized Homo sapiens activation-induced cytidine deaminase) which induces C-to-T mutations in a targeted 3kb DNA barcode. Crucially, these C-to-T mutations only become fixed after DNA replication, meaning the number of barcode mutations directly correlates with the number of cell divisions. iScel (an inactive variant of the homing nuclease I-Scel) is fused to HsAID and guides it to specific motifs within the barcode, ensuring targeted mutagenesis. The expression of this HsAIDiScel fusion protein is inducible by doxycycline, allowing temporal control over barcode mutagenesis.
Once lineage trees are constructed, quantitative methods are applied to assess clonality. TarCA (targeting coalescent analysis) is used to estimate Np (the number of founding progenitor cells) by analyzing the probability of two random cells sharing a common ancestor within a monophyletic clade. WES and WGS are used to identify somatic mutations (SNVs, indels, SCNAs) and calculate CCFs (Cancer Cell Fractions), which help determine whether a lesion is monoclonal (one dominant clone) or polyclonal (multiple clones). Finally, scRNA-seq is employed to dissect the cellular states and intercellular communication networks within these lesions, providing insights into the microenvironmental factors that accompany clonal evolution.
4.2. Core Methodology In-depth (Layer by Layer)
The methodology involves a comprehensive multi-omics approach, spanning DNA barcoding, mouse modeling, sequencing, computational analysis of genomic and transcriptomic data, and quantitative phylogenetic inference.
4.2.1. SMALT System and Mouse Model Establishment
The study first establishes the SMALT lineage tracing system in mice.
-
SMALT System Components (Fig. 1a, Extended Data Fig. 1a,b):
- HsAID: An optimized human activation-induced cytidine deaminase with 10 amino acid substitutions relative to the wild type. It induces
cytosine (C)-to-thymine (T)mutations at a rate approximately 30-fold faster than the wild type. The key property is that the deaminated (uracil, ) is converted to only afterDNAreplication, meaning the accumulatedC-to-Tbarcode mutations directly measure the relative number of cell divisions a cell has undergone. - iScel: An inactive variant of the homing nuclease
I-Scel. It binds specifically to an motif, thereby guidingHsAIDto target regions within theDNAbarcode. - 3-kb DNA Barcode: This engineered barcode consists of 16 different tandem targets. Each target contains an
18 bp iScelbinding site and a156 bpediting region whereHsAIDcan induce mutations. - Inducible Expression: The expression of the
HsAIDiScelfusion protein is under the control of a doxycycline-inducible promoter. This allows for precise temporal control over when barcode mutagenesis begins. - Integration: The
SMALTsystem (hsAID,iScel, and the3 kb DNAbarcode) was knocked into theH11locus of mouse embryonic stem cells to generateSMALTmice.
- HsAID: An optimized human activation-induced cytidine deaminase with 10 amino acid substitutions relative to the wild type. It induces
-
Mouse Models of Intestinal Tumorigenesis (Fig. 1b, Extended Data Fig. 1c-e):
- AOM/DSS Model (Inflammation-driven): Transgenic male mice () were used. At 6-8 weeks, mice were fed doxycycline for 3 days. Then, a single intraperitoneal injection of
AOM( body weight) was administered. Three days later, mice underwent three cycles of2% DSStreatment (7 daysDSS-dissolved water, 14 days tap water per cycle). This model is used to studyCRCinduced by inflammation. - Model (
APC-driven): male mice were mated with female mice to generate males. Female mice were fed doxycycline for 3 days before mating. This model is used forAPC-driven intestinal tumorigenesis. - Sample Collection: At the experimental endpoint (e.g., around 30 weeks postnatal), normal colon, inflamed colon,
AOM/DSSneoplasms (), polyps (), and other unaffected organs (blood, liver, lung) were collected. - Cell Sorting: Cells were sorted using
MojoSortnanobeads. cells (immune cells) andCD45- EpCAM+cells (epithelial cells) were isolated, confirmed byFACS.
- AOM/DSS Model (Inflammation-driven): Transgenic male mice () were used. At 6-8 weeks, mice were fed doxycycline for 3 days. Then, a single intraperitoneal injection of
4.2.2. Barcode Sequencing and Data Processing
-
Library Preparation for PacBio Sequencing:
- Genomic
DNAwas isolated from sorted cells. - A three-step
PCRstrategy was employed to amplify the3-kbtarget barcode:- Lineage Amplification: One cycle of
PCRusing aP1primer to incorporate14-nucleotide Unique Molecular Identifiers (UMIs)onto each originalDNAmolecule. - Nested PCR Enrichment: Ten cycles of nested
PCRamplification usingP2andP3primers to enrich the indexed target molecules. - Sample Multiplexing: Final
PCRamplification usingP4andP5primers, which contain6-nucleotide symmetric sample barcodesto enable multiplexing of samples.
- Lineage Amplification: One cycle of
PacBio SMARTbelllibrary preparation was performed, followed byPacBio Sequel IIeplatform sequencing, generatingHiFireads.
- Genomic
-
Long-read Sequencing Data Processing:
- Read Processing: Adapters were removed, and
Circular Consensus Sequencing (CCS)reads were generated usingpbccs v6.2.0(with parameters ). - Alignment:
CCSreads were mapped to the reference3-kbtarget barcode sequence usingminimap2 v2.17(with parameters-t 10 -A4 -B12 -O10,15 -E2,1 --score-N 0 --end-bonus 10 -a --MD -x map-pb).samtoolswas used forsamfile manipulation. - UMI Dereplication and Variant Calling:
UMIs,3-kbbarcode sequence, and sample barcodes were annotated.UMIswere grouped usingusearch v11.0.667(with parameters-id 0.95 -gapopen 3.0l/2.0E -gapext 1.0l/0.5E -match +2.0 -mismatch 20.0 -sizeout), removing groups with fewer than 3CCSreads. For eachCCSgroup, reads were realigned and collapsed into a consensusCCSread, calling nucleotide substitutions with an overall mapping quality > 50 and a mutation frequency > 0.6. - Filtering for Downstream Analysis:
- Only high-quality
CCSreads with at least two mutations on the barcode were retained. - To remove normal cells from neoplastic samples:
- If a neoplastic sample showed a bimodal distribution of mutation counts, cells in the lower mutation cluster were removed.
- If not bimodal, cells with mutation counts lower than the 75th percentile of mutation counts in adjacent normal cells were removed.
- After filtering, 260,922 cells were used for analysis.
- Only high-quality
- Read Processing: Adapters were removed, and
4.2.3. Phylogenetic Reconstruction and Clonality Assessment
-
Phylogenetic Tree Reconstruction:
- Phylogenetic trees were reconstructed using a maximum-likelihood method implemented in
IQ-TREE v2.2.2.7(with parameters-T5 -oref -m GTR2+FO+R10). - The original reference
3-kbbarcode sequence was used as the phylogenetic root. - was selected as the optimal substitution model.
- Robustness Evaluation: 1,000 rounds of ultrafast bootstrap approximation () and 1,000 rounds of
SH-like approximate likelihood ratio test(-alrt 1000) were performed. - Hotspot Sites: Sites with a mutation frequency > 0.04 and present in at least two normal samples were identified as hotspots (14 sites).
- Phylogenetic trees were reconstructed using a maximum-likelihood method implemented in
-
Clonality Determination:
- Monoclonal: Characterized by the presence of a single dominant monophyletic clade of neoplastic cells with shared clonal mutations.
- Polyclonal: Identified when neoplastic cells were dispersed into multiple phylogenetic clades, intermixed with normal cells, and lacked clonal mutations.
-
Quantifying Founding Progenitors (TarCA - Targeting Coalescent Analysis):
- The
TarCA v0.1.0method (Ref. 22) was used to estimate the number of founding progenitor cells (Np) for each neoplastic lesion. - A "progenitor" is defined as an ancestral cell capable of founding a clonally expanding population in the lesion.
- The effective number of progenitors (
Np) is calculated as the inverse ofPr, the probability of two random neoplastic cells sharing a common progenitor in a monophyletic clade. - The formula for
Npis: $ \mathrm{Np} = 1 / \mathrm{Pr} $ where is the effective number of progenitors. - The formula for
Pr(probability of two random neoplastic cells sharing a common progenitor in a monophyletic clade of the phylogenetic tree) is: $ \mathrm{Pr} = \frac{\sum C_{m_i}^2}{C_{N_s}^2} = \frac{\sum (m_i \times (m_i - 1))}{N_s \times (N_s - 1)} $ where:- is the number of sampled cells of the -th monophyletic clade of neoplastic cells.
- is the total number of neoplastic cells in the sample.
- represents "X choose 2", or the number of ways to choose 2 items from X without replacement, which is . The formula can be simplified by canceling the terms in both numerator and denominator.
- Robustness Evaluation:
Npestimates were evaluated by downsampling: if a sample had >1,000 cells, 1,000 neoplastic cells were downsampled 20 times (including all normal cells); if <1,000 cells, normal or neoplastic cells were downsampled 20 times ( being the smaller cell count).
- The
-
Estimating Timing of Progenitors (Extended Data Fig. 5a):
- This method infers the start time of clonal expansions for polyp initiation.
- The average barcode mutation burden in normal cells () is expressed as: $ m_0 = \mu_0 T $ where is the normal cell mutation rate per cell division, and is the total time from fertilized egg to polyp sampling.
- The average mutation burden for neoplastic cells within each monophyletic clade in the phylogenetic tree is .
- The ratio of the neoplastic cell mutation rate to the normal cell mutation rate is . This ratio is estimated from mutation accumulation data obtained from in vitro organoid cultures (see below).
- The timing of progenitor initiation () is then inferred using these values.
- Organoid Experiments: Organoids derived from normal small intestine and neoplastic polyps of
SMALTmice were cultured for 30 days with doxycycline.SMALTbarcodes were sequenced at Day 0, 15, and 30. Linear regression of barcode mutations over time allowed estimation of mutation rates ( and ) and thus the ratio .
4.2.4. Genomic Analysis of Human and Mouse Samples
-
Whole-Genome Sequencing (WGS) of Mouse Samples:
- High-quality genomic
DNAwas extracted from 34 mouse samples (normal, polyps, tumors). - Sequencing was performed on the
Illumina Novaseq PE150platform, generating an average of90 Gbof data (approximately30xcoverage) per sample.
- High-quality genomic
-
Whole-Exome Sequencing (WES) of Human Sporadic Polyp/CRC Cohort:
WESdata (mean depth ) was collected from 107 treatment-naive patients with synchronous sporadic premalignant polyps andCRCs.- Synchronous tumor (), polyp (), and adjacent normal () samples were collected.
- Genomic
DNAwas extracted,NEBNext Ultra DNA Library Prep Kitwas used for library construction, andSureSelect XT Human All Exon V6 kitfor exome capture. - Sequencing was performed on the
Illumina NovaSeq 6000platform (150 bppaired-end). - Histological Grading: Polyps were graded as low-grade or high-grade dysplasia based on
H&E-stained images, considering architectural and cytological features.
-
Detection of Somatic
SNVsandSCNAs:- Preprocessing: Raw
fastqfiles were preprocessed withfastp v0.19.7. - Alignment: Cleaned reads were aligned to the reference genome (
mm10for mouse,GRCh38for human) usingBWA-MEM algorithm(). - GATK Best Practices: Aligned reads were processed with
MarkDuplicates,BaseRecalibrator, andApplyBQSR(GATK v4.2.0.0). SNV/IndelCalling:Mutect2(Ref. 58) was used to identifySsNVsandindelsfor each tumor/normal or polyp/normal pair. A panel-of-normals (PoN) filter removed artefactual/germline variants.FilterMutectCallextracted high-confidence variants.- Annotation: Somatic mutations were annotated using
ANNOVAR v20200608. SCNADetection and Purity/Ploidy Estimation:TitanCNA v1.28.0(Ref. 60) was used to detectSCNAsand estimate tumor purity and ploidy. Samples with purity were retained.CCFCalculation: The Cancer Cell Fraction (CCF) for each somatic mutation was calculated by adjusting the observedVAF(variant allele frequency) based on tumor purity, local copy number, and multiplicity (Ref. 28). The formula forCCFis often context-dependent, but generally follows principles similar to: $ \mathrm{CCF} = \frac{\mathrm{VAF} \times (1 - \mathrm{Purity} + \mathrm{Purity} \times \mathrm{CopyNumber})}{\mathrm{Purity}} $ where:- is the Cancer Cell Fraction.
- is the Variant Allele Frequency.
- is the estimated tumor cell purity.
- is the total copy number of the genomic locus in the tumor.
- Clonal vs. Subclonal: Mutations were classified as
clonalif the upper bound of their95%confidence interval forCCFwas ; otherwise,subclonal. - Filtering
SSNVs: Retained if variant reads in WES (or in WGS) in polyp/tumor, total reads , and in the normal sample.
- Preprocessing: Raw
-
Single-Gland WGS (Human):
- Isolated 29 neoplastic glands from 5 polyps and 3 adjacent normal crypts from one sporadic patient (B139).
WGS(mean depth approximately21x) was performed usingVazyme TruePrep DNA Library Prep Kit V2andIllumina NovaSeq.- Variant Calling:
Mutect2andStrelka v2.9.2were used, andVariantFilterobtained a consensus call set. - High-Confidence Mutations: Retained if , variant reads , total reads in gland, and no reads in the matched bulk normal sample.
- Phylogenetic Trees: Reconstructed using a
neighbour-joining method(Ref. 66) inBiopython(Ref. 67), with the reference sequence as the root. SCNAEstimation:Sequenza v3.0.0R package (Ref. 68) was used to estimateSCNAsrelative to matched normal tissue. Cutoffs for amplification (AMP) and deletion (DEL) were and , respectively.
-
Clonal Relatedness for Polyp/CRC Pairs:
Breakclone v0.3.3(Ref. 69) was used to assess clonal relatedness, incorporating population frequency and allele frequency.- It calculates a P-value (from a permutation test) and a clonal relatedness score.
- Classification:
related( and clonality score ),unrelated( and clonality score ), orambiguous(remaining pairs).
-
dN/dSAnalysis:- The
dN/dSratio (ratio of non-synonymous to synonymous mutation rates) was estimated using thedndscv v0.0.1.0R package (Ref. 24). - This analysis was applied separately to mutation sets within polyclonal polyps, monoclonal polyps, and
CRCs. dN/dSfor putativeCRCdriver genes was extracted using thegenesetdndsfunction.- Conceptual Definition: The dN/dS ratio, often denoted as , is a measure of selective pressure acting on protein-coding genes. It quantifies the ratio of the rate of non-synonymous (amino acid changing) substitutions (
dN) to the rate of synonymous (silent, non-amino acid changing) substitutions (dS). AdN/dSratio suggests positive selection (mutations provide a fitness advantage), suggests purifying selection (deleterious mutations are removed), and suggests neutral evolution. In cancer, a highdN/dScan indicate driver mutations are being selected for. - Mathematical Formula:
$
\omega = \frac{dN}{dS}
$
where
dNis the number of non-synonymous substitutions per non-synonymous site, anddSis the number of synonymous substitutions per synonymous site. These rates are calculated by comparing sequences and accounting for the genetic code and codon usage. - Symbol Explanation:
- : The dN/dS ratio.
dN: The rate of non-synonymous substitutions, calculated as the number of non-synonymous mutations divided by the number of non-synonymous sites.dS: The rate of synonymous substitutions, calculated as the number of synonymous mutations divided by the number of synonymous sites.
- The
4.2.5. Single-Cell RNA Sequencing and Interaction Analysis
- scRNA-seq Sample Preparation:
- Colons from nine
AOM/DSSmice were dissected, and tumors were cut into pieces. - Single-cell suspension was prepared using
MACS Tissue Dissociation Kits, followed by digestion, filtration, and resuspension. - Library generation was performed using
10x Genomics v2 chemistryandChromium Single Cell 5' Reagent Kits.
- Colons from nine
- scRNA-seq Data Processing and Clustering:
- Alignment and Quantification: Raw
scRNA-seqdata were aligned to themm10reference genome, andUMIs were quantified usingCell Ranger v7.1. - Integration: 2,619 additional normal colon cells from a previous study (
GSE134255, Ref. 33) were integrated as normal controls. - Quality Control (QC): Cells were retained if they had genes, mitochondrial gene expression, and were identified as singlets by
DoubletFinder v2.0.3(Ref. 74). Genes expressed in cells were filtered. - Normalization:
sctransform v2(Ref. 75, 76) was used for normalization. - Dimensionality Reduction and Integration:
PCAwas performed, and the top 50 significant principal components were selected.Harmony v1.1.0(Ref. 77) was used to remove batch effects and integrate data. - Clustering:
FindNeighborsandFindClustersfunctions (Louvain algorithm) were used (resolution 0.1 for first round, 0.4 for second round). This resulted in 8 main cell types and 26 subclusters (10 for epithelial, 7 for macrophages).
- Alignment and Quantification: Raw
- Differential Gene Expression and Annotation:
Seurat FindALLMarkersandFindMarkersfunctions (Wilcoxon rank-sum test) identifiedDEGs.- Canonical markers and
DEGswere used for cell type annotation. Gene Set Enrichment Analysis (GSEA)was performed usingclusterProfilerpackage (Ref. 78) withGene OntologyandMSigDBhallmark gene sets.Gene Set Variation Analysis (GSVA)package (Ref. 79) was used to score gene sets for individual cells, particularly for macrophage classification.
- Differential Cell Abundance Analysis:
miloR v1.8.1(Ref. 80) was used to identify changes in cell abundance across cell types.- A
KNNgraph was constructed on theHarmony space. A sampling refinement algorithm selected10%of cells, and neighborhoods were formed.Spatial FDR(corrected P-value) was used.
- Cell-Cell Communication Analysis:
CellChat v1.6.1(Ref. 36) inferred interactions between epithelial cell types within each sample. It considers expression levels, structural components, soluble agonists, etc.- Quantification: The intensity of interactions was measured by the total number of inferred ligand-receptor pairs.
- Downsampling: To compare across samples, 50 downsamplings were performed, each with 689 epithelial cells.
- Correlation: The average number of ligand-receptor pairs per sample was correlated with lesion clonality ().
- Differential Interaction Analysis: 1,543 candidate ligand-receptor pairs from
CellChatDBwere compared between early low-clonality lesions (, 6 samples) and late high-clonality lesions (, 3 samples) using Wilcoxon rank-sum test (). - Validation:
MultiNicheNetR v1.0.3(Ref. 81) was used as an orthogonal method to identify significantly altered ligand-receptor pairs, considering downstream target gene regulation.
5. Experimental Setup
5.1. Datasets
The study utilized a combination of mouse and human tissue samples, processed with various high-throughput techniques to generate multi-omics datasets.
-
Mouse Models:
- SMALT Lineage Tracing:
AOM/DSSneoplasms (): 30 samples.- polyps (): 17 samples.
- Normal and inflamed intestinal tissues: 26 samples (including normal colon, inflamed colon, small intestine).
- Unaffected organs: 18 samples from 6 organs (blood, liver, lung) in 3
SMALTmice. - Total single cells for phylogenetic analysis: 260,922 cells across 112 normal and neoplastic samples.
- Whole Genome Sequencing (WGS):
- 11 samples and 4 samples, each with matched healthy samples.
- Total of 34 mouse samples for
WGS.
- Single-Cell RNA Sequencing (scRNA-seq):
- 9 samples.
- Integrated with 2,619 normal colon cells from a public dataset (
GSE134255, Ref. 33). - Total high-quality cells for
scRNA-seqanalysis: 45,620.
- Organoid Cultures:
- Organoids derived from normal small intestine and neoplastic polyps of
SMALTmice for in vitro mutation rate estimation.
- Organoids derived from normal small intestine and neoplastic polyps of
- SMALT Lineage Tracing:
-
Human Samples:
- Whole-Exome Sequencing (WES):
- A cohort of 107 treatment-naive patients with sporadic premalignant polyps and synchronous
CRCs. - For each patient, synchronous tumor (), polyp (), and adjacent normal () samples were collected.
- Sample processing led to 102 polyps and 86
CRCsretained for analysis after purity filtering.
- A cohort of 107 treatment-naive patients with sporadic premalignant polyps and synchronous
- Single-Gland Whole-Genome Sequencing (WGS):
- One additional sporadic patient (B139, male, 73 years old).
- 29 neoplastic glands isolated from 5 separate polyps.
- 3 adjacent normal crypts.
- Whole-Exome Sequencing (WES):
-
Data Characteristics and Domain:
-
The datasets cover various stages of colorectal tumorigenesis, from normal tissue to precancerous lesions (polyps/adenomas) and malignant tumors, across both genetically engineered mouse models and human sporadic cases.
-
This allows for comparative analysis between species and different etiologies (inflammation-driven vs.
APC-driven). -
The inclusion of multi-omics data (
DNAbarcodes, bulkDNAmutations, single-cellRNAexpression) provides complementary information on genetic evolution and cellular phenotypes.The choice of these datasets is effective for validating the method's performance and supporting the proposed model. Mouse models provide a controlled environment for
SMALTlineage tracing, allowing precise tracking from initiation. HumanWESandWGSdata from polyps andCRCsoffer direct clinical relevance and validation of findings in a human context. The single-glandWGSprovides a unique way to confirm clonality at a micro-anatomical level in human tissue.
-
5.2. Evaluation Metrics
For every evaluation metric mentioned in the paper, here is a complete explanation:
-
Barcode Mutation Count:
- Conceptual Definition: This metric quantifies the absolute number of
C-to-Tpoint mutations identified within the engineered3kb SMALTDNAbarcode sequence of a single cell. Since these mutations are induced byHsAIDand fixed uponDNAreplication, a higher barcode mutation count directly correlates with a greater number of cell divisions or generations a cell has undergone. It serves as a proxy for proliferative activity or cellular age. - Mathematical Formula: Not a single formula, but derived by counting unique
C-to-Tsubstitutions in the3kbbarcode sequence of eachCCSread (representing a single cell) after bioinformatics processing. - Symbol Explanation: Not applicable as it's a direct count.
- Conceptual Definition: This metric quantifies the absolute number of
-
Number of Unique Barcode Alleles:
- Conceptual Definition: This metric assesses the genetic diversity within a sample by counting how many distinct
3kbbarcode sequences (alleles) are present. A high number of unique alleles indicates extensive genetic heterogeneity and a diverse population of cells, suggesting a lack of strong clonal selection or recent common ancestry. - Mathematical Formula: Not a single formula, but a count of distinct barcode sequences identified in the filtered
CCSreads from a given sample. - Symbol Explanation: Not applicable.
- Conceptual Definition: This metric assesses the genetic diversity within a sample by counting how many distinct
-
Number of Founding Progenitors (Np):
- Conceptual Definition:
Np, estimated usingTarCA(targeting coalescent analysis), quantifies the effective number of ancestral cells that successfully initiated and contributed to the clonally expanding population within a lesion. ANpvalue of 1 indicates a monoclonal origin, while values greater than 1 suggest a polyclonal origin. It reflects the breadth of initial cellular involvement in tumorigenesis. - Mathematical Formula: $ \mathrm{Np} = 1 / \mathrm{Pr} $ where is the effective number of progenitors. The probability is calculated as: $ \mathrm{Pr} = \frac{\sum C_{m_i}^2}{C_{N_s}^2} = \frac{\sum (m_i \times (m_i - 1))}{N_s \times (N_s - 1)} $
- Symbol Explanation:
- : The effective number of founding progenitor cells.
- : The probability that two randomly selected neoplastic cells from the sample share a common progenitor within a monophyletic clade in the phylogenetic tree.
- : The number of sampled cells belonging to the -th monophyletic clade of neoplastic cells within the phylogenetic tree.
- : The total number of neoplastic cells sampled from the lesion.
- : Denotes the number of combinations of choosing 2 items from a set of items, which equals . The term cancels out in the numerator and denominator of the full formula.
- Conceptual Definition:
-
Genome-wide Mutation Burden:
- Conceptual Definition: This metric represents the total number of somatic mutations (single nucleotide variants (
SNVs), short insertions/deletions (indels)) identified across the entire sequenced genome (forWGS) or exome (forWES) of a tumor or polyp sample, relative to its matched normal tissue. A higher burden indicates more genetic instability and accumulated damage. - Mathematical Formula: Not a single formula; it's a direct count of detected
SsNVsandindelsafter filtering false positives. - Symbol Explanation: Not applicable.
- Conceptual Definition: This metric represents the total number of somatic mutations (single nucleotide variants (
-
Putative Driver Mutation Burden:
- Conceptual Definition: This metric specifically counts the number of somatic mutations that occur in genes known or strongly suspected to play a causal role in cancer development (
driver genes). It indicates the accumulation of functionally significant genetic alterations that drive tumor growth and progression. - Mathematical Formula: Not a single formula; it's a count of mutations found in a curated list of cancer driver genes.
- Symbol Explanation: Not applicable.
- Conceptual Definition: This metric specifically counts the number of somatic mutations that occur in genes known or strongly suspected to play a causal role in cancer development (
-
Clonal Expansion Score:
- Conceptual Definition: This score quantifies the degree of clonal expansion within a tissue by measuring the phylogenetic similarity between randomly chosen pairs of cells. A higher score indicates that cells within the sample are more closely related phylogenetically, suggesting significant proliferation from common ancestors.
- Mathematical Formula: Not explicitly provided in the main text, but typically involves calculating a measure of genetic distance or shared ancestry between cell pairs within the reconstructed phylogenetic tree. For example, it could be based on the proportion of shared mutations or the depth of their most recent common ancestor.
- Symbol Explanation: Not applicable.
-
Proliferative Fitness:
- Conceptual Definition: This metric reflects the relative growth advantage of a cell lineage or clone. In the context of
SMALTdata, it can be estimated from the rate of barcode mutation accumulation or the size and growth dynamics of a specific clade within the phylogenetic tree. Higher fitness implies a greater capacity for rapid proliferation and expansion. - Mathematical Formula: Not explicitly provided, but often inferred from the growth dynamics of clades or relative increases in cell numbers associated with certain mutations. In the paper, it is likely derived from the barcode mutation burden relative to time or the proportion of cells belonging to a given clone.
- Symbol Explanation: Not applicable.
- Conceptual Definition: This metric reflects the relative growth advantage of a cell lineage or clone. In the context of
-
dN/dSRatio ():- Conceptual Definition: The
dN/dSratio, or , is a measure of selective pressure on protein-coding genes. It compares the rate of non-synonymous (amino acid-altering) substitutions (dN) to the rate of synonymous (silent, non-amino acid-altering) substitutions (dS).- : Positive selection (mutations provide a fitness advantage and are favored).
- : Purifying selection (deleterious mutations are removed).
- : Neutral evolution (mutations are neither favored nor disfavored). In cancer, for a gene indicates it's a driver, undergoing positive selection.
- Mathematical Formula:
$
\omega = \frac{dN}{dS}
$
where:
dNis the number of non-synonymous substitutions per non-synonymous site.dSis the number of synonymous substitutions per synonymous site.
- Symbol Explanation:
- : The
dN/dSratio. dN: The rate of non-synonymous substitutions (number of non-synonymous mutations normalized by the number of possible non-synonymous sites).dS: The rate of synonymous substitutions (number of synonymous mutations normalized by the number of possible synonymous sites).
- : The
- Conceptual Definition: The
-
Cancer Cell Fraction (CCF):
- Conceptual Definition: The
CCFrepresents the proportion of tumor cells in a given sample that harbor a specific somatic mutation. It's a crucial metric for distinguishing betweenclonalmutations (present in virtually all cancer cells,CCFclose to 1) andsubclonalmutations (present in a subset of cancer cells, ).CCFaccounts for tumor purity, ploidy, and local copy number. - Mathematical Formula: The calculation of
CCFis often performed using computational tools (TitanCNAin this paper) that involve complex probabilistic modeling, integratingVAF(Variant Allele Frequency), tumor purity, and localDNAcopy number. A simplified representation for a diploid locus in a pure tumor is: $ \mathrm{CCF} = \frac{\mathrm{VAF} \times (1 - \mathrm{Purity} + \mathrm{Purity} \times \mathrm{CopyNumber})}{\mathrm{Purity}} $ - Symbol Explanation:
- : Cancer Cell Fraction.
- : The observed Variant Allele Frequency from sequencing reads for a specific mutation.
- : The estimated proportion of tumor cells within the total cells sampled (ranging from 0 to 1).
- : The estimated total
DNAcopy number of the genomic region where the mutation is located in the tumor cells.
- Conceptual Definition: The
-
Ligand-Receptor Interactions (Inferred by
CellChat):- Conceptual Definition: This metric quantifies the potential for communication between different cell types (or subtypes) based on the expression levels of known ligand-receptor pairs.
CellChatinfers these interactions by considering the expression of ligand genes in "sender" cells and receptor genes in "receiver" cells, along with complex factors like multimeric complexes and co-receptors. A higher number or strength of inferred interactions indicates more active intercellular communication. - Mathematical Formula:
CellChatuses a statistical model to infer interactions. While no single formula is given for the "number" of interactions, the process involves computing a communication probability for each ligand-receptor pair based on cell group expression. - Symbol Explanation: Not applicable as it's an output of a software tool based on complex models.
- Conceptual Definition: This metric quantifies the potential for communication between different cell types (or subtypes) based on the expression levels of known ligand-receptor pairs.
5.3. Baselines
The paper compared its findings and methods against several baselines and control groups to establish significance and novelty:
- Mouse Models:
- Normal Tissue Controls: , normal small intestine (), and
IBDnormal tissue () were consistently used as controls to establish baseline barcode mutation rates, cell type proportions, and clonal expansion levels in non-diseased states. - Adjacent Normal Cells: For both
AOM/DSSand lesions, adjacent normal cells from the same mice were used to differentiate neoplastic from normal cellular properties and to filter out plausible normal cells within neoplastic samples during barcode processing. - Unaffected Organs: Samples from blood, liver, and lung in
SMALTmice served as further controls for background mutation rates and cell characterization. CRISPR-Cas9Lineage Tracing: The branching index ofSMALTphylogenetic trees was quantitatively compared to those generated by previousCRISPR-Cas9lineage tracing studies (Ref. 12, 14) to highlightSMALT's superior resolution.
- Normal Tissue Controls: , normal small intestine (), and
- Human Sporadic Polyp/CRC Cohort:
- Monoclonal Polyps and
CRCs: Polyclonal polyps were compared against monoclonal polyps and malignantCRCswithin the human cohort to characterize differences in mutation burden, driver mutations, size, and dysplasia grade, supporting the polyclonal-to-monoclonal transition model. - Matched Normal Tissue: For
WESandsingle-gland WGS, matched normal tissue (adjacent normal colon or normal crypts) was used as a baseline forsomatic mutationcalling andSCNAdetection.
- Monoclonal Polyps and
- scRNA-seq Analysis:
- Public
scRNA-seqData:scRNA-seqdata from wild-type mouse normal colon (GSE134255, Ref. 33) was integrated into the study's dataset as a normal control, allowing for comparison of cell type composition and intercellular interactions between normal and neoplastic tissues. - Late Monoclonal Lesions: Early polyclonal lesions were compared to late monoclonal lesions () in
AOM/DSSneoplasms to identify differences in cell state dynamics and intercellular communication during the transition.
- Public
- Computational Methods:
-
CellChatwas used to infer ligand-receptor interactions, and its results were validated using an orthogonal method,MultiNicheNetR, which serves as a cross-validation baseline for the communication analysis.These baselines are representative as they include healthy controls, established disease models, human clinical samples, and comparisons with existing techniques, ensuring that the observed findings are robust, specific to the disease, and represent an advancement over previous methodologies.
-
6. Results & Analysis
6.1. Core Results Analysis
The paper presents a comprehensive set of results across mouse models and human samples, integrating diverse experimental techniques to support the central hypothesis of a polyclonal-to-monoclonal transition in colorectal precancer.
6.1.1. High-Resolution Lineage Tracing with SMALT in Mouse Models
-
SMALT System Validation: The
SMALTsystem demonstrated high in vivo mutagenesis activity and lineage barcoding capacity. More than 90% of mutations wereC/G-to-T/A, as expected fromHsAID. Mutations were widely distributed across the3kbbarcode, with an average of 836 mutable sites per sample, far surpassingCRISPR-Cas9methods (typically 10-60 sites). This high diversity was reflected in approximately 90% of cells exhibiting a unique combination of mutations (Fig. 1f,g,h). -
Neoplastic vs. Normal Cell Proliferation: Neoplastic cells consistently showed significantly higher barcode mutation counts than normal cells from adjacent tissues or other organs (Fig. 1f, Extended Data Fig. 2b). This allowed for effective separation of neoplastic cells.
AOM/DSSlesions (4.3-fold increase) had a higher mutation burden increase than lesions (2.8-fold increase) compared to normal cells, likely due toAOM's mutagenic effect.The following figure (Figure 1 from the original paper) illustrates the SMALT lineage tracing system and key initial findings:
该图像是示意图,展示了SMALT谱系追踪系统在小鼠肠道肿瘤发生中的应用。图中包括样本类型的比例(c),不同样本中的突变频率(d),以及小鼠实验所用的模型(b)。其中, 值和其他相关数据展示了不同样本对比的结果,强调了多克隆至单克隆的转变在肿瘤进展中的重要性。
Fig. 1 | SMALT lineage tracing of mouse intestinal tumorigenesis. a, Schematic of the SMALT lineage tracing system. b, Intestinal tumorigenesis with AOM/DSS or ApcMin/+ mice carrying the engineered SMALT system in the germline. Normal and neoplastic samples were collected for long-read sequencing of lineage barcodes (barcode-seq), WGS and scRNA-seq. WT, wild type. c, The relative proportions of distinct substitution types in barcodes across all samples.Apc_, normal small intestine inApcMinmice; Apc_P, polyps in ApcMin/+mice; IBD_N, normal tissue in AOM/DSS mice; IBD_T, neoplasms in AOM/DSS mice; WT_N, wild-type normal colon. d, Per site mutation frequency on barcodes across all samples. e, Correlation of per site mutation frequency between mouse and fruit fly. Pearson's r and Pvalue are shown. f, Violin plot showing the number of barcode mutations per cellin different tissues. The mean number of mutations and the number of cells are shown. values are by two-sided Wilcoxon rank-sum test. g, The proportion of unique barcode alleles and the number of cells with the unique barcode in WT_N , Apc_N ,iBD_N A_ and IBD_T , respectively. Data are mean s.e.m. h, The proportion of unique barcode alleles and the number of samples with the unique barcode. In box plots, the horizontal line is the median, the box delineates the 25th to 75th centiles, and whiskers extend to 1.5 times the interquartile range.
6.1.2. Polyclonal Origins and Transition in Mouse Models
-
Inflammation-Driven Lesions (AOM/DSS):
-
SMALTlineage tracing revealed that the majority (66.7%, 20 out of 30) ofAOM/DSSneoplasms had polyclonal origins (Fig. 2a,b, Extended Data Fig. 3). This indicates parallel expansions of multiple distinct cell lineages within the same lesion. -
The
Np(number of founding progenitors) varied from 2 to 33 for polyclonal lesions, robustly estimated byTarCA(Fig. 2e). -
Monoclonal lesions (10 out of 30) exhibited significantly more barcode mutations, higher genome-wide mutation burdens, and more putative driver mutations compared to polyclonal lesions (Fig. 2f,g,h). This suggests that monoclonal lesions represent a more advanced stage of tumorigenesis.
-
Inflamed colons showed greater clonal expansion than normal colons (Fig. 2i,j). Monoclonal lesions had greater expansion and higher proliferative fitness than polyclonal lesions (Fig. 2i,k). A higher
dN/dSratio in monoclonal lesions indicated stringent selection during somatic evolution (Supplementary Fig. 14c). -
Spatial computational inference further suggested strong subclonal selection () even after monoclonal transition.
The following figure (Figure 2 from the original paper) presents single-cell phylogenies of inflammation-driven neoplasms:
该图像是图表,展示了不同肠道肿瘤样本的单细胞谱系分析结果。左侧标记为“Polyclonal”的部分显示了多克隆样本,右侧标记为“Monoclonal”的部分显示了单克隆样本。每个样本的树状图显示了细胞的演化关系,并附有样本的总细胞数(n)及进一步克隆的数量(Np)。这些数据支持了多克隆向单克隆转变的模型,提示样本的肿瘤发展阶段。
Fig. 2 | Single-cell phylogenies reveal the origin of inflammation-driven neoplasms. a,b, Single-cell phylogeny (left) and corresponding barcode mutations (right) for a representative monoclonal lesion (a; lesion IBD4_T) and a representative polyclonal lesion (b; lesion IBD50_T). c, Bootstrapping values of the phylogenetic tree for IBD4_T (top) and IBD50_T (bottom). d, The branching index of SMALT trees in this study compared with CRISPR Cas9 lineage trees from two previous studies (ref. 12, ; and ref.14, ). e, The number of founding progenitors (Np) estimated from single-cell phylogeny. For each lesion, Np was estimated 20 times using downsampled cells. f, The barcode mutation count per cell in monoclonal cells) versus polyclonal cells) lesions. g,h, Total somatic mutation burden or putative driver mutation burden in WGS data of monoclonal versus polyclonal lesions. i, Clonal expansion scores calculated using 1,000 downsampled cell pairs ranked by the median clonal expansion scores within each sample type.j, A representative single-cell phylogeny for inflamed normal colon. Lineages exhibiting clonal expansions are highlighted in colour. k, Single-cell fitness scores in monoclonal versus polyclonal lesions. d,f-h, Pvalues by two-sided Wilcoxon rank-sum test.
-
-
Lesions:
- All 17 individual polyps examined from mice exhibited polyclonal origins (Extended Data Fig. 4a).
Npranged from 4 to approximately 100, indicating even higher polyclonality than inAOM/DSSlesions (Extended Data Fig. 5). - Some regions within a polyp showed stronger, independent clonal expansions (e.g., P5-1 and P5-5 in Apc68_P5), characterized by monophyletic subtrees and higher proliferative fitness (Extended Data Fig. 4b,c,e).
- Timing analysis suggested that neoplastic initiation in mice occurred early (59-130 postnatal days), corresponding to infancy in human
FAPpatients (Extended Data Fig. 5g).
- All 17 individual polyps examined from mice exhibited polyclonal origins (Extended Data Fig. 4a).
6.1.3. Polyclonal-to-Monoclonal Transition in Human Sporadic Polyps
-
WES Analysis of Human Cohort:
-
Analysis of
WESdata from 107 patients with sporadic polyps and synchronousCRCsrevealed that polyclonality was more common in polyps (29.4%, 30/102) than inCRCs(8.1%, 7/86) (Fig. 3b). -
Polyclonal polyps had fewer
SsNVsandSCNAsthan monoclonal polyps, both having lower mutational burdens thanCRCs(Fig. 3c, Extended Data Fig. 6). -
Polyclonal polyps were typically smaller, more commonly exhibited low-grade dysplasia, and were found in younger patients (Fig. 3d,e,g). These clinical features further supported that monoclonality represents a more advanced stage.
-
KRASmutations were significantly more common in monoclonal polyps (34.7%) andCRCs(40.5%) than in polyclonal polyps (6.7%), suggestingKRASmutations confer a selective advantage for clonal outgrowth and monoclonal transition (Fig. 3h,i). -
The overall selective strength (
dN/dS) for driver mutations was higher in monoclonal polyps than in polyclonal ones (Supplementary Fig. 23). -
These human genomic and clinical data strongly validate the
polyclonal-to-monoclonal transitiontrajectory observed in mouse models (Fig. 3j).The following figure (Figure 3 from the original paper) summarizes the polyclonal-to-monoclonal transition in human sporadic polyps:
该图像是示意图,展示了不同基因型(Apc68 和 Apc72)下的小鼠肠道肿瘤的单细胞谱系图,显示了多克隆到单克隆转变的动态过程。图中包含多条分支,代表独立细胞谱系的并行扩展。整体布局提供了高分辨率的细胞谱系构建,支持多克隆肿瘤发展为单克隆肿瘤的模型。
*Fig3|Polyclonal-to-onoclonaltransitioninumansporadicolyps. a, A human cohort with sporadic premalignant polyps, including 107 patients with synchronous polyps and CRC. b, The distribution of CCFs reveals the clonality of each lesion. One representative polyclonal polyp (P_poly, B046P) and one monoclonal polyp (P_mono, B002P) are shown. c, The total somatic mutation burden in P_poly , P_mono or CRCs after removing samples with low purity . d, Distribution of small (<1 cm) and large (≥1 cm) polyps. e, Distribution oflow-grade and high-grade dysplasias. f, Representative images of haematoxylin and eosin (H&E) staining. Scale bar, Age distribution of participants. h, Percentage of patients carrying indicated putative driver mutations. - - ,one-sided
Fisher's exact test. Highlighted genes have Benjamini-Hochberg FDR . i, The CCFs of putative driver genes.j, Schematic of polyclonal-to-monoclonal transition in the somatic evolution of premalignant polyp and its subsequent malignant transformation. k, Schematic of wGS of single glands from five premalignant polyps (P1P4 and P6) in a patient with sporadic polyps (B139). N, normal tissue; R, regions within polyps. I, Images of individually isolated glands. m, Integrated phylogenetic tree including 3 normal glands and 29 neoplastic glands from 5 polyps. Putative driver mutations shared by multiple glands are labelled. Pvalues by two-sided Wilcoxon rank-sum test (c,g) or Fisher's exact test (d,e). Graphics in a,k adapted from Servier Medical Art (CC BY 4.0).*
-
-
Single-Gland WGS Validation:
WGSof 29 neoplastic glands from 5 polyps in one patient (B139) further confirmed polyclonal origins. For polyps P1, P2, and P4, the most recent common ancestor of their glands was near the phylogenetic root, with no putative driver mutations in the trunk, indicating polyclonal initiation (Fig. 3m).- Partially shared driver mutations (e.g.,
FAT3in P1,JAK1in P4) suggested clonal expansion within polyclonal lesions. - Many neoplastic glands (18/29) exhibited non-
APCdriver mutations (CTNNB1,JAK1,FAT3), implying diverse early drivers beyondAPCin sporadic polyps.
6.1.4. Evolution of Cell States and Intercellular Interactions
- scRNA-seq Landscape:
scRNA-seqof 9AOM/DSSneoplasms identified 8 main cell types and 26 subclusters (Extended Data Fig. 8a-c). - Cell Type Abundance Changes: As lesion clonality () increased (i.e., becoming more monoclonal), there were marked increases in macrophage, neutrophil, and endothelial cell proportions, and decreases in neoplastic epithelial cells (Extended Data Fig. 8d,e).
- and macrophages (overlapping with immunosuppressive
LA_TAMsandReg_TAMs) were enriched with increasing clonality (Fig. 4b,c), suggesting an immune-suppressive microenvironment contributes to monoclonal transition. - High-clonality lesions showed upregulation of cancer-associated hallmarks (e.g.,
MYCtargets,KRASsignaling,EMT) (Extended Data Fig. 8g).
- and macrophages (overlapping with immunosuppressive
- Intercellular Communication:
-
Cell-cell communicationanalysis revealed a significant elevation of ligand-receptor interactions between epithelial subtypes in early polyclonal lesions compared to normal colons and late monoclonal lesions (Fig. 4d,e, Extended Data Fig. 9). -
14 ligand-receptor interactions were significantly enriched in early polyclonal lesions, primarily involved in
extracellular matrix (ECM)organization and cell adhesion (e.g., laminins,CDH1,SEMA4) (Fig. 4f). -
neoplastic cells contributed about 40% of these enriched interactions and exhibited pro-inflammatory characteristics (Extended Data Fig. 10a-c).
-
These findings strongly suggest that extensive intercellular cooperation, potentially via
ECMand cell adhesion, is a hallmark of early inflammation-driven intestinal tumorigenesis.The following figure (Figure 4 from the original paper) shows the model of intercellular interactions and polyclonal-to-monoclonal evolution:
该图像是一个图表,展示了Apc突变小鼠模型中,正常细胞和肿瘤前体细胞的进化图谱与较高分辨率的细胞系谱。包括不同时间点的肠道类器官的显微图像,表现出多克隆到单克隆转变的过程,并通过公式 描述突变负担变化。
Fig. 4 | Intercellular interactions and polyclonal-to-monoclonal evolution model. a, scRNA-seq identifies 26 cell subclusters. b, Beeswarm plot of differential abundance for the cell subclusters along the increase of lesion clonality (measured by 1/Np). Each point represents a neighbourhood that contains a group of cells with similar transcriptomes. Cell neighbourhoods with spatial FDR are highlighted in red for decreased abundance and blue for increased abundance. c, Subclusters of macrophages and the expression of Trem2, signatures of LA_TAMs and Reg_TAMs. d, Cell-cell communication between neoplastic epithelial subclusters inferred by CellChat. The nodes represent epithelial subclusters. The thickness of edges represents the average number of ligand-receptor interactions between every two subclusters from the 50 downsamplings (Methods). e, Correlation between lesion clonality and the average number of ligand-receptor interactions. Spearman's and Pvalue are shown. f, A total of 14 ligandreceptor interactions significantly enriched in the early polyclonal lesions relative to late lesions belong to 4 pathways: laminin, desmocolin (DSC), CDH1 and SEMA4. Data are mean s.e.m. g, Schematic of polyclonal origins and monoclonal transition in early intestinal tumorigenesis. Each precancerous lesion is founded by many lineages undergoing parallel clonal expansions and strong inter-clonal interactions. The gradual loss of inter-clonal interactions and microenvironmental changes might facilitate subsequent polyclonal-tomonoclonal transition. Subclonal selection remains stringent after monoclonal transition, where malignant transformation requires a further clonal sweep.
-
6.2. Data Presentation (Tables)
The provided full content of the research paper does not include any tables within the main text or figures. All tabular data are referenced as Supplementary Tables (e.g., Supplementary Table 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16) which are not part of the provided text. Therefore, no tables can be transcribed here.
6.3. Ablation Studies / Parameter Analysis
While the paper does not present traditional ablation studies where specific components of a proposed model are removed to assess their contribution, it performs several analyses to validate the robustness of its methods and explore the impact of specific features:
-
Robustness of
NpEstimation: The number of founding progenitors (Np) estimated byTarCAwas tested for robustness by performing 20 downsamplings of cells for each lesion. The consistency ofNpestimates across these downsamplings (Fig. 2e) indicates the reliability of the quantification method, rather than being sensitive to cell sampling depth. -
Impact of Hotspot Mutations: The robustness of
Npestimates was also confirmed against hotspot mutation events (Supplementary Fig. 13b), demonstrating that regions with unusually high mutation rates did not unduly influence the overallNpcalculation. -
Comparison with Previous Lineage Tracing Methods: The branching index of
SMALTtrees was quantitatively compared toCRISPR-Cas9lineage trees from previous studies (Ref. 12, 14). This comparison (Fig. 2d) highlightedSMALT's superior resolution (3.3 times more internal branching events), justifying the choice ofSMALTas an advanced tool for this research. -
Cross-Validation of Cell-Cell Communication: The findings from
CellChatanalysis regarding ligand-receptor interactions were validated using an orthogonal method,MultiNicheNetR(Supplementary Fig. 28). This serves as a strong validation that the observed intercellular communication patterns are not an artifact of a single computational tool, enhancing confidence in the findings. -
Correlation Analyses: The paper extensively uses correlation analyses (e.g., Pearson's and Spearman's ) to explore relationships between various parameters:
- Correlation of per site mutation frequency between mouse and Drosophila (Fig. 1e).
- Correlation between proportions of different cell types (macrophages, neutrophils, endothelial, epithelial) and lesion clonality () (Extended Data Fig. 8e).
- Correlation between lesion clonality () and the average number of ligand-receptor interactions (Fig. 4e). These analyses help to understand how different cellular and genomic features co-vary with the progression of clonality.
-
Spatial Computational Inference: The use of agent-based tumor simulations and approximate Bayesian computation (Supplementary Fig. 15) to estimate selection coefficients () in monoclonal
AOM/DSStumors (Supplementary Fig. 16) provides a computational validation of the strength of subclonal selection, which is a key driver of the polyclonal-to-monoclonal transition.These analyses collectively serve to demonstrate the robustness of the methods, validate key findings against alternative approaches or conditions, and explore the quantitative relationships between different biological parameters, much like traditional ablation studies would for specific model components.
7. Conclusion & Reflections
7.1. Conclusion Summary
This paper presents a groundbreaking study that comprehensively maps the single-cell phylogenies of colorectal precancerous lesions, revealing a consistent polyclonal-to-monoclonal transition in both mouse models and human sporadic polyps. Through the innovative use of the high-resolution SMALT DNA barcoding system, integrated with WES, WGS, and scRNA-seq, the authors demonstrate that early precancerous lesions are often founded by multiple independent cell lineages undergoing parallel clonal expansions (polyclonal origin). Monoclonal lesions, in contrast, represent a more advanced stage, characterized by higher mutation burdens, more driver mutations, and increased selective pressure. Crucially, the study uncovers extensive intercellular interactions, particularly involving ECM organization and cell adhesion, in these early polyclonal lesions, which significantly diminish during the transition to a monoclonal state. This highlights the vital role of intercellular cooperation in the initial phases of tumorigenesis. The findings provide a novel conceptual framework for understanding early cancer evolution, suggesting opportunities for intervention by targeting these initial cooperative interactions.
7.2. Limitations & Future Work
The authors acknowledge several limitations and suggest future research directions:
- Underestimation of Progenitors: The
Npestimates might be lower than the actual number of progenitors, as many founding lineages could be lost during growth due to random drift and competition. - Unclear Mechanisms of Driver Mutations: The paper notes that some driver mutations found frequently in polyps (e.g.,
BCL9L,SOX9) are less common inCRCs. The underlying mechanisms by which these early drivers might potentially impede malignant transformation remain unclear and require further investigation. - Further Elucidation of Intercellular Interactions: While extensive intercellular interactions were identified in early polyclonal lesions, the precise mechanisms by which these interactions promote neoplastic growth require further study.
- Linking Cell States and Lineage Information: Future research should focus on unraveling the molecular crosstalk within the microenvironment by linking cellular states (e.g., using single-cell multiomics) with lineage information in the same cells. This would provide a more direct causal link between cellular phenotype and evolutionary trajectory.
- Predictive Modeling for Cancer Risk: Efforts are needed to formulate a predictive model that can forecast cancer risk based on the molecular and evolutionary features of premalignant lesions.
7.3. Personal Insights & Critique
This paper is a tour de force in cancer evolutionary biology, representing a significant leap forward in understanding early tumorigenesis.
-
Innovations:
- The
SMALTsystem: This is a major technical innovation. The high number of mutable sites compared toCRISPR-Cas9based methods (~800vs.10-60) provides an unprecedented resolution for reconstructing cell phylogenies, which is absolutely critical for deciphering the nuances of polyclonal origins and early clonal dynamics. This technological advancement underpins much of the paper's success. - Multi-omics Integration: The seamless integration of high-resolution lineage tracing with
WES,WGS, andscRNA-seqin both mouse and human samples is exceptionally rigorous. This comprehensive approach allows for robust validation across species and provides both genetic and microenvironmental insights, offering a holistic view of the evolutionary process. - The "Polyclonal-to-Monoclonal Transition" Framework: This model provides a compelling and empirically supported conceptual framework that refines our understanding of early cancer evolution. It moves beyond a simple monoclonal/polyclonal dichotomy to illustrate a dynamic process driven by selection, offering a more realistic representation of disease progression.
- The
-
Implications & Transferability:
- The finding that early lesions are often polyclonal, relying on extensive intercellular cooperation, is particularly insightful. It suggests that cancer initiation is not always a solitary event driven by a single advantageous clone, but can involve a "community effect" where multiple lineages collaborate to overcome initial growth barriers. This implies that early intervention strategies could focus on disrupting these cooperative interactions rather than solely targeting individual driver mutations.
- The methods, particularly the high-resolution
SMALTlineage tracing and the integrated multi-omics approach, are highly transferable. They could be adapted to study clonal dynamics in various other contexts, such as normal tissue homeostasis, regeneration, aging, and other diseases involving complex cellular interactions and evolution (e.g., inflammatory diseases, neurological disorders).
-
Potential Issues & Areas for Improvement:
-
Causality of Intercellular Interactions: While
scRNA-seqreveals strong correlations between extensive intercellular interactions and early polyclonal lesions, the precise causal mechanisms remain to be fully elucidated. Future experiments (e.g., in vitro co-culture systems or in vivo manipulation of specific ligand-receptor pathways) would be crucial to functionally demonstrate how these interactions promote polyclonal growth and how their loss facilitates monoclonal transition. The paper hints at a recruitment model, but direct evidence of how non-driver-mutated cells are supported by neighboring clones is an exciting next step. -
Initial Cellular State: The paper defines progenitors as cells capable of founding clonal populations, but the specific cellular identity (e.g., stem cells, transit-amplifying cells, differentiated cells undergoing dedifferentiation) that gives rise to these initial polyclonal lineages is not deeply explored. Understanding this could further refine the initiation model.
-
Generalizability of the
AOM/DSSModel: While a good model for inflammation-driven cancer, the specific inflammatory context might influence the observed intercellular interactions. It would be valuable to explore if similar interaction patterns are seen in other non-inflammatoryCRCmodels or other cancer types. -
Predictive Model Development: The suggested future work of formulating a predictive model for cancer risk based on precancerous lesion features is ambitious but critical. Translating these complex single-cell evolutionary insights into clinically actionable biomarkers or risk assessment tools will be a significant challenge, requiring robust validation in larger human cohorts.
Overall, this paper provides a robust and deeply insightful view into the nascent stages of colorectal cancer, fundamentally reshaping our understanding of how these complex diseases begin and evolve. Its rigorous methodology and compelling findings lay a strong foundation for future research aimed at truly early cancer prevention.
-
Similar papers
Recommended via semantic vector search.