AiPaper
Paper status: completed

Convergent genome evolution shaped the emergence of terrestrial animals

Published:11/12/2025
Original Link
Price: 0.10
0 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study analyzes 154 genomes from 21 animal phyla to uncover the convergence and contingency in terrestrialization events, revealing unique gene patterns yet recurrent adaptive functions, crucial for life on land, while establishing a timeline for these transitions.

Abstract

The challenges associated with the transition of life from water to land are profound; yet they have been met in many distinct animal lineages. This constitutes a series of independent evolutionary experiments from which we can decipher the role of contingency versus convergence in the adaptation of animal genomes. Here we compare 154 genomes from 21 animal phyla and their outgroups to reconstruct the protein-coding content of the ancestral genomes linked to 11 animal terrestrialization events, and to produce a timescale of terrestrialization. We uncover distinct patterns of gene gain and loss underlying each transition to land, but similar biological functions emerged recurrently, pointing to specific adaptations as key to life on land.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Convergent genome evolution shaped the emergence of terrestrial animals

1.2. Authors

Jialin Wei, Davide Pisani, Philip C. J. Donoghue, Marta Álvarez-Presas, Jordi Paps

1.3. Journal/Conference

Nature. The journal Nature is one of the most prestigious and highly influential scientific journals globally, publishing original research across all fields of science and technology. Its reputation ensures rigorous peer review and widespread recognition within the scientific community.

1.4. Publication Year

2025

1.5. Abstract

The transition of life from water to land presents significant evolutionary challenges, which have been overcome independently by multiple animal lineages. These events serve as natural experiments to understand the roles of contingency versus convergence in genomic adaptation. This study compares 154 genomes from 21 animal phyla and their outgroups to reconstruct the protein-coding content of ancestral genomes associated with 11 independent animal terrestrialization events, and to establish a timeline for these transitions. The research uncovers distinct patterns of gene gain and loss for each transition, yet recurrent emergence of similar biological functions suggests specific adaptations are crucial for life on land. Semi-terrestrial species are found to have evolved convergent functional patterns, in contrast to fully terrestrial lineages which followed more divergent paths. The timeline proposed identifies three major temporal windows for animal land colonization over the last 487 million years, each linked to specific ecological contexts. The study concludes that while each lineage exhibits unique adaptations, strong evidence of convergent genome evolution across the animal kingdom implies a largely predictable adaptive response to terrestrial life, connecting genes to ecosystems.

/files/papers/6919aa68110b75dcc59ae248/paper.pdf Publication Status: Officially published (Published online: 12 November 2025).

2. Executive Summary

2.1. Background & Motivation

The transition of life from water to land is one of the most profound evolutionary events, fundamentally shaping Earth's ecosystems and biodiversity. This terrestrialization has occurred multiple times independently across diverse animal lineages (e.g., arthropods, vertebrates, molluscs, nematodes). Each such event represents a unique "evolutionary experiment" in overcoming universal challenges like desiccation, temperature fluctuations, new modes of locomotion, respiration, and reproduction outside of water.

The core problem the paper aims to solve is to decipher the genomic underpinnings of these independent transitions. Prior research has noted widespread phenotypic convergence (e.g., water-retentive skin, adapted immune systems, changes in skeletal design) across terrestrial lineages, suggesting predictable responses to similar environmental pressures. At the genomic level, studies have linked gene innovation, duplication, and loss to major evolutionary transitions and identified specific genes (e.g., aquaporin-coding genes) or genomic changes associated with terrestrialization in individual lineages. However, compared to land plants, the comprehensive genomic basis of terrestrialization across multiple animal lineages remains largely uncharacterized.

The specific challenge is to move beyond single-lineage studies and conduct a broad, comparative genomic analysis across diverse animal phyla to determine whether terrestrialization primarily leads to lineage-specific, contingent genomic adaptations (unique solutions due to chance or historical context) or convergent (predictable, parallel solutions due to similar environmental pressures) changes. The paper's entry point is to leverage the independent nature of these terrestrialization events as a natural laboratory to explore this fundamental question of evolutionary biology: the role of contingency versus convergence in shaping animal genomes during adaptation to land.

2.2. Main Contributions / Findings

The paper makes several primary contributions:

  • Development of the InterEvo Framework: The study introduces and applies an intersection framework for convergent evolution (InterEvo) (Extended Data Fig. 1), a comparative genomics pipeline used to analyze 154 genomes from 21 animal phyla and their outgroups. This framework systematically identifies the intersection of biological functions between different sets of genes that were independently gained or reduced across 11 distinct animal terrestrialization events.
  • Identification of Convergent Functional Adaptations: Despite distinct patterns of gene gain and loss in each lineage, similar biological functions emerged recurrently across independent transitions. These convergent functions, driven by gene gains (novel, novel core, and expanded HGs), primarily involve osmoregulation (water transport), metabolism (especially fatty acids, linked to diet), reproduction, detoxification, sensory reception, and reaction to stimuli. Gene reductions also show convergent patterns, notably the loss of Dbl-homology domain and pleckstrin-homology domain gene families (related to Rho GTPases and regeneration) and contraction of chloride channel protein genes (osmoregulation).
  • Differentiation Between Semi- and Fully Terrestrial Lineages: The study reveals that semi-terrestrial species (e.g., rotifers, nematodes) evolved convergent functional patterns, characterized by an "expansive and versatile toolkit" for environmental flexibility (e.g., cuticle remodelling, visual development, stress response). In contrast, fully terrestrial lineages (e.g., land gastropods, arachnids, hexapods, tetrapods) followed more diverse genomic paths, displaying a "small and streamlined set" centered on neuronal development and ion membrane homeostasis, with limited functional convergence among themselves outside of arthropods.
  • Establishment of a Temporal Framework for Terrestrialization: The paper reconstructs a molecular evolutionary timescale (Fig. 1) that supports three major temporal windows of animal land colonization during the last 487 million years:
    • First Window (Middle Cambrian - Middle Ordovician): Associated with early land plants, including nematodes, myriapods, hexapods, and arachnids.
    • Second Window (Late Devonian - Early Carboniferous): Linked to episodic flooding and deepening soils, involving clitellate annelids and tetrapods.
    • Third Window (Cretaceous): Characterized by greenhouse landscapes, leading to the terrestrialization of bdelloid rotifers and land gastropods.
  • Interplay of Contingency and Predictability: The study concludes that adaptation to life on land involves both predictable (convergent) molecular responses to common challenges and lineage-specific (contingent) adaptations shaped by unique evolutionary histories, genomic backgrounds, and ecological contexts. This highlights the repeatability and uniqueness of evolutionary innovation.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

  • Comparative Genomics: A field of biological research in which the genomic features of different organisms are compared. It reveals similarities and differences in DNA, RNA, and protein sequences, as well as gene order, regulation, and other genomic structural features. This comparison helps understand evolutionary relationships, identify functionally important genes, and uncover adaptations.
  • Homology Groups (HGs): Groups of genes (or proteins) that share a common evolutionary ancestor. These can include orthologs (genes in different species that evolved from a common ancestral gene by speciation) and paralogs (genes within the same species that arose from a common ancestral gene by gene duplication). The paper uses OrthoFinder to cluster protein sequences into HGs.
  • Gene Ontology (GO): A collaborative bioinformatics initiative that provides a controlled vocabulary (ontology) of terms for describing gene products in any organism. It covers three domains: molecular function (what a gene product does at the molecular level), cellular component (where a gene product is active), and biological process (the larger processes or pathways to which a gene product contributes). GO terms allow for standardized functional annotation and comparison across species.
  • Pfam Protein Domains: A large collection of protein families, each represented by multiple sequence alignments and hidden Markov models (HMMs). Pfam domains are common, conserved parts of proteins that can function independently or in combination with other domains. Identifying Pfam domains in novel proteins can provide clues about their function.
  • Phylogenetic Tree (Phylogeny): A branching diagram or "tree" showing the inferred evolutionary relationships among various biological species or other entities—their phylogeny—based upon similarities and differences in their physical or genetic characteristics. It helps to visualize common ancestry and divergence over evolutionary time.
  • Gene Gain and Loss: Fundamental evolutionary processes where new genes emerge (gene gain, often through gene duplication or de novo gene birth) or existing genes are eliminated from a genome (gene loss). These processes are crucial drivers of genomic evolution and adaptation.
  • Gene Expansion and Contraction: Refers to changes in the number of copies of genes within a homology group in a particular lineage. Gene expansion means an increase in copy number, often suggesting a gene family has become more important or adapted to new functions. Gene contraction means a decrease in copy number.
  • Principal Component Analysis (PCA): A statistical procedure that transforms a set of possibly correlated variables into a set of linearly uncorrelated variables called principal components. It is used for dimensionality reduction and visualizing high-dimensional data, revealing underlying patterns and groupings.
  • Principal Coordinates Analysis (PCoA): Similar to PCA, PCoA is a method used to visualize the similarity or dissimilarity of data points. Unlike PCA, which operates on a raw data matrix, PCoA operates on a dissimilarity matrix (e.g., Jaccard distance), allowing it to handle non-Euclidean distances. It finds a low-dimensional ordination that best represents the distances between objects.
  • Permutational Multivariate Analysis of Variance (PERMANOVA): A non-parametric multivariate statistical test used to compare groups of objects based on a distance matrix. It tests whether there are significant differences between the centroids of two or more groups, and whether the groups differ in their dispersion, without assuming multivariate normality.
  • Molecular Clock: A technique in molecular evolution that uses the rate of accumulation of molecular changes (e.g., mutations in DNA or protein sequences) to estimate the time when two species diverged from a common ancestor. It assumes a relatively constant rate of molecular evolution over time, allowing scientists to date evolutionary events.

3.2. Previous Works

The paper contextualizes its research by referencing several key prior studies:

  • Phenotypic Adaptations:
    • Studies noting widespread convergent phenotypic adaptations in terrestrial animals, such as water-retentive skin or cuticle, adapted immune systems, changes in skeletal design, elevated metabolic rates, developmental adaptations (e.g., encapsulated larvae), and vision adaptation in aerial environments. These observations form the basis for hypothesizing genomic convergence.
  • Genomic Changes in Metazoan Evolution:
    • Research demonstrating that genomic changes, including gene innovation (e.g., Paps & Holland, 2018), duplication (e.g., Fernandez & Gabaldon, 2020), and loss (e.g., Guijarro-Clarke et al., 2020), were crucial to major metazoan evolutionary transitions. This sets the stage for investigating these genomic dynamics during terrestrialization.
  • Specific Genes and Lineage-Specific Terrestrialization:
    • Work linking specific genes, like aquaporin-coding genes, to terrestrialization in several clades (e.g., Martinez-Redondo et al., 2023).
    • Studies associating genomic changes with terrestrialization in individual lineages, such as molluscs (e.g., Aristide & Fernández, 2023), arthropods (e.g., Thomas et al., 2020), beetles (e.g., Balart-Garcia et al., 2023), and annelids (e.g., Vargas-Chavez et al., 2025). These studies often highlighted roles for genes related to metabolism, stress response, osmoregulation, and immunity.
  • Land Plants Terrestrialization:
    • Research on the genomic basis of terrestrialization in land plants (e.g., Bowles et al., 2020), which identified two bursts of genomic novelty linked to this transition. This serves as a comparative benchmark, noting that the genomic basis in animals is largely uncharacterized in comparison.

3.3. Technological Evolution

The field of evolutionary genomics, particularly concerning large-scale comparative analyses, has seen significant advancements driven by:

  • Next-Generation Sequencing (NGS): The exponential decrease in sequencing costs and increase in throughput has made it possible to sequence hundreds of genomes across diverse taxa, providing the raw data for comparative studies.

  • Bioinformatics Tools for Orthology Inference: Algorithms like OrthoFinder have been developed to accurately identify homology groups and orthologs across many species, overcoming challenges of gene duplication and loss.

  • Ancestral State Reconstruction: Methods to infer the gene content or other characteristics of ancestral genomes have become more sophisticated, allowing researchers to track gene evolution over deep time.

  • Phylogenomic Methods: Advancements in constructing highly resolved phylogenetic trees using genomic-scale datasets (e.g., IQ-TREE with complex models) provide a robust evolutionary framework for interpreting genomic changes.

  • Gene Family Evolution Models: Tools like CAFE (Computational Analysis of Gene Family Evolution) allow for statistical inference of gene family expansions and contractions along a phylogeny, accounting for varying evolutionary rates.

  • Functional Annotation Resources: Comprehensive databases like Gene Ontology (GO), Pfam, eggNOG, and PANTHER provide standardized classifications of gene functions, enabling large-scale functional enrichment analyses.

    This paper's work fits into the current state of technological evolution by integrating these advanced tools and large genomic datasets to perform a comprehensive, genome-wide comparative analysis across a broad range of animal phyla, a scale previously challenging to achieve for such a specific evolutionary transition.

3.4. Differentiation Analysis

Compared to previous works, the core differences and innovations of this paper's approach are:

  • Scope and Breadth of Comparison: While prior studies focused on genomic changes in specific lineages undergoing terrestrialization, this paper performs a genome-wide comparative analysis across 11 independent animal terrestrialization events spanning 21 animal phyla using 154 genomes. This unprecedented breadth allows for a systematic and robust investigation into widespread patterns of convergence versus contingency.

  • InterEvo Framework: The paper introduces a novel Intersection Framework for Convergent Evolution (InterEvo). This approach specifically looks for the intersection of biological functions across independently evolved gene sets (gained or reduced), rather than just identifying changes within individual lineages. This design directly addresses the question of functional convergence.

  • Combined Analysis of Gene Turnover: The study comprehensively analyzes gene gains (novel, novel core, expanded) and gene reductions (contracted, lost) simultaneously across all terrestrialization events. This integrated view provides a more complete picture of genome plasticity than focusing solely on one type of genomic change.

  • Distinction Between Semi- and Fully Terrestrial: A novel aspect is the explicit comparison of genomic adaptations between semi-terrestrial and fully terrestrial lineages, revealing distinct patterns of functional convergence and divergence between these two categories.

  • Molecular Timescale Integration: The study integrates a molecular clock analysis to establish a temporal framework for animal terrestrialization, linking genomic changes to specific paleoecological contexts (e.g., early land plants, seasonal wetlands, Cretaceous greenhouse).

    In essence, the paper moves beyond identifying "what changed" in individual lineages to systematically address "what converged functionally" across many independent events, and "when" these events occurred within Earth's history, offering a deeper understanding of the predictability of evolutionary adaptation.

4. Methodology

4.1. Principles

The core idea of the method used in this paper, termed the intersection framework for convergent evolution (InterEvo), is to systematically identify patterns of convergent evolution at the genomic level across multiple independent terrestrialization events in animals. The theoretical basis is that if similar environmental pressures drive adaptation, then independent lineages facing these pressures might evolve similar biological functions, even if the specific genes or genomic changes (gains, losses, expansions, contractions) are distinct. The intuition is that by comparing the protein-coding content of ancestral genomes before and after terrestrialization in many lineages, and then analyzing the functional annotations of the genes that changed, one can uncover recurrent adaptive strategies critical for life on land. The InterEvo workflow, depicted in Extended Data Fig. 1, integrates comparative genomics, homology inference, and functional annotation to detect these convergent evolution patterns.

The following figure (Extended Data Fig. 1 from the original paper) shows the InterEvo (Intersection Framework for Convergent Evolution) workflow:

该图像是示意图,展示了动物基因组的同源群体分类,包括新生同源群体、扩展同源群体和丢失/收缩同源群体。图中展示了基因本体术语与功能的关系,旨在重建过渡节点的祖先基因组。 该图像是示意图,展示了动物基因组的同源群体分类,包括新生同源群体、扩展同源群体和丢失/收缩同源群体。图中展示了基因本体术语与功能的关系,旨在重建过渡节点的祖先基因组。

4.2. Core Methodology In-depth (Layer by Layer)

The methodology involves several sequential steps, from genome acquisition to functional analysis and timescale reconstruction.

4.2.1. Taxon Sampling and Homology Groups (HGs) Inference

  1. Genome Acquisition: The researchers compiled 154 genome samplings from public databases including UniProt, NCBI, and Ensembl. This dataset comprises 151 metazoan (animal) genomes and 3 unicellular holozoan genomes, serving as outgroups. These genomes collectively contained 3,934,362 predicted proteins. The sampling focused on species that flank 11 key terrestrialization events identified across 21 animal phyla (Fig. 1, Extended Data Fig. 2).
  2. Protein Processing:
    • A side script from OrthoFinder (primary_transcript.py) and Cd-hit v.4.8.1 were used to extract canonical proteins (the longest representative transcript for each gene) from the raw data.
    • Cd-hit uses a similarity threshold of 1.00 to cluster identical protein sequences, ensuring that only unique canonical proteins are used.
  3. Genome Quality Assessment: The quality of the canonical proteins from all 154 genomes was assessed using BUSCO v.5.4.7 (Benchmarking Universal Single-Copy Orthologs). The preference was for genomes with completeness greater than 85% and fragmentation less than 15%. However, some genomes not perfectly meeting these criteria were included based on their habitat and phylogenetic importance.
  4. Homology Group (HG) Inference: HGs were inferred using OrthoFinder v.2.5.5. OrthoFinder is a widely used bioinformatics tool that identifies orthologs and paralogs (i.e., homology groups) across multiple species. It relies on MAFFT v.7.505 for multiple sequence alignment and DIAMOND v.2.1.8 for sequence similarity searches. The output HGs are groups of proteins that have distinctly diverged from other groups, comprising orthologues and/or paralogues.

4.2.2. Guide Tree Construction

  1. Conserved Gene Identification: The researchers started by identifying conserved single-copy genes within the Metazoa_odb10 dataset from BUSCO v.5.4.7. For this, Homo sapiens was used as a reference, identifying 943 such genes.

  2. Alignment and Trimming: The identified conserved protein sequences were aligned using MAFFT v.7.505 and then trimmed using trimAl v.1.4.rev.15 to remove poorly aligned regions that could introduce noise into phylogenetic inference.

  3. Supermatrix Concatenation: The trimmed alignments were concatenated into a single supermatrix using FASconCAT-G v.1.05.1. A supermatrix combines multiple gene alignments into a single, larger alignment, which is then used to infer a species tree.

  4. Phylogenetic Tree Building: The concatenated supermatrix was used to build the phylogeny (guide tree) with IQ-TREE v.2.2.2.6. The C60+G+l model was used for phylogenetic inference, and 1,000 bootstrap replicates were performed to assess tree robustness. The guide tree was used as a constraint, incorporating species positions inferred from previous literature to ensure a phylogenetically sound backbone. This resulting phylogeny, with branch lengths representing genetic changes, served as the input for CAFE5 and the molecular clock analysis.

    The following figure (Extended Data Fig. 2 from the original paper) shows the Species tree of the 154 sampled taxa:

    该图像是一个系统发生树,展示了不同动物门及其亲缘关系。图中显示154个基因组的进化关系,以及它们在11个陆地化事件中对应的变化,通过这种方式可以观察到基因增减的特征和相关的生物适应性。 该图像是一个系统发生树,展示了不同动物门及其亲缘关系。图中显示154个基因组的进化关系,以及它们在11个陆地化事件中对应的变化,通过这种方式可以观察到基因增减的特征和相关的生物适应性。

4.2.3. Gene Content Analysis

The HG content for key nodes in the tree (including the 11 terrestrialization events and their ancestors) was reconstructed using a previously described approach (Paps and Holland, 2018; Guijarro-Clarke et al., 2020; Bowles et al., 2020). HGs were classified based on their mode of evolution:

  1. Novel HGs: These are HGs that are present in at least one species within the LCA (Last Common Ancestor) of a lineage (referred to as a "node"), while being completely absent in all species of the outgroup (sister groups and other more distantly related aquatic relatives).

  2. Novel Core HGs: A more stringent category of novel HGs. These are HGs that are present in all species within a node (allowing for one absence if the node contains more than three species), while being absent in all species of the outgroup. For nodes with only two species, novel HGs are equivalent to novel core HGs.

  3. Lost HGs: These are HGs that are absent in all species within a specific node, but were present in its sister groups and other species in the outgroup.

  4. Expanded HGs: These HGs show a statistically significant increase in the number of gene copies within a lineage, often due to gene duplication events.

  5. Contracted HGs: These HGs show a statistically significant reduction in the number of gene copies within a lineage.

  6. Ancestral HGs: All HGs inferred to be present in a given ancestral node.

    The inference of novel, novel core, and lost HGs was performed using the Phylogenetically Aware Parsing Script pipeline developed by Paps and Holland.

For expanded and contracted HGs, the CAFE5 software (v.5.22) was used:

  • An ultrametric phylogenetic tree (a tree where all tips are equidistant from the root, implying a constant evolutionary rate along all paths from the root to the tips) was generated from the IQ-TREE phylogeny using ape, TreeTools, and phytools packages in R.
  • CAFE5 was run with a Poisson distribution and an error model.
  • Due to the large dataset, the phylogeny was split into three smaller trees: Lophotrochozoa, Ecdysozoa, and Deuterostomia.
  • For each smaller tree, CAFE5 was run with two-lambda and three-lambda models (models that allow for different rates of gene family evolution across different parts of the tree) ten times each to test for convergence of Model Base Final Likelihood (-lnL).
  • The model with the highest -lnL was selected, and a likelihood ratio test with a chi-squared distribution was used (via the lmtest package in R) to compare the two-lambda and three-lambda models. This indicated that three-lambda models were generally a better fit (P<0.001P < 0.001) for Lophotrochozoa and Ecdysozoa. However, simulation tests within CAFE5 revealed that the three-lambda model for Deuterostomia fluctuated, making the two-lambda model more stable and thus chosen as better fit for this specific phylogeny.

4.2.4. Novel Core HG Validation

To ensure the robustness of the identified novel core HGs, a validation step was performed using BLASTp v.2.14.0+.

  • Novel core HGs were searched against the NCBI RefSeq database (downloaded 23 August 2023), which contains a broad range of high-quality molecular sequences.
  • The crucial step was to exclude protein sequences from the in-groups (the terrestrial nodes themselves) from the RefSeq search using the "-negative_taxidlist" option. This ensures that any significant hits found are from species outside the target lineage, confirming the novelty of the HGs to the specific terrestrial node.
  • The results showed very weak hits for the vast majority of sequences, with evalues>1010e-values > 10^-10 and identity<50identity < 50%, confirming the true novelty of these HGs to the terrestrial lineages.

4.2.5. Permutation Test Analysis

Two permutation tests were conducted to statistically validate key findings:

  1. Novel HGs Gain Rate:

    • Objective: To determine if the rate of novel gene emergence per million years (Myr) is significantly higher in terrestrial nodes compared to aquatic nodes.
    • Method:
      • The rate of novel HGs (total novel HGs / total divergence time) was calculated for the 11 terrestrial nodes.
      • 11 aquatic nodes (e.g., Actinopterygii, Bivalvia) were randomly selected as a comparison group.
      • The observed total evolutionary rate for terrestrial nodes (Rterr=4.900R_{terr} = 4.900) was recorded.
      • A permutation test with 10,000 bootstrap draws was performed. In each permutation, 11 aquatic nodes were sampled (with replacement), their evolutionary rate (RbootR_{boot}) was recalculated, and the value recorded. This generated a null distribution of novel gene rates for aquatic nodes.
      • The empirical one-tailed P-value was the proportion of bootstraps where RbootRterrR_{boot} \ge R_{terr}.
    • Result: The observed novel gene rates found in terrestrial lineages were significantly higher than in aquatic nodes (P=0.0015P = 0.0015) (Extended Data Fig. 3a).
  2. Functional Repertoire:

    • Objective: To assess if the GO term composition of terrestrial lineages (derived from novel genes) significantly differs from that of aquatic lineages.

    • Method:

      • Lineages with the biggest taxon sampling from random aquatic lineages (e.g., Actinopterygii, Bivalvia) were included.
      • The GO matrix (presence/absence profile of GO terms) derived from novel genes for each lineage was converted into a binary presence/absence matrix.
      • The dissimilarity between terrestrial and aquatic GO term profiles was quantified using Jaccard distance (proportion of non-shared terms).
      • A permutation test was run 10,000 times. In each permutation, the aquatic/terrestrial labels were randomly reshuffled across lineages, two group profiles were rebuilt, and the Jaccard distance between them was recalculated. This generated a null distribution.
      • The empirical P-value was the proportion of permutations where the distance was \ge the observed distance.
    • Result: The biological functions in terrestrial nodes were significantly different from those in other nodes (observed Jaccard distance = 0.583, P=0P = 0) (Extended Data Fig. 3b).

      Both analyses were conducted in R using the vegan, car, and ggplot2 packages.

4.2.6. Functional Annotation and Enrichment Analysis

  1. Representative Species Selection: For each of the 11 terrestrial events, a single representative species was chosen for detailed functional annotation (e.g., Homo sapiens for Tetrapoda, Drosophila melanogaster for Hexapoda).
  2. Pfam and GO Annotation: egg-NOG-mapper v.2 was applied online with default parameters to annotate Pfam domains and GO terms for the HGs of interest. UniProt was used to confirm gene names and PANTHER 19.0 to classify genes by protein class.
  3. GO Enrichment Analysis:
    • Objective: To find overrepresented GO terms in novel and expanded HGs of terrestrial events.
    • Background: The GO terms hitting all HGs present in the LCA of Bilateria were used as the background. This ensures a consistent and broad evolutionary context for comparison.
    • Method: A Fisher's exact test was performed to compare the number of HGs hitting each GO term between the terrestrial events and the bilaterian background.
    • Correction: P-values for multiple comparisons were corrected using the Benjamini-Hochberg method.
    • Significance: GO terms with adjusted Pvalues<0.05P-values < 0.05 were considered significantly enriched.
  4. Differential Functional Term Presence (Semi vs. Fully Terrestrial):
    • Objective: To identify biological functions that significantly differentiate semi-terrestrial and fully terrestrial groups after the PCoA analysis.
    • Method: Using binary presence/absence matrices of GO terms or Pfams, a two-tailed Fisher's Exact Test was conducted in R for every feature (term). Functional terms lacking variability (present in all or none) were discarded.
    • Background: The marginal totals across the entire pool of species served as the background for the test.
    • Correction: P-values were corrected using the Benjamini-Hochberg method.
    • Significance: Adjusted Pvalues<0.05P-values < 0.05 indicated significant enrichment, with P<0.01P < 0.01 terms specifically reported. Terms present in 10%\le 10\% of both groups were excluded for biological relevance.

4.2.7. PCoA and PCA

  1. Principal Component Analysis (PCA):

    • Objective: To compare the distribution of GO terms linked to novel and ancestral HGs among semi-terrestrial and fully terrestrial lineages.
    • Method: PCA was conducted using the prcomp function in R. Species GO terms were plotted using the first two principal components (PC1 and PC2).
    • Statistical Analysis: ANOVA and Tukey's honest significant difference (HSD) test were performed on principal components scores to evaluate differences. MANOVA examined the combined effect on PC1 and PC2.
    • Visualization: Ellipses were generated (using normal distribution-based ellipse fitting) for semi-terrestrial and fully terrestrial groups to visualize clustering.
  2. Principal Coordinates Analysis (PCoA):

    • Objective: To quantify compositional differences in GO term and Pfam presence/absence profiles between semi-terrestrial and fully terrestrial species, especially considering that shared absences might bias Euclidean-based PCA.

    • Method:

      • Pairwise dissimilarities among species were computed using Jaccard distance (which focuses on shared presences) in the vegan R package.
      • PCoA was performed on the Jaccard distance matrix. The axes explain percentages of Jaccard distance variation.
    • Statistical Analysis: A PERMANOVA (adonis2 function) was performed on the Jaccard distances with 10,000 permutations to test for overall group differences. Homogeneity of multivariate dispersion was tested using the betadisper function to ensure PERMANOVA results were not driven by unequal within-group spread.

    • Visualization: Plots were generated using ggplot2, with group ellipses representing 95% concentration regions.

    • Result: PERMANOVA showed significant differences between semi- and fully terrestrial groups for GO terms (R2=0.0995,P<0.01R^2 = 0.0995, P < 0.01) and Pfam domains (R2=0.0992,P<0.01R^2 = 0.0992, P < 0.01), while group dispersions did not differ.

      The following figure (Fig. 5 from the original paper) shows the PCoA of GO terms and Pfam domains associated with novel genes in semi- and fully terrestrial species:

      Fig. 5 | PCoA of GO terms and Pfam domains associated with novel genes in semi- and fully terrestrial species. a, PCoA of Jaccard dissimilarities based on GO terms presence/absence profiles. b, PCoA… 该图像是图5的统计图,展示了基于GO术语(左图)和Pfam域(右图)的PCoA分析结果。每个点代表61种采样的陆生物种,颜色表示不同的分类群。图中椭圆显示了半陆生(橙色)和完全陆生(绿色)物种的聚类模式,第一和第二主坐标分别解释了19.9%和15.6%的变异性。

4.2.8. Molecular Clock

Molecular clock analysis was performed using a two-step approach in MCMCTree (part of the PAML package).

  1. Step 1: Branch Length Estimation (CODEML)

    • The previously described concatenated alignment of 943 conserved orthologous genes (generated from BUSCO genes using MAFFT, trimAl, and FASconCAT-G) was used.
    • CODEML was used to estimate branch lengths by maximum likelihood. This calculates the gradient and Hessian of the likelihood function at the maximum likelihood estimates.
    • The Empirical+F model (model =3= 3) and an independent rates clock model (clock =2= 2) were applied. The Empirical+F model is a protein substitution model that uses empirical amino acid frequencies. An independent rates clock model allows for variable evolutionary rates across different branches of the phylogeny.
  2. Step 2: Divergence Time Estimation (MCMCTree)

    • MCMCTree was executed to estimate divergence times.

    • The same independent rates clock model was used.

    • A discrete gamma distribution with 4 categories and a shapeparameteralpha=0.5shape parameter alpha = 0.5 was used to model rate variation among sites.

    • The prior for the substitution rate was determined based on the approximate root age (591.255 Ma), resulting in a gamma distribution with shapeα=2shape α = 2 and scaleβ=5.1scale β = 5.1.

    • The Markov chain Monte Carlo (MCMC) was run for approximately 20 million generations, with the first 100,000 generations discarded as burn-in. Samples were collected every 1,000 generations to obtain 20,000 samples.

    • Six independent MCMC runs were performed to ensure convergence and reliability.

    • Tracer v.1.7.2 was used to assess convergence, with effective sample sizes (ESS) exceeding 200 for all parameters across all runs.

    • The fourth run was selected for final divergence time estimates based on consistency.

      The results of the molecular clock analysis are presented in Figure 1, illustrating the temporal windows of terrestrialization.

5. Experimental Setup

5.1. Datasets

The study utilized a comprehensive dataset of 154 genome samplings (Supplementary Table 1 in the original paper).

  • Source: These genomes were compiled from publicly available databases including UniProt, NCBI, Ensembl, and other resources.
  • Scale and Characteristics: The dataset included 151 metazoan genomes and 3 unicellular holozoan genomes (serving as outgroups). These represent 21 animal phyla, covering a broad diversity of animals, with a specific focus on species that flank the 11 key terrestrialization events identified in the study.
  • Quality Control: The quality of the protein sequences derived from these genomes was assessed using BUSCO v.5.4.7. Genomes with completeness greater than 85% and fragmentation less than 15% were preferred, though some genomes not perfectly meeting these criteria were included if deemed important for the phylogenetic and habitat context.
  • Why these datasets were chosen: This extensive and diverse sampling allowed the researchers to:
    • Reconstruct ancestral genomes with high confidence.

    • Identify homology groups across a wide range of evolutionary distances.

    • Cover multiple independent terrestrialization events, providing the necessary comparisons for discerning convergent versus contingent adaptations.

    • Establish a robust molecular timescale for these events.

      The paper does not provide a concrete example of a data sample like a specific gene sequence or annotation entry, but the nature of the data is protein sequences from various animal genomes.

5.2. Evaluation Metrics

The paper employs a range of statistical and biological metrics to evaluate its findings:

  1. P-value (P):

    • Conceptual Definition: The probability of observing a test statistic as extreme as, or more extreme than, the one observed, assuming the null hypothesis is true. A small P-value (typically < 0.05) indicates that the observed result is unlikely under the null hypothesis, leading to its rejection.
    • Mathematical Formula: There is no single universal formula for P-value, as it depends on the statistical test being performed. For a test statistic TT and observed value tobst_{obs}: P=P(TtobsH0)P = P(T \ge t_{obs} | H_0) (for a one-tailed test, or similar for two-tailed)
    • Symbol Explanation:
      • PP: Probability.
      • TT: The test statistic (e.g., F-statistic, t-statistic, chi-squared value, Jaccard distance).
      • tobst_{obs}: The observed value of the test statistic from the data.
      • H0H_0: The null hypothesis.
    • Usage in Paper:
      • Permutation test for novel gene rates: P=0.0015P = 0.0015 (terrestrial rates significantly higher).
      • Permutation test for functional repertoire: P=0P = 0 (terrestrial GO term composition significantly different).
      • CAFE5 model selection: P<0.001P < 0.001 (three-lambda models better fit than two-lambda).
      • GO enrichment analysis: Adjusted P<0.05P < 0.05 (significantly enriched GO terms after Benjamini-Hochberg correction).
      • PERMANOVA: P<0.01P < 0.01 (significant differences between semi- and fully terrestrial groups).
      • Differential functional term presence: Adjusted P<0.05P < 0.05 (significantly enriched GO/Pfam terms between semi- and fully terrestrial).
  2. R-squared (R2\boldsymbol{R^2}):

    • Conceptual Definition: In the context of PERMANOVA, R2R^2 represents the proportion of the total variance in the distance matrix that is explained by the grouping variable (e.g., habitat type). A higher R2R^2 indicates that the grouping variable accounts for a larger proportion of the observed differences between samples.
    • Mathematical Formula: For PERMANOVA, it is typically calculated as: R2=SSamongSStotalR^2 = \frac{SS_{among}}{SS_{total}}
    • Symbol Explanation:
      • R2R^2: The proportion of variance explained.
      • SSamongSS_{among}: Sum of squares between groups (variation explained by the grouping factor).
      • SStotalSS_{total}: Total sum of squares (total variation in the distance matrix).
    • Usage in Paper:
      • PERMANOVA for GO terms: R2=0.0995R^2 = 0.0995 (9.95% of GO term profile variance explained by habitat).
      • PERMANOVA for Pfam domains: R2=0.0992R^2 = 0.0992 (9.92% of Pfam profile variance explained by habitat).
  3. Log-likelihood (-lnL):

    • Conceptual Definition: A measure used in maximum likelihood estimation to assess how well a statistical model fits the observed data. A higher log-likelihood (or a smaller negative log-likelihood, -lnL) indicates a better fit of the model to the data. It is used for model comparison, often through likelihood ratio tests.
    • Mathematical Formula: The likelihood function L(θx)L(\theta | x) represents the probability of observing the data xx given the model parameters θ\theta. The log-likelihood is ln(L(θx))ln(L(\theta | x)). -lnL is simply the negative of this value.
    • Symbol Explanation:
      • LL: Likelihood function.
      • θ\theta: Model parameters.
      • xx: Observed data.
    • Usage in Paper: Used in CAFE5 to compare the fit of two-lambda and three-lambda models.
  4. Jaccard Distance:

    • Conceptual Definition: A measure of dissimilarity between two sets. It is calculated as one minus the Jaccard similarity coefficient, which is the size of the intersection divided by the size of the union of the sets. A Jaccard distance of 0 indicates identical sets, while 1 indicates completely disjoint sets. It is particularly suitable for binary presence/absence data.
    • Mathematical Formula: For two sets A and B, the Jaccard similarity coefficient J(A,B) is: J(A,B)=ABABJ(A,B) = \frac{|A \cap B|}{|A \cup B|} The Jaccard distance DJ(A,B)D_J(A,B) is then: DJ(A,B)=1J(A,B)=ABABABD_J(A,B) = 1 - J(A,B) = \frac{|A \cup B| - |A \cap B|}{|A \cup B|}
    • Symbol Explanation:
      • A, B: The two sets being compared (e.g., GO term profiles of two lineages).
      • | \cdot |: Cardinality (number of elements) of a set.
      • \cap: Intersection of sets (elements common to both).
      • \cup: Union of sets (all unique elements in either set).
    • Usage in Paper: Used to quantify dissimilarity between GO term and Pfam presence/absence profiles of terrestrial and aquatic species, and between semi-terrestrial and fully terrestrial species for PCoA.
  5. Effective Sample Size (ESS):

    • Conceptual Definition: A measure used to assess the convergence and quality of samples generated by Markov Chain Monte Carlo (MCMC) algorithms. It estimates the number of independent samples equivalent to the correlated samples generated by the MCMC chain. An ESS value generally greater than 200 (as used in the paper) indicates good mixing and adequate sampling for reliable parameter estimation.
    • Mathematical Formula: While the exact formula is complex and involves autocorrelation functions, conceptually it is: ESS=N1+2k=1ρkESS = \frac{N}{1 + 2 \sum_{k=1}^{\infty} \rho_k}
    • Symbol Explanation:
      • NN: Total number of samples in the MCMC chain.
      • ρk\rho_k: Autocorrelation at lag kk.
    • Usage in Paper: Used to assess the convergence of MCMCTree runs for molecular clock analysis, ensuring ESS values exceeded 200 for all parameters.

5.3. Baselines

The paper's methodology involves comparisons against different "baselines" depending on the analysis context, rather than a single baseline model for the entire framework:

  • Ancestral States: For analyzing gene gains and losses (novel, novel core, lost HGs), the implicitly used baseline is the gene content of the immediate ancestral nodes to the terrestrialization events. This allows the researchers to identify genes that emerged or disappeared during the transition.

  • Aquatic Nodes: For the permutation tests, randomly selected aquatic nodes (e.g., Actinopterygii, Bivalvia, Cnidaria) served as the null distribution baseline to assess whether the observed novel gene gain rates and functional repertoire of terrestrial lineages were significantly different from what would be expected in non-terrestrial (aquatic) evolution.

  • Bilaterian Ancestral Genes: For GO enrichment analysis, the GO terms hitting all HGs present in the LCA of Bilateria were used as the background. This provides a broad, evolutionarily ancient baseline against which to identify overrepresented functions in the terrestrial lineages.

  • Two-lambda Model: In CAFE5 analysis, the two-lambda model was used as a baseline to compare against the three-lambda model using likelihood ratio tests, determining which model better fit the gene family evolution data.

    These various baselines are representative because they allow the researchers to isolate the genomic and functional changes specifically associated with terrestrialization by comparing them against relevant ancestral or non-terrestrial states, or against alternative statistical models.

6. Results & Analysis

6.1. Core Results Analysis

The study's results comprehensively detail the genomic changes and functional adaptations associated with animal terrestrialization, highlighting patterns of convergence and contingency, and establishing a temporal framework.

  • High Gene Turnover Characterizes Terrestrialization:

    • All 11 terrestrial lineages showed significant gene turnover, characterized by both gene gains (novel genes and expansions) and gene reductions (losses and contractions) (Fig. 2). This indicates genome plasticity as animals adapted to new environmental challenges.
    • Novelty (new genes) was particularly high in bdelloid rotifers, nematodes, tetrapods, and land gastropods (the latter primarily in gene expansions).
    • Gene reduction was pervasive, with Nematoda, Tardigrada, and Onychophora exhibiting the largest gene losses.
    • A permutation test confirmed that the rate of novel gene emergence per million years in terrestrial nodes was significantly higher than in aquatic nodes (P=0.0015P = 0.0015) (Extended Data Fig. 3a), validating the importance of gene gains in this transition.
    • Conversely, arachnids and hexapods showed lower levels of plasticity, suggesting gene co-option (repurposing existing genes) might have played a more dominant role in their adaptation.
  • Convergent Functions via Gene Gains:

    • Despite distinct patterns of gene gain at the HG level, functional annotation revealed strong convergence in the types of biological functions that emerged.
    • Novel HGs shared by at least 10 terrestrial nodes (118 GO terms and 26 novel core HGs) were involved in osmosis (water transport), metabolism (fatty acids, linked to diet), reproduction, detoxification, sensory reception, and reaction to stimuli (Fig. 3a, b, c).
    • The 55 most specific GO functions included locomotion, membrane ion transport, transporter activity, response to stimulus, neuronal functions, and various metabolic, reproductive, and developmental processes (Fig. 3b).
    • Pfam domains echoed these findings, recovering functions related to osmoregulation (e.g., neurotransmitter-gated ion channel), stimulus/neuronal functions (transmembrane receptor), and detoxification (cytochrome P450) (Fig. 3c).
    • Expanded HGs (genes predating the transition but increasing in copy number) also showed convergence in functions related to detoxification (e.g., cytochrome P450, flavin-containing monooxygenases, glutathione S-transferase), oxidative stress, metabolism, and reception of stimuli (e.g., G-protein-coupled receptor family) (Fig. 4a).
  • Gene Reduction Marks Land Adaptation:

    • Gene loss was numerically high in most terrestrial nodes (Fig. 2).
    • Notably, the Dbl-homology domain gene family was lost in 8 out of 11 terrestrial events, and the pleckstrin-homology domain gene family in 7 out of 11. Both are components of Rho GTPases involved in regeneration and wound healing, suggesting a convergent reduction in regenerative capacity as an adaptation to land.
    • Chlorophyllase protein family loss indicated dietary shifts, and Shugoshin C-terminal domain loss pointed to changes in reproduction.
    • Convergent contractions in HGs included chloride channel protein members (critical for osmoregulation), carbohydrate sulfotransferases (extracellular communication), and melatonin-related receptors (circadian rhythms) (Fig. 4b).
  • Semi Versus Fully Terrestrial Lineages:

    • PCoA based on GO terms and Pfam domains showed significant separation between semi-terrestrial and fully terrestrial groups (P<0.01P < 0.01 for both) (Fig. 5), indicating distinct functional adaptations.
    • Semi-terrestrial species (e.g., bdelloid rotifers, nematodes, tardigrades) evolved an "expansive and versatile toolkit" emphasizing cuticle remodelling, visual development, and stress response. They showed broad functional convergence in areas like circulatory system development, osmoregulation, nutrient processing, muscle function, energy metabolism, detoxification, and sensory response.
    • Fully terrestrial species (e.g., land gastropods, arachnids, hexapods, tetrapods) displayed a "small and streamlined set" of novel gene functions centered on neuronal development and ion membrane homeostasis. They showed limited convergence among themselves, with most shared adaptations within arthropods (Myriapoda, Armadillidium, Hexapoda, Arachnida) likely stemming from similar ancestral toolkits.
  • Unique Adaptations in Terrestrial Events:

    • Beyond convergence, each lineage also exhibited unique adaptations. Examples include stress-response genes in bdelloid rotifers, nervous system and muscle adaptations in clitellates, shell formation and estivation genes in land snails, and cuticle-related genes in nematodes.
    • For arthropods, examples included exoskeleton wax layer synthesis genes and retinol-binding protein genes for vision adaptation. Hexapods showed enrichment in moulting and vision genes.
    • Tetrapods exhibited enriched novel and expanded genes related to immunity functions (e.g., T cell co-stimulation, innate immunity, neutrophil degranulation, mucins), supporting the evolution of specialized skin barriers against terrestrial pathogens.
  • Temporal Windows of Terrestrialization:

    • The molecular clock timescale (Fig. 1) identified three major temporal windows for animal land conquest:
      1. Middle Cambrian - Middle Ordovician (~534-403 Ma): Coinciding with early land plants. Included nematodes, myriapods, hexapods, and arachnids. Adaptations focused on cuticle formation, exoskeleton maintenance, lipid metabolism, drought, light, and oxidative stress tolerance.
      2. Late Devonian - Early Carboniferous (~465-263 Ma for clitellates, ~351-338 Ma for tetrapods): A period of episodic flooding and seasonal wetlands. Clitellate annelids and tetrapods adapted. Adaptations included limbs, lungs, skin barriers (tetrapods), and nervous/muscular system enhancements (clitellates).
      3. Cretaceous Period (~181-39 Ma): A greenhouse landscape. Bdelloid rotifers and land gastropods diversified. Adaptations included extreme stress tolerance (rotifers) and shell formation, mucus secretion, estivation (snails). Gene expansions in ammonium transporters, NADP-dependent oxidoreductases, G-protein-coupled receptors were convergent in this window.

6.2. Data Presentation (Tables)

The following are the results from Extended Data Table 1 of the original paper:

Novel Genes associated with terrestrialisation-linked GOs in human
Gene Symbol Protein Name Protein Class Biological Functions
APOA2 Apolipoprotein A-lIl transfer/carrier protein (PC00219) lipid metabolism
IL27 Interleukin-27 subunit alpha immunity and response to stimuli
OSM Oncostatin-M
XCL1 XCL2 Lymphotactin intercellular signal molecule (PC00207)
Cytokine SCM-1 beta
CXCL16 C-X-C motif chemokine 16 Tumor necrosis factor ligand
TNFSF18 FLT3LG superfamily member 18 Fms-related tyrosine kinase 3 ligand
CD1A, CD1B, T-cell surface glycoprotein
CD1C, CD1E CD1D Antigen-presenting glycoprotein defense/immunity protein (PC00090)
TMIGD2 CD1d Transmembrane and immunoglobulin domain-containing cell adhesion molecule (PC00069)
protein 2 Urokinase plasminogen activator
PLAUR surface receptor Ly6_PLAUR domain-containing transmembrane signal receptor (PC00197)
LYPD3 protein 3 Megakaryocyte and platelet blood cell function regulation
MPIG6B inhibitory receptor G6b
SPP1 Osteopontin intercellular signal molecule (PC00207) bone regeneration
ENAM Enamelin structural protein (PC00211) teeth development retinal cell -to-cell
GPR152 AKAP3, Probable G-protein coupled receptor 152 transmembrane signal receptor (PC00197) communication
AKAP, KAP5 DKKL1 A-kinase anchor protein scaffold/adaptor protein (PC00226) reproductive strategies
ZNF239 Dickkopf-like protein 1 Zinc finger protein 239 membrane traffic protein (PC00150) gene-specific transcriptional regulator
TBC1D21 (PC00264)
TBC1 domain family member 21 Protein phosphatase 1 regulatory protein -binding activity modulator PC095) neurodevelopment
PPP1R3F HR subunit 3F Lysine-specific demethylase chromatin/chromatin-binding, or- hair-cycle regulation (suggesting skin barrier)
Novel Genes associated with terrestrialisation-linked GOs in fruit fly
Pof Protein painting of fourth RNA metabolism protein (PC00031) gene -specific transcriptional regulator reproductive strategies
MESR4 Misexpression suppressor of ras 4, isoform A (PC00264)
Ir64a, Ir75d, Ir31a, Ir84a Gr39b Ionotropic receptor Putative gustatory receptor 39b transmembrane signal receptor (PC00197) sensory activity (response to stimuli)

6.3. Ablation Studies / Parameter Analysis

While the paper does not present explicit ablation studies in the traditional sense (removing a component of a model to assess its impact), it does include several analyses that function as parameter choices or validation of methodologies:

  • CAFE5 Model Comparison: The choice between two-lambda and three-lambda models in CAFE5 for inferring gene expansions and contractions can be considered a parameter analysis. The researchers ran both models ten times, compared their fit using likelihood ratio tests, and performed simulation tests to check model stability (Supplementary Table 20). This rigorous process ensured that the most appropriate model (e.g., three-lambda for Lophotrochozoa and Ecdysozoa, two-lambda for Deuterostomia due to stability) was chosen for each phylogenetic group, thereby validating the inference of expanded and contracted HGs.

  • Permutation Tests: The permutation tests (Extended Data Fig. 3) for novel HG gain rates and functional repertoire differences serve to statistically validate the significance of the observed patterns against random chance. By showing that terrestrial lineages have significantly higher novel gene rates and distinct functional profiles compared to aquatic ones, these tests confirm that the observed genomic changes are not merely stochastic but linked to the terrestrialization process.

  • BLASTp Validation of Novel Core HGs: The validation of novel core HGs using BLASTp against RefSeq (excluding ingroup taxa) acts as a crucial check on the methodology for identifying truly novel genes. The results (very weak hits with high e-values and low identity) confirm that the identified novel core HGs are indeed specific to the terrestrial lineages, reinforcing the reliability of these findings.

  • PCoA and PERMANOVA for Habitat Types: The PCoA and subsequent PERMANOVA analyses to differentiate semi-terrestrial and fully terrestrial groups (Fig. 5) rigorously test the hypothesis that habitat dependency correlates with distinct genomic adaptation patterns. The significant P-values and R-squared values support the distinction in functional profiles, validating this categorization.

    These analyses, though not explicitly termed "ablation studies," systematically evaluate the robustness of key methodological choices and the statistical significance of the core findings.

7. Conclusion & Reflections

7.1. Conclusion Summary

This study provides a comprehensive comparative genomic analysis of 11 independent animal terrestrialization events, leveraging an InterEvo framework to explore convergent and contingent evolutionary paths. The key findings demonstrate that terrestrialization is universally characterized by extensive gene turnover, involving both gene gains (novel genes, expansions) and gene reductions (contractions, losses). Crucially, despite lineage-specific gene changes, convergent functional adaptations emerged recurrently, particularly in processes related to osmoregulation, stress response, immunity, sensory reception, metabolism, and reproduction.

The paper highlights a distinction between semi-terrestrial and fully terrestrial lineages: semi-terrestrial animals exhibit broader functional convergence with a versatile toolkit for environmental flexibility, whereas fully terrestrial animals show more contingent adaptations with a streamlined set of functions, indicating diverse genomic solutions for permanent land colonization.

Furthermore, the molecular timescale analysis identifies three major temporal windows for animal terrestrialization (Middle Cambrian-Ordovician, Late Devonian-Early Carboniferous, and Cretaceous), each coinciding with distinct paleoecological contexts. The study concludes that genomic adaptations to terrestrial life are a complex interplay of predictable (convergent) molecular responses to common environmental challenges and unique (contingent) solutions shaped by individual evolutionary histories.

7.2. Limitations & Future Work

The authors acknowledge several limitations:

  • Habitat Classification Ambiguity: The classification of terrestriality (e.g., semi-terrestrial vs. fully terrestrial) lacks a universal consensus, and various definitions exist (e.g., cryptic forms, poikilohydric, homoiohydric). This could influence interpretations, and more comparisons using alternative classifications are needed.

  • Challenges in Annotating Lost and Contracted Genes: Many lost and contracted HGs are difficult to functionally annotate because they are absent or poorly characterized in common model organisms. Functional analysis often relies on distant homologs from humans or fruit flies, which may not accurately reflect functions due to sequence divergence, or become virtually impossible if HGs are lost in traditional models.

  • Inference of Gene Duplication Events: CAFE5 infers gene expansions based on copy number changes, not explicit gene trees. This means it does not pinpoint whether duplications occurred precisely at the terrestrial nodes or independently within descendant lineages. While observed expansions remain robust, more precise methods are needed.

  • Phylogenetic Position Incongruence: Certain phylogenetic relationships, such as the position of Chelicerata (specifically Xiphosura relative to Arachnida), remain debated. Such incongruences can complicate interpretations of terrestrial transitions and ancestral states.

  • Limited Taxon Sampling: For some lineages (e.g., tardigrades, onychophorans, woodlice), only one or two genomes were available. This limited sampling may lead to HG numbers that are not fully representative of the clade's entire gene content.

    Based on these limitations, the authors suggest future research directions:

  • Increased Taxon Sampling: Inclusion of more genomes for underrepresented lineages to improve representativeness.

  • Advanced Annotation Tools: Developing sophisticated annotation tools for lost and contracted genes, potentially using machine learning approaches (e.g., language models), to overcome challenges posed by sequence divergence and limited homologs.

  • Gene Tree-Based Expansion Inference: Improving gene family expansion inference by integrating gene tree-based approaches to precisely pinpoint duplication events.

7.3. Personal Insights & Critique

This paper offers a compelling and comprehensive analysis that significantly advances our understanding of animal terrestrialization. The InterEvo framework is a powerful conceptual and methodological innovation for studying convergent evolution on a genomic scale, moving beyond simple gene counts to shared biological functions. The sheer breadth of 154 genomes across 21 phyla makes the findings highly robust and generalizable across the animal kingdom.

The distinction between semi-terrestrial and fully terrestrial adaptations, while acknowledged by the authors as having classification ambiguities, provides a valuable layer of nuance to the convergence-contingency debate. It suggests that the "predictability" of evolutionary adaptation might vary depending on the degree of environmental independence achieved. Semi-terrestrial forms, constantly buffering between aquatic and terrestrial challenges, may be under stronger, more uniform selective pressures leading to broader functional convergence. Fully terrestrial forms, having overcome initial hurdles, might then explore more diverse and contingent genomic pathways as they specialize within drier niches.

The integration of molecular clock dates with paleoecological contexts is particularly insightful, transforming the genomic data into a narrative of co-evolution with Earth's changing environments and the rise of land plants. This highlights the profound interconnectedness of biological and geological history.

One area for potential critique, as the authors touched upon, lies in the functional annotation of novel and lost genes. The reliance on existing databases, often biased towards well-studied model organisms, means that truly novel or highly divergent functions might still be mis- or under-annotated. AI-driven methods, as suggested, could be transformative here. Additionally, while the paper meticulously analyzes gene turnover, a deeper dive into the regulatory changes (e.g., cis-regulatory elements, microRNAs) associated with these transitions could offer another layer of insight into how gene co-option and expression changes contribute to terrestrial adaptation, especially in lineages showing lower gene plasticity.

The paper's conclusions, particularly the statement that adaptation to life on land is predictable in large part, have significant implications. They suggest that fundamental biophysical and physiological constraints of terrestrial environments impose strong selective pressures that funnel evolution towards similar molecular solutions across vastly different lineages. This principle could potentially be applied to understanding other major evolutionary transitions (e.g., flight, deep-sea adaptation, endothermy) to discern generalizable molecular principles.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.