Abstract

Standard genome-wide association studies (GWAS) and rare variant burden tests are essential tools for identifying trait-relevant genes. By analyzing association studies of 209 quantitative traits in the UK Biobank, we show that they systematically prioritize different genes. We propose prioritization criteria based on trait importance and trait specificity and find that GWAS prioritize genes near trait-specific variants, while burden tests prioritize trait-specific genes, revealing differences in trait biology and implications for interpretation and usage.

1. Bibliographic Information

1.1. Title

Specificity, length and luck drive gene rankings in association studies

1.2. Authors

Jeffrey P. Spence, Hakhamanesh Mostafavi, Mineto Ota, Nikhil Milind, Tamara Gjorgjieva, Courtney J. Smith, Yuval B. Simons, Guy Sella & Jonathan K. Pritchard. The corresponding authors are Jeffrey P. Spence, Hakhamanesh Mostafavi, Mineto Ota, and Jonathan K. Pritchard.

1.3. Journal/Conference

Nature. Nature is one of the world's most prestigious and highly-cited multidisciplinary scientific journals, known for publishing groundbreaking research across all fields of science and technology. Its reputation ensures rigorous peer review and high impact in the scientific community.

1.4. Publication Year

2025 (Published online: 05 November 2025).

1.5. Abstract

Standard genome-wide association studies (GWAS) and rare variant burden tests are fundamental tools for identifying genes relevant to specific traits. By analyzing association studies of 209 quantitative traits in the UK Biobank, the authors demonstrate that these two methods systematically prioritize different genes. To address this, they propose prioritization criteria based on trait importance (how much a gene quantitatively affects a trait) and trait specificity (the importance of a gene for the studied trait relative to its importance across all traits). Their findings indicate that GWAS prioritize genes near trait-specific variants, while burden tests prioritize trait-specific genes. This distinction reveals differences in the underlying trait biology and carries significant implications for the interpretation and practical application of association studies.

1.6. Original Source Link

https://doi.org/10.1038/s41586-025-09703-7 The paper is officially published online in Nature.

2. Executive Summary

2.1. Background & Motivation

The central goal of human genetics is to identify genes that influence traits and disease risk and to understand the extent of their effects. This knowledge is crucial for deciphering biological processes underlying trait variation, identifying critical genes and pathways, and discovering potential therapeutic targets.

The core problem the paper addresses is the observed discrepancy in gene prioritization between two essential tools in human genetics: Genome-Wide Association Studies (GWAS) and rare variant burden tests. While conceptually similar, previous anecdotal evidence and a systematic analysis by Weiner et al. (2023) suggested that these methods often identify distinct sets of genes, even with some overlap. This raises critical questions:

How do these methods prioritize genes?
What underlying biological principles drive these differences?
Which method is more relevant for understanding trait biology or for downstream applications like drug discovery?

Existing challenges in interpreting these studies include:
GWAS do not directly pinpoint causal genes, as most associated variants are non-coding.
A large fraction of the genome contributes to heritability, and trait-associated variants often cannot be mapped to genes with clear phenotypic relevance.
Rare protein-coding variants, crucial for direct gene study, are often excluded or underpowered in standard GWAS but are the focus of burden tests.

The paper's innovative idea is to propose two distinct criteria for ideal gene prioritization—trait importance and trait specificity—and then to use population genetics models and empirical data from the UK Biobank to understand how GWAS and LoF burden tests align with these criteria, and what non-biological factors might also influence their rankings.

2.2. Main Contributions / Findings

The paper makes several primary contributions and key findings:

Systematic Quantification of Differences: The study systematically confirms that GWAS and LoF burden tests prioritize different genes for the same traits, even after conservatively accounting for power differences and issues in linking variants to genes.
Proposed Prioritization Criteria: It introduces two conceptually distinct criteria for ideal gene prioritization:
- Trait Importance: How much a gene quantitatively affects a trait.
- Trait Specificity: The importance of a gene for the trait under study relative to its importance across all traits.
Mechanisms of Prioritization:
- Burden Tests Prioritize Trait-Specific Genes: LoF burden tests tend to prioritize genes by their trait specificity (\Psi_G) rather than their trait importance. This is because the strength of selection against Loss-of-Function (LoF) variants, which determines their aggregate frequency, is proportional to the total effect across all fitness-relevant traits.
- GWAS Prioritize Trait-Specific Variants: GWAS prioritize trait-specific variants (\Psi_V). Variants can achieve specificity in two ways: by acting through a trait-specific gene or by having context-specific effects on a pleiotropic gene (e.g., regulating expression only in trait-relevant cell types).
Role of Non-Coding Variants: The difference between LoF burden tests and GWAS is largely driven by GWAS including non-coding variants, which can have context-specific effects and thus prioritize pleiotropic genes in a trait-specific manner, a capability burden tests generally lack.
Impact of Trait-Irrelevant Factors:
- Gene Length (for Burden Tests): LoF burden tests systematically prioritize longer genes, as more potential LoF positions lead to greater power, irrespective of the gene's true trait importance.
- Genetic Drift (for GWAS): Random genetic drift causes minor allele frequencies (MAFs) to vary widely around their expected values. This stochasticity significantly influences GWAS rankings, leading to GWAS hits appearing more pleiotropic than they truly are, as higher frequency variants (due to drift) increase power for multiple traits.
Method for Estimating Trait Importance: The paper suggests that non-standard GWAS approaches that aggregate signals across different types of variants (e.g., using AMM) can better estimate trait importance than standard P-value rankings, overcoming the flattening effect where highly important genes are harder to detect due to stronger purifying selection.
Implications: The findings underscore that LoF burden tests and GWAS reveal distinct but complementary aspects of trait biology. Understanding these differences is crucial for accurate interpretation, target discovery (e.g., trait-specific genes for drug targets to minimize side effects), and improving future association studies.

3.1. Foundational Concepts

Genome-Wide Association Studies (GWAS): A research approach that involves scanning markers across the complete sets of DNA (or genomes) of many people, looking for genetic variations associated with a particular disease or trait.
- How it works: Researchers collect DNA from individuals (e.g., thousands or hundreds of thousands). They then analyze single nucleotide polymorphisms (SNPs), which are common genetic variations where a single nucleotide in the genome differs between members of a species. For each SNP, they compare the allele frequencies (the proportion of a specific variant of a gene) between groups (e.g., people with a disease vs. healthy controls) or correlate allele dosage with quantitative traits.
- Output: GWAS typically generate P-values for millions of SNPs, indicating the statistical significance of their association with the trait. Effect sizes (e.g., beta coefficients) describe the magnitude and direction of the association. Genome-wide significant hits are SNPs with very small P-values (typically $< 5 \times 10^{-8}$ ), suggesting a strong association.
Rare Variant Burden Tests: A statistical method used to identify genes associated with complex traits or diseases by aggregating the effects of multiple rare variants within a specific gene.
- How it works: Instead of testing individual SNPs like GWAS (which is underpowered for rare variants), burden tests group rare variants (typically those with minor allele frequency (MAF) less than 1%) within a gene. These variants are often Loss-of-Function (LoF) or damaging missense variants. The aggregated presence of these rare variants in an individual creates a burden genotype. This burden is then tested for association with the phenotype. By burdening (or summing) rare effects, the method boosts statistical power.
- Output: Gene-level P-values and effect sizes for the aggregated rare variants within each gene.
UK Biobank: A large-scale biomedical database and research resource containing in-depth genetic and health information from half a million UK participants. It is a critical resource for GWAS and rare variant burden test analyses due to its large sample size and extensive phenotypic data.
Quantitative Traits: Traits that show continuous variation (e.g., height, blood pressure, body mass index) rather than discrete categories. Their variation is typically influenced by multiple genes and environmental factors.
Single Nucleotide Polymorphism (SNP): A variation in a single nucleotide that occurs at a specific position in the genome, where the nucleotide (A, C, G, or T) at that position can differ between individuals. SNPs are the most common type of genetic variation among people.
Loss-of-Function (LoF) Variants: Genetic variants (mutations) that are predicted to cause a complete or partial loss of function of the gene product (e.g., protein). These can include nonsense mutations (introducing a premature stop codon), frameshift mutations (altering the reading frame), or splice site mutations (affecting RNA splicing).
P-value: In hypothesis testing, the P-value is the probability of observing a test statistic (or something more extreme) if the null hypothesis were true. A small P-value (typically less than 0.05 or $5 \times 10^{-8}$ for GWAS) suggests that the observed data are unlikely under the null hypothesis, leading to its rejection and supporting the alternative hypothesis (e.g., an association exists).
Heritability: In genetics, heritability refers to the proportion of phenotypic variation in a population that is attributable to genetic variation among individuals. It estimates how much of the differences between people for a trait are due to genes, as opposed to environmental factors.
Genetic Drift: The change in the frequency of an existing gene allele in a population due to random sampling of organisms. It's a random process, not driven by selection, and can cause alleles to become more or less common over generations, especially in small populations.
Pleiotropy: The phenomenon where a single gene affects two or more seemingly unrelated phenotypic traits. A pleiotropic gene might have broad effects across multiple biological systems.
Linkage Disequilibrium (LD): The non-random association of alleles at different loci (genomic positions). Alleles are in LD when the frequency of association of their genotypes is higher or lower than what would be expected if the loci were independent and associated randomly. LD blocks are regions where SNPs are highly correlated.
Trait Importance (proposed by paper): The quantitative effect a gene (or variant) has on the trait under study. Formally, for a variant, it's its squared effect on the trait of interest ( $\alpha_t^2$ ); for a gene, it's the squared LoF burden effect size ( $\gamma_t^2$ ).
Trait Specificity (proposed by paper): The importance of a gene (or variant) for the trait under study relative to its importance across all fitness-relevant traits. Formally, for a variant, it's $\Psi_V := \alpha_1^2 / \Sigma_t \alpha_t^2$ ; for a gene, it's $\Psi_G := \gamma_1^2 / \Sigma_t \gamma_t^2$ , where trait 1 is the trait under study.
Minor Allele Frequency (MAF): The frequency at which the less common allele occurs in a given population. For rare variants, MAF is typically very low (e.g., <1%).
Assay for Transposase-Accessible Chromatin using sequencing (ATAC-seq): A molecular biology technique used to assess chromatin accessibility across the genome. Accessible chromatin regions are often regulatory elements (e.g., enhancers, promoters) and indicate active gene regulation. ATAC peaks denote regions of open chromatin.
S-LDSC (Stratified Linkage Disequilibrium Score Regression): A method used to partition the heritability of complex traits across different genomic annotations (e.g., gene bodies, enhancers, tissue-specific regulatory regions). It quantifies how much a given annotation contributes to heritability beyond what is expected by chance. The output $τ/h^2$ represents the change in the proportion of heritability explained by a single variant when that variant is in a given annotation.
MAGMA (Multi-marker Analysis of GenoMic Annotation): A gene-set analysis tool that uses GWAS summary statistics to calculate gene-level P-values and then tests for enrichment of these gene-level P-values in predefined gene sets. It aggregates SNP-level P-values within genes to obtain a gene-level score.
PoPS (Polygenic Priority Score): A method that leverages polygenic enrichments of gene features (e.g., expression patterns, protein-protein interactions) to predict gene-level scores (like those from MAGMA) and prioritize genes underlying complex traits and diseases.
AMM (Allele-level Mixed Model): A statistical method designed to partition gene-mediated disease heritability from GWAS data without requiring eQTLs (expression quantitative trait loci). It estimates the total heritability contributed by variants acting via a given set of genes.

3.2. Previous Works

The paper builds upon and references several key prior studies:

Weiner et al. (2023) - "Polygenic architecture of rare coding variation across 394,783 exomes.": This study systematically analyzed rare coding variants and found that burden heritability is explained by fewer genes compared to SNP heritability, and burden tests tend to prioritize genes more closely related to trait biology. This observation of distinct gene sets identified by GWAS and burden tests forms a primary motivation for the current paper's investigation into why these differences occur. The current paper directly aims to explain the "why" behind the Weiner et al. findings.
Simons et al. (2018) - "A population genetic interpretation of GWAS findings for human quantitative traits.": This work, and subsequent extensions (ref 3, 33, 82), developed population genetics models of complex traits, often assuming stabilizing selection. The current paper explicitly utilizes these models (Supplementary Appendix B) to derive theoretical predictions about how natural selection influences the power of GWAS and burden tests to prioritize variants and genes based on their effect sizes and frequencies. Specifically, the concept of flattening (where selection makes it harder to detect very large effect variants) is central to Simons et al.'s work and is directly incorporated here to explain the decoupling of z-scores from trait importance in burden tests.
Backman et al. (ref 4): This refers to the source of the LoF burden test summary statistics used in the current study, highlighting the reliance on existing large-scale datasets.
Finucane et al. (2015) - "Partitioning heritability by functional annotation using genome-wide association summary statistics." (ref 9): This paper introduced S-LDSC, a foundational method for partitioning heritability across genomic annotations. The current paper uses S-LDSC extensively to quantify the contribution of trait-specific variants (coding and non-coding) to heritability in GWAS, thus leveraging a widely accepted methodology for interpreting GWAS signals.
Morgenthaler & Thilly (2007) - "A strategy to discover genes that carry multi-allelic or mono-allelic risk for common diseases: a cohort allelic sums test (CAST)." (ref 17): This is one of the earliest conceptual papers outlining the idea of burden tests for rare variants, demonstrating the historical foundation of the methods being analyzed.

3.3. Technological Evolution

The field of human genetics has evolved significantly:

Early Linkage Studies: Focused on large families to map disease-causing genes, primarily for Mendelian diseases (single gene disorders).
Candidate Gene Studies: Hypothesized specific genes and tested their association with traits, often with limited success for complex traits.
Rise of GWAS (mid-2000s): Revolutionized complex trait genetics by enabling unbiased, genome-wide scans for common variants. This was driven by advancements in SNP genotyping arrays and large cohorts. GWAS revealed the polygenic nature of most complex traits (many genes, each with small effects) and highlighted the importance of non-coding regulatory regions.
Post-GWAS Interpretation (late 2000s-present): The challenge shifted from finding GWAS hits to interpreting them. Methods like LDSC and S-LDSC emerged to link GWAS signals to functional annotations and tissues. Fine-mapping techniques aimed to pinpoint causal variants within LD regions.
Whole-Exome/Whole-Genome Sequencing (early 2010s-present): The advent of affordable sequencing technologies enabled the study of rare variants, which were missed by GWAS. This led to the development and widespread application of rare variant burden tests, initially for severe Mendelian disorders and now increasingly for complex traits. Large biobanks like the UK Biobank provide the necessary sample sizes for these studies.
Integration and Causal Inference (present): Current research, including this paper, focuses on integrating information from GWAS and burden tests, understanding their complementary nature, and moving towards causal inference and functional interpretation of genetic signals. Methods like AMM, MAGMA, and PoPS represent efforts to extract gene-level insights from SNP-level GWAS data.

This paper fits into the current era of integrating different genetic association approaches and providing a theoretical framework to understand their strengths and weaknesses in prioritizing genes for complex traits.

3.4. Differentiation Analysis

Compared to previous studies that anecdotally or systematically observed differences between GWAS and burden tests, this paper provides a novel theoretical and empirical framework to explain why these differences exist and how they relate to distinct biological properties of genes and variants.

Novel Prioritization Criteria: The introduction of trait importance and trait specificity as explicit, formal criteria for ideal gene prioritization is a core innovation. Previous work might have implicitly considered these, but this paper defines them rigorously and uses them as a lens to analyze existing methods.
Population Genetics Framework: The paper rigorously applies population genetics models (building on Simons et al.) to derive theoretical predictions for how natural selection shapes the observed P-values and z-scores in both GWAS and burden tests. This moves beyond purely statistical comparisons to a deeper biological explanation.
Explanation for Prioritization Mechanisms:
- It explicitly differentiates that burden tests prioritize trait-specific genes (driven by the selection strength $s_{\mathrm{het}}$ being inversely related to LoF frequency, which itself sums across all trait effects) while GWAS prioritize trait-specific variants (which can include context-specific effects on pleiotropic genes). This distinction, especially the role of non-coding variants in allowing GWAS to capture pleiotropic genes in a trait-specific manner, is a key insight.
- The paper uncovers and quantifies the impact of trait-irrelevant factors like gene length (for burden tests) and random genetic drift (for GWAS), which were previously less systematically understood as drivers of observed rankings.
Proposing Solutions for Trait Importance: While previous work noted the difficulty in identifying trait-important genes from P-value rankings, this paper proposes and empirically tests methods (e.g., aggregating signals with AMM) that can better estimate trait importance by overcoming the flattening effect.

In essence, while others observed what was different, this paper provides a robust theoretical and empirical explanation for why these differences arise and how to potentially leverage or mitigate them.

4. Methodology

4.1. Principles

The core idea of the methodology is to understand the drivers behind gene prioritization in two major types of genetic association studies: Genome-Wide Association Studies (GWAS) and Loss-of-Function (LoF) burden tests. The theoretical basis hinges on integrating population genetics models of complex traits with statistical genetics, particularly how natural selection influences the allele frequencies and effect sizes of variants, and consequently, the power of association tests.

The paper posits two ideal criteria for prioritizing genes: trait importance (the magnitude of a gene's effect on the trait) and trait specificity (how unique that effect is to the trait under study compared to other traits). The methodology then involves:

Theoretical Derivation: Using population genetics models to predict how the strength of association (z-score) in GWAS and burden tests is expected to relate to trait importance and trait specificity. This involves modeling the interplay between mutation rates, $selection pressure (s_het)$ , and allele frequencies.
Empirical Validation: Analyzing real GWAS and LoF burden test summary statistics from the UK Biobank for hundreds of quantitative traits to test these theoretical predictions. This includes comparing rankings, examining heritability enrichment in tissue-specific annotations, and investigating the influence of gene length and minor allele frequency (MAF).
Simulation Studies: Using simulated data to further explore the effects of genetic drift on GWAS variant rankings and apparent pleiotropy.
Proposing Improved Estimation: Investigating whether alternative approaches, such as aggregating signals across variants (e.g., using AMM), can better estimate trait importance compared to P-value based rankings.

The intuition is that if a gene or variant has a large effect on a trait (high trait importance), it might also affect many other traits (high pleiotropy). Natural selection tends to remove variants with large, negative effects across many traits, leading to lower frequencies. This interplay, coupled with technical aspects of each assay, dictates which genes rise to the top of association study rankings.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Data Acquisition and Preprocessing

GWAS Summary Statistics:
- Downloaded from the Neale Lab (http://www.nealelab.is/uk-biobank/; v3) for 305 continuous traits.
- These regressions were performed on inverse rank normal-transformed phenotypes in approximately 360,000 UK Biobank individuals.
- Covariates included age, age $^2$ , inferred sex, age $\times$ inferred sex, age $^2 \times$ inferred sex, and principal components 1-20.
- Genome-wide significance threshold was set at $5 \times 10^{-8}$ .
LoF Burden Test Summary Statistics:
- Downloaded from Backman et al. (ref 4) for 292 LoF burden tests.
- 209 traits overlapped with the GWAS data (Supplementary Table 1).
- Burden genotypes were calculated by categorizing individuals: homozygous non-LoF (at all sites), homozygous LoF (at any site), or heterozygotes.
- Burden tests were run using REGENIE (ref 59) on inverse rank normal-transformed phenotypes.
- Mask M1: Used for primary analyses, includes only LoF variants with stringent filtering criteria and allele frequency upper bound of 1%.
- Mask M3: Used for analyses including missense variants, also includes likely damaging missense variants with an allele frequency upper bound of 1%.
- Per-trait genome-wide significance threshold was $2.7 \times 10^{-6}$ , derived from Bonferroni correction for testing approximately 18,000 genes.
Subset of Genetically Uncorrelated Traits:
- For specific analyses (e.g., Figs. 3b-d, 4b,c, Extended Data Figs. 1a-c, 3a-c), a subset of 27 genetically uncorrelated traits was selected.
- This subset was formed by intersecting the 209 overlapping traits with those analyzed by Mostafavi et al. (ref 45), ensuring pairwise genetic correlations (from Neale Lab) were below 0.5 and prioritizing traits with higher heritability. Biomarkers were excluded. This minimizes results being driven by highly correlated phenotypes.

4.2.2. Defining GWAS Loci and Ranking

To systematically compare GWAS and burden test discoveries and minimize technical artifacts from unknown causal genes or variant-to-gene mapping errors, a conservative approach for defining GWAS loci was used:

Locus Definition: For traits with at least one burden test hit and one GWAS hit (151 traits), LD-clumped hits ( $P < 5 \times 10^{-8}$ , clumping $r^2 < 0.1$ ) were used.
Starting with the most significant GWAS hit, a 1-Mb window was taken around it.
All independent hits with larger P-values (lower significance) within this 1-Mb window were included.
The locus size was then expanded to ensure no other hit was within 1 Mb of any variant already in the locus.
Overlapping loci were merged.
This process was repeated for the next most significant hit not yet assigned.
Gene Assignment: Overlapping protein-coding genes (18,524 genes in LoF burden tests) were assigned to each locus.
Ranking: GWAS loci were ranked by the minimum GWAS P-value within each locus. Burden test genes were ranked by their burden P-value.
Overlap Quantification (Fig. 1c): For genome-wide significant burden test hits, their rank was compared to the rank of the GWAS locus containing them. Top GWAS loci were defined by selecting a number of GWAS loci equal to the number of significant burden hits for a given trait (e.g., 82 for height).

4.2.3. Prioritization Criteria: Trait Importance and Trait Specificity

The paper formally defines these concepts (illustrated in Figure 2):

The following figure (Figure 2 from the original paper) illustrates how genes should ideally be prioritized:

$Fig. 3| Burden tests prioritize trait-specific genes, not large-effect genes. a, Burden tests prioritize genes by trait specificity. $\\mu$ is the per-site mutation rate, L is the number of potential…$ 该图像是图表，展示了负载测试中基因优先级与特定性之间的关系。图表中显示了选择强度 $s_{het}$ 和 LoF 频率之间的负相关，使用 LOESS 拟合趋势线，强调了特定基因与其表型的关联性。图 e 为量化-量化图，展示了在不同表型组织对的 P 值分布。

Trait Importance:
- For a variant: its squared effect on the trait of interest. If $\alpha_t$ is the effect size of a variant on trait $t$ , trait importance for trait 1 is $\alpha_1^2$ .
- For a gene: the trait importance of LoF variants in that gene. If $\gamma_t$ is the LoF burden effect size of a gene on trait $t$ , trait importance for trait 1 is $\gamma_1^2$ .
- The paper considers high-impact variants important regardless of their direction of effect.
Trait Specificity:
- Defined as the importance for the trait of interest relative to the importance across all fitness-relevant traits (measured in appropriate units).
- For a variant: $\Psi_V := \alpha_1^2 / \sum_t \alpha_t^2$ .
- For a gene: $\Psi_G := \gamma_1^2 / \sum_t \gamma_t^2$ .
- Here, trait 1 is always the trait under study.

4.2.4. Theoretical Model for LoF Burden Test Prioritization

The paper analyzes population genetics models developed by Simons et al. (ref 3) to understand how LoF burden tests prioritize genes.

Expected Strength of Association ( $z^2$ ): For a gene, the strength of association in burden tests is proportional to its trait importance (\gamma_1^2) and the aggregate frequency of LoFs (p_{\mathrm{LoF}}(1-p_{\mathrm{LoF}})). $ E[z^2] \propto \gamma_1^2 p_{\mathrm{LoF}}(1-p_{\mathrm{LoF}}) $ Where:
- $E[z^2]$ is the expected squared z-score (strength of association).
- $\gamma_1^2$ is the trait importance of the gene for trait 1.
- $p_{\mathrm{LoF}}$ is the aggregate frequency of LoFs within the gene.
Relationship between $p_{\mathrm{LoF}}$ and Selection ( $s_{\mathrm{het}}$ ): Under stabilizing selection (where intermediate trait values are favored), the aggregate frequency of LoFs is inversely related to the strength of selection against heterozygous LoF carriers (s_{\mathrm{het}}) and positively related to mutation rate (\mu) and gene length (L) (number of sites where an LoF can occur). $ p_{\mathrm{LoF}}(1-p_{\mathrm{LoF}}) \propto \frac{\mu L}{s_{\mathrm{het}}} $ Where:
- $\mu$ is the per-base mutation rate.
- $L$ is the number of potential LoF positions within the gene (proxy for gene length).
- $s_{\mathrm{het}}$ is the strength of selection against heterozygous LoF carriers.
Relationship between $s_{\mathrm{het}}$ and Total Trait Effects: For genes affecting complex traits under stabilizing selection, the strength of selection (s_{\mathrm{het}}) is approximately proportional to the sum of trait importances across all fitness-relevant traits: $ s_{\mathrm{het}} \approx \sum_t \gamma_t^2 $ Where:
- $\sum_t \gamma_t^2$ is the sum of trait importances across all fitness-relevant traits.
Combining these, the expected strength of association for LoF burden tests becomes: $ E[z^2] \propto \gamma_1^2 \frac{\mu L}{\sum_t \gamma_t^2} = (\mu L) \frac{\gamma_1^2}{\sum_t \gamma_t^2} = (\mu L) \Psi_G $ This shows that LoF burden tests prioritize genes by their trait specificity (\Psi_G) and gene length (\mu L). It does not directly prioritize by trait importance (\gamma_1^2).
"Flattening" Effect: For genes with sufficiently large effects (high trait importance), selection causes their LoF frequencies to be very low, leading to larger standard errors in effect size estimates. This flattening effect decouples the strength of association (z^2) from the true trait importance, making rankings by significance independent of trait importance for the most important genes.
Empirical Testing for Burden Tests:
- Correlation between estimated s_het (from Zeng et al. (ref 35)) and aggregate LoF frequencies ( $p_{\mathrm{LoF}}$ ) (Fig. 3b).
- Correlation between estimated s_het and unbiased estimates of average trait importance ( $\sum_t \gamma_t^2$ ) across 27 genetically uncorrelated traits (Fig. 3c).
- Plotting mean squared z-scores (z^2) against mean importance to show decoupling (Fig. 3d).
- Using gene expression specificity as a proxy for trait specificity (\Psi_G). Genes were binned by expression specificity in nine trait-tissue pairs, and quantile-quantile plots of LoF burden test P-values were generated to see if more specific genes had stronger signals (Fig. 3e).

4.2.5. Theoretical Model for GWAS Prioritization

A similar argument applies to GWAS at the variant level:

The expected strength of association for a variant is proportional to its trait importance (\alpha_1^2) relative to its total trait importance across all fitness-relevant traits ( $\sum_t \alpha_t^2$ ). $ E[z^2] \propto \frac{\alpha_1^2}{\sum_t \alpha_t^2} = \Psi_V $ This means GWAS prioritize trait-specific variants (\Psi_V).
Variant Specificity Types (Fig. 4a):
- Trait-specific gene: A variant affects a gene that primarily impacts the studied trait.
- Context-specific effects: A variant (often non-coding) has effects only in specific cellular contexts or developmental stages relevant to the trait, even if the underlying gene is pleiotropic.
  
  The following figure (Figure 4 from the original paper) illustrates how GWAS prioritizes trait-specific variants:
  
  $Fig. 5 | Estimating trait importance by combining different variant types. a, Theoretical expected contributions to heritability, $h ^ { 2 }$ , as a function of the total effect of a variant on a tra…$ 该图像是图表，展示了不同变异类型对遗传力贡献的估计。图中(a)显示了预期的遗传力贡献与变异特异性之间的关系，公式为 $h^2$ ；(b)和(d)展示了基因组中基因的遗传力富集与特异性之间的相关性，(c) 描绘了变异对遗传力贡献的结构示意。
Empirical Testing for GWAS:
- S-LDSC (Stratified Linkage Disequilibrium Score Regression): Used to quantify heritability enrichment (a proxy for how highly variants are prioritized on average) along axes of trait specificity.
  - Gene Trait Specificity (for coding variants): Restricted analysis to coding variants and used expression specificity of the gene they act on as a proxy for $\Psi_G$ . S-LDSC was run for nine trait-tissue pairs (Fig. 4b).
  - Context Specificity (for non-coding variants): Used non-coding variants and tissue specificity of ATAC-seq peaks as a proxy for context specificity. S-LDSC was run while controlling for ATAC peak strength (Fig. 4c).
- $\tau/h^2$ was reported, representing the change in the proportion of heritability explained by a single variant due to annotation.

4.2.6. Impact of Trait-Irrelevant Factors

Gene Length on LoF Burden Tests

Theoretical Prediction: The expected strength of association in LoF burden tests is proportional to $\mu L$ . Longer genes (larger $L$ ) should have higher power, all else being equal, because they have more potential LoF positions and thus a higher aggregate LoF frequency.
Empirical Testing:
- Correlated gene length (proxied by expected number of segregating LoFs from gnomAD (v2)) with unbiased estimates of squared trait importance (\gamma^2), squared standard errors, and z-scores (z^2) across 27 genetically uncorrelated traits (Extended Data Fig. 1).
  
  The following figure (Extended Data Fig. 1 from the original paper) illustrates how coding sequence length drives prioritization in LoF burden tests:
  
  该图像是一个展示相同突变轨迹的频率变化图。图中横轴表示自突变出现以来的世代数，纵轴表示频率。可以看到，随着世代的增加，频率呈现出不同的增长趋势，突出显示了基因突变在不同世代中的分布情况。

Random Genetic Drift on GWAS

Theoretical Prediction: While the expected strength of association is proportional to trait specificity (\Psi_V), random genetic drift causes variant allele frequencies to deviate widely from their expected values (Extended Data Fig. 2a). GWAS considers variants individually, so this stochasticity in MAF can disproportionately affect rankings.
Realized Heritability: The realized heritability of a variant is $2 \alpha_1^2 p(1-p)$ , where $p$ is the variant allele frequency. Genetic drift makes $p$ highly variable.
Simulations: Simulated GWAS to show that for sufficiently trait-important variants, the ranking by realized heritability is largely random with respect to trait importance, driven by MAF differences (Extended Data Fig. 2b).
Counterintuitive Pleiotropy: This MAF randomness leads to a counterintuitive result: variants that are the strongest GWAS hits for one trait are more likely to be hits for other traits, even if they are, on average, more trait specific. This is because high-frequency variants (due to drift) have increased power across all traits (Extended Data Fig. 3).

The following figure (Extended Data Fig. 2 from the original paper) illustrates how GWAS variant rankings are driven largely by genetic drift:

该图像是一个散点图，显示了模拟SNPs的标准化平方效应与相对实现的遗传力之间的关系。颜色条表示最小等位基因频率（MAF），其范围从0.1到0.4，反映不同SNP的分布特征。

The following figure (Extended Data Fig. 3 from the original paper) illustrates how genetic drift makes GWAS hits appear more pleiotropic:

该图像是表格，展示了关于性别和种族等社会相关分组的报告信息，包括群体特征、招募及伦理监督等内容。表格中提供了N/A的标注，表示这些信息在此研究中不适用或未报告。

4.2.7. Estimating Trait Importance

The paper explores how to overcome the flattening effect and estimate trait importance more effectively.

Simplified Model: A variant has an effect (\beta) on a gene, which in turn has an effect (\gamma) on the trait, such that the overall variant effect (\alpha) is $\alpha = \beta\gamma$ .
Flattening and Plateaus (Fig. 5a): The expected contribution to heritability first increases with the total effect (\alpha^2 = (\beta\gamma)^2), but then plateaus or decouples for very large effects due to strong selection.
Aggregation Strategy (Fig. 5c): While individual variants experience flattening, genes with higher trait importance (large\gamma) will have more variants (even those with small $\beta$ ) that cross the heritability contribution threshold (\tau) and contribute to heritability. Thus, the total heritability contributed by variants acting on a given gene should correlate with its trait importance.
Empirical Testing with AMM:
- Used AMM (Allele-level Mixed Model, ref 47) to estimate the total heritability of variants acting via a given set of genes using GWAS data.
- Genes were binned by s_het (a proxy for trait importance).
- Compared how AMM-estimated total heritability (Fig. 5d) tracks s_het versus LoF burden heritability (Fig. 5b).
  
  The following figure (Figure 5 from the original paper) illustrates how trait importance is estimated by combining different variant types:
  
  该图像是图表，展示了长基因与性状的效应关系。图A显示长基因对性状没有更大影响，图B则表明长基因的标准误差较小，图C显示LoF负担测试优先考虑长基因，均与平均预期的LoF数量相关。

4.2.8. Unbiased Estimates of Trait Importance

To obtain reliable estimates of trait importance ( $\alpha^2$ for variants, $\gamma^2$ for genes), the paper used an unbiased estimator to correct for the inherent bias in simply squaring the observed effect size estimates ( $\hat{\gamma}^2$ ).

Assuming effect size estimates ( $\hat{\gamma}$ ) are approximately normally distributed about their true values ( $\gamma$ ) with noise dependent on their standard errors (s): $\hat{\gamma} \sim \mathrm{Normal}(\gamma, s^2)$ .
An unbiased estimator for $\gamma^2$ $γ^{2}$ is: $ \hat{\gamma}^2 - s^2 $ Where:
- $\hat{\gamma}^2$ is the squared estimated effect size.
- $s^2$ is the squared standard error of the estimate.

4.2.9. Specific Methodological Details

LoF Burden Summary Statistics Binned by $s_{\mathrm{het}}$ : s_het values were from Zeng et al. (ref 35). Genes were binned into 100 bins by s_het, and summary statistics (e.g., unbiased $\hat{\gamma}^2$ , $z^2$ ) were averaged within bins across 27 uncorrelated traits. Heritability enrichment was computed as the average $(z^2 - 1)$ in a bin relative to the overall average, then inverse-variance weighted across traits.
ATAC Peak Specificity: ATAC-seq files from ChIP-Atlas (ref 71) were grouped into 19 tissue/cell-type categories. A peak was 'present' if >5% of samples in a tissue contained it. Peak specificity was measured by the number of shared tissues a peak was present in (for peaks relevant to a trait-tissue pair) and peak intensity (fraction of samples within the focal tissue containing the peak).
Gene Expression Specificity: Average gene expression (TPM) from Human Protein Atlas (ref 73) and Gene Expression Omnibus (ref 74) for 17 tissues/cell types. A gene was 'expressed' if >10 TPM. Expression specificity score = expression in trait-relevant tissue / sum of expression across all 17 tissues. Genes were binned into quintiles based on this score.
Linking Traits to Tissues: S-LDSC was used to partition heritability for traits with $h^2 > 0.04$ using the 19 ATAC-seq annotations. Traits were assigned to a tissue if it had an $LDSC τ$ with a z-score > 4.5 and >40% of heritability explained by ATAC-seq peaks in that tissue. Genetically uncorrelated traits ( $r^2 < 0.04$ ) were kept, resulting in nine trait-tissue pairs.
Regression of Burden $z^2$ on Expression Specificity: Linear regression of burdenz^2

on `expression specificity quintiles` for genes expressed in the `top tissue`, controlling for `unbiased estimates of trait importance`.
*   **S-LDSC Analysis using ATAC-seq peaks:** `ATAC-seq peaks` were binned by `number of shared tissues` (5 bins) and `peak intensity` (5 bins). These annotations, along with `LDSC baseline v1.1 covariates`, were used in `S-LDSC v.1.0.1` on `HapMap3 SNPs`.
*   **S-LDSC Analysis using Coding Variants:** `Coding variants` were defined by `Ensembl Variant Effect Predictor (v85)` consequences. `S-LDSC` was run with `expression specificity bins` (5 bins), `gene expression level bins` (5 bins), and `baseline v1.1 covariates` on `HapMap3 SNPs`.
*   **LoF Burden Summary Statistics Binned by  $\mu L$ :** Used `expected number of segregating LoFs` from `gnomAD (v2)` as a proxy for  $\mu L$ . Binned genes into 100 bins, averaged `summary statistics` (e.g., unbiased  $\hat{\gamma}^2$ ,  $s^2$ ,  $z^2$ ) across 27 `uncorrelated traits`.
*   **Computing Frequency Spectra given  $s_{\mathrm{het}}$ :** Simulated `allele frequency distributions` under a `stabilizing selection model` (heterozygote fitness  $1-s_{\mathrm{het}}$ ) using `fastDTWF (ref 81)` for a population of 20,000 diploids.
*   **Simulating Realized Heritability:** Simulated 50,000 unlinked `variants`. For each of 1,000 `s_het` values, 50 `variants` were simulated by drawing `allele frequencies` from computed distributions. `GWAS sample allele counts` were drawn from a `Binomial` distribution ( $N=600,000$ ). `Realized heritability` was set to  $2 s_{\mathrm{het}} \widetilde{f}(1 - \widetilde{f})$ , where  $\widetilde{f}$  is the `GWAS sample allele frequency`.
*   **Computing Pleiotropy of GWAS Hits:** Considered 18 `uncorrelated traits` with at least 100 `GWAS hits`. Hits were grouped into `P-value quartiles`. For each hit, the number of traits for which it was a `hit` was counted and averaged within quartiles.
*   **Simulating Pleiotropy of GWAS Hits:** Simulated `GWAS summary statistics` for 18 traits and 10 million positions. `Squared effect sizes (`\vec{\alpha^2_j}`)` for variant  $j$  were drawn as:
     $\vec{\alpha^2_j} \sim \frac{10^{-7}}{f} \times \exp\{3f \times \mathrm{Normal}(0, p{\bf I} + (1-p){\bf 1}{\bf 1}^T)\}$ 
    Where:
    *   The `exponentiation` is element-wise.
    *    $f$  and  $p$  are parameters related to the overall effect magnitude and trait specificity distribution.
    *   The `strength of selection` was assumed to be  $||\vec{\alpha^2_j}||_1$ .
    *   `MAF` was drawn from the `frequency distribution` with the closest `s_het`.
    *   `Observed association statistic` for trait  $k$  and variant  $j$  was simulated as:
         $\hat{\alpha}_{jk} \sim \mathrm{Normal}\left( \sqrt{\vec{\alpha^2_j}}, \frac{1}{\sqrt{2N_{\mathrm{eff}} \times \mathsf{MAF}_j (1 - \mathsf{MAF}_j )}} \right)$ 
        Where:
        *    $N_{\mathrm{eff}}$  is a scaling factor for `environmental noise` and `sample size`.
        *   These were converted to `P-values` ( $2 N_{\mathrm{eff}} \mathsf{MAF}_j (1 - \mathsf{MAF}_j) \hat{\alpha}_{jk}^2$  as a `chi-squared` distributed `z-score squared`).
        *   A `variant` was a `hit` if its `P-value` was less than threshold  $t$ .
        *   Default parameters:  $f = 0.33, p = 0.5, N_{\mathrm{eff}} = 100,000,000, t = 10$ .
*   **AMM Analysis:**  $AMM (ref 47)$  was run to estimate `heritability enrichments` for `gene sets`. `Genes` were binned into 100 `s_het` bins. `AMM` estimates the probability that a `SNP` acts via the closest gene, etc., using probabilities from `ref 47`. `LDSC baseline covariates v2.3` and `HapMap3 variants` were used.
*   **Correlation of GWAS hit probability and  $s_{\mathrm{het}}$ :** Logistic regression was performed to differentiate `GWAS hits` from randomly sampled `SNPs`, using `s_het` of the nearest gene as a predictor, along with covariates (MAF, LD score, gene density, distance to `TSS`).
*   **Correlation of  $\hat{\gamma}^2$  and number of GWAS hits:** `LD-clumped GWAS hits` were assigned to the closest gene. The number of `GWAS hits` per gene was correlated with the `unbiased estimate of trait importance (`\hat{\gamma}^2`)` from `LoF burden tests`.

# 5. Experimental Setup

## 5.1. Datasets
The study extensively uses data from the `UK Biobank` and public `genomic annotation` resources.

*   **UK Biobank GWAS Summary Statistics:**
    *   **Source:** Neale Lab (http://www.nealelab.is/uk-biobank/; v3).
    *   **Scale & Characteristics:** Summary statistics for 305 continuous traits. The underlying `GWAS` were performed on approximately 360,000 individuals from the `UK Biobank`. Phenotypes were `inverse rank normal-transformed`.
    *   **Domain:** A wide range of quantitative traits, including anthropometric (e.g., height), blood biomarkers, and others.
*   **UK Biobank LoF Burden Test Summary Statistics:**
    *   **Source:** `Backman et al. (ref 4)`.
    *   **Scale & Characteristics:** Summary statistics for 292 `LoF burden tests`. 209 of these traits overlapped with the `GWAS data`. `Burden genotypes` were constructed by aggregating `rare Loss-of-Function (LoF) variants` within genes.
    *   **Domain:** Covers similar quantitative traits as the `GWAS` data.
*   **Subset of Genetically Uncorrelated Traits:**
    *   For analyses requiring independence (e.g., in Figures 3b-d, 4b,c, and Extended Data Figures 1a-c, 3a-c), a subset of 27 `genetically uncorrelated traits` was used.
    *   **Source:** Derived from the overlapping 209 traits, ensuring pairwise `genetic correlations` were below 0.5 (from Neale Lab) and prioritizing higher `heritability` traits.
    *   **Domain:** Diverse quantitative traits, excluding biomarkers. Examples include mean corpuscular volume, reticulocyte percentage, eosinophil percentage, lymphocyte count, standing height, heel bone mineral density, glucose, creatinine, and alanine aminotransferase.
*   **ATAC-seq Data:**
    *   **Source:** `ChIP-Atlas (ref 71)`.
    *   **Scale & Characteristics:** All `ATAC-seq files` with >5,000,000 mapped reads and >5,000 identified `peaks`. Overlapping `peaks` were merged, yielding 2,131,526 unique `peaks`. Samples were grouped into 19 `tissue/cell-type categories` (e.g., adipocyte, bone, breast, T cell, erythroid). A `peak` was considered present in a `tissue` if >5% of samples showed it.
    *   **Domain:** `Chromatin accessibility` data for various human tissues and cell types, used to infer `regulatory regions` and their tissue specificity.
*   **Gene Expression Data:**
    *   **Source:** `Human Protein Atlas (ref 73)` (rna_tissue_hpa.tsv.zip, rna_single_cell_type.tsv.zip) and `Gene Expression Omnibus (GEO)` accession GSE106292 (`refs 75,76`) for human bone samples.
    *   **Scale & Characteristics:** Estimates of `gene expression` (transcripts per million, `TPM`) in 17 `tissue/cell types`. Genes with >10 `TPM` were considered expressed.
    *   **Domain:** `Gene expression levels`, used to infer `gene tissue specificity`.
*   **Gene Constraint Estimates ( $s_{\mathrm{het}}$ ):**
    *   **Source:** `Zeng et al. (ref 35)`, downloaded from `Zenodo (ref 70)`.
    *   **Characteristics:** Bayesian estimates of `gene constraint` (`s_het`), reflecting the `strength of purifying selection` against `LoF variants` in a gene.
    *   **Domain:** Measures of evolutionary constraint for human genes.
*   **Expected Number of Segregating LoFs:**
    *   **Source:** Calculated in `gnomAD (v2, ref 79)`, downloaded from `Zenodo (ref 70)`.
    *   **Characteristics:** Represents a proxy for `gene length (L)` and `mutation rate (`\mu`)` for `LoF variants`.
    *   **Domain:** `LoF variant` counts and genomic characteristics.

        These datasets are effective for validating the method's performance because they provide:
1.  **Large Sample Sizes:** The `UK Biobank` data allows for well-powered `GWAS` and `burden tests`, enabling robust statistical inferences.
2.  **Diverse Phenotypes:** The wide range of quantitative traits allows for generalizable conclusions about gene prioritization across different biological systems.
3.  **Multimodal Genetic Information:** Combining `common variant GWAS` with `rare variant burden tests` provides a comprehensive view of genetic architecture.
4.  **Rich Functional Annotations:** `ATAC-seq` and `gene expression data` allow for empirical testing of `trait specificity` at both the `variant` and `gene levels` in a tissue-specific manner.
5.  **Population Genetics Parameters:** `s_het` and `LoF counts` provide critical inputs for testing the theoretical population genetics models.

## 5.2. Evaluation Metrics
The paper employs a range of statistical and biological metrics to evaluate its hypotheses and findings.

*   **P-value:**
    *   **Conceptual Definition:** The `P-value` is a statistical measure used in `hypothesis testing` to quantify the evidence against a null hypothesis. It represents the probability of observing test results at least as extreme as the results actually observed, assuming that the `null hypothesis` is true. A small `P-value` suggests that the observed data are inconsistent with the `null hypothesis`, providing evidence for the alternative hypothesis.
    *   **Mathematical Formula:** While there isn't a single universal formula for the `P-value` as it depends on the specific statistical test and its underlying distribution, for `GWAS` and `burden tests` which typically produce `z-scores` or `chi-squared statistics`, the `P-value` is derived from the tail probability of these distributions. For a two-sided test using a `z-score` ( $Z$ ):
         $P = 2 \times \mathrm{Pr}(|Z| \geq |z_{\mathrm{obs}}|)$ 
        Where:
        *    $P$  is the `P-value`.
        *    $Z$  is a random variable following the standard `normal distribution`.
        *    $z_{\mathrm{obs}}$  is the observed `z-score` from the association test.
    *   **Symbol Explanation:**
        *    $\mathrm{Pr}(|Z| \geq |z_{\mathrm{obs}}|)$ : The probability of observing a `z-score` absolute value greater than or equal to the observed absolute `z-score` under the `null hypothesis`.
*   **z-score squared ( $z^2$ ):**
    *   **Conceptual Definition:** The `z-score` measures how many `standard deviations` an element is from the mean. In `association studies`, the `z-score` for an effect size estimate ( $\hat{\gamma}$ ) is often  $\hat{\gamma} / \mathrm{SE}(\hat{\gamma})$ . The `squared z-score (`z^2`)` is a common measure of the `strength of association` and is approximately `chi-squared distributed` with 1 `degree of freedom` under the null hypothesis. It is directly related to statistical power.
    *   **Mathematical Formula:**
         $z^2 = \left( \frac{\hat{\gamma}}{\mathsf{SE}(\hat{\gamma})} \right)^2$ 
        Where:
        *    $z^2$  is the `squared z-score` (strength of association).
        *    $\hat{\gamma}$  is the `estimated effect size` (e.g., `LoF burden effect size`, `variant effect size`).
        *    $\mathsf{SE}(\hat{\gamma})$  is the `standard error` of the `estimated effect size`.
*   **Spearman's Rank Correlation Coefficient ( $\rho$ ):**
    *   **Conceptual Definition:** A `non-parametric measure` of the strength and direction of association between two ranked variables. It assesses how well the relationship between two variables can be described using a monotonic function. It is particularly useful for comparing rankings, as done for `GWAS` and `burden test P-values`.
    *   **Mathematical Formula:**
         $\rho = 1 - \frac{6 \sum d_i^2}{n(n^2 - 1)}$ 
        Where:
        *    $\rho$  is `Spearman's rank correlation coefficient`.
        *    $d_i = \mathrm{rank}(X_i) - \mathrm{rank}(Y_i)$  is the difference between the ranks of the  $i$ -th observations for two variables  $X$  and  $Y$ .
        *    $n$  is the number of observations.
    *   **Symbol Explanation:**
        *    $\mathrm{rank}(X_i)$ : The rank of the  $i$ -th value of variable  $X$ .
        *    $\mathrm{rank}(Y_i)$ : The rank of the  $i$ -th value of variable  $Y$ .
*   **Pearson's Correlation Coefficient ( $r$ ):**
    *   **Conceptual Definition:** A measure of the `linear correlation` between two sets of data. It is the ratio between the `covariance` of the two variables and the product of their `standard deviations`. It indicates the strength and direction of a linear relationship.
    *   **Mathematical Formula:**
         $r = \frac{n \sum (X_i Y_i) - \sum X_i \sum Y_i}{\sqrt{\left[n \sum X_i^2 - (\sum X_i)^2\right] \left[n \sum Y_i^2 - (\sum Y_i)^2\right]}}$ 
        Where:
        *    $r$  is `Pearson's correlation coefficient`.
        *    $X_i, Y_i$  are the individual data points for variables  $X$  and  $Y$ .
        *    $n$  is the number of observations.
    *   **Symbol Explanation:**
        *    $\sum X_i Y_i$ : Sum of the products of each pair of values.
        *    $\sum X_i$ : Sum of all  $X$  values.
        *    $\sum Y_i$ : Sum of all  $Y$  values.
        *    $\sum X_i^2$ : Sum of the squared  $X$  values.
        *    $\sum Y_i^2$ : Sum of the squared  $Y$  values.
*   **Heritability Enrichment ( $\tau/h^2$  from S-LDSC):**
    *   **Conceptual Definition:** In `S-LDSC`, `heritability enrichment` for an `annotation` (e.g., `tissue-specific ATAC peaks`, `coding variants`) quantifies how much more `heritability` is explained by `SNPs` within that `annotation` compared to `SNPs` outside it, relative to the proportion of `SNPs` in the `annotation`. The  $\tau$  parameter in `S-LDSC` represents the change in `heritability` per `SNP` associated with toggling an `annotation` from 0 to 1. When normalized by `total heritability (`h^2`)`,  $\tau/h^2$  can be interpreted as the increase in the *proportion* of `heritability` explained by a single `variant` when it falls within that `annotation`, conditional on other `annotations`. It's a key metric for understanding the functional architecture of `heritability`.
    *   **Mathematical Formula:** The `S-LDSC` model is complex, but the parameter  $\tau_k$  for `annotation`  $k$  is estimated from the relationship between `LD score` and `chi-squared statistics`. The enrichment is then derived as:
         $\text{Enrichment}_k = \frac{\tau_k / M_k}{\sum_j (\tau_j / M_j)}$ 
        Where  $\tau_k$  is the per-SNP `heritability contribution` of `annotation`  $k$ , and  $M_k$  is the number of `SNPs` in `annotation`  $k$ . The paper specifically reports  $\tau/h^2$  which is directly estimated by `S-LDSC`.
    *   **Symbol Explanation:**
        *    $\tau$ : The `S-LDSC` parameter representing the per-`SNP` contribution to `heritability` from a given `annotation`.
        *    $h^2$ : The `total SNP heritability` of the trait.
        *    $\tau/h^2$ : The change in the proportion of `heritability` explained by a single `variant` due to being in the `annotation`.

## 5.3. Baselines
The paper primarily compares the performance and prioritization mechanisms of `GWAS` and `LoF burden tests` against **each other** and against **theoretical predictions from population genetics models**, rather than against a specific set of alternative `gene prioritization models`.
*   **Standard GWAS:** The `P-value` ranking from conventional `GWAS` is treated as one of the primary methods under investigation.
*   **Standard LoF Burden Tests:** Similarly, the `P-value` ranking from conventional `LoF burden tests` is the other primary method being analyzed.
*   **Theoretical Predictions:** The paper's own `population genetics models` serve as a theoretical baseline against which the empirical observations from `GWAS` and `burden tests` are compared (e.g., predictions about how `z-scores` should relate to `trait importance` and `specificity`, and the `flattening effect`).
*   **Alternative Prioritization Approaches:** When the paper discusses `estimating trait importance`, methods like `AMM (Allele-level Mixed Model)` are introduced as an alternative/improved approach compared to `P-value` based rankings, effectively serving as a benchmark for how `trait importance` *could* be estimated.

    The focus is less on outperforming existing `gene prioritization algorithms` and more on understanding the fundamental properties and biases of the two most common `genetic association study` designs.

# 6. Results & Analysis

## 6.1. Core Results Analysis

### 6.1.1. Burden Test and GWAS Gene Ranks Differ
The study begins by systematically quantifying the discrepancy in gene prioritization between `GWAS` and `LoF burden tests`.

The following figure (Figure 1 from the original paper) illustrates that `GWAS` and `LoF burden tests` prioritize different `loci`:

![Fig. 1| GWAS and LoF burden tests prioritize different loci. a,b, Schematics of GWAS (a) and LoF burden tests (b). c, Each cell is a genome-wide significant gene according to LoF burden tests, ordere…](/files/papers/6919ac5d110b75dcc59ae258/images/1.jpg)
*该图像是图表，展示GWAS和LoF负担测试的不同优先级。a, b部分为示意图展示遗传变异对表型的影响；c, 每个单元格表示根据LoF负担测试的重要性排名的基因。d部分则比较了GWAS和LoF测试的P值，e和f显示了不同基因组区域的GWAS结果。*

*   **Overlap, but Discordant Ranking:** Across 151 traits with at least one `burden hit` and one `GWAS hit`, 74.6% (1,382 out of 1,852) of `genome-wide significant burden test hits` fall within a `GWAS locus` (Fig. 1c). This indicates a substantial overlap in terms of physical location. However, the *ranking* of these genes/loci is very different. Only 26% (480 out of 1,852) of genes with `burden support` fall in the `top GWAS loci` (Supplementary Fig. 1), where `top GWAS loci` are defined as a number of `GWAS loci` matching the number of significant `burden hits`.
*   **Example: Height Trait (Fig. 1d):** For height, with 382 `genome-wide significant GWAS loci`, the rankings show some concordance (`Spearman's`\rho = 0.46

), but there's little overlap in the top hits. Many significant GWAS loci do not contain a single significant burden gene.

Illustrative Loci (Fig. 1e,f):
- NPR2 locus (Fig. 1e): NPR2 is the second most significant gene in LoF burden tests for height but is in the 243rd most significant GWAS locus. Mutations in NPR2 are known to cause short stature, making it a biologically plausible hit for both.
- HHIP locus (Fig. 1f): HHIP is in the third most significant GWAS locus for height, with P-values as small as $10^{-185}$ . HHIP is biologically relevant to height through its role in osteogenesis and interaction with Hedgehog proteins. However, there is essentially no burden signal for HHIP or other genes in this locus.
Interpretation: These examples vividly demonstrate that while both methods identify biologically relevant genes, their prioritization criteria lead to fundamentally different top-ranked discoveries. The results are robust to various analytical choices (Supplementary Appendix A, Figs. 4-31).

6.1.2. Burden Tests Favour Trait-Specific Genes

The theoretical model predicts that LoF burden tests prioritize genes by their trait specificity (\Psi_G) and gene length (\mu L), not primarily by trait importance.

The following figure (Figure 3 from the original paper) illustrates that burden tests prioritize trait-specific genes, not large-effect genes:

$Fig. 4| GwAS prioritize trait-specific variants. a, Schematic of what determines trait specificity for variants, $\\Psi _ { V } . \\Psi _ { V }$ is determined by two components: the trait specificity o…$ 该图像是图4，展示了GWAS如何优先考虑特定的变异。图中包括一幅示意图，说明变异的特异性是由基因的特异性和变异对基因的相对特异性两个组成部分决定的；同时显示了编码变异和非编码变异在不同细胞上下文中的作用。图b和图c分别展现了编码变异和非编码变异在特定组织中的遗传力富集结果。

Inverse Relationship between $s_{\mathrm{het}}$ and LoF Frequency (Fig. 3b): Genes with higher estimateds_{\mathrm{het}}

(stronger purifying selection, implying larger overall fitness effects) have lower `aggregate LoF frequencies`. This negative relationship is strong and significant (`Spearman's`\rho = -0.547, P < 10^{-15}

). This confirms that highly constrained genes have fewer LoF variants in the population.

$s_{\mathrm{het}}$ Proportional to Total Trait Importance (Fig. 3c): The average trait importance across traits ( $\sum_t \gamma_t^2$ ) shows a significant positive relationship with s_het (Pearson'sr = 0.078, P < 10^{-15}

). This supports the model's assumption that `s_het` captures the total effect of a gene across all fitness-relevant traits.
*   **Decoupling of  $z^2$  from Trait Importance (Fig. 3d):** For genes with sufficiently large effects (high `trait importance`), the `strength of association (`z^2`)` in `LoF burden tests` is largely decoupled from their `trait importance`. The `Pearson's r` between mean importance and mean  $z^2$  for the 25 highest `s_het` bins is low and not significant (

r = 0.188, P = 0.368

). This is due to the `flattening effect`: highly constrained genes have very `rare LoFs`, leading to larger standard errors and thus weaker statistical signals despite their true importance.
*   **Prioritization by Expression Specificity (Fig. 3e):** Using `gene expression specificity` as a proxy for `trait specificity`, `LoF burden tests` show significantly stronger signals (lower `P-values`) in genes with higher expression specificity to the `trait-relevant tissue`. This holds true regardless of `s_het` (Supplementary Fig. 34) and using different `burden masks` (Supplementary Figs. 33, 35).
*   **Interpretation:** These results strongly confirm that `LoF burden tests` prioritize genes based on their `trait specificity (`\Psi_G`)` and `gene length`, effectively selecting genes whose `LoFs` have relatively specific effects on the studied trait, rather than genes with the largest overall impact (`trait importance`).

### 6.1.3. GWAS Prioritize Trait-Specific Variants
The theoretical model predicts that `GWAS` prioritize `trait-specific variants (`\Psi_V`)`. This specificity can arise from `variants` affecting `trait-specific genes` or having `context-specific effects` on `pleiotropic genes`.

*   **GWAS Prioritization of Coding Variants by Gene Specificity (Fig. 4b):** Analyzing `coding variants` (which act through specific genes), `heritability enrichment (`\tau/h^2`)` increases significantly in genes with higher `expression specificity` to the `trait-relevant tissue`. This indicates that `GWAS` prioritize `variants` acting on `trait-specific genes`.
*   **GWAS Prioritization of Non-coding Variants by Context Specificity (Fig. 4c):** For `non-coding variants` within `ATAC peaks`, `heritability enrichment` shows a significant trend of increasing contribution in more `tissue-specific ATAC peaks`. This holds even when conditioning on `s_het` (Supplementary Figs. 41, 42).
*   **Interpretation:** `GWAS` can prioritize `variants` that are `trait-specific` through two mechanisms: either they are in `trait-specific genes` (captured by `coding variants` and `gene expression specificity`) or they have `context-specific regulatory effects` in specific tissues (captured by `non-coding variants` in `tissue-specific ATAC peaks`). This means `GWAS` can highlight `pleiotropic genes` if their `non-coding regulatory variants` exhibit `context-specific effects`. This contrasts with `LoF burden tests` which primarily prioritize `trait-specific genes`.

### 6.1.4. LoF Burden Tests Prioritize Long Genes
The theoretical model indicated that `LoF burden tests` prioritize genes with more potential `LoF` positions (`gene length`, represented by  $\mu L$ ), as this increases the `aggregate frequency of LoFs` and thus power.

The following figure (Extended Data Fig. 1 from the original paper) shows how `coding sequence length` drives prioritization in `LoF burden tests`:

![该图像是一个展示相同突变轨迹的频率变化图。图中横轴表示自突变出现以来的世代数，纵轴表示频率。可以看到，随着世代的增加，频率呈现出不同的增长趋势，突出显示了基因突变在不同世代中的分布情况。](/files/papers/6919ac5d110b75dcc59ae258/images/7.jpg)
*该图像是一个展示相同突变轨迹的频率变化图。图中横轴表示自突变出现以来的世代数，纵轴表示频率。可以看到，随着世代的增加，频率呈现出不同的增长趋势，突出显示了基因突变在不同世代中的分布情况。*

*   **No Correlation with Trait Importance (Extended Data Fig. 1a):** There is no substantial positive correlation between `gene length` (proxied by `expected number of unique LoFs`) and `unbiased estimates of squared trait importance (`\gamma^2`)` ( $Pearson's r = 0.017, P = 0.023$ ). This means longer genes do not inherently have larger trait effects.
*   **Smaller Standard Errors for Longer Genes (Extended Data Fig. 1b):** Longer genes have considerably smaller `LoF burden test standard errors` (`Spearman's`\rho = -0.255, P < 10^{-15}

). This is expected because more LoF sites lead to more observed LoF variants, reducing the statistical uncertainty.

Significant Effect on Burden Signal (Extended Data Fig. 1c): Consequently, gene length has a significant positive effect on the LoF burden testz^2 $(`Pearson's`r = 0.112, P < 10^{-16}$ ).
Interpretation: LoF burden tests systematically favor longer genes, even if those genes are not necessarily more trait important or trait specific in a biological sense. This is a technical artifact of the aggregation strategy, meaning gene length is a trait-irrelevant factor driving rankings. This also makes longer genes appear more pleiotropic in burden tests simply because they are more often detected across traits due to higher power.

6.1.5. Random Genetic Drift Affects GWAS

The paper demonstrates that random genetic drift significantly influences GWAS rankings, introducing a layer of "luck" beyond biological trait importance or specificity.

The following figure (Extended Data Fig. 2 from the original paper) shows how GWAS variant rankings are driven largely by genetic drift:

该图像是一个散点图，显示了模拟SNPs的标准化平方效应与相对实现的遗传力之间的关系。颜色条表示最小等位基因频率（MAF），其范围从0.1到0.4，反映不同SNP的分布特征。

MAF Stochasticity (Extended Data Fig. 2a): Genetic drift causes variant frequencies (even for identical mutations under the same selection pressure) to spread widely around their expected values over time.
GWAS Ranking by Frequency, not Importance (Extended Data Fig. 2b): In simulated GWAS, for sufficiently trait-important variants, the ranking by realized heritability (2 \alpha_1^2 p(1-p)) is largely random with respect to their true trait importance. This randomness is driven by differences in minor allele frequency (MAF) due to genetic drift. LoF burden tests largely ameliorate this by aggregating variants, which averages out MAF stochasticity.
Apparent Pleiotropy (Extended Data Fig. 3):
- Real Data (Extended Data Fig. 3b): Stronger GWAS hits (lower P-value rank) tend to have higher mean MAF.
- Simulations (Extended Data Fig. 3d): Variants that are stronger GWAS hits (lower P-value rank) also tend to be hits for a greater number of traits. This is because a variant that, by chance, drifts to a higher MAF will have increased power to be detected across all traits it affects.
Interpretation: Genetic drift introduces substantial noise into GWAS rankings. Strong GWAS hits are not necessarily the most trait important but often variants that have, by chance, drifted to a higher MAF. This statistical artifact also explains why GWAS hits often appear surprisingly pleiotropic: high-frequency variants are more easily detected for any trait they influence, making them seem to affect more traits than their underlying biology might suggest.

6.1.6. Estimating Trait Importance

Given that neither GWAS nor LoF burden tests directly rank genes by trait importance based on P-values, the paper investigated if aggregating signals could provide better estimates.

The following figure (Figure 5 from the original paper) illustrates how trait importance is estimated by combining different variant types:

该图像是图表，展示了长基因与性状的效应关系。图A显示长基因对性状没有更大影响，图B则表明长基因的标准误差较小，图C显示LoF负担测试优先考虑长基因，均与平均预期的LoF数量相关。

Flattening in LoF Burden Test Heritability (Fig. 5b): LoF burden test heritability enrichment (based on $z^2-1$ ) does not correlate well with s_het (a proxy for trait importance), especially for highly constrained genes ( $Pearson's r = -0.337, P = 0.099$ across the 25 highest s_het bins). This again shows the flattening effect where highly important genes are hard to detect by burden tests.
AMM Better Tracks Trait Importance (Fig. 5d): AMM (Allele-level Mixed Model) heritability enrichment (which aggregates GWAS signals across variants for a gene) shows a strong positive correlation with s_het ( $Pearson's r = 0.832, P = 2.57 \times 10^{-7}$ across the 25 highest s_het bins).
Interpretation: While individual variants or LoF aggregates experience flattening for highly important genes, aggregating GWAS signals across multiple variants (some with smaller individual effects but collectively contributing) for a given gene can overcome this. This implies that methods like AMM are more effective at prioritizing trait-important genes by leveraging the collective signal, even if individual variants are subject to flattening (Fig. 5c). This approach is robust across different aggregation methods (Supplementary Figs. 42, 49, 50).

6.2. Data Presentation (Tables)

The paper does not contain any tables within its main article body that are presented in a format (like markdown or HTML tables) suitable for direct transcription. All quantitative results are presented within the main text or integrated into figures. For instance, statistical values like P-values, Pearson's r, and $Spearman's ρ$ are reported directly in the text or figure captions alongside their corresponding figures (e.g., Fig. 1c states "74.6% (1,382 out of 1,852) of genome-wide significant burden test hits fall within a GWAS locus").

6.3. Ablation Studies / Parameter Analysis

The paper primarily validates its theoretical models and empirical findings through robust sensitivity analyses and simulations rather than traditional ablation studies on a single proposed model.

Robustness of GWAS and Burden Test Comparisons:
- Definition of GWAS Loci: The study tests different approaches for defining GWAS loci (e.g., using LD-clumping vs. COJO (ref 61) for conditionally independent SNPs). The main results regarding GWAS and burden test discrepancies remain robust (Supplementary Figs. 8-10).
- MAF Thresholds for GWAS: Comparisons are made by restricting GWAS to SNPs below various MAF thresholds (0.01, 0.1, 0.5). The discrepancy persists even when GWAS is restricted to lower frequency variants (Supplementary Figs. 26-28), suggesting the difference is not merely due to considering different allele frequency spectrums.
- Ranking by Effect Size vs. P-value: The analysis confirms that ranking loci by largest effect size instead of P-value does not fundamentally change the qualitative differences in prioritization (Supplementary Figs. 29-31).
- Burden Test Masks: The findings for burden tests (e.g., relationship between specificity and power) are consistent when including likely damaging missense variants (mask M3) in addition to LoF variants (mask M1) (Supplementary Figs. 33, 35).
Simulations of Genetic Drift Effects:
- The paper conducts extensive simulations to illustrate how genetic drift affects GWAS (Extended Data Fig. 2, 3).
- Parameter Sensitivity: The sensitivity of these simulated pleiotropy results to various simulation parameters ( $N_{\mathrm{eff}}$ (effective population size), $t$ (P-value threshold), $p$ (trait specificity distribution), and $f$ (overall effect magnitude)) is explored in Supplementary Figs. 45-48. These analyses demonstrate that the qualitative conclusions about genetic drift making GWAS hits appear more pleiotropic are not sensitive to the specific choice of these parameters.
Controlling for Covariates:
- When regressing $burden z^2$ on expression specificity, unbiased estimates of trait importance were included as a covariate to ensure that the observed effect of specificity was not driven by inherent differences in gene importance across specificity bins (Supplementary Fig. 36).
- In S-LDSC analyses for ATAC peaks, the strength of ATAC peaks was controlled for to isolate the effect of tissue specificity. Similarly, for coding variants, gene expression level bins were included as covariates.
- S-LDSC analyses for ATAC peaks were also conditioned on s_het to show that the effect of tissue specificity is independent of overall gene constraint (Supplementary Figs. 41, 42).
Comparison of Trait Importance Estimation Methods:
- The paper compares LoF burden heritability enrichment (Fig. 5b) with AMM-estimated heritability enrichment (Fig. 5d) against s_het (as a proxy for trait importance). This comparison acts as a form of ablation or comparison study to show that AMM's aggregation strategy is more effective at tracking trait importance than standard P-value based burden test signals. Additional analyses using GWAS hit probability and correlation of $\hat{\gamma}^2$ with number of GWAS hits further support these findings (Supplementary Figs. 49, 50).
  
  These extensive analyses demonstrate the robustness of the paper's core findings and provide strong evidence for the proposed mechanisms driving gene prioritization in GWAS and LoF burden tests.

7. Conclusion & Reflections

7.1. Conclusion Summary

This paper rigorously demonstrates that standard Genome-Wide Association Studies (GWAS) and rare variant Loss-of-Function (LoF) burden tests, while both crucial for identifying trait-relevant genes, systematically prioritize different sets of genes due to fundamental differences in their underlying mechanisms and what aspects of trait biology they are sensitive to.

The core findings are:

Divergent Prioritization: LoF burden tests primarily prioritize long, trait-specific genes (genes whose LoFs have relatively specific effects on the studied trait), while GWAS prioritize genes near trait-specific variants.
Role of Non-coding Variants: A key distinction is that GWAS can capture trait-relevant, pleiotropic genes if non-coding variants acting on these genes have context-specific effects (e.g., tissue-specific regulation). LoF burden tests, which focus on coding variants, generally cannot achieve this.
Impact of Trait-Irrelevant Factors: Both methods are influenced by factors unrelated to a gene's true trait importance or specificity:
- LoF burden tests are biased towards longer genes due to increased statistical power from more potential LoF sites.
- GWAS rankings are significantly affected by random genetic drift, which causes variants to drift to unexpectedly high minor allele frequencies (MAFs). These high-MAF variants then appear as strong GWAS hits and seem more pleiotropic than they truly are.
Estimating Trait Importance: Standard P-value rankings in neither method effectively capture trait importance due to the flattening effect (where strong purifying selection on highly important genes makes their variants rare and hard to detect). However, non-standard GWAS approaches that aggregate signals across multiple variants (like AMM) can more accurately estimate trait importance.

In essence, LoF burden tests and GWAS are complementary tools, each revealing distinct facets of trait biology. Understanding their specific biases and strengths is critical for accurate interpretation and application in human genetics.

7.2. Limitations & Future Work

The authors themselves highlight several limitations and suggest future research directions:

Improving Burden Tests: While larger sample sizes will reduce noise, the authors anticipate that Bayesian frameworks incorporating priors based on gene features (e.g., ref 3, 5) could be particularly effective at improving the accuracy and interpretation of burden tests, potentially mitigating the gene length bias.
Enhancing GWAS for Trait Importance: The paper suggests that non-standard GWAS approaches that aggregate signals across variants (ref 47, 48, 56, 57) are promising for prioritizing genes by trait importance, and further development and refinement of such methods are needed.
Context-Specific Targeting of Pleiotropic Genes: The paper notes that while trait-specific genes might be ideal drug targets due to reduced side effects, highly pleiotropic genes could still be impactful if they can be targeted in a context-specific way. This points to the ongoing challenge and research area of understanding context-specific gene function and drugability.
Differences in Experimental Systems: The paper acknowledges that the effects of pleiotropic genes observed in knockout experimental systems might differ fundamentally from the phenotypic consequences of regulatory variants identified in GWAS, suggesting a need for integrating insights across different study designs.

7.3. Personal Insights & Critique

This paper offers profoundly valuable insights for anyone interpreting genetic association studies.

Complementary Tools, Not Competing: The most striking takeaway is the clear articulation that GWAS and burden tests are not competing but rather complementary. They are designed to detect different biological signals and are affected by distinct non-biological factors. This fundamentally shifts the perspective from asking "which method is better?" to "what specific biological question can each method best answer?".
Beyond P-values: The rigorous demonstration of P-value decoupling from true trait importance due to selection and genetic drift is a crucial message. It underscores that blindly ranking by P-value can be misleading for identifying truly important genes. This highlights the need for more sophisticated methods (like AMM) that aggregate information and account for evolutionary pressures.
Implications for Drug Discovery: The distinction between trait importance and trait specificity has direct practical implications for drug target discovery. Trait-specific genes (prioritized by burden tests) might indeed make better drug targets due to fewer off-target effects, aligning with observations that LoF burden evidence is more predictive of drug trial success. However, if context-specific targeting is possible, more trait-important (but pleiotropic) genes (potentially identified by GWAS through context-specific variants) could yield greater clinical impact. This provides a clear framework for evaluating targets.
Reframing Pleiotropy: The explanation that GWAS hits appear pleiotropic partly as a statistical artifact of genetic drift pushing variants to higher MAFs is a fascinating and important point. This challenges the naive interpretation of GWAS pleiotropy and suggests that some observed shared genetic influences across traits might be due to statistical power rather than deep biological interconnectedness at the variant level.
Value of Non-coding Genome: The paper reinforces the critical role of the non-coding genome in complex traits. Context-specific non-coding variants allow GWAS to pinpoint trait-specific effects of broadly pleiotropic genes, a capability burden tests lack. This emphasizes the continued need for sophisticated functional genomic annotation to interpret GWAS signals.
Areas for Improvement/Further Research:
- Unified Framework for Prioritization: While the paper provides criteria, a practical, unified framework or score that intelligently combines trait importance and trait specificity (and perhaps accounts for gene length and drift biases) across both GWAS and burden tests would be a valuable next step.
- Quantifying "Context Specificity": The proxies used for context specificity (ATAC-seq peak tissue specificity, gene expression specificity) are good, but a more direct and granular measure of variant-level context specificity (e.g., across cell types, developmental stages, environmental conditions) could refine these analyses further.
- Beyond Quantitative Traits: The study focuses on quantitative traits. Extending this framework to binary disease traits could reveal additional nuances, especially concerning the effects of selection and drift in case-control studies.
- Dynamic Nature of Selection: The models assume stabilizing selection. While a common assumption, investigating the implications of other forms of selection or time-varying selection on gene prioritization could add complexity and realism.
  
  Overall, this paper is a landmark study that significantly advances our theoretical and practical understanding of how genetic association studies reveal trait biology. It provides a robust, evidence-based roadmap for interpreting current findings and designing future research.