AiPaper
Paper status: completed

Estimation and mapping of the missing heritability of human phenotypes

Published:11/12/2025
Original Link
Price: 0.10
1 readers
This analysis is AI-generated and may not be fully accurate. Please refer to the original paper.

TL;DR Summary

This study analyzes whole-genome sequencing data from the UK Biobank to quantify the impact of rare non-coding variants on heritability of 34 complex traits. It finds that WGS captures about 88% of narrow-sense heritability and identifies significant loci for lipid traits.

Abstract

Rare coding variants shape inter-individual differences in human phenotypes. However, the contribution of rare non-coding variants to those differences remains poorly characterized. Here we analyse whole-genome sequence (WGS) data from 347,630 individuals with European ancestry in the UK Biobank to quantify the relative contribution of 40 million single-nucleotide and short indel variants to the heritability of 34 complex traits and diseases. On average, we find that WGS captures approximately 88% of the pedigree-based narrow sense heritability, which is derived from 20% rare variants and 68% common variants. We identified 15 traits with no significant difference between WGS-based and pedigree-based heritability estimates. Overall, our study provides high-precision estimates of rare-variant heritability and demonstrates significant mapping of specific loci for lipid traits.

Mind Map

In-depth Reading

English Analysis

1. Bibliographic Information

1.1. Title

Estimation and mapping of the missing heritability of human phenotypes

1.2. Authors

Pierrick Wainschtein, Yuanxiang Zhang, Jeremy Schwartzentruber, Irfahan Kassam, Julia Sidorenko, Petko P. Fiziev, Huanwei Wang, Jeremy McRae, Richard Border, Noah Zaitlen, Sriram Sankararaman, Michael E. Goddard, Jian Zeng, Peter M. Visscher, Kyle Kai-How Farh & Loic Yengo.

The authors represent a collaborative effort from various institutions, including Illumina Inc., The University of Queensland, and others involved in academic research and biotechnology. Their diverse affiliations suggest expertise spanning genomics, statistical genetics, bioinformatics, and population genetics.

1.3. Journal/Conference

Nature. Nature is one of the world's most prestigious and highly cited multidisciplinary scientific journals, known for publishing significant original research across all fields of science and technology. Publication in Nature indicates that the research is considered to be of exceptional importance and broad interest.

1.4. Publication Year

2025 (Published online: 12 November 2025)

1.5. Abstract

The paper investigates the contribution of rare non-coding variants to inter-individual differences in human phenotypes, a poorly characterized area despite known effects of rare coding variants. Utilizing whole-genome sequence (WGS) data from 347,630 individuals of European ancestry in the UK Biobank, the study quantifies the relative contribution of 40 million single-nucleotide and short indel variants to the heritability of 34 complex traits and diseases. The key findings indicate that WGS captures approximately 88% of the pedigree-based narrow sense heritability, with 20% attributed to rare variants (minor allele frequency, MAF < 1%) and 68% to common variants (MAF \ge 1%). It further delineates that coding and non-coding variants account for 21% and 79% of the rare-variant WGS-based heritability, respectively. For 15 traits, no significant difference was observed between WGS-based and pedigree-based heritability estimates, suggesting their heritability is largely explained by WGS data. The study also identified 11,243 common-variant associations and 886 rare-variant associations, demonstrating significant mapping of specific loci for lipid traits, where over 25% of rare-variant heritability can be mapped using fewer than 500,000 fully sequenced genomes.

/files/papers/6919acc6110b75dcc59ae266/paper.pdf (Published, DOI: 10.1038/s41586-025-09720-6)

2. Executive Summary

2.1. Background & Motivation

The paper addresses the long-standing missing heritability problem in human genetics. While it's known that human traits are heritable and influenced by many DNA variants, the full extent of this genetic contribution has been difficult to quantify. Previous studies using common genetic variants (like single-nucleotide polymorphisms or SNPs with minor allele frequency (MAF) > 1% or 5%) only explained a fraction of the heritability estimated from family studies (pedigree-based heritability). This discrepancy was termed still-missing heritability.

The core challenge is that current methods for estimating SNP-based heritability (hSNP2h^2_{SNP}) have largely been restricted to common variants due to technological and sample size limitations. This leaves the contribution of rare variants (MAF < 1%) and especially rare non-coding variants (variants outside protein-coding regions) poorly characterized. Understanding the full genetic architecture, including the role of rare variants, is crucial for improving the statistical power of genome-wide association studies (GWAS) and enhancing the prediction accuracy of genetic risk scores for complex traits and diseases.

The paper's innovative entry point is to leverage the unprecedented scale of whole-genome sequencing (WGS) data from the UK Biobank (nearly 350,000 individuals with WGS data) to precisely quantify the contribution of both common and rare variants, including the often-overlooked rare non-coding variants, to trait heritability. This large sample size allows for high-precision estimates, which was a limitation of earlier, smaller WGS studies.

2.2. Main Contributions / Findings

The paper makes several significant contributions to the field of quantitative genetics:

  • High-precision estimates of WGS-based heritability (hWGS2h^2_{WGS}): The study provides accurate estimates of heritability for 34 complex traits and diseases using WGS data, with low standard errors (as low as 0.6% for quantitative traits). This precision was previously unachievable due to smaller WGS sample sizes.

  • Quantification of rare variant contribution: It shows that, on average across traits, rare variants (MAF < 1%) contribute approximately 20% to the total WGS-based heritability, while common variants (MAF \ge 1%) contribute 68%.

  • Resolution of 'still-missing heritability' for many traits: For 15 traits (including 15 quantitative traits with small standard errors), the WGS-based heritability estimates were not significantly different from traditional pedigree-based heritability estimates. This suggests that for these traits, the still-missing heritability is largely accounted for by the variants captured by WGS. On average across all traits, WGS captures 88% of the pedigree-based narrow sense heritability.

  • Partitioning of rare-variant heritability by genomic region: The study demonstrates that coding variants account for 21% and non-coding variants for 79% of the rare-variant WGS-based heritability, emphasizing the substantial role of the non-coding genome even for rare variants.

  • Mapping of specific loci, especially for lipid traits: Through GWAS analyses, the study identified 11,243 common-variant associations (CVAs) and 886 rare-variant associations (RVAs). Notably, for lipid-related traits (e.g., LDL and HDL cholesterol), RVAs collectively explain over one-quarter of their rare-variant heritability, demonstrating that a substantial portion of rare-variant heritability is already mappable.

  • Characterization of RVA genomic distribution and colocalization: RVAs tend to colocalize with CVAs, and RVAs closer to CVAs explain more phenotypic variance. This insight can inform future strategies for GWAS discovery.

    These findings significantly advance our understanding of the genetic architecture of complex traits, reduce uncertainty about the role of non-additive effects, and provide benchmarks for improving polygenic scores and GWAS methodologies.

3. Prerequisite Knowledge & Related Work

3.1. Foundational Concepts

3.1.1. Heritability

Heritability in genetics refers to the proportion of phenotypic variation in a population that is attributable to genetic variation among individuals. It quantifies how much of the differences we see in a trait (like height or disease risk) across people is due to their genes, rather than their environment.

  • Narrow-sense heritability (h2h^2): Specifically refers to the proportion of phenotypic variance explained by additive genetic effects. Additive effects are when the effect of each allele (a specific form of a gene) simply adds up across different genes to influence a trait. This is the most common form of heritability studied because it is directly relevant to how traits respond to selection and to the predictive power of genetic markers.
  • Pedigree-based heritability (hPED2h^2_{PED}): This is traditionally estimated by observing the resemblance of traits among relatives (e.g., parents and offspring, siblings, cousins) within families or large pedigrees. By comparing the phenotypic similarity of individuals with known degrees of genetic relatedness, statistical models can infer the proportion of variation due to shared genes. These estimates capture additive, dominant, and epistatic (gene-gene interaction) genetic effects, as well as shared environmental effects, which can sometimes inflate additive genetic variation estimates if not properly accounted for.
  • SNP-based heritability (hSNP2h^2_{SNP}): This is estimated from SNP data (often using genome-wide array data) in a population of ostensibly unrelated individuals. It quantifies the proportion of phenotypic variance explained by the additive effects of the SNPs included in the analysis. This method typically uses statistical approaches like Genome-wide Complex Trait Analysis (GCTA) or LD Score Regression. When only common SNPs are used, hSNP2h^2_{SNP} is usually lower than hPED2h^2_{PED}, contributing to the missing heritability problem.
  • WGS-based heritability (hWGS2h^2_{WGS}): This is the SNP-based heritability specifically estimated using whole-genome sequencing (WGS) data, which captures a much broader spectrum of genetic variation, including rare variants and short indels, compared to SNP arrays. The goal is to see if including these additional variants can explain more of the pedigree-based heritability.

3.1.2. Minor Allele Frequency (MAF)

Minor allele frequency (MAF) refers to the frequency of the less common allele (variant form of a gene) at a particular locus (specific position on a chromosome) in a given population.

  • Common variants: Typically defined as SNPs or indels with a MAF greater than 1% or 5%. These are widespread in the population.
  • Rare variants: Defined as SNPs or indels with a MAF less than 1%. These are less common and often have larger effects on traits or diseases.
  • Ultra-rare variants: SNPs or indels with an extremely low MAF, often less than 0.01%.

3.1.3. Complex Traits and Diseases

Complex traits (also called quantitative traits) are characteristics that are influenced by multiple genes (polygenic) and environmental factors, and typically show continuous variation in a population (e.g., height, blood pressure, BMI). Complex diseases (e.g., type 2 diabetes, coronary artery disease) are also influenced by multiple genetic and environmental factors but often have a dichotomous outcome (presence or absence of disease).

3.1.4. Single-Nucleotide Polymorphisms (SNPs) and Indels

  • SNPs are variations in a single DNA base pair (e.g., at a specific position, one individual might have an 'A' while another has a 'G'). They are the most common type of genetic variation.
  • Indels are insertions or deletions of a small number of DNA base pairs (typically 1-50 bp).

3.1.5. Whole-Genome Sequencing (WGS) and Whole-Exome Sequencing (WES)

  • Whole-Genome Sequencing (WGS): A comprehensive genetic test that determines the complete DNA sequence of an organism's entire genome. This includes both coding regions (exons) and non-coding regions (introns, intergenic regions), capturing virtually all types of genetic variation.
  • Whole-Exome Sequencing (WES): A targeted sequencing approach that sequences only the protein-coding regions of the genome (exons), which make up about 1-2% of the total genome. WES is less expensive than WGS but misses variants in non-coding regions.

3.1.6. Genome-Wide Association Studies (GWAS)

GWAS is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. Typically, GWAS focuses on SNPs. When a SNP or indel is found to be more frequent in individuals with a particular trait or disease, it is said to be associated with that trait/disease.

3.1.7. Linkage Disequilibrium (LD)

Linkage Disequilibrium (LD) refers to the non-random association of alleles at different loci (positions on a chromosome). When alleles at two loci are in LD, they tend to be inherited together more often than would be expected by chance. LD is crucial for GWAS because it allows detection of associations with SNPs that are not themselves causal but are tagged by a causal variant in LD.

3.1.8. Genomic Relationship Matrix (GRM)

A Genomic Relationship Matrix (GRM) quantifies the genetic similarity (relatedness) between every pair of individuals in a sample based on their SNP genotypes across the genome. Unlike pedigree relationships (which are theoretical expectations), GRMs capture actual realized sharing of genetic material. It's a key input for GREML methods.

3.1.9. GREML (Genomic Restricted Maximum Likelihood)

GREML is a statistical method used to estimate the variance components of complex traits, including SNP-based heritability, from genome-wide SNP data and a GRM. It estimates how much of the total phenotypic variance is explained by genetic factors (captured by the GRM) and how much by residual (environmental or uncaptured genetic) factors. The paper uses GREML-LDMS, which partitions the genome into different MAF and LD bins to better model the contribution of variants with different characteristics.

3.1.10. Haseman-Elston (HE) Regression

Haseman-Elston (HE) regression is an older, simpler method for estimating SNP-based heritability by regressing the squared phenotypic difference between pairs of individuals onto their genomic relatedness (from the GRM). It is computationally less intensive than GREML but can be biased by factors like assortative mating.

3.1.11. Liability Scale

For binary traits (like diseases: presence/absence), heritability is often reported on the liability scale. This assumes an underlying continuous liability (predisposition) to the disease, which is normally distributed in the population. Individuals whose liability exceeds a certain threshold develop the disease. Converting heritability estimates from the observed scale (the scale on which the disease is observed as present/absent) to the liability scale provides a more biologically meaningful estimate that is comparable across traits with different prevalences.

3.1.12. Winner's Curse

Winner's curse is a phenomenon in GWAS where the estimated effect sizes of SNPs that achieve genome-wide significance tend to be inflated compared to their true effect sizes. This happens because associations are more likely to be detected if their estimated effect size is larger than the true effect size due to random chance. Correction methods are applied to provide more accurate estimates of effect sizes.

3.1.13. Functional Annotations

Functional annotations assign biological meaning to genomic regions or variants. They categorize variants based on their predicted impact on gene function (e.g., coding, non-coding, missense, loss-of-function, splicing effect), their location relative to genes (e.g., promoter, UTR, intron, intergenic), or evolutionary conservation (conserved regions). These annotations help in understanding which types of variants or genomic regions disproportionately contribute to heritability (heritability enrichment).

3.1.14. Credible Sets

In fine-mapping analyses, credible sets are a set of SNPs within a GWAS locus that are highly likely to contain the true causal SNP(s). A 95% credible set, for example, means there is a 95% probability that the true causal variant is within that set. Smaller credible sets indicate higher fine-mapping resolution.

3.2. Previous Works

The paper contextualizes its work by discussing several key prior research avenues:

  • Initial missing heritability paradox: Early studies showed that pedigree-based heritability often greatly exceeded the heritability explained by common SNPs identified by GWAS (e.g., hSNP2h^2_{SNP} often 5-49% for common SNPs, while hPED2h^2_{PED} could be much higher). This gap spurred the concept of missing heritability.
  • Hiding heritability vs. still-missing heritability: The gap between GWAS-detected associations (hGWAS2h^2_{GWAS}) and SNP-based heritability (hSNP2h^2_{SNP}) was termed hiding heritability (e.g., Yengo et al. 2022 showed convergence for height with large sample sizes). The gap between hSNP2h^2_{SNP} (based on common variants) and hPED2h^2_{PED} was termed still-missing heritability (ref. 10).
  • Factors for still-missing heritability: Proposed factors included genetic variation not well-tagged by common SNPs (rare variants, structural variants), shared environmental effects, and non-additive genetic effects (e.g., epistasis, dominance) inflating pedigree estimates.
  • TOPMed program WGS studies: Since 2022, studies using Trans-Omics for Precision Medicine (TOPMed) data started providing WGS-based heritability estimates for traits like height, BMI, smoking, type 2 diabetes, and coronary artery disease (Wainschtein et al. 2022, Jang et al. 2022, Rocheleau et al. 2024). However, these studies were limited by relatively small sample sizes (N ~ 25,000), resulting in large standard errors (~10%) that made firm conclusions about still-missing heritability difficult.
  • UK Biobank WES studies: More recently, whole-exome sequence (WES) data from over 300,000 UK Biobank participants (e.g., Hujoel et al. 2024) provided more precise estimates for the role of rare coding variants, but a significant gap remained for rare non-coding variants (which constitute a much larger portion of the genome).

3.3. Technological Evolution

The field of human quantitative genetics has evolved significantly:

  1. Early 20th century - Pedigree Studies: Focused on family trees to estimate heritability, primarily hPED2h^2_{PED}.

  2. 2000s - Common SNP Arrays & GWAS: The advent of affordable SNP arrays allowed for GWAS to identify common variants associated with complex traits. This led to the initial missing heritability paradox.

  3. 2010s - SNP-based heritability with unrelated individuals: Methods like GCTA (later GREML) enabled estimation of hSNP2h^2_{SNP} from common SNPs in large cohorts of unrelated individuals, but this still did not fully account for hPED2h^2_{PED}.

  4. Early 2020s - Whole-Exome Sequencing (WES): Targeted sequencing of coding regions began to shed light on the role of rare coding variants.

  5. Mid 2020s - Whole-Genome Sequencing (WGS): Large-scale WGS initiatives (like TOPMed and now UK Biobank) allow for a more comprehensive assessment of both common and rare variants across the entire genome, including non-coding regions. This paper represents a major step in leveraging this technology.

    This paper's work fits within this timeline by pushing the boundaries of WGS-based heritability estimation, particularly for rare non-coding variants, thanks to its unprecedented sample size.

3.4. Differentiation Analysis

The core innovation of this paper, compared to previous WGS and WES studies, is its scale and precision.

  • Previous WGS studies (e.g., TOPMed): While they utilized WGS data, their sample sizes were often around 25,000 individuals. This was insufficient to precisely estimate the contribution of rare variants (which by definition have low frequencies and thus require very large samples for robust statistical power), leading to high standard errors (~10%) in heritability estimates. This paper uses WGS data from 347,630 individuals, providing estimates with much higher precision (standard errors as low as 0.6-2.7%).

  • WES studies (e.g., UK Biobank WES): These studies had large sample sizes (e.g., >300,000) and could estimate the role of rare coding variants precisely. However, they inherently missed the vast majority of the genome (the non-coding regions), leaving the contribution of rare non-coding variants unexplored. This paper, using WGS, specifically quantifies the role of rare non-coding variants, showing they contribute 79% of the rare-variant WGS-based heritability.

  • Addressing still-missing heritability: By combining large-scale WGS with precise statistical methods, the paper directly addresses the still-missing heritability problem. It demonstrates that for many traits, WGS can nearly fully explain the pedigree-based heritability, which was not conclusively shown by prior WGS studies due to their lower precision.

    In essence, this paper provides the most comprehensive and high-resolution view of the genetic architecture of complex traits to date, particularly regarding the elusive contribution of rare non-coding variants, by overcoming previous sample size and genomic coverage limitations.

4. Methodology

4.1. Principles

The core principle of this study is to meticulously quantify the additive genetic contribution of nearly all accessible genetic variation (common, rare, coding, non-coding, SNPs, and indels) across the human genome to complex traits and diseases. This is achieved by leveraging whole-genome sequencing (WGS) data from a massive cohort (UK Biobank) and advanced statistical genetic methods. The derived WGS-based heritability estimates are then rigorously compared to traditional pedigree-based heritability estimates from the same cohort to ascertain how much of the still-missing heritability can be accounted for. Additionally, the study performs genome-wide association studies (GWAS) to map specific rare-variant associations (RVAs) and assess their contribution to the total rare-variant heritability.

4.2. Core Methodology In-depth (Layer by Layer)

4.2.1. Sample Selection and Quality Control

The study began with 490,542 genomes from the second tranche of WGS data released by the UK Biobank in December 2023.

  1. Ancestry Selection: European ancestry samples were identified using principal component loadings computed from the 1000 Genomes (1KG) Project data. SNP-array data from 488,377 samples were used, and samples within 3 standard deviations of the 1KG reference European ancestry population mean for the first 10 principal components were retained, resulting in 455,516 European ancestry samples. From these, 452,618 samples with both SNP-array and WGS data and consent for data use were selected for GWAS analyses.
  2. WGS Variant Processing:
    • Initial processing involved 136,477 autosomal chunks of raw Binary Variant Call Format (BCF) data.
    • Variant Filtering: Variants with a minor allele count (MAC) < 30, non-'PASS' status, or more than 200 alleles were removed. Multi-allelic variants were split into separate rows.
    • Second QC Step (European Ancestry): After merging chunks into a single file (initially ~130 million variants), a second QC step was applied to the selected European ancestry samples. This involved:
      • Normalization of variants on the GRCh38 reference genome.
      • Removal of variants with genotype missingness > 0.1.
      • Removal of variants deviating from Hardy-Weinberg equilibrium (HWE) with P<108P < 10^{-8}.
      • Removal of samples with missingness threshold > 0.05.
    • This rigorous filtering yielded a final set of 40,575,204 single-nucleotide polymorphisms (SNPs) and indels with a MAF > 0.01% for primary analyses.
  3. Unrelated Sample Set: A subset of 347,630 conventionally unrelated individuals (defined as having a genomic relationship coefficient lower than 0.05) was extracted from the 452,618 European ancestry samples. This unrelated set was used for GREML analyses to avoid potential biases from close relatives.

4.2.2. Genomic Relationship Matrix (GRM) Calculation

GRMs are central to GREML and HE regression for quantifying genetic relatedness between individuals.

  1. Standard GRM for Relatedness: The GRM for related individuals was computed from 583,191 genotyped SNPs with MAF > 0.01. The genomic relationship coefficient (AikA_{ik}) between individuals ii and kk was calculated using the following estimator (Yang et al., 2011): $ A _ { i k } = \frac { 1 } { M } \sum _ { j = 1 } ^ { M } \frac { ( x _ { i j } - 2 p _ { j } ) ( x _ { k j } - 2 p _ { j } ) } { 2 p _ { j } ( 1 - p _ { j } ) } $ Where:

    • AikA_{ik}: The genomic relationship coefficient between individual ii and individual kk.
    • MM: The total number of SNPs or variants used to quantify relatedness.
    • xijx_{ij}: The minor allele count (0, 1, or 2) at SNP jj for individual ii.
    • pjp_j: The minor allele frequency (MAF) at SNP jj. This formula essentially measures the covariance of allele counts between individuals ii and kk, normalized by the expected variance under Hardy-Weinberg equilibrium, giving a standardized measure of genetic similarity. A sparse GRM was extracted with non-zero entries for pairs of relatives with Aik>0.05A_{ik} > 0.05.
  2. Ultra-rare Variant GRM (Secondary Analysis): For a secondary analysis exploring the contribution of ultra-rare variants (MAF < 0.01%), an extra GRM was included. Given that unrelated individuals are unlikely to share ultra-rare variants, this GRM was assumed to be diagonal dominant (primarily affecting only the individual themselves). The diagonal elements (DiiD_{ii}) for individual ii were calculated as: $ D _ { i i } \mathrm { = } \frac { 1 } { M } \sum _ { k \mathrm { =1 } } ^ { K } \frac { N ( N \mathrm { - } 2 k ) S _ { i k } \mathrm { + } k ^ { 2 } M _ { k } } { k ( N \mathrm { - } k / 2 ) } $ Where:

    • DiiD_{ii}: The diagonal element of the GRM for individual ii, representing the contribution of ultra-rare variants to their own genetic variance.
    • MM: The total number of ultra-rare variants (760,525,073).
    • KK: The maximum minor allele count considered for ultra-rare variants (implicitly, up to the total number of individuals NN).
    • NN: The total number of individuals in the sample.
    • kk: A specific minor allele count (i.e., the number of times an ultra-rare variant is observed in the sample).
    • MkM_k: The number of ultra-rare variants found in exactly kk out of NN individuals.
    • SikS_{ik}: The number of ultra-rare variants with minor allele count kk that individual ii carries. This formula essentially sums up the individual's contribution across all ultra-rare variants, weighted by their rarity and the variant's observed minor allele count.

4.2.3. Variant Grouping for GREML-LDMS

For GREML-LDMS (a refined GREML method that accounts for differences in genetic architecture across variant types), variants were assigned to groups based on two characteristics:

  1. Minor Allele Frequency (MAF) Bins:
    • 0.01% - 0.1%
    • 0.1% - 1%
    • 1% - 10%
    • 10% - 50%
  2. LD Bins: Within each MAF bin, variants were further assigned to LD bins based on their median LD score statistic. LD score statistics were calculated for each SNP as the sum of squared correlations between its allele counts and those of all nearby SNPs within a 1-Mb window. This partitioning allows the GREML model to estimate separate heritability contributions for variants with different allele frequencies and LD properties.

4.2.4. Phenotypes and Covariates Quality Control

  1. Phenotype Selection: 41 complex phenotypes were selected based on data availability and clinical relevance, all showing a marginally significant estimate of pedigree-based heritability (hPED2h^2_{PED}). The final analysis focused on 34 phenotypes with both a significant WGS-based heritability (hWGS2h^2_{WGS}) (two-sided Wald test P<0.05/410.001P < 0.05/41 \approx 0.001) and a marginally significant rare-variant heritability estimate (P<0.05P < 0.05).
  2. Phenotype Standardization: Phenotypes were standardized within each sex to have a mean of 0 and a variance of 1. For quantitative traits, samples with phenotypic values above 6 standard deviations were excluded.
  3. Covariates: A comprehensive set of covariates was included to adjust for potential confounding factors:
    • Base Covariates: Sex, year of birth, assessment centers, fasting time at blood sample collection, month of assessment, and prescription drug usage (grouped into categories like statins, diuretics).
    • Geographical Information: Individuals were grouped based on their north and east birth coordinates (UKB fields 129 and 130) using k-means clustering (with 10, 20, 50, or 100 clusters). Individuals with missing birth locations were assigned to a separate cluster.
    • Genetic Principal Components: 30 genotypic principal components were computed for each MAF/LDbinMAF/LD bin of independent variants (after LD pruning with R2=0.1R^2 = 0.1 for MAF > 0.01% and R2=0.01R^2 = 0.01 for MAF < 0.01%). These principal components (totaling 8×30=2408 \times 30 = 240 PCs) were used to account for population stratification.
    • Dimensionality Reduction: To reduce collinearity and dimensionality, singular-value decomposition (SVD) was applied to the covariate matrix, retaining top singular vectors explaining >99% of total variance. Five sets of covariates were generated and tested to ensure robust heritability estimates.

4.2.5. Heritability Estimation (GREML and Haseman-Elston Regression)

  1. GREML-based Estimates:
    • WGS-based heritability (hWGS2h^2_{WGS}) was estimated using the GREML-LDMS method implemented in MPH v.0.53.2. This involved fitting multiple GRMs corresponding to the different MAF and LD bins (8 GRMs in total, plus coding and non-coding partitions, leading to 24 GRMs for functional enrichment analyses).
    • SNP-based heritability estimates for binary traits were converted to the liability scale using the formula K(1K)/[ϕ(ϕ1(K))2]K ( 1 - K ) / [ \phi ( \phi ^ { - 1 } ( K ) ) ^ { 2 } ], where KK is the prevalence of the binary trait in the population, ϕ\phi is the probability density function of a standard normal distribution, and ϕ1\phi^{-1} is its quantile function. This ensures comparability of heritability estimates for binary traits with varying prevalences.
  2. Haseman-Elston (HE) Regression: HE estimates were also obtained using MPH v.0.53.2 for comparison, by initializing all variance components to 0 (except residual variance) and performing one iteration of minimum norm quadratic unbiased estimation. This approach allows for proper covariate adjustment.
  3. Assortative Mating Adjustment: For traits known to be affected by assortative mating (e.g., height, educational attainment, fluid intelligence), HE regression estimates were adjusted using a method proposed by Border et al. (2022).

4.2.6. Pedigree-based Heritability Estimation

Pedigree-based narrow sense heritability (hPED2h^2_{PED}) was estimated from 171,446 pairs of relatives identified in the UK Biobank (pairs with GRM value > 0.05).

  1. General Model: For most traits, the phenotypic covariance between relatives was modeled as: $ \mathrm { c o v } ( y _ { i } , y _ { j } | X ) = \sigma _ { \mathrm { A } } ^ { 2 } \pi _ { i j } + \sigma _ { \mathrm { N A } } ^ { 2 } \pi _ { i j } ^ { 2 } + \delta _ { i j } \sigma _ { \mathrm { E } } ^ { 2 } $ Where:

    • cov(yi,yjX)\mathrm{cov}(y_i, y_j | X): The covariance between the phenotypes of individuals ii and jj, conditional on a set of covariates XX.
    • yi,yjy_i, y_j: The phenotypes of individuals ii and jj.
    • σA2\sigma_A^2: The variance component attributable to additive genetic effects.
    • πij\pi_{ij}: The observed genomic relationship coefficient (GRM value) between individuals ii and jj.
    • σNA2\sigma_{NA}^2: The variance component attributable to non-additive genetic effects (including dominance, epistasis, and potentially shared environmental effects correlated with πij\pi_{ij}).
    • πij2\pi_{ij}^2: The squared genomic relationship coefficient, which is used to model non-additive genetic effects or shared environmental effects.
    • δij\delta_{ij}: A direct indicator variable that is 1 if i=ji = j (for the same individual) and 0 otherwise.
    • σE2\sigma_E^2: The residual variance component (environmental effects and uncaptured genetic variance). These parameters were estimated using a maximum-likelihood procedure. Then, hPED2h^2_{PED} was calculated as hPED2^=σA2^/(σE2^+σA2^+σNA2^)\widehat{h^2_{PED}} = \widehat{\sigma_A^2} / (\widehat{\sigma_E^2} + \widehat{\sigma_A^2} + \widehat{\sigma_{NA}^2}).
  2. Assortative Mating (AM) Model: For traits known to be subject to assortative mating (height, educational attainment, fluid intelligence score), a modified model was used: $ \begin{array} { r l r } { { \mathsf { c o v } ( y _ { i } , y _ { j } | X ) = \sigma _ { \mathrm { A } } ^ { 2 } ( 0 . 5 ) ^ { d _ { i j } } [ 1 + \theta ] ^ { d _ { i j } } + \sigma _ { \mathrm { E } } ^ { 2 } \delta _ { i j } } } \ & { } & { \approx \sigma _ { \mathrm { A } } ^ { 2 } ( 0 . 5 ) ^ { d _ { i j } } + \sigma _ { \mathrm { A } } ^ { 2 } \theta [ ( 0 . 5 ) ^ { d _ { i j } } d _ { i j } ] + \sigma _ { \mathrm { E } } ^ { 2 } \delta _ { i j } } \ & { } & { = \sigma _ { \mathrm { A } } ^ { 2 } \pi _ { i j } + \sigma _ { \mathrm { A M } } ^ { 2 } \pi _ { i j } ( \frac { \log ( \pi _ { i j } ) } { \log ( 0 . 5 ) } ) + \sigma _ { \mathrm { E } } ^ { 2 } \delta _ { i j } \quad } \end{array} $ Where:

    • dij=log(πij)/log(0.5)d_{ij} = \log(\pi_{ij}) / \log(0.5): Measures the degree of relatedness between pairs of individuals.
    • θ\theta: Represents the correlation between genetic values of mates in a population undergoing assortative mating.
    • σAM2=σA2θ\sigma_{AM}^2 = \sigma_A^2 \theta: A variance component specifically accounting for assortative mating effects. The formula uses a first-order approximation assuming θ1\theta \ll 1.
  3. Explained Heritability Ratio (EHR): The EHR was defined as the ratio of WGS-based heritability to pedigree-based heritability: EHR=hWGS2^/hPED2^\mathrm{EHR} = \widehat{h^2_{WGS}} / \widehat{h^2_{PED}}.

4.2.7. Heritability Enrichment Analysis

To assess the relative contribution of coding and non-coding variants, hWGS2h^2_{WGS} was partitioned.

  1. Variant Classification: Variants were categorized as coding or non-coding based on their functional consequences predicted by the Nirvana pipeline version 3.22.0 and their location within WES-covered loci (defined by the IDT xGen Exome Research Panel v.1.0 plus 100 bp flanking regions).
    • Coding variants within WES loci: (0.71% of all WGS variants with MAF > 0.01%)
    • Non-coding variants within WES loci: (0.29% of all WGS variants with MAF > 0.01%)
    • Variants outside WES loci: (99% of all WGS variants with MAF > 0.01%)
  2. Partitioned GRMs: Each of the eight MAF/LD groups of variants was further split into these three subgroups, resulting in 24 GRMs. GREML analyses were run fitting these 24 GRMs simultaneously with the full set of covariates.
  3. Heritability Enrichment Formula: The heritability enrichment in coding variants (E^(coding)\widehat{E}(\mathrm{coding})) was calculated as: $ \widehat { E } ( \mathrm { c o d i n g } ) = \frac { \widehat { h } _ { \mathrm { C o d i n g } } ^ { 2 } / M _ { \mathrm { C od i n g } } } { \widehat { h } _ { \mathrm { W G S } } ^ { 2 } / M _ { \mathrm { W G S } } } $ Where:
    • E^(coding)\widehat{E}(\mathrm{coding}): The estimated heritability enrichment in coding variants.
    • hCoding2^\widehat{h^2_{Coding}}: The estimated contribution to heritability from coding variants.
    • MCodingM_{Coding}: The number of coding variants in the analysis (approximately 0.71% of the 40,575,204 WGS variants).
    • hWGS2^\widehat{h^2_{WGS}}: The total estimated WGS-based heritability.
    • MWGSM_{WGS}: The total number of WGS variants used in the analysis (40,575,204). This ratio compares the per-variant heritability of coding variants to the average per-variant heritability across all WGS variants.

4.2.8. GWAS Analyses

GWAS analyses were performed for all 34 phenotypes in the larger sample of 452,618 European ancestry individuals.

  1. Association Testing: Regenie was used, fitting all covariates used for heritability estimation (including 100 k-means clusters for birth coordinates). Leave-one-chromosome-out (LOCO) genomic predictors were computed using 500,999 LD-pruned common variants (R2>0.9R^2 > 0.9 window size 10 Mb, MAF > 0.05) to account for stratification and cryptic relatedness.
  2. Significance Threshold: A stringent P value threshold of 5×1095 \times 10^{-9} was used for genome-wide significance.
  3. Clumping and Joint Analysis:
    • PLINK was used to clump genome-wide significant associations into independent loci (LD r2<0.01r^2 < 0.01 between lead SNPs within 1 Mb).
    • A joint analysis was performed by fitting all clumped SNPs simultaneously using Regenie (multivariate linear regression for quantitative traits) or Firth's penalized logistic regression (using the R package logistf for binary traits) to retain truly independent genome-wide significant SNPs.
  4. Variance Explained by GWAS (hGWAS2h^2_{GWAS}): The proportion of phenotypic variance explained (on the observed scale) by different sets of associations (CVAs and RVAs) was quantified using: $ \widehat { h } _ { \mathrm { G W A S } } ^ { 2 } = \sum _ { j = 1 } ^ { m } 2 p _ { j } ( 1 - p _ { j } ) \widehat { \beta } _ { j m } \widehat { \beta } _ { j c } $ Where:
    • hGWAS2^\widehat{h^2_{GWAS}}: The estimated proportion of phenotypic variance explained by GWAS-identified variants.
    • mm: The number of SNPs in the focal set of associations.
    • pjp_j: The minor allele frequency of SNP jj.
    • β^jm\widehat{\beta}_{jm}: The winner's curse corrected estimated marginal effect size of SNP jj.
    • β^jc\widehat{\beta}_{jc}: The winner's curse corrected estimated conditional effect size of SNP jj. For binary traits, this was converted to the liability scale using a specific R code provided in ref. 55.
  5. Replication: hGWAS2^\widehat{h^2_{GWAS}} for LDL, HDL, and ALK was re-assessed in an independent sample of approximately 67,000 unrelated individuals of European ancestry from the Alliance for Genomic Discovery (AGD) cohort.
  6. Variant Annotation: GWAS-identified variants were annotated using Gencode v.39 (for gene position), IDT xGen Exome Research Panel (for WES locus coverage), dbSNFp (for functional predictions like AlphaMissense, CADD, Polyphen2, Revel, SIFT, PrimateAI3D, SpliceAI, PromoterAI), and Zoonomia phylogenetic score (for conservation).
  7. Fine-mapping: SuSiE (Sum of Single Effects model) was used to fine-map GWAS loci into 95% credible sets.

4.2.9. GWAS of Imputed SNPs

For comparison, GWAS analyses were also run using imputed variants from two reference panels: HRC+UK10KHRC+UK10K and TOPMed. Similar quality control thresholds were applied (MAF > 0.01%, genotyping missingness < 0.1, sample missingness < 0.05, imputation quality INFO score > 0.3, HWE P>108P > 10^{-8}). Regenie was used on dosage genotypes. Fine-mapping with SuSiE was also performed for imputed datasets.

4.2.10. Association between Variant Density and Structural Variants

  1. Density Calculation: For each CVA and RVA, the density of other CVAs or RVAs (associated with the same trait) was calculated within a 100 kb window, termed CVA-CVA density and RVA-RVA density, respectively.
  2. Structural Variant (SV) Colocalization: GWAS variants were assigned to LD blocks. Publicly available independent associations for tandem repeats (VNTR), array-called copy-number variants (CNV_ARRAY), and WES-called CNVs (CNV_WES) were used to identify structural variants associated with the same traits.
  3. Logistic Regression: Two logistic regression models were fitted (one for common variants, one for rare variants) to assess if high CVA-CVA or RVA-RVA density predicts the presence of a nearby (within 100 kb) trait-associated structural variant.

5. Experimental Setup

5.1. Datasets

5.1.1. UK Biobank (UKB)

  • Source: UK Biobank is a large-scale biomedical database and research resource containing in-depth genetic and health information from half a million UK participants.
  • Scale:
    • WGS Data: 490,542 genomes in the second tranche. Primary analyses focused on 347,630 unrelated individuals of European ancestry for GREML. GWAS analyses used 452,618 European ancestry participants (including relatives).
    • Variants: 40,575,204 autosomal single-nucleotide and short indel variants (MAF > 0.01%) after quality control.
    • Pedigree Data: 171,446 pairs of relatives for pedigree-based heritability estimation.
  • Characteristics: Participants are aged 40-69, with extensive phenotypic data (demographic, health, lifestyle, biochemical, disease diagnoses) and genetic data. The healthy-volunteer bias in UKB participation means disease prevalences may be lower than in the general population.
  • Domain: Comprehensive human health and disease phenotypes, including quantitative traits (e.g., height, BMI, cholesterol levels) and binary traits (e.g., type 2 diabetes, hypertension).
  • Choice Rationale: UKB provides an unparalleled dataset for this research due to its large sample size, extensive phenotyping, and the availability of WGS data, which is critical for studying rare variants.

5.1.2. Alliance for Genomic Discovery (AGD) Cohort

  • Source: Vanderbilt University Medical Center's BioVU biobank, part of the AGD consortium (NashBio, Illumina, and industry partners).
  • Scale: Approximately 67,000 unrelated individuals with European ancestry for replication of specific GWAS results. Also included ~15,690 African ancestry samples for specific RVA analysis.
  • Characteristics: BioVU is an opt-out biobank linking DNA with de-identified medical records.
  • Domain: Primarily for replication of GWAS findings, especially for lipid-related traits and alkaline phosphatase (ALK).
  • Choice Rationale: Provides an independent replication cohort to validate GWAS associations and effect sizes, strengthening the confidence in the identified RVAs.

5.2. Evaluation Metrics

5.2.1. WGS-based Heritability (hWGS2^\widehat{h^2_{WGS}})

  • Conceptual Definition: The proportion of total phenotypic variance in a population that can be attributed to the additive effects of single-nucleotide polymorphisms and indels observed through whole-genome sequencing. It reflects the total genetic variation captured by the WGS data.
  • Mathematical Formula: Not a single formula, but derived from variance component estimation using GREML-LDMS. For binary traits, it is converted to the liability scale using: $ h^2_{\text{liability}} = h^2_{\text{observed}} \times \frac{K(1-K)}{\phi(\phi^{-1}(K))^2} $
  • Symbol Explanation:
    • hliability2h^2_{\text{liability}}: Heritability on the liability scale.
    • hobserved2h^2_{\text{observed}}: Heritability on the observed scale (as measured in the study population).
    • KK: Prevalence of the binary trait in the population.
    • ϕ(x)\phi(x): Probability density function (PDF) of the standard normal distribution at xx.
    • ϕ1(K)\phi^{-1}(K): Quantile function (inverse CDF) of the standard normal distribution at probability KK.

5.2.2. Pedigree-based Heritability (hPED2^\widehat{h^2_{PED}})

  • Conceptual Definition: The proportion of phenotypic variance explained by additive genetic effects derived from comparing trait resemblance among relatives in a pedigree. It serves as a benchmark for the total additive genetic variance.
  • Mathematical Formula: Derived from variance component estimation based on phenotypic covariance between relatives. For the general model: $ \widehat{h^2_{PED}} = \frac{\widehat{\sigma_A^2}}{\widehat{\sigma_E^2} + \widehat{\sigma_A^2} + \widehat{\sigma_{NA}^2}} $
  • Symbol Explanation:
    • hPED2^\widehat{h^2_{PED}}: Estimated pedigree-based heritability.
    • σA2^\widehat{\sigma_A^2}: Estimated additive genetic variance.
    • σE2^\widehat{\sigma_E^2}: Estimated residual variance.
    • σNA2^\widehat{\sigma_{NA}^2}: Estimated non-additive genetic variance.

5.2.3. Explained Heritability Ratio (EHR)

  • Conceptual Definition: A metric to quantify how much of the pedigree-based narrow sense heritability is captured by WGS variants. A ratio close to 1 indicates that WGS data largely explains the heritability estimated from families.
  • Mathematical Formula: $ \mathrm{EHR} = \frac{\widehat{h^2_{WGS}}}{\widehat{h^2_{PED}}} $
  • Symbol Explanation:
    • EHR\mathrm{EHR}: Explained Heritability Ratio.
    • hWGS2^\widehat{h^2_{WGS}}: Estimated WGS-based heritability.
    • hPED2^\widehat{h^2_{PED}}: Estimated pedigree-based heritability.

5.2.4. Heritability Enrichment (E^(annotation)\widehat{E}(\mathrm{annotation}))

  • Conceptual Definition: Measures how much more heritability is contributed per variant in a specific genomic annotation (e.g., coding region) compared to the average heritability contributed per variant across the entire genome. An enrichment value > 1 suggests the annotation is disproportionately important for the trait.
  • Mathematical Formula: For coding variants: $ \widehat { E } ( \mathrm { c o d i n g } ) = \frac { \widehat { h } _ { \mathrm { C od i n g } } ^ { 2 } / M _ { \mathrm { C o d i n g } } } { \widehat { h } _ { \mathrm { W G S } } ^ { 2 } / M _ { \mathrm { W G S } } } $
  • Symbol Explanation:
    • E^(coding)\widehat{E}(\mathrm{coding}): Estimated heritability enrichment in coding variants.
    • hCoding2^\widehat{h^2_{Coding}}: Estimated heritability attributable to coding variants.
    • MCodingM_{Coding}: Number of coding variants.
    • hWGS2^\widehat{h^2_{WGS}}: Total estimated WGS-based heritability.
    • MWGSM_{WGS}: Total number of WGS variants.

5.2.5. Proportion of Variance Explained by GWAS (hGWAS2^\widehat{h^2_{GWAS}})

  • Conceptual Definition: The cumulative proportion of phenotypic variance explained by the GWAS-identified common-variant associations (CVAs) or rare-variant associations (RVAs). It quantifies the 'mappability' of genetic effects.
  • Mathematical Formula: $ \widehat { h } _ { \mathrm { G W A S } } ^ { 2 } = \sum _ { j = 1 } ^ { m } 2 p _ { j } ( 1 - p _ { j } ) \widehat { \beta } _ { j m } \widehat { \beta } _ { j c } $
  • Symbol Explanation:
    • hGWAS2^\widehat{h^2_{GWAS}}: Estimated proportion of phenotypic variance explained by GWAS-identified variants.
    • mm: Number of SNPs in the focal set of associations.
    • pjp_j: Minor allele frequency of SNP jj.
    • β^jm\widehat{\beta}_{jm}: Winner's curse corrected estimated marginal effect size of SNP jj.
    • β^jc\widehat{\beta}_{jc}: Winner's curse corrected estimated conditional effect size of SNP jj.

5.2.6. Pearson's Correlation Coefficient (RR)

  • Conceptual Definition: A measure of the linear correlation between two sets of data. It ranges from -1 to +1, where +1 indicates a perfect positive linear correlation, -1 indicates a perfect negative linear correlation, and 0 indicates no linear correlation.
  • Mathematical Formula: $ R = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}} $
  • Symbol Explanation:
    • RR: Pearson's correlation coefficient.
    • nn: Number of data points.
    • xi,yix_i, y_i: Individual data points for variables XX and YY.
    • xˉ,yˉ\bar{x}, \bar{y}: Mean of variable XX and YY, respectively.

5.2.7. P-value

  • Conceptual Definition: In hypothesis testing, the P-value is the probability of obtaining test results at least as extreme as the observed results, assuming the null hypothesis is true. A small P-value (typically < 0.05 or < 5×1095 \times 10^{-9} for genome-wide significance) suggests that the observed data are unlikely under the null hypothesis, leading to its rejection.
  • Mathematical Formula: Not a single formula, as it's derived from various statistical tests (e.g., Wald test for heritability significance, F-test for variance explained, Pearson's correlation test).
  • Symbol Explanation: No specific symbols, it's a probability.

5.3. Baselines

The paper compares its WGS-based heritability estimates and GWAS findings against several implicit and explicit baselines:

  • Pedigree-based heritability (hPED2h^2_{PED}): This serves as the primary benchmark for the total additive genetic variance of a trait. The goal is to see how closely hWGS2h^2_{WGS} approaches hPED2h^2_{PED}, thereby addressing the still-missing heritability gap.
  • Previous WGS and WES studies: Earlier studies using WGS (e.g., TOPMed) or WES (e.g., UK Biobank WES) provide context for the advancements in precision and scope. The paper highlights how its larger sample size overcomes the limitations of these prior works.
  • Imputation-based GWAS: The study explicitly compares its WGS-based GWAS results (number of associations, fine-mapping resolution) with those obtained using imputed genotypes from common reference panels like HRC+UK10KHRC+UK10K and TOPMed. This demonstrates the gain in discovery power and resolution offered by true WGS data, especially for rare variants.
  • LD Score Regression estimates: In supplementary analyses, the paper empirically assesses the limits of LD score regression for estimating heritability, comparing its estimates with those from GREML when including rare variants.

6. Results & Analysis

6.1. Core Results Analysis

6.1.1. WGS Captures Most Pedigree-based Heritability

The study's central finding is that WGS data from 347,630 unrelated individuals of European ancestry in the UK Biobank captures a substantial portion of the pedigree-based narrow sense heritability (hPED2h^2_{PED}). On average across 34 complex traits and diseases, WGS-based heritability (hWGS2h^2_{WGS}) explained approximately 88% of hPED2h^2_{PED} (median 0.87). This indicates that the still-missing heritability (the gap between hPED2h^2_{PED} and heritability explained by common SNPs) is largely accounted for by the genetic variants captured by WGS.

For 15 traits, including quantitative traits like Albumin (ALB), Alkaline phosphatase (ALK), Heel bone mineral density (BMD), Creatinine (CREA), C-reactive protein (CRP), Diastolic blood pressure (DBP), Forced expiratory volume in 1s (FEV1), Haemoglobin concentration (Hb), Hypertension (HypT), LDL cholesterol levels (LDL), Neuroticism score (NEURO), Red blood cell count (RBC), Systolic blood pressure (SBP), Sleep duration (SLP), and Triglycerides levels (TG), there was no significant difference (two-sided Wald test P>0.05P > 0.05) between hWGS2^\widehat{h^2_{WGS}} and hPED2^\widehat{h^2_{PED}} estimates. This suggests that for these traits, their narrow sense heritability is almost fully explained by WGS data. This is a crucial finding, as it implies the missing heritability for these traits is no longer "missing" but rather hidden within the rare and non-coding variants now accessible by WGS.

The average hWGS2h^2_{WGS} across the 34 traits was 0.284 (s.e. 0.002), ranging from 0.075 (number of children) to 0.709 (height). These estimates for height, BMI (0.339, s.e. 0.009), and smoking status (0.174, s.e. 0.015) were consistent with previous, less precise WGS studies from TOPMed.

6.1.2. Contribution of Rare vs. Common Variants

On average, WGS-based heritability was partitioned into 20% from rare variants (MAF < 1%) and 68% from common variants (MAF \ge 1%). This highlights the significant, albeit smaller, contribution of rare variants to overall heritability. The average rare-variant heritability across traits was 0.063 (s.e. 0.002), representing about 22% of the mean hWGS2h^2_{WGS}. Educational attainment showed the largest contribution from rare variants (approximately 43% of its hWGS2h^2_{WGS}), while bone mineral density and LDL cholesterol had less than 12% contribution from variants with MAF between 0.01% and 1%.

The correlation between the rare-variant and common-variant components of hWGS2^\widehat{h^2_{WGS}} was moderate (R=0.55R = 0.55, P=8.5×104P = 8.5 \times 10^{-4}), indicating some shared genetic architecture but also trait-specific differences in the relative importance of common vs. rare variants.

The following are the results from [Table 1] of the original paper:

Phenotype Acronym h2PED s.e. (h2P) h2WGS s.e.(h2WGS) P
Albumin ALB 0.277 0.031 0.243 0.010 0.299
Alkaline phosphatase ALK 0.435 0.026 0.420 0.009 0.572
Alanine aminotransferase ALT 0.148 0.029 0.190 0.010 0.156
Heel bone mineral density BMD 0.375 0.035 0.396 0.014 0.591
BMI BMI 0.392 0.023 0.339 0.009 0.031
Chronic ischaemic heart disease (I25) CIHD 0.300 0.113 0.228 0.026 0.539
Creatinine CREA 0.244 0.028 0.295 0.009 0.077
C-reactive protein CRP 0.178 0.030 0.138 0.010 0.203
Diastolic blood pressure DBP 0.171 0.029 0.211 0.010 0.191
Dyslipidaemia (E78) DISLIP 0.350 0.080 0.216 0.018 0.101
Educational qualification EA 0.409 0.015 0.347 0.009 <0.001
Forced expiratory volume in 1s FEV1 0.313 0.033 0.299 0.011 0.689
Fluid intelligence score FI 0.405 0.036 0.328 0.027 0.084
Hand grip strength GRIP 0.310 0.028 0.223 0.009 0.003
Haemoglobin concentration Hb 0.272 0.028 0.272 0.009 0.987
HDL cholesterol levels HDL 0.541 0.029 0.398 0.009 <0.001
Standing height HT 0.882 0.010 0.709 0.006 <0.001
Hypertension (I10) HypT 0.251 0.070 0.253 0.015 0.986
IGF-1 IGF 0.405 0.028 0.354 0.009 0.083
LDL cholesterol levels LDL 0.239 0.029 0.228 0.010 0.705
Mean corpuscular volume MCV 0.509 0.027 0.413 0.008 <0.001
Number of children NC 0.152 0.028 0.075 0.010 0.010
Neuroticism score NEURO 0.212 0.034 0.185 0.011 0.455
Platelet count PLAT 0.554 0.027 0.457 0.008 <0.001
Red blood cell count RBC 0.388 0.027 0.355 0.009 0.251
Systolic blood pressure SBP 0.188 0.029 0.217 0.010 0.333
Sleep duration SLP 0.106 0.028 0.125 0.009 0.523
Ever smoked Type 2 diabetes (E11) SMK 0.248 0.060 0.174 0.015 0.237
T2D 0.597 0.100 0.403 0.030 0.065
Telomere length TELO 0.377 0.028 0.127 0.010 <0.001
Triglycerides levels Vitamin D TG 0.323 0.029 0.287 0.009 0.240
VITD 0.227 0.030 0.178 0.010 0.118
White blood cell count WBC 0.319 0.028 0.324 0.009 0.864
Waist-to-hip ratio WHR ^ 2 0.291 0.027 0.240 0.009 0.071

The following are the results from [Table 2] of the original paper:

TOPMed data UKB data (this study)
N Estimate (s.e.) N Estimate (s.e.)
Height 25,465 0.68 (0.10) 346,828 0.709 (0.006)
BMI 25,465 0.30 (0.10) 346,381 0.339 (0.008)
Smoking initiation 26,257 0.23 (0.10) 346,215 0.174 (0.015)

6.1.3. Heritability Enrichment in Coding Variants

Coding variants, which constitute a small fraction of the genome (0.71% of variants with MAF > 0.01% in this study), showed significant heritability enrichment. On average, they accounted for 17.5% of the total hWGS2^\widehat{h^2_{WGS}} across traits (Figure 2a).

  • For rare variants, coding regions accounted for 21.0% of the rare-variant heritability.

  • For common variants, coding regions accounted for 16.9% of the common-variant heritability.

    Relative to their proportion in the genome, this translates to a 36-fold enrichment for common coding variants and a 26-fold enrichment for rare coding variants. This confirms that coding variants disproportionately contribute to heritability. The correlation of heritability enrichment in coding variants between common and rare variants was moderate (R=0.56R = 0.56, P=5.6×104P = 5.6 \times 10^{-4}), suggesting trait-specific differences in how coding variants contribute across MAF spectrums. For instance, type 2 diabetes showed significant enrichment for common coding variants but not for rare coding variants, which could be due to lack of power for rare variants in disease or very deleterious rare coding variants being at even lower frequencies.

The following figure (Figure 2 from the original paper) illustrates the relative contribution of coding and non-coding variants to WGS-based heritability:

Fig. 2 | Relative contribution of coding and non-coding variants to WGSbased heritability. a, This panel represents, across 34 phenotypes, the ratio of proportion of phenotypic variance explained by… 该图像是图表,展示了编码和非编码变异对全基因组序列(WGS)基础遗传率的相对贡献。图a呈现了34种表型中,编码变异与非编码变异所解释的表型方差比例。图b比较了常见变异与稀有编码变异的遗传率富集,标出误差条,标注了二元与定量分布的不同。相关性通过Pearson相关系数(R)和显著性P值呈现。

6.1.4. GWAS Discoveries and Mappability of Rare-Variant Heritability

The GWAS analyses identified a total of 12,129 independent associations (P<5×109P < 5 \times 10^{-9}), comprising 11,243 common-variant associations (CVAs) and 886 rare-variant associations (RVAs).

  • Effect Sizes: After winner's curse correction, RVAs on average explained 0.027% of phenotypic variance, slightly higher than CVAs (0.023%).

  • Mappability:

    • CVAs explained an average of 31% (range 1.9-56%) of the common-variant heritability.
    • RVAs explained an average of 11% (range 0.2-50%) of the rare-variant heritability (Figure 3a).
  • Lipid Traits Highlighted: Lipid-related traits showed a notable enrichment of RVAs (18% of all RVAs were for lipid traits, despite these only being 12% of the traits studied). RVAs for HDL and LDL cholesterol collectively accounted for more than one-third of their estimated rare-variant heritability. This was replicated in the AGD cohort, where RVAs explained approximately 34% of LDL rare-variant heritability and 29% of HDL rare-variant heritability. This demonstrates that a substantial amount of rare-variant heritability is already mappable for certain traits.

  • ALK Trait: Alkaline phosphatase (ALK) was another non-lipid trait where 61 RVAs explained more than one-third of its rare-variant heritability.

  • Non-WES-covered Loci: Many significant RVAs were detected outside WES-covered loci, including a highly pleiotropic indel (rs754165241) in ASGR1 associated with a 1.43 s.d. increase in ALK levels, explaining ~3% of phenotypic variance. This underscores the value of WGS over WES.

    The following figure (Figure 3 from the original paper) illustrates the characterization of variance explained by trait-associated variants detected in WGS-based GWAS:

    Fig. 3 | Characterization of variance explained by trait-associated variants detected in WGS-based GWAS. a, Proportion of WGS-based heritability explained by trait-associated variants. Left bars comp… 该图像是图表,展示了WGS基础GWAS信号解释的表型变异百分比。左侧显示了由稀有变异(RVA)解释的遗传力比例,右侧则为由常见变异(CVA)解释的比例。b部分呈现稀有变异与最近常见变异的距离与变异解释的分布关系,c部分则展示了不同窗口大小下CVA的平均密度。图中包括p值等统计信息。

6.1.5. Genomic Distribution and Colocalization of RVAs

  • Colocalization with CVAs: RVAs showed a significant enrichment near CVAs, with a median DCCVA (distance to closest common variant association) of 27 kb across trait-RVA pairs. The mean density of CVAs within 100 kb of each RVA was 1.8 (Figure 3c). This suggests that RVAs tend to be found in regions already known to harbor common-variant associations.

  • Effect Size and DCCVA: DCCVA was significantly predictive of per-SNP variance explained (R2=0.007R^2 = 0.007, P=1.85×102P = 1.85 \times 10^{-2}), with RVAs closer to CVAs tending to explain more phenotypic variance.

  • Structural Variants: Loci with high RVA-RVA density (at least 2 RVAs within 100 kb) showed a 1.8-fold increased probability of colocalization with a structural variant associated with the same trait. For CVAs, a CVA-CVA density > 2 increased this probability by 1.4-fold. This suggests that complex LD patterns and underlying structural variants contribute to the clustering of both common and rare associations.

    The following figure (Extended Data Fig. 4 from the original paper) illustrates the relationship between estimated effect sizes and allele frequencies for 12,129 trait-associated variants across 34 phenotypes:

    该图像是散点图,展示了罕见变异对不同复杂性状和疾病的影响。左侧列出二分类性状,右侧为定量性状,横轴为次要等位基因频率(MAF),纵轴为遗传变异效应(β_GWAS)。红色点表示可能发生影响的变异,灰色点为其他变异。图中显示多个与脂质相关的基因及其在研究中的作用。横轴上的半对数坐标及不同基因的标注增强了可读性与信息传递。 该图像是散点图,展示了罕见变异对不同复杂性状和疾病的影响。左侧列出二分类性状,右侧为定量性状,横轴为次要等位基因频率(MAF),纵轴为遗传变异效应(β_GWAS)。红色点表示可能发生影响的变异,灰色点为其他变异。图中显示多个与脂质相关的基因及其在研究中的作用。横轴上的半对数坐标及不同基因的标注增强了可读性与信息传递。

6.1.6. Sensitivity Analyses and Comparison with Imputation

  • Covariate Robustness: Heritability estimates were generally robust to covariate adjustments, though educational attainment and fluid intelligence were sensitive to fine-scale geographical clusters (Extended Data Fig. 1).
  • Assortative Mating: Assortative mating significantly biased HE regression estimates for height and educational attainment, but AM-adjusted HE estimates and GREML estimates were more consistent (Extended Data Fig. 2).
  • Genetic Correlations: Genetic correlations between phenotypes showed high concordance between estimates derived from common and rare variants (Extended Data Fig. 3).
  • WGS vs. Imputation: WGS detected more independent associations than imputation panels (HRC+UK10KHRC+UK10K and TOPMed), especially for rare variants (Extended Data Fig. 5a). WGS also provided better fine-mapping resolution, with smaller 95% credible sets compared to imputation, particularly for rare variants (Extended Data Fig. 6). Many WGS-associated variants were missed by imputation (Extended Data Fig. 5b-d), highlighting the unique contribution of WGS.
  • Ultra-rare Variants: Secondary analyses including ultra-rare variants (MAF < 0.01%) showed a modest average increase in hWGS2h^2_{WGS} (~6%), but a more substantial increase for specific traits like number of children (1.7-fold increase, making its hWGS2h^2_{WGS} no longer statistically different from hPED2h^2_{PED}). However, these estimates were less reliable, with some negative heritability estimates indicating potential model misspecification.

6.2. Data Presentation (Tables)

The following are the results from [Table 1] of the original paper:

Phenotype Acronym h2PED s.e. (h2P) h2WGS s.e.(h2WGS) P
Albumin ALB 0.277 0.031 0.243 0.010 0.299
Alkaline phosphatase ALK 0.435 0.026 0.420 0.009 0.572
Alanine aminotransferase ALT 0.148 0.029 0.190 0.010 0.156
Heel bone mineral density BMD 0.375 0.035 0.396 0.014 0.591
BMI BMI 0.392 0.023 0.339 0.009 0.031
Chronic ischaemic heart disease (I25) CIHD 0.300 0.113 0.228 0.026 0.539
Creatinine CREA 0.244 0.028 0.295 0.009 0.077
C-reactive protein CRP 0.178 0.030 0.138 0.010 0.203
Diastolic blood pressure DBP 0.171 0.029 0.211 0.010 0.191
Dyslipidaemia (E78) DISLIP 0.350 0.080 0.216 0.018 0.101
Educational qualification EA 0.409 0.015 0.347 0.009 <0.001
Forced expiratory volume in 1s FEV1 0.313 0.033 0.299 0.011 0.689
Fluid intelligence score FI 0.405 0.036 0.328 0.027 0.084
Hand grip strength GRIP 0.310 0.028 0.223 0.009 0.003
Haemoglobin concentration Hb 0.272 0.028 0.272 0.009 0.987
HDL cholesterol levels HDL 0.541 0.029 0.398 0.009 <0.001
Standing height HT 0.882 0.010 0.709 0.006 <0.001
Hypertension (I10) HypT 0.251 0.070 0.253 0.015 0.986
IGF-1 IGF 0.405 0.028 0.354 0.009 0.083
LDL cholesterol levels LDL 0.239 0.029 0.228 0.010 0.705
Mean corpuscular volume MCV 0.509 0.027 0.413 0.008 <0.001
Number of children NC 0.152 0.028 0.075 0.010 0.010
Neuroticism score NEURO 0.212 0.034 0.185 0.011 0.455
Platelet count PLAT 0.554 0.027 0.457 0.008 <0.001
Red blood cell count RBC 0.388 0.027 0.355 0.009 0.251
Systolic blood pressure SBP 0.188 0.029 0.217 0.010 0.333
Sleep duration SLP 0.106 0.028 0.125 0.009 0.523
Ever smoked Type 2 diabetes (E11) SMK 0.248 0.060 0.174 0.015 0.237
T2D 0.597 0.100 0.403 0.030 0.065
Telomere length TELO 0.377 0.028 0.127 0.010 <0.001
Triglycerides levels Vitamin D TG 0.323 0.029 0.287 0.009 0.240
VITD 0.227 0.030 0.178 0.010 0.118
White blood cell count WBC 0.319 0.028 0.324 0.009 0.864
Waist-to-hip ratio WHR ^ 2 0.291 0.027 0.240 0.009 0.071

The following are the results from [Table 2] of the original paper:

TOPMed data UKB data (this study)
N Estimate (s.e.) N Estimate (s.e.)
Height 25,465 0.68 (0.10) 346,828 0.709 (0.006)
BMI 25,465 0.30 (0.10) 346,381 0.339 (0.008)
Smoking initiation 26,257 0.23 (0.10) 346,215 0.174 (0.015)

6.3. Ablation Studies / Parameter Analysis

6.3.1. Covariate Adjustment Sensitivity

The study performed sensitivity analyses by varying the sets of covariates included in GREML estimations (Extended Data Fig. 1). This involved using different combinations of base covariates, genotypic principal components (PCs), and k-means based birthplace clusters.

  • Results: For most traits, heritability estimates were robust to different covariate adjustments, showing minimal changes. This indicates that the primary estimates are not heavily influenced by the specific choice of covariate model.

  • Exceptions: Educational attainment and fluid intelligence score were notable exceptions. Their uncorrected hWGS2h^2_{WGS} estimates were significantly inflated. This inflation was attributed to fine-scale geographical structures within the UK that were not fully captured by genotypic principal components alone. Adjusting for birthplace clusters was crucial for these traits, highlighting the importance of considering geographical information, especially for behavioral traits linked to migration patterns.

    The following figure (Extended Data Fig. 1 from the original paper) illustrates the sensitivity analyses showing the effect of covariates adjustment on WGS-based heritability estimates:

    该图像是一个示意图,展示了不同协变量调整对34种复杂性状及疾病遗传力估计的影响。X轴表示不同的调整方式,Y轴为\(h^{2}_{WGS} - h^{2}_{ped}\),不同颜色的线条代表不同性状,显示在不同协变量调整下遗传力的变化趋势。 该图像是一个示意图,展示了不同协变量调整对34种复杂性状及疾病遗传力估计的影响。X轴表示不同的调整方式,Y轴为hWGS2hped2h^{2}_{WGS} - h^{2}_{ped},不同颜色的线条代表不同性状,显示在不同协变量调整下遗传力的变化趋势。

6.3.2. Assortative Mating (AM) Effects

The paper compared GREML and Haseman-Elston (HE) regression estimates and further adjusted for assortative mating (AM) effects (Extended Data Fig. 2).

  • Initial Discrepancy: HE regression estimates for height (0.862, s.e. 0.01) and educational attainment (0.464, s.e. 0.011) were substantially higher than GREML estimates (0.709, s.e. 0.006 for height; 0.347, s.e. 0.009 for EA).

  • AM Adjustment: These discrepancies are expected because assortative mating (where individuals with similar traits tend to mate) is known to differentially affect these two methods. After assortative mating adjustment (assuming a spousal correlation of 0.2 for height and 0.4 for EA) to convert HE estimates to an expected value under random mating, the HE estimates for height (0.702, s.e. 0.008) and educational attainment (0.353, s.e. 0.007) became highly consistent with the GREML estimates.

  • Conclusion: This analysis demonstrates the importance of accounting for assortative mating when interpreting heritability estimates, particularly for traits known to be influenced by it. It supports the validity of the GREML estimates as representing additive genetic variance under random mating conditions.

    The following figure (Extended Data Fig. 2 from the original paper) illustrates the effect of assortative mating (AM) on heritability estimates:

    该图像是一个图表,展示了不同遗传方法对复杂性状(HT和EA)遗传率估计的比较。图中显示了各方法的估计值及其误差条,标示了与传统量谱估计的对应关系。 该图像是一个图表,展示了不同遗传方法对复杂性状(HT和EA)遗传率估计的比较。图中显示了各方法的估计值及其误差条,标示了与传统量谱估计的对应关系。

7. Conclusion & Reflections

7.1. Conclusion Summary

This study represents a significant leap forward in understanding the genetic architecture of human complex traits and diseases. By leveraging whole-genome sequencing (WGS) data from nearly 350,000 European ancestry individuals in the UK Biobank, the researchers provided high-precision estimates of WGS-based heritability (hWGS2h^2_{WGS}). A pivotal finding is that WGS data, on average, captures approximately 88% of the pedigree-based narrow sense heritability (hPED2h^2_{PED}). Crucially, for 15 quantitative traits, hWGS2h^2_{WGS} was not significantly different from hPED2h^2_{PED}, effectively resolving the still-missing heritability paradox for these phenotypes.

The study clarified the roles of different variant types: rare variants (MAF < 1%) contribute approximately 20% of the hWGS2h^2_{WGS}, while common variants (MAF \ge 1%) contribute 68%. Furthermore, non-coding variants were shown to account for a substantial 79% of the rare-variant heritability, highlighting their critical, often overlooked, role. Through extensive GWAS analyses, the study identified numerous common- and rare-variant associations (RVAs). Notably, for lipid-related traits, over one-quarter of their rare-variant heritability could be mapped to specific loci, demonstrating the mappability of some rare-variant effects even with current sample sizes. The observed colocalization of RVAs and CVAs suggests potential strategies for future GWAS discovery.

7.2. Limitations & Future Work

The authors acknowledge several limitations:

  • Ancestry Restriction: Analyses were limited to European ancestry individuals due to insufficient sample sizes for other ancestry groups in the UK Biobank. This restricts the generalizability of the findings and highlights the need for heritability studies in diverse populations.

  • MAF Threshold: The primary analyses focused on variants with MAF > 0.01%. While secondary analyses including ultra-rare variants (MAF < 0.01%) showed some additional heritability (e.g., a 1.7-fold increase for number of children), estimates for these variants were less precise and sometimes exhibited negative heritability, indicating potential model misspecification or biases not yet fully understood.

  • Precision for Common Diseases: Rare-variant heritability estimates for many common diseases were not significantly different from zero, reflecting a lack of statistical power in population-based biobank data for these specific conditions.

  • Sex Chromosomes: The study focused on autosomal variants, leaving the contribution of sex chromosomes unexplored. However, previous work suggests their contribution to SNP-based heritability is small (<3%).

  • Genome Build Gaps: The study used the hg38 genome build, which misses approximately 8% of the DNA sequence compared to more recent telomere-to-telomere (T2T) builds. This missing data could contribute to the remaining still-missing heritability. Supplementary analyses indicate that common variants outside hg38 contribute some additional common-variant heritability.

  • Healthy-Volunteer Bias: Pedigree-based heritability estimates for diseases in the UK Biobank might be downwardly biased due to the healthy-volunteer bias in participation.

    Future work should focus on:

  • Expanding WGS studies to larger and more diverse ancestry groups.

  • Improving statistical methods for reliably estimating heritability from ultra-rare variants and accurately modeling their effects.

  • Utilizing case-control designs and even larger sample sizes for common diseases to increase precision.

  • Incorporating sex chromosome variants and addressing structural variants more comprehensively.

  • Developing and using newer T2T genome builds to capture currently missing genetic variation.

  • Developing polygenic scores that integrate rare variants and account for interactions between functional annotations and MAF.

  • Utilizing the colocalization of rare-variant and common-variant heritability to improve GWAS discovery for rare non-coding variants, for example, through burden test analyses within GWAS-associated loci.

7.3. Personal Insights & Critique

This paper is a landmark study that significantly advances our understanding of missing heritability. The sheer scale of WGS data from the UK Biobank provides unprecedented precision, allowing for definitive statements about the contribution of rare variants and non-coding regions that were previously speculative. The finding that WGS largely closes the heritability gap for many traits is profoundly impactful, redirecting research efforts from simply finding the "missing" variance to thoroughly characterizing the functional mechanisms of already captured genetic variation.

The demonstration that a substantial portion of rare-variant heritability is already mappable for traits like lipid levels is encouraging for precision medicine. It implies that future polygenic scores incorporating rare variants could see improved predictive power, potentially by up to 20%, as suggested by the authors. The observation that RVAs and CVAs tend to colocalize is a practical insight, suggesting that future GWAS for rare variants might be most fruitful by focusing on regions already implicated by common variants, rather than searching blindly across the entire genome.

Critically, while the paper largely resolves the still-missing heritability for many traits, it highlights that some traits (e.g., number of children, telomere length) still show a substantial gap. This indicates that other factors, such as ultra-rare variants, structural variants not well-tagged by SNPs, non-additive genetic effects, or gene-environment interactions, likely play a larger role for these specific traits. The caution regarding ultra-rare variant heritability estimation (due to model misspecification leading to negative heritability) is a vital self-critique, emphasizing that statistical methods need further refinement for these extremely rare variants.

From a broader perspective, the study underscores the persistent bias towards European ancestry in large-scale genetic research. While understandable given data availability, this limits direct applicability to other populations and necessitates similar WGS efforts in diverse cohorts. The empirical assessment of LD score regression for rare variants also provides a valuable technical contribution, guiding methodological development in heritability estimation from summary statistics.

Overall, this paper provides a robust foundation for future genetic research, shifting the focus from "is heritability missing?" to "how can we fully leverage all available genetic information to understand and predict complex traits?". Its methods and conclusions are highly transferable, offering a blueprint for analyzing other large WGS datasets and refining our understanding of genetic architecture across biology.

Similar papers

Recommended via semantic vector search.

No similar papers found yet.