Estimation and mapping of the missing heritability of human phenotypes
TL;DR Summary
This study analyzes whole-genome sequencing data from the UK Biobank to quantify the impact of rare non-coding variants on heritability of 34 complex traits. It finds that WGS captures about 88% of narrow-sense heritability and identifies significant loci for lipid traits.
Abstract
Rare coding variants shape inter-individual differences in human phenotypes. However, the contribution of rare non-coding variants to those differences remains poorly characterized. Here we analyse whole-genome sequence (WGS) data from 347,630 individuals with European ancestry in the UK Biobank to quantify the relative contribution of 40 million single-nucleotide and short indel variants to the heritability of 34 complex traits and diseases. On average, we find that WGS captures approximately 88% of the pedigree-based narrow sense heritability, which is derived from 20% rare variants and 68% common variants. We identified 15 traits with no significant difference between WGS-based and pedigree-based heritability estimates. Overall, our study provides high-precision estimates of rare-variant heritability and demonstrates significant mapping of specific loci for lipid traits.
Mind Map
In-depth Reading
English Analysis
1. Bibliographic Information
1.1. Title
Estimation and mapping of the missing heritability of human phenotypes
1.2. Authors
Pierrick Wainschtein, Yuanxiang Zhang, Jeremy Schwartzentruber, Irfahan Kassam, Julia Sidorenko, Petko P. Fiziev, Huanwei Wang, Jeremy McRae, Richard Border, Noah Zaitlen, Sriram Sankararaman, Michael E. Goddard, Jian Zeng, Peter M. Visscher, Kyle Kai-How Farh & Loic Yengo.
The authors represent a collaborative effort from various institutions, including Illumina Inc., The University of Queensland, and others involved in academic research and biotechnology. Their diverse affiliations suggest expertise spanning genomics, statistical genetics, bioinformatics, and population genetics.
1.3. Journal/Conference
Nature. Nature is one of the world's most prestigious and highly cited multidisciplinary scientific journals, known for publishing significant original research across all fields of science and technology. Publication in Nature indicates that the research is considered to be of exceptional importance and broad interest.
1.4. Publication Year
2025 (Published online: 12 November 2025)
1.5. Abstract
The paper investigates the contribution of rare non-coding variants to inter-individual differences in human phenotypes, a poorly characterized area despite known effects of rare coding variants. Utilizing whole-genome sequence (WGS) data from 347,630 individuals of European ancestry in the UK Biobank, the study quantifies the relative contribution of 40 million single-nucleotide and short indel variants to the heritability of 34 complex traits and diseases. The key findings indicate that WGS captures approximately 88% of the pedigree-based narrow sense heritability, with 20% attributed to rare variants (minor allele frequency, MAF < 1%) and 68% to common variants (MAF 1%). It further delineates that coding and non-coding variants account for 21% and 79% of the rare-variant WGS-based heritability, respectively. For 15 traits, no significant difference was observed between WGS-based and pedigree-based heritability estimates, suggesting their heritability is largely explained by WGS data. The study also identified 11,243 common-variant associations and 886 rare-variant associations, demonstrating significant mapping of specific loci for lipid traits, where over 25% of rare-variant heritability can be mapped using fewer than 500,000 fully sequenced genomes.
1.6. Original Source Link
/files/papers/6919acc6110b75dcc59ae266/paper.pdf (Published, DOI: 10.1038/s41586-025-09720-6)
2. Executive Summary
2.1. Background & Motivation
The paper addresses the long-standing missing heritability problem in human genetics. While it's known that human traits are heritable and influenced by many DNA variants, the full extent of this genetic contribution has been difficult to quantify. Previous studies using common genetic variants (like single-nucleotide polymorphisms or SNPs with minor allele frequency (MAF) > 1% or 5%) only explained a fraction of the heritability estimated from family studies (pedigree-based heritability). This discrepancy was termed still-missing heritability.
The core challenge is that current methods for estimating SNP-based heritability () have largely been restricted to common variants due to technological and sample size limitations. This leaves the contribution of rare variants (MAF < 1%) and especially rare non-coding variants (variants outside protein-coding regions) poorly characterized. Understanding the full genetic architecture, including the role of rare variants, is crucial for improving the statistical power of genome-wide association studies (GWAS) and enhancing the prediction accuracy of genetic risk scores for complex traits and diseases.
The paper's innovative entry point is to leverage the unprecedented scale of whole-genome sequencing (WGS) data from the UK Biobank (nearly 350,000 individuals with WGS data) to precisely quantify the contribution of both common and rare variants, including the often-overlooked rare non-coding variants, to trait heritability. This large sample size allows for high-precision estimates, which was a limitation of earlier, smaller WGS studies.
2.2. Main Contributions / Findings
The paper makes several significant contributions to the field of quantitative genetics:
-
High-precision estimates of WGS-based heritability (): The study provides accurate estimates of heritability for 34 complex traits and diseases using WGS data, with low standard errors (as low as 0.6% for quantitative traits). This precision was previously unachievable due to smaller WGS sample sizes.
-
Quantification of rare variant contribution: It shows that, on average across traits, rare variants (MAF < 1%) contribute approximately 20% to the total WGS-based heritability, while common variants (MAF 1%) contribute 68%.
-
Resolution of 'still-missing heritability' for many traits: For 15 traits (including 15 quantitative traits with small standard errors), the WGS-based heritability estimates were not significantly different from traditional pedigree-based heritability estimates. This suggests that for these traits, the
still-missing heritabilityis largely accounted for by the variants captured by WGS. On average across all traits, WGS captures 88% of the pedigree-based narrow sense heritability. -
Partitioning of rare-variant heritability by genomic region: The study demonstrates that coding variants account for 21% and non-coding variants for 79% of the rare-variant WGS-based heritability, emphasizing the substantial role of the non-coding genome even for rare variants.
-
Mapping of specific loci, especially for lipid traits: Through
GWASanalyses, the study identified 11,243common-variant associations (CVAs)and 886rare-variant associations (RVAs). Notably, for lipid-related traits (e.g., LDL and HDL cholesterol),RVAscollectively explain over one-quarter of their rare-variant heritability, demonstrating that a substantial portion of rare-variant heritability is already mappable. -
Characterization of
RVAgenomic distribution and colocalization:RVAstend to colocalize withCVAs, andRVAscloser toCVAsexplain more phenotypic variance. This insight can inform future strategies forGWASdiscovery.These findings significantly advance our understanding of the genetic architecture of complex traits, reduce uncertainty about the role of non-additive effects, and provide benchmarks for improving
polygenic scoresandGWASmethodologies.
3. Prerequisite Knowledge & Related Work
3.1. Foundational Concepts
3.1.1. Heritability
Heritability in genetics refers to the proportion of phenotypic variation in a population that is attributable to genetic variation among individuals. It quantifies how much of the differences we see in a trait (like height or disease risk) across people is due to their genes, rather than their environment.
- Narrow-sense heritability (): Specifically refers to the proportion of phenotypic variance explained by additive genetic effects. Additive effects are when the effect of each allele (a specific form of a gene) simply adds up across different genes to influence a trait. This is the most common form of heritability studied because it is directly relevant to how traits respond to selection and to the predictive power of genetic markers.
- Pedigree-based heritability (): This is traditionally estimated by observing the resemblance of traits among relatives (e.g., parents and offspring, siblings, cousins) within families or large pedigrees. By comparing the phenotypic similarity of individuals with known degrees of genetic relatedness, statistical models can infer the proportion of variation due to shared genes. These estimates capture additive, dominant, and epistatic (gene-gene interaction) genetic effects, as well as shared environmental effects, which can sometimes inflate
additive genetic variationestimates if not properly accounted for. - SNP-based heritability (): This is estimated from
SNPdata (often usinggenome-wide arraydata) in a population of ostensiblyunrelated individuals. It quantifies the proportion of phenotypic variance explained by the additive effects of theSNPsincluded in the analysis. This method typically uses statistical approaches likeGenome-wide Complex Trait Analysis (GCTA)orLD Score Regression. When onlycommon SNPsare used, is usually lower than , contributing to themissing heritabilityproblem. - WGS-based heritability (): This is the
SNP-based heritabilityspecifically estimated usingwhole-genome sequencing (WGS)data, which captures a much broader spectrum of genetic variation, includingrare variantsandshort indels, compared toSNP arrays. The goal is to see if including these additional variants can explain more of thepedigree-based heritability.
3.1.2. Minor Allele Frequency (MAF)
Minor allele frequency (MAF) refers to the frequency of the less common allele (variant form of a gene) at a particular locus (specific position on a chromosome) in a given population.
- Common variants: Typically defined as
SNPsorindelswith aMAFgreater than 1% or 5%. These are widespread in the population. - Rare variants: Defined as
SNPsorindelswith aMAFless than 1%. These are less common and often have larger effects on traits or diseases. - Ultra-rare variants:
SNPsorindelswith an extremely lowMAF, often less than 0.01%.
3.1.3. Complex Traits and Diseases
Complex traits (also called quantitative traits) are characteristics that are influenced by multiple genes (polygenic) and environmental factors, and typically show continuous variation in a population (e.g., height, blood pressure, BMI). Complex diseases (e.g., type 2 diabetes, coronary artery disease) are also influenced by multiple genetic and environmental factors but often have a dichotomous outcome (presence or absence of disease).
3.1.4. Single-Nucleotide Polymorphisms (SNPs) and Indels
SNPsare variations in a single DNA base pair (e.g., at a specific position, one individual might have an 'A' while another has a 'G'). They are the most common type of genetic variation.Indelsare insertions or deletions of a small number of DNA base pairs (typically 1-50 bp).
3.1.5. Whole-Genome Sequencing (WGS) and Whole-Exome Sequencing (WES)
Whole-Genome Sequencing (WGS): A comprehensive genetic test that determines the complete DNA sequence of an organism's entire genome. This includes bothcoding regions(exons) andnon-coding regions(introns, intergenic regions), capturing virtually all types of genetic variation.Whole-Exome Sequencing (WES): A targeted sequencing approach that sequences only the protein-coding regions of the genome (exons), which make up about 1-2% of the total genome.WESis less expensive thanWGSbut misses variants innon-coding regions.
3.1.6. Genome-Wide Association Studies (GWAS)
GWAS is an observational study of a genome-wide set of genetic variants in different individuals to see if any variant is associated with a trait. Typically, GWAS focuses on SNPs. When a SNP or indel is found to be more frequent in individuals with a particular trait or disease, it is said to be associated with that trait/disease.
3.1.7. Linkage Disequilibrium (LD)
Linkage Disequilibrium (LD) refers to the non-random association of alleles at different loci (positions on a chromosome). When alleles at two loci are in LD, they tend to be inherited together more often than would be expected by chance. LD is crucial for GWAS because it allows detection of associations with SNPs that are not themselves causal but are tagged by a causal variant in LD.
3.1.8. Genomic Relationship Matrix (GRM)
A Genomic Relationship Matrix (GRM) quantifies the genetic similarity (relatedness) between every pair of individuals in a sample based on their SNP genotypes across the genome. Unlike pedigree relationships (which are theoretical expectations), GRMs capture actual realized sharing of genetic material. It's a key input for GREML methods.
3.1.9. GREML (Genomic Restricted Maximum Likelihood)
GREML is a statistical method used to estimate the variance components of complex traits, including SNP-based heritability, from genome-wide SNP data and a GRM. It estimates how much of the total phenotypic variance is explained by genetic factors (captured by the GRM) and how much by residual (environmental or uncaptured genetic) factors. The paper uses GREML-LDMS, which partitions the genome into different MAF and LD bins to better model the contribution of variants with different characteristics.
3.1.10. Haseman-Elston (HE) Regression
Haseman-Elston (HE) regression is an older, simpler method for estimating SNP-based heritability by regressing the squared phenotypic difference between pairs of individuals onto their genomic relatedness (from the GRM). It is computationally less intensive than GREML but can be biased by factors like assortative mating.
3.1.11. Liability Scale
For binary traits (like diseases: presence/absence), heritability is often reported on the liability scale. This assumes an underlying continuous liability (predisposition) to the disease, which is normally distributed in the population. Individuals whose liability exceeds a certain threshold develop the disease. Converting heritability estimates from the observed scale (the scale on which the disease is observed as present/absent) to the liability scale provides a more biologically meaningful estimate that is comparable across traits with different prevalences.
3.1.12. Winner's Curse
Winner's curse is a phenomenon in GWAS where the estimated effect sizes of SNPs that achieve genome-wide significance tend to be inflated compared to their true effect sizes. This happens because associations are more likely to be detected if their estimated effect size is larger than the true effect size due to random chance. Correction methods are applied to provide more accurate estimates of effect sizes.
3.1.13. Functional Annotations
Functional annotations assign biological meaning to genomic regions or variants. They categorize variants based on their predicted impact on gene function (e.g., coding, non-coding, missense, loss-of-function, splicing effect), their location relative to genes (e.g., promoter, UTR, intron, intergenic), or evolutionary conservation (conserved regions). These annotations help in understanding which types of variants or genomic regions disproportionately contribute to heritability (heritability enrichment).
3.1.14. Credible Sets
In fine-mapping analyses, credible sets are a set of SNPs within a GWAS locus that are highly likely to contain the true causal SNP(s). A 95% credible set, for example, means there is a 95% probability that the true causal variant is within that set. Smaller credible sets indicate higher fine-mapping resolution.
3.2. Previous Works
The paper contextualizes its work by discussing several key prior research avenues:
- Initial
missing heritabilityparadox: Early studies showed thatpedigree-based heritabilityoften greatly exceeded the heritability explained bycommon SNPsidentified byGWAS(e.g., often 5-49% for common SNPs, while could be much higher). This gap spurred the concept ofmissing heritability. Hiding heritabilityvs.still-missing heritability: The gap betweenGWAS-detected associations() andSNP-based heritability() was termedhiding heritability(e.g.,Yengo et al. 2022showed convergence for height with large sample sizes). The gap between (based on common variants) and was termedstill-missing heritability(ref. 10).- Factors for
still-missing heritability: Proposed factors included genetic variation not well-tagged by common SNPs (rare variants, structural variants), shared environmental effects, and non-additive genetic effects (e.g., epistasis, dominance) inflating pedigree estimates. - TOPMed program WGS studies: Since 2022, studies using
Trans-Omics for Precision Medicine (TOPMed)data started providingWGS-based heritabilityestimates for traits like height, BMI, smoking, type 2 diabetes, and coronary artery disease (Wainschtein et al. 2022,Jang et al. 2022,Rocheleau et al. 2024). However, these studies were limited by relatively small sample sizes (N ~ 25,000), resulting in large standard errors (~10%) that made firm conclusions aboutstill-missing heritabilitydifficult. - UK Biobank WES studies: More recently,
whole-exome sequence (WES)data from over 300,000 UK Biobank participants (e.g.,Hujoel et al. 2024) provided more precise estimates for the role of rare coding variants, but a significant gap remained forrare non-coding variants(which constitute a much larger portion of the genome).
3.3. Technological Evolution
The field of human quantitative genetics has evolved significantly:
-
Early 20th century - Pedigree Studies: Focused on family trees to estimate heritability, primarily .
-
2000s - Common SNP Arrays & GWAS: The advent of affordable
SNP arraysallowed forGWASto identify common variants associated with complex traits. This led to the initialmissing heritabilityparadox. -
2010s -
SNP-based heritabilitywithunrelated individuals: Methods likeGCTA(laterGREML) enabled estimation of fromcommon SNPsin large cohorts ofunrelated individuals, but this still did not fully account for . -
Early 2020s -
Whole-Exome Sequencing (WES): Targeted sequencing of coding regions began to shed light on the role ofrare coding variants. -
Mid 2020s -
Whole-Genome Sequencing (WGS): Large-scaleWGSinitiatives (likeTOPMedand now UK Biobank) allow for a more comprehensive assessment of both common and rare variants across the entire genome, includingnon-coding regions. This paper represents a major step in leveraging this technology.This paper's work fits within this timeline by pushing the boundaries of
WGS-based heritabilityestimation, particularly forrare non-coding variants, thanks to its unprecedented sample size.
3.4. Differentiation Analysis
The core innovation of this paper, compared to previous WGS and WES studies, is its scale and precision.
-
Previous
WGSstudies (e.g., TOPMed): While they utilizedWGSdata, their sample sizes were often around 25,000 individuals. This was insufficient to precisely estimate the contribution ofrare variants(which by definition have low frequencies and thus require very large samples for robust statistical power), leading to high standard errors (~10%) in heritability estimates. This paper usesWGSdata from 347,630 individuals, providing estimates with much higher precision (standard errors as low as 0.6-2.7%). -
WESstudies (e.g., UK Biobank WES): These studies had large sample sizes (e.g., >300,000) and could estimate the role ofrare coding variantsprecisely. However, they inherently missed the vast majority of the genome (thenon-coding regions), leaving the contribution ofrare non-coding variantsunexplored. This paper, usingWGS, specifically quantifies the role ofrare non-coding variants, showing they contribute 79% of therare-variant WGS-based heritability. -
Addressing
still-missing heritability: By combining large-scaleWGSwith precise statistical methods, the paper directly addresses thestill-missing heritabilityproblem. It demonstrates that for many traits,WGScan nearly fully explain thepedigree-based heritability, which was not conclusively shown by priorWGSstudies due to their lower precision.In essence, this paper provides the most comprehensive and high-resolution view of the genetic architecture of complex traits to date, particularly regarding the elusive contribution of
rare non-coding variants, by overcoming previous sample size and genomic coverage limitations.
4. Methodology
4.1. Principles
The core principle of this study is to meticulously quantify the additive genetic contribution of nearly all accessible genetic variation (common, rare, coding, non-coding, SNPs, and indels) across the human genome to complex traits and diseases. This is achieved by leveraging whole-genome sequencing (WGS) data from a massive cohort (UK Biobank) and advanced statistical genetic methods. The derived WGS-based heritability estimates are then rigorously compared to traditional pedigree-based heritability estimates from the same cohort to ascertain how much of the still-missing heritability can be accounted for. Additionally, the study performs genome-wide association studies (GWAS) to map specific rare-variant associations (RVAs) and assess their contribution to the total rare-variant heritability.
4.2. Core Methodology In-depth (Layer by Layer)
4.2.1. Sample Selection and Quality Control
The study began with 490,542 genomes from the second tranche of WGS data released by the UK Biobank in December 2023.
- Ancestry Selection:
European ancestrysamples were identified usingprincipal component loadingscomputed from the 1000 Genomes (1KG) Project data.SNP-array datafrom 488,377 samples were used, and samples within 3 standard deviations of the 1KG referenceEuropean ancestrypopulation mean for the first 10principal componentswere retained, resulting in 455,516European ancestrysamples. From these, 452,618 samples with bothSNP-arrayandWGSdata and consent for data use were selected forGWASanalyses. - WGS Variant Processing:
- Initial processing involved 136,477 autosomal chunks of raw
Binary Variant Call Format (BCF)data. - Variant Filtering: Variants with a
minor allele count (MAC)< 30, non-'PASS' status, or more than 200 alleles were removed.Multi-allelic variantswere split into separate rows. - Second QC Step (European Ancestry): After merging chunks into a single file (initially ~130 million variants), a second QC step was applied to the selected
European ancestrysamples. This involved:- Normalization of variants on the
GRCh38 reference genome. - Removal of variants with
genotype missingness> 0.1. - Removal of variants deviating from
Hardy-Weinberg equilibrium (HWE)with . - Removal of samples with
missingnessthreshold > 0.05.
- Normalization of variants on the
- This rigorous filtering yielded a final set of
40,575,204 single-nucleotide polymorphisms (SNPs) and indelswith aMAF> 0.01% for primary analyses.
- Initial processing involved 136,477 autosomal chunks of raw
- Unrelated Sample Set: A subset of 347,630
conventionally unrelated individuals(defined as having agenomic relationship coefficientlower than 0.05) was extracted from the 452,618European ancestrysamples. Thisunrelated setwas used forGREMLanalyses to avoid potential biases from close relatives.
4.2.2. Genomic Relationship Matrix (GRM) Calculation
GRMs are central to GREML and HE regression for quantifying genetic relatedness between individuals.
-
Standard GRM for Relatedness: The
GRMfor related individuals was computed from 583,191genotyped SNPswithMAF> 0.01. The genomic relationship coefficient () between individuals and was calculated using the following estimator (Yang et al., 2011): $ A _ { i k } = \frac { 1 } { M } \sum _ { j = 1 } ^ { M } \frac { ( x _ { i j } - 2 p _ { j } ) ( x _ { k j } - 2 p _ { j } ) } { 2 p _ { j } ( 1 - p _ { j } ) } $ Where:- : The genomic relationship coefficient between individual and individual .
- : The total number of
SNPsorvariantsused to quantify relatedness. - : The
minor allele count(0, 1, or 2) atSNPfor individual . - : The
minor allele frequency (MAF)atSNP. This formula essentially measures the covariance of allele counts between individuals and , normalized by the expected variance underHardy-Weinberg equilibrium, giving a standardized measure of genetic similarity. Asparse GRMwas extracted with non-zero entries for pairs of relatives with .
-
Ultra-rare Variant GRM (Secondary Analysis): For a secondary analysis exploring the contribution of
ultra-rare variants(MAF< 0.01%), an extraGRMwas included. Given thatunrelated individualsare unlikely to shareultra-rare variants, thisGRMwas assumed to bediagonal dominant(primarily affecting only the individual themselves). The diagonal elements () for individual were calculated as: $ D _ { i i } \mathrm { = } \frac { 1 } { M } \sum _ { k \mathrm { =1 } } ^ { K } \frac { N ( N \mathrm { - } 2 k ) S _ { i k } \mathrm { + } k ^ { 2 } M _ { k } } { k ( N \mathrm { - } k / 2 ) } $ Where:- : The diagonal element of the
GRMfor individual , representing the contribution ofultra-rare variantsto their own genetic variance. - : The total number of
ultra-rare variants(760,525,073). - : The maximum
minor allele countconsidered forultra-rare variants(implicitly, up to the total number of individuals ). - : The total number of individuals in the sample.
- : A specific
minor allele count(i.e., the number of times anultra-rare variantis observed in the sample). - : The number of
ultra-rare variantsfound in exactly out of individuals. - : The number of
ultra-rare variantswithminor allele countthat individual carries. This formula essentially sums up the individual's contribution across allultra-rare variants, weighted by their rarity and the variant's observedminor allele count.
- : The diagonal element of the
4.2.3. Variant Grouping for GREML-LDMS
For GREML-LDMS (a refined GREML method that accounts for differences in genetic architecture across variant types), variants were assigned to groups based on two characteristics:
- Minor Allele Frequency (MAF) Bins:
- 0.01% - 0.1%
- 0.1% - 1%
- 1% - 10%
- 10% - 50%
- LD Bins: Within each
MAFbin, variants were further assigned toLD binsbased on theirmedian LD score statistic.LD score statisticswere calculated for eachSNPas the sum of squared correlations between itsallele countsand those of all nearbySNPswithin a 1-Mb window. This partitioning allows theGREMLmodel to estimate separate heritability contributions for variants with different allele frequencies andLDproperties.
4.2.4. Phenotypes and Covariates Quality Control
- Phenotype Selection: 41 complex phenotypes were selected based on data availability and clinical relevance, all showing a
marginally significant estimate of pedigree-based heritability(). The final analysis focused on 34 phenotypes with both a significantWGS-based heritability() (two-sided Wald test ) and amarginally significant rare-variant heritabilityestimate (). - Phenotype Standardization: Phenotypes were
standardizedwithin each sex to have a mean of 0 and a variance of 1. For quantitative traits, samples with phenotypic values above 6 standard deviations were excluded. - Covariates: A comprehensive set of covariates was included to adjust for potential confounding factors:
- Base Covariates: Sex, year of birth, assessment centers, fasting time at blood sample collection, month of assessment, and prescription drug usage (grouped into categories like statins, diuretics).
- Geographical Information: Individuals were grouped based on their north and east birth coordinates (UKB fields 129 and 130) using
k-means clustering(with 10, 20, 50, or 100 clusters). Individuals with missing birth locations were assigned to a separate cluster. - Genetic Principal Components: 30
genotypic principal componentswere computed for each ofindependent variants(afterLD pruningwith forMAF> 0.01% and forMAF< 0.01%). Theseprincipal components(totalingPCs) were used to account forpopulation stratification. - Dimensionality Reduction: To reduce collinearity and dimensionality,
singular-value decomposition (SVD)was applied to the covariate matrix, retaining top singular vectors explaining >99% of total variance. Five sets of covariates were generated and tested to ensure robust heritability estimates.
4.2.5. Heritability Estimation (GREML and Haseman-Elston Regression)
- GREML-based Estimates:
WGS-based heritability() was estimated using theGREML-LDMSmethod implemented inMPH v.0.53.2. This involved fitting multipleGRMscorresponding to the differentMAFandLDbins (8 GRMsin total, pluscodingandnon-codingpartitions, leading to 24GRMsfor functional enrichment analyses).SNP-based heritability estimatesforbinary traitswere converted to theliability scaleusing the formula , where is the prevalence of the binary trait in the population, is theprobability density functionof a standard normal distribution, and is itsquantile function. This ensures comparability of heritability estimates forbinary traitswith varying prevalences.
- Haseman-Elston (HE) Regression:
HE estimateswere also obtained usingMPH v.0.53.2for comparison, by initializing allvariance componentsto 0 (except residual variance) and performing one iteration ofminimum norm quadratic unbiased estimation. This approach allows for proper covariate adjustment. - Assortative Mating Adjustment: For traits known to be affected by
assortative mating(e.g., height, educational attainment, fluid intelligence),HE regressionestimates were adjusted using a method proposed byBorder et al. (2022).
4.2.6. Pedigree-based Heritability Estimation
Pedigree-based narrow sense heritability () was estimated from 171,446 pairs of relatives identified in the UK Biobank (pairs with GRM value > 0.05).
-
General Model: For most traits, the
phenotypic covariancebetween relatives was modeled as: $ \mathrm { c o v } ( y _ { i } , y _ { j } | X ) = \sigma _ { \mathrm { A } } ^ { 2 } \pi _ { i j } + \sigma _ { \mathrm { N A } } ^ { 2 } \pi _ { i j } ^ { 2 } + \delta _ { i j } \sigma _ { \mathrm { E } } ^ { 2 } $ Where:- : The
covariancebetween the phenotypes of individuals and , conditional on a set ofcovariates. - : The phenotypes of individuals and .
- : The
variance componentattributable toadditive genetic effects. - : The observed
genomic relationship coefficient(GRM value) between individuals and . - : The
variance componentattributable tonon-additive genetic effects(includingdominance,epistasis, and potentiallyshared environmental effectscorrelated with ). - : The squared
genomic relationship coefficient, which is used to modelnon-additive genetic effectsorshared environmental effects. - : A
direct indicator variablethat is 1 if (for the same individual) and 0 otherwise. - : The
residual variance component(environmental effects and uncaptured genetic variance). These parameters were estimated using amaximum-likelihood procedure. Then, was calculated as .
- : The
-
Assortative Mating (AM) Model: For traits known to be subject to
assortative mating(height, educational attainment, fluid intelligence score), a modified model was used: $ \begin{array} { r l r } { { \mathsf { c o v } ( y _ { i } , y _ { j } | X ) = \sigma _ { \mathrm { A } } ^ { 2 } ( 0 . 5 ) ^ { d _ { i j } } [ 1 + \theta ] ^ { d _ { i j } } + \sigma _ { \mathrm { E } } ^ { 2 } \delta _ { i j } } } \ & { } & { \approx \sigma _ { \mathrm { A } } ^ { 2 } ( 0 . 5 ) ^ { d _ { i j } } + \sigma _ { \mathrm { A } } ^ { 2 } \theta [ ( 0 . 5 ) ^ { d _ { i j } } d _ { i j } ] + \sigma _ { \mathrm { E } } ^ { 2 } \delta _ { i j } } \ & { } & { = \sigma _ { \mathrm { A } } ^ { 2 } \pi _ { i j } + \sigma _ { \mathrm { A M } } ^ { 2 } \pi _ { i j } ( \frac { \log ( \pi _ { i j } ) } { \log ( 0 . 5 ) } ) + \sigma _ { \mathrm { E } } ^ { 2 } \delta _ { i j } \quad } \end{array} $ Where:- : Measures the
degree of relatednessbetween pairs of individuals. - : Represents the
correlation between genetic values of matesin a population undergoingassortative mating. - : A
variance componentspecifically accounting forassortative matingeffects. The formula uses a first-order approximation assuming .
- : Measures the
-
Explained Heritability Ratio (EHR): The
EHRwas defined as the ratio ofWGS-based heritabilitytopedigree-based heritability: .
4.2.7. Heritability Enrichment Analysis
To assess the relative contribution of coding and non-coding variants, was partitioned.
- Variant Classification: Variants were categorized as
codingornon-codingbased on their functional consequences predicted by theNirvana pipeline version 3.22.0and their location withinWES-covered loci(defined by theIDT xGen Exome Research Panel v.1.0plus 100 bp flanking regions).Coding variants within WES loci: (0.71% of all WGS variants with MAF > 0.01%)Non-coding variants within WES loci: (0.29% of all WGS variants with MAF > 0.01%)Variants outside WES loci: (99% of all WGS variants with MAF > 0.01%)
- Partitioned GRMs: Each of the eight
MAF/LD groupsof variants was further split into these three subgroups, resulting in 24GRMs.GREMLanalyses were run fitting these 24GRMssimultaneously with the full set of covariates. - Heritability Enrichment Formula: The
heritability enrichmentincoding variants() was calculated as: $ \widehat { E } ( \mathrm { c o d i n g } ) = \frac { \widehat { h } _ { \mathrm { C o d i n g } } ^ { 2 } / M _ { \mathrm { C od i n g } } } { \widehat { h } _ { \mathrm { W G S } } ^ { 2 } / M _ { \mathrm { W G S } } } $ Where:- : The estimated heritability enrichment in
coding variants. - : The estimated contribution to heritability from
coding variants. - : The number of
coding variantsin the analysis (approximately 0.71% of the 40,575,204 WGS variants). - : The total estimated
WGS-based heritability. - : The total number of
WGS variantsused in the analysis (40,575,204). This ratio compares the per-variant heritability ofcoding variantsto the average per-variant heritability across allWGS variants.
- : The estimated heritability enrichment in
4.2.8. GWAS Analyses
GWAS analyses were performed for all 34 phenotypes in the larger sample of 452,618 European ancestry individuals.
- Association Testing:
Regeniewas used, fitting all covariates used for heritability estimation (including 100k-meansclusters for birth coordinates).Leave-one-chromosome-out (LOCO) genomic predictorswere computed using 500,999LD-pruned common variants( window size 10 Mb,MAF> 0.05) to account for stratification and cryptic relatedness. - Significance Threshold: A stringent
P valuethreshold of was used forgenome-wide significance. - Clumping and Joint Analysis:
PLINKwas used toclumpgenome-wide significant associationsintoindependent loci(LD betweenlead SNPswithin 1 Mb).- A
joint analysiswas performed by fitting allclumped SNPssimultaneously usingRegenie(multivariate linear regression forquantitative traits) orFirth's penalized logistic regression(using the R packagelogistfforbinary traits) to retain truly independentgenome-wide significant SNPs.
- Variance Explained by GWAS (): The proportion of
phenotypic variance explained(on the observed scale) by different sets of associations (CVAsandRVAs) was quantified using: $ \widehat { h } _ { \mathrm { G W A S } } ^ { 2 } = \sum _ { j = 1 } ^ { m } 2 p _ { j } ( 1 - p _ { j } ) \widehat { \beta } _ { j m } \widehat { \beta } _ { j c } $ Where:- : The estimated proportion of
phenotypic variance explainedbyGWAS-identified variants. - : The number of
SNPsin the focal set of associations. - : The
minor allele frequencyofSNP. - : The
winner's curse correctedestimatedmarginal effect sizeofSNP. - : The
winner's curse correctedestimatedconditional effect sizeofSNP. Forbinary traits, this was converted to theliability scaleusing a specific R code provided inref. 55.
- : The estimated proportion of
- Replication: for LDL, HDL, and ALK was re-assessed in an independent sample of approximately 67,000
unrelated individualsofEuropean ancestryfrom theAlliance for Genomic Discovery (AGD)cohort. - Variant Annotation:
GWAS-identified variantswere annotated usingGencode v.39(for gene position),IDT xGen Exome Research Panel(forWES locuscoverage),dbSNFp(for functional predictions likeAlphaMissense,CADD,Polyphen2,Revel,SIFT,PrimateAI3D,SpliceAI,PromoterAI), andZoonomia phylogenetic score(for conservation). - Fine-mapping:
SuSiE(Sum of Single Effectsmodel) was used tofine-map GWAS lociinto95% credible sets.
4.2.9. GWAS of Imputed SNPs
For comparison, GWAS analyses were also run using imputed variants from two reference panels: and TOPMed. Similar quality control thresholds were applied (MAF > 0.01%, genotyping missingness < 0.1, sample missingness < 0.05, imputation quality INFO score > 0.3, HWE ). Regenie was used on dosage genotypes. Fine-mapping with SuSiE was also performed for imputed datasets.
4.2.10. Association between Variant Density and Structural Variants
- Density Calculation: For each
CVAandRVA, thedensityof otherCVAsorRVAs(associated with the same trait) was calculated within a 100 kb window, termedCVA-CVA densityandRVA-RVA density, respectively. - Structural Variant (SV) Colocalization:
GWAS variantswere assigned toLD blocks. Publicly availableindependent associationsfortandem repeats (VNTR),array-called copy-number variants (CNV_ARRAY), andWES-called CNVs (CNV_WES)were used to identifystructural variantsassociated with the same traits. - Logistic Regression: Two
logistic regression modelswere fitted (one forcommon variants, one forrare variants) to assess if highCVA-CVAorRVA-RVA densitypredicts the presence of a nearby (within 100 kb) trait-associatedstructural variant.
5. Experimental Setup
5.1. Datasets
5.1.1. UK Biobank (UKB)
- Source: UK Biobank is a large-scale biomedical database and research resource containing in-depth genetic and health information from half a million UK participants.
- Scale:
- WGS Data: 490,542 genomes in the second tranche. Primary analyses focused on 347,630
unrelated individualsofEuropean ancestryforGREML.GWASanalyses used 452,618European ancestryparticipants (including relatives). - Variants:
40,575,204 autosomal single-nucleotide and short indel variants(MAF > 0.01%) after quality control. - Pedigree Data: 171,446 pairs of relatives for
pedigree-based heritabilityestimation.
- WGS Data: 490,542 genomes in the second tranche. Primary analyses focused on 347,630
- Characteristics: Participants are aged 40-69, with extensive phenotypic data (demographic, health, lifestyle, biochemical, disease diagnoses) and genetic data. The
healthy-volunteer biasin UKB participation means disease prevalences may be lower than in the general population. - Domain: Comprehensive human health and disease phenotypes, including
quantitative traits(e.g., height, BMI, cholesterol levels) andbinary traits(e.g., type 2 diabetes, hypertension). - Choice Rationale: UKB provides an unparalleled dataset for this research due to its large sample size, extensive phenotyping, and the availability of
WGSdata, which is critical for studyingrare variants.
5.1.2. Alliance for Genomic Discovery (AGD) Cohort
- Source:
Vanderbilt University Medical Center's BioVU biobank, part of theAGDconsortium (NashBio, Illumina, and industry partners). - Scale: Approximately 67,000
unrelated individualswithEuropean ancestryfor replication of specificGWASresults. Also included ~15,690African ancestrysamples for specificRVAanalysis. - Characteristics:
BioVUis anopt-out biobanklinkingDNAwith de-identified medical records. - Domain: Primarily for replication of
GWASfindings, especially forlipid-related traitsandalkaline phosphatase (ALK). - Choice Rationale: Provides an
independent replication cohortto validateGWASassociations and effect sizes, strengthening the confidence in the identifiedRVAs.
5.2. Evaluation Metrics
5.2.1. WGS-based Heritability ()
- Conceptual Definition: The proportion of total phenotypic variance in a population that can be attributed to the additive effects of
single-nucleotide polymorphismsandindelsobserved throughwhole-genome sequencing. It reflects the total genetic variation captured by theWGSdata. - Mathematical Formula: Not a single formula, but derived from
variance component estimationusingGREML-LDMS. Forbinary traits, it is converted to theliability scaleusing: $ h^2_{\text{liability}} = h^2_{\text{observed}} \times \frac{K(1-K)}{\phi(\phi^{-1}(K))^2} $ - Symbol Explanation:
- : Heritability on the
liability scale. - : Heritability on the
observed scale(as measured in the study population). - : Prevalence of the
binary traitin the population. - :
Probability density function (PDF)of the standard normal distribution at . - :
Quantile function(inverse CDF) of the standard normal distribution at probability .
- : Heritability on the
5.2.2. Pedigree-based Heritability ()
- Conceptual Definition: The proportion of phenotypic variance explained by
additive genetic effectsderived from comparing trait resemblance amongrelativesin apedigree. It serves as a benchmark for the total additive genetic variance. - Mathematical Formula: Derived from
variance component estimationbased onphenotypic covariancebetweenrelatives. For the general model: $ \widehat{h^2_{PED}} = \frac{\widehat{\sigma_A^2}}{\widehat{\sigma_E^2} + \widehat{\sigma_A^2} + \widehat{\sigma_{NA}^2}} $ - Symbol Explanation:
- : Estimated
pedigree-based heritability. - : Estimated
additive genetic variance. - : Estimated
residual variance. - : Estimated
non-additive genetic variance.
- : Estimated
5.2.3. Explained Heritability Ratio (EHR)
- Conceptual Definition: A metric to quantify how much of the
pedigree-based narrow sense heritabilityis captured byWGS variants. A ratio close to 1 indicates thatWGSdata largely explains the heritability estimated from families. - Mathematical Formula: $ \mathrm{EHR} = \frac{\widehat{h^2_{WGS}}}{\widehat{h^2_{PED}}} $
- Symbol Explanation:
- : Explained Heritability Ratio.
- : Estimated
WGS-based heritability. - : Estimated
pedigree-based heritability.
5.2.4. Heritability Enrichment ()
- Conceptual Definition: Measures how much more
heritabilityis contributed pervariantin a specific genomicannotation(e.g.,coding region) compared to the averageheritabilitycontributed pervariantacross the entire genome. An enrichment value > 1 suggests theannotationis disproportionately important for the trait. - Mathematical Formula: For
coding variants: $ \widehat { E } ( \mathrm { c o d i n g } ) = \frac { \widehat { h } _ { \mathrm { C od i n g } } ^ { 2 } / M _ { \mathrm { C o d i n g } } } { \widehat { h } _ { \mathrm { W G S } } ^ { 2 } / M _ { \mathrm { W G S } } } $ - Symbol Explanation:
- : Estimated
heritability enrichmentincoding variants. - : Estimated
heritabilityattributable tocoding variants. - : Number of
coding variants. - : Total estimated
WGS-based heritability. - : Total number of
WGS variants.
- : Estimated
5.2.5. Proportion of Variance Explained by GWAS ()
- Conceptual Definition: The cumulative proportion of
phenotypic varianceexplained by theGWAS-identified common-variant associations (CVAs)orrare-variant associations (RVAs). It quantifies the 'mappability' of genetic effects. - Mathematical Formula: $ \widehat { h } _ { \mathrm { G W A S } } ^ { 2 } = \sum _ { j = 1 } ^ { m } 2 p _ { j } ( 1 - p _ { j } ) \widehat { \beta } _ { j m } \widehat { \beta } _ { j c } $
- Symbol Explanation:
- : Estimated proportion of
phenotypic variance explainedbyGWAS-identified variants. - : Number of
SNPsin the focal set of associations. - :
Minor allele frequencyofSNP. - :
Winner's curse correctedestimatedmarginal effect sizeofSNP. - :
Winner's curse correctedestimatedconditional effect sizeofSNP.
- : Estimated proportion of
5.2.6. Pearson's Correlation Coefficient ()
- Conceptual Definition: A measure of the linear correlation between two sets of data. It ranges from -1 to +1, where +1 indicates a perfect positive linear correlation, -1 indicates a perfect negative linear correlation, and 0 indicates no linear correlation.
- Mathematical Formula: $ R = \frac{\sum_{i=1}^n (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^n (x_i - \bar{x})^2 \sum_{i=1}^n (y_i - \bar{y})^2}} $
- Symbol Explanation:
- :
Pearson's correlation coefficient. - : Number of data points.
- : Individual data points for variables and .
- : Mean of variable and , respectively.
- :
5.2.7. P-value
- Conceptual Definition: In
hypothesis testing, theP-valueis the probability of obtaining test results at least as extreme as the observed results, assuming thenull hypothesisis true. A smallP-value(typically < 0.05 or < forgenome-wide significance) suggests that the observed data are unlikely under thenull hypothesis, leading to its rejection. - Mathematical Formula: Not a single formula, as it's derived from various
statistical tests(e.g.,Wald testfor heritability significance,F-testfor variance explained,Pearson's correlation test). - Symbol Explanation: No specific symbols, it's a probability.
5.3. Baselines
The paper compares its WGS-based heritability estimates and GWAS findings against several implicit and explicit baselines:
- Pedigree-based heritability (): This serves as the primary benchmark for the total
additive genetic varianceof a trait. The goal is to see how closely approaches , thereby addressing thestill-missing heritabilitygap. - Previous WGS and WES studies: Earlier studies using
WGS(e.g.,TOPMed) orWES(e.g., UK Biobank WES) provide context for the advancements in precision and scope. The paper highlights how its larger sample size overcomes the limitations of these prior works. - Imputation-based GWAS: The study explicitly compares its
WGS-based GWASresults (number of associations, fine-mapping resolution) with those obtained usingimputed genotypesfrom common reference panels like andTOPMed. This demonstrates the gain indiscovery powerandresolutionoffered by trueWGSdata, especially forrare variants. - LD Score Regression estimates: In supplementary analyses, the paper empirically assesses the limits of
LD score regressionfor estimating heritability, comparing its estimates with those fromGREMLwhen includingrare variants.
6. Results & Analysis
6.1. Core Results Analysis
6.1.1. WGS Captures Most Pedigree-based Heritability
The study's central finding is that WGS data from 347,630 unrelated individuals of European ancestry in the UK Biobank captures a substantial portion of the pedigree-based narrow sense heritability (). On average across 34 complex traits and diseases, WGS-based heritability () explained approximately 88% of (median 0.87). This indicates that the still-missing heritability (the gap between and heritability explained by common SNPs) is largely accounted for by the genetic variants captured by WGS.
For 15 traits, including quantitative traits like Albumin (ALB), Alkaline phosphatase (ALK), Heel bone mineral density (BMD), Creatinine (CREA), C-reactive protein (CRP), Diastolic blood pressure (DBP), Forced expiratory volume in 1s (FEV1), Haemoglobin concentration (Hb), Hypertension (HypT), LDL cholesterol levels (LDL), Neuroticism score (NEURO), Red blood cell count (RBC), Systolic blood pressure (SBP), Sleep duration (SLP), and Triglycerides levels (TG), there was no significant difference (two-sided Wald test ) between and estimates. This suggests that for these traits, their narrow sense heritability is almost fully explained by WGS data. This is a crucial finding, as it implies the missing heritability for these traits is no longer "missing" but rather hidden within the rare and non-coding variants now accessible by WGS.
The average across the 34 traits was 0.284 (s.e. 0.002), ranging from 0.075 (number of children) to 0.709 (height). These estimates for height, BMI (0.339, s.e. 0.009), and smoking status (0.174, s.e. 0.015) were consistent with previous, less precise WGS studies from TOPMed.
6.1.2. Contribution of Rare vs. Common Variants
On average, WGS-based heritability was partitioned into 20% from rare variants (MAF < 1%) and 68% from common variants (MAF 1%). This highlights the significant, albeit smaller, contribution of rare variants to overall heritability. The average rare-variant heritability across traits was 0.063 (s.e. 0.002), representing about 22% of the mean . Educational attainment showed the largest contribution from rare variants (approximately 43% of its ), while bone mineral density and LDL cholesterol had less than 12% contribution from variants with MAF between 0.01% and 1%.
The correlation between the rare-variant and common-variant components of was moderate (, ), indicating some shared genetic architecture but also trait-specific differences in the relative importance of common vs. rare variants.
The following are the results from [Table 1] of the original paper:
| Phenotype | Acronym | h2PED | s.e. (h2P) | h2WGS | s.e.(h2WGS) | P |
| Albumin | ALB | 0.277 | 0.031 | 0.243 | 0.010 | 0.299 |
| Alkaline phosphatase | ALK | 0.435 | 0.026 | 0.420 | 0.009 | 0.572 |
| Alanine aminotransferase | ALT | 0.148 | 0.029 | 0.190 | 0.010 | 0.156 |
| Heel bone mineral density | BMD | 0.375 | 0.035 | 0.396 | 0.014 | 0.591 |
| BMI | BMI | 0.392 | 0.023 | 0.339 | 0.009 | 0.031 |
| Chronic ischaemic heart disease (I25) | CIHD | 0.300 | 0.113 | 0.228 | 0.026 | 0.539 |
| Creatinine | CREA | 0.244 | 0.028 | 0.295 | 0.009 | 0.077 |
| C-reactive protein | CRP | 0.178 | 0.030 | 0.138 | 0.010 | 0.203 |
| Diastolic blood pressure | DBP | 0.171 | 0.029 | 0.211 | 0.010 | 0.191 |
| Dyslipidaemia (E78) | DISLIP | 0.350 | 0.080 | 0.216 | 0.018 | 0.101 |
| Educational qualification | EA | 0.409 | 0.015 | 0.347 | 0.009 | <0.001 |
| Forced expiratory volume in 1s | FEV1 | 0.313 | 0.033 | 0.299 | 0.011 | 0.689 |
| Fluid intelligence score | FI | 0.405 | 0.036 | 0.328 | 0.027 | 0.084 |
| Hand grip strength | GRIP | 0.310 | 0.028 | 0.223 | 0.009 | 0.003 |
| Haemoglobin concentration | Hb | 0.272 | 0.028 | 0.272 | 0.009 | 0.987 |
| HDL cholesterol levels | HDL | 0.541 | 0.029 | 0.398 | 0.009 | <0.001 |
| Standing height | HT | 0.882 | 0.010 | 0.709 | 0.006 | <0.001 |
| Hypertension (I10) | HypT | 0.251 | 0.070 | 0.253 | 0.015 | 0.986 |
| IGF-1 | IGF | 0.405 | 0.028 | 0.354 | 0.009 | 0.083 |
| LDL cholesterol levels | LDL | 0.239 | 0.029 | 0.228 | 0.010 | 0.705 |
| Mean corpuscular volume | MCV | 0.509 | 0.027 | 0.413 | 0.008 | <0.001 |
| Number of children | NC | 0.152 | 0.028 | 0.075 | 0.010 | 0.010 |
| Neuroticism score | NEURO | 0.212 | 0.034 | 0.185 | 0.011 | 0.455 |
| Platelet count | PLAT | 0.554 | 0.027 | 0.457 | 0.008 | <0.001 |
| Red blood cell count | RBC | 0.388 | 0.027 | 0.355 | 0.009 | 0.251 |
| Systolic blood pressure | SBP | 0.188 | 0.029 | 0.217 | 0.010 | 0.333 |
| Sleep duration | SLP | 0.106 | 0.028 | 0.125 | 0.009 | 0.523 |
| Ever smoked Type 2 diabetes (E11) | SMK | 0.248 | 0.060 | 0.174 | 0.015 | 0.237 |
| T2D | 0.597 | 0.100 | 0.403 | 0.030 | 0.065 | |
| Telomere length | TELO | 0.377 | 0.028 | 0.127 | 0.010 | <0.001 |
| Triglycerides levels Vitamin D | TG | 0.323 | 0.029 | 0.287 | 0.009 | 0.240 |
| VITD | 0.227 | 0.030 | 0.178 | 0.010 | 0.118 | |
| White blood cell count | WBC | 0.319 | 0.028 | 0.324 | 0.009 | 0.864 |
| Waist-to-hip ratio | WHR ^ 2 | 0.291 | 0.027 | 0.240 | 0.009 | 0.071 |
The following are the results from [Table 2] of the original paper:
| TOPMed data | UKB data (this study) | |||
| N | Estimate (s.e.) | N | Estimate (s.e.) | |
| Height | 25,465 | 0.68 (0.10) | 346,828 | 0.709 (0.006) |
| BMI | 25,465 | 0.30 (0.10) | 346,381 | 0.339 (0.008) |
| Smoking initiation | 26,257 | 0.23 (0.10) | 346,215 | 0.174 (0.015) |
6.1.3. Heritability Enrichment in Coding Variants
Coding variants, which constitute a small fraction of the genome (0.71% of variants with MAF > 0.01% in this study), showed significant heritability enrichment. On average, they accounted for 17.5% of the total across traits (Figure 2a).
-
For
rare variants, coding regions accounted for 21.0% of therare-variant heritability. -
For
common variants, coding regions accounted for 16.9% of thecommon-variant heritability.Relative to their proportion in the genome, this translates to a 36-fold enrichment for common coding variants and a 26-fold enrichment for rare coding variants. This confirms that coding variants disproportionately contribute to heritability. The correlation of
heritability enrichmentin coding variants between common and rare variants was moderate (, ), suggesting trait-specific differences in howcoding variantscontribute acrossMAFspectrums. For instance, type 2 diabetes showed significant enrichment for common coding variants but not for rare coding variants, which could be due to lack of power for rare variants in disease or very deleterious rare coding variants being at even lower frequencies.
The following figure (Figure 2 from the original paper) illustrates the relative contribution of coding and non-coding variants to WGS-based heritability:
该图像是图表,展示了编码和非编码变异对全基因组序列(WGS)基础遗传率的相对贡献。图a呈现了34种表型中,编码变异与非编码变异所解释的表型方差比例。图b比较了常见变异与稀有编码变异的遗传率富集,标出误差条,标注了二元与定量分布的不同。相关性通过Pearson相关系数(R)和显著性P值呈现。
6.1.4. GWAS Discoveries and Mappability of Rare-Variant Heritability
The GWAS analyses identified a total of 12,129 independent associations (), comprising 11,243 common-variant associations (CVAs) and 886 rare-variant associations (RVAs).
-
Effect Sizes: After
winner's curse correction,RVAson average explained 0.027% of phenotypic variance, slightly higher thanCVAs(0.023%). -
Mappability:
CVAsexplained an average of 31% (range 1.9-56%) of thecommon-variant heritability.RVAsexplained an average of 11% (range 0.2-50%) of therare-variant heritability(Figure 3a).
-
Lipid Traits Highlighted:
Lipid-related traitsshowed a notable enrichment ofRVAs(18% of allRVAswere for lipid traits, despite these only being 12% of the traits studied).RVAsfor HDL and LDL cholesterol collectively accounted for more than one-third of their estimatedrare-variant heritability. This was replicated in theAGD cohort, whereRVAsexplained approximately 34% of LDL rare-variant heritability and 29% of HDL rare-variant heritability. This demonstrates that a substantial amount ofrare-variant heritabilityis already mappable for certain traits. -
ALK Trait:
Alkaline phosphatase (ALK)was another non-lipid trait where 61RVAsexplained more than one-third of itsrare-variant heritability. -
Non-WES-covered Loci: Many significant
RVAswere detected outsideWES-covered loci, including a highly pleiotropicindel(rs754165241) inASGR1associated with a 1.43 s.d. increase in ALK levels, explaining ~3% of phenotypic variance. This underscores the value ofWGSoverWES.The following figure (Figure 3 from the original paper) illustrates the characterization of variance explained by trait-associated variants detected in WGS-based GWAS:
该图像是图表,展示了WGS基础GWAS信号解释的表型变异百分比。左侧显示了由稀有变异(RVA)解释的遗传力比例,右侧则为由常见变异(CVA)解释的比例。b部分呈现稀有变异与最近常见变异的距离与变异解释的分布关系,c部分则展示了不同窗口大小下CVA的平均密度。图中包括p值等统计信息。
6.1.5. Genomic Distribution and Colocalization of RVAs
-
Colocalization with CVAs:
RVAsshowed a significant enrichment nearCVAs, with a medianDCCVA(distance to closest common variant association) of 27 kb acrosstrait-RVA pairs. The mean density ofCVAswithin 100 kb of eachRVAwas 1.8 (Figure 3c). This suggests thatRVAstend to be found in regions already known to harborcommon-variant associations. -
Effect Size and DCCVA:
DCCVAwas significantly predictive of per-SNP variance explained (, ), withRVAscloser toCVAstending to explain more phenotypic variance. -
Structural Variants: Loci with high
RVA-RVA density(at least 2RVAswithin 100 kb) showed a 1.8-fold increased probability ofcolocalizationwith astructural variantassociated with the same trait. ForCVAs, aCVA-CVA density> 2 increased this probability by 1.4-fold. This suggests that complexLDpatterns and underlyingstructural variantscontribute to the clustering of bothcommonandrare associations.The following figure (Extended Data Fig. 4 from the original paper) illustrates the relationship between estimated effect sizes and allele frequencies for 12,129 trait-associated variants across 34 phenotypes:
该图像是散点图,展示了罕见变异对不同复杂性状和疾病的影响。左侧列出二分类性状,右侧为定量性状,横轴为次要等位基因频率(MAF),纵轴为遗传变异效应(β_GWAS)。红色点表示可能发生影响的变异,灰色点为其他变异。图中显示多个与脂质相关的基因及其在研究中的作用。横轴上的半对数坐标及不同基因的标注增强了可读性与信息传递。
6.1.6. Sensitivity Analyses and Comparison with Imputation
- Covariate Robustness:
Heritability estimateswere generally robust to covariate adjustments, though educational attainment and fluid intelligence were sensitive to fine-scale geographical clusters (Extended Data Fig. 1). - Assortative Mating:
Assortative matingsignificantly biasedHE regressionestimates for height and educational attainment, butAM-adjusted HE estimatesandGREMLestimates were more consistent (Extended Data Fig. 2). - Genetic Correlations:
Genetic correlationsbetween phenotypes showed high concordance between estimates derived fromcommonandrare variants(Extended Data Fig. 3). - WGS vs. Imputation:
WGSdetected moreindependent associationsthanimputation panels( andTOPMed), especially forrare variants(Extended Data Fig. 5a).WGSalso provided betterfine-mapping resolution, with smaller95% credible setscompared to imputation, particularly forrare variants(Extended Data Fig. 6). ManyWGS-associated variantswere missed by imputation (Extended Data Fig. 5b-d), highlighting the unique contribution ofWGS. - Ultra-rare Variants: Secondary analyses including
ultra-rare variants(MAF< 0.01%) showed a modest average increase in (~6%), but a more substantial increase for specific traits likenumber of children(1.7-fold increase, making its no longer statistically different from ). However, these estimates were less reliable, with somenegative heritability estimatesindicating potential model misspecification.
6.2. Data Presentation (Tables)
The following are the results from [Table 1] of the original paper:
| Phenotype | Acronym | h2PED | s.e. (h2P) | h2WGS | s.e.(h2WGS) | P |
| Albumin | ALB | 0.277 | 0.031 | 0.243 | 0.010 | 0.299 |
| Alkaline phosphatase | ALK | 0.435 | 0.026 | 0.420 | 0.009 | 0.572 |
| Alanine aminotransferase | ALT | 0.148 | 0.029 | 0.190 | 0.010 | 0.156 |
| Heel bone mineral density | BMD | 0.375 | 0.035 | 0.396 | 0.014 | 0.591 |
| BMI | BMI | 0.392 | 0.023 | 0.339 | 0.009 | 0.031 |
| Chronic ischaemic heart disease (I25) | CIHD | 0.300 | 0.113 | 0.228 | 0.026 | 0.539 |
| Creatinine | CREA | 0.244 | 0.028 | 0.295 | 0.009 | 0.077 |
| C-reactive protein | CRP | 0.178 | 0.030 | 0.138 | 0.010 | 0.203 |
| Diastolic blood pressure | DBP | 0.171 | 0.029 | 0.211 | 0.010 | 0.191 |
| Dyslipidaemia (E78) | DISLIP | 0.350 | 0.080 | 0.216 | 0.018 | 0.101 |
| Educational qualification | EA | 0.409 | 0.015 | 0.347 | 0.009 | <0.001 |
| Forced expiratory volume in 1s | FEV1 | 0.313 | 0.033 | 0.299 | 0.011 | 0.689 |
| Fluid intelligence score | FI | 0.405 | 0.036 | 0.328 | 0.027 | 0.084 |
| Hand grip strength | GRIP | 0.310 | 0.028 | 0.223 | 0.009 | 0.003 |
| Haemoglobin concentration | Hb | 0.272 | 0.028 | 0.272 | 0.009 | 0.987 |
| HDL cholesterol levels | HDL | 0.541 | 0.029 | 0.398 | 0.009 | <0.001 |
| Standing height | HT | 0.882 | 0.010 | 0.709 | 0.006 | <0.001 |
| Hypertension (I10) | HypT | 0.251 | 0.070 | 0.253 | 0.015 | 0.986 |
| IGF-1 | IGF | 0.405 | 0.028 | 0.354 | 0.009 | 0.083 |
| LDL cholesterol levels | LDL | 0.239 | 0.029 | 0.228 | 0.010 | 0.705 |
| Mean corpuscular volume | MCV | 0.509 | 0.027 | 0.413 | 0.008 | <0.001 |
| Number of children | NC | 0.152 | 0.028 | 0.075 | 0.010 | 0.010 |
| Neuroticism score | NEURO | 0.212 | 0.034 | 0.185 | 0.011 | 0.455 |
| Platelet count | PLAT | 0.554 | 0.027 | 0.457 | 0.008 | <0.001 |
| Red blood cell count | RBC | 0.388 | 0.027 | 0.355 | 0.009 | 0.251 |
| Systolic blood pressure | SBP | 0.188 | 0.029 | 0.217 | 0.010 | 0.333 |
| Sleep duration | SLP | 0.106 | 0.028 | 0.125 | 0.009 | 0.523 |
| Ever smoked Type 2 diabetes (E11) | SMK | 0.248 | 0.060 | 0.174 | 0.015 | 0.237 |
| T2D | 0.597 | 0.100 | 0.403 | 0.030 | 0.065 | |
| Telomere length | TELO | 0.377 | 0.028 | 0.127 | 0.010 | <0.001 |
| Triglycerides levels Vitamin D | TG | 0.323 | 0.029 | 0.287 | 0.009 | 0.240 |
| VITD | 0.227 | 0.030 | 0.178 | 0.010 | 0.118 | |
| White blood cell count | WBC | 0.319 | 0.028 | 0.324 | 0.009 | 0.864 |
| Waist-to-hip ratio | WHR ^ 2 | 0.291 | 0.027 | 0.240 | 0.009 | 0.071 |
The following are the results from [Table 2] of the original paper:
| TOPMed data | UKB data (this study) | |||
| N | Estimate (s.e.) | N | Estimate (s.e.) | |
| Height | 25,465 | 0.68 (0.10) | 346,828 | 0.709 (0.006) |
| BMI | 25,465 | 0.30 (0.10) | 346,381 | 0.339 (0.008) |
| Smoking initiation | 26,257 | 0.23 (0.10) | 346,215 | 0.174 (0.015) |
6.3. Ablation Studies / Parameter Analysis
6.3.1. Covariate Adjustment Sensitivity
The study performed sensitivity analyses by varying the sets of covariates included in GREML estimations (Extended Data Fig. 1). This involved using different combinations of base covariates, genotypic principal components (PCs), and k-means based birthplace clusters.
-
Results: For most traits, heritability estimates were robust to different covariate adjustments, showing minimal changes. This indicates that the primary estimates are not heavily influenced by the specific choice of covariate model.
-
Exceptions:
Educational attainmentandfluid intelligence scorewere notable exceptions. Their uncorrected estimates were significantly inflated. This inflation was attributed to fine-scale geographical structures within the UK that were not fully captured bygenotypic principal componentsalone. Adjusting forbirthplace clusterswas crucial for these traits, highlighting the importance of considering geographical information, especially for behavioral traits linked to migration patterns.The following figure (Extended Data Fig. 1 from the original paper) illustrates the sensitivity analyses showing the effect of covariates adjustment on WGS-based heritability estimates:
该图像是一个示意图,展示了不同协变量调整对34种复杂性状及疾病遗传力估计的影响。X轴表示不同的调整方式,Y轴为,不同颜色的线条代表不同性状,显示在不同协变量调整下遗传力的变化趋势。
6.3.2. Assortative Mating (AM) Effects
The paper compared GREML and Haseman-Elston (HE) regression estimates and further adjusted for assortative mating (AM) effects (Extended Data Fig. 2).
-
Initial Discrepancy:
HE regressionestimates for height (0.862, s.e. 0.01) and educational attainment (0.464, s.e. 0.011) were substantially higher thanGREMLestimates (0.709, s.e. 0.006 for height; 0.347, s.e. 0.009 for EA). -
AM Adjustment: These discrepancies are expected because
assortative mating(where individuals with similar traits tend to mate) is known to differentially affect these two methods. Afterassortative mating adjustment(assuming a spousal correlation of 0.2 for height and 0.4 for EA) to convertHE estimatesto an expected value underrandom mating, theHE estimatesfor height (0.702, s.e. 0.008) and educational attainment (0.353, s.e. 0.007) became highly consistent with theGREMLestimates. -
Conclusion: This analysis demonstrates the importance of accounting for
assortative matingwhen interpreting heritability estimates, particularly for traits known to be influenced by it. It supports the validity of theGREMLestimates as representing additive genetic variance under random mating conditions.The following figure (Extended Data Fig. 2 from the original paper) illustrates the effect of assortative mating (AM) on heritability estimates:
该图像是一个图表,展示了不同遗传方法对复杂性状(HT和EA)遗传率估计的比较。图中显示了各方法的估计值及其误差条,标示了与传统量谱估计的对应关系。
7. Conclusion & Reflections
7.1. Conclusion Summary
This study represents a significant leap forward in understanding the genetic architecture of human complex traits and diseases. By leveraging whole-genome sequencing (WGS) data from nearly 350,000 European ancestry individuals in the UK Biobank, the researchers provided high-precision estimates of WGS-based heritability (). A pivotal finding is that WGS data, on average, captures approximately 88% of the pedigree-based narrow sense heritability (). Crucially, for 15 quantitative traits, was not significantly different from , effectively resolving the still-missing heritability paradox for these phenotypes.
The study clarified the roles of different variant types: rare variants (MAF < 1%) contribute approximately 20% of the , while common variants (MAF 1%) contribute 68%. Furthermore, non-coding variants were shown to account for a substantial 79% of the rare-variant heritability, highlighting their critical, often overlooked, role. Through extensive GWAS analyses, the study identified numerous common- and rare-variant associations (RVAs). Notably, for lipid-related traits, over one-quarter of their rare-variant heritability could be mapped to specific loci, demonstrating the mappability of some rare-variant effects even with current sample sizes. The observed colocalization of RVAs and CVAs suggests potential strategies for future GWAS discovery.
7.2. Limitations & Future Work
The authors acknowledge several limitations:
-
Ancestry Restriction: Analyses were limited to
European ancestryindividuals due to insufficient sample sizes for other ancestry groups in the UK Biobank. This restricts the generalizability of the findings and highlights the need forheritability studiesin diverse populations. -
MAF Threshold: The primary analyses focused on variants with
MAF> 0.01%. While secondary analyses includingultra-rare variants(MAF< 0.01%) showed some additional heritability (e.g., a 1.7-fold increase fornumber of children), estimates for these variants were less precise and sometimes exhibitednegative heritability, indicating potential model misspecification or biases not yet fully understood. -
Precision for Common Diseases:
Rare-variant heritabilityestimates for manycommon diseaseswere not significantly different from zero, reflecting a lack of statistical power in population-based biobank data for these specific conditions. -
Sex Chromosomes: The study focused on
autosomal variants, leaving the contribution ofsex chromosomesunexplored. However, previous work suggests their contribution toSNP-based heritabilityis small (<3%). -
Genome Build Gaps: The study used the
hg38 genome build, which misses approximately 8% of theDNA sequencecompared to more recenttelomere-to-telomere (T2T) builds. This missing data could contribute to the remainingstill-missing heritability. Supplementary analyses indicate thatcommon variantsoutsidehg38contribute some additionalcommon-variant heritability. -
Healthy-Volunteer Bias:
Pedigree-based heritabilityestimates for diseases in the UK Biobank might be downwardly biased due to thehealthy-volunteer biasin participation.Future work should focus on:
-
Expanding
WGS studiesto larger and more diverse ancestry groups. -
Improving statistical methods for reliably estimating
heritabilityfromultra-rare variantsand accurately modeling their effects. -
Utilizing
case-control designsand even larger sample sizes forcommon diseasesto increase precision. -
Incorporating
sex chromosome variantsand addressingstructural variantsmore comprehensively. -
Developing and using newer
T2T genome buildsto capture currently missinggenetic variation. -
Developing
polygenic scoresthat integraterare variantsand account for interactions betweenfunctional annotationsandMAF. -
Utilizing the
colocalizationofrare-variantandcommon-variant heritabilityto improveGWAS discoveryforrare non-coding variants, for example, throughburden test analyseswithinGWAS-associated loci.
7.3. Personal Insights & Critique
This paper is a landmark study that significantly advances our understanding of missing heritability. The sheer scale of WGS data from the UK Biobank provides unprecedented precision, allowing for definitive statements about the contribution of rare variants and non-coding regions that were previously speculative. The finding that WGS largely closes the heritability gap for many traits is profoundly impactful, redirecting research efforts from simply finding the "missing" variance to thoroughly characterizing the functional mechanisms of already captured genetic variation.
The demonstration that a substantial portion of rare-variant heritability is already mappable for traits like lipid levels is encouraging for precision medicine. It implies that future polygenic scores incorporating rare variants could see improved predictive power, potentially by up to 20%, as suggested by the authors. The observation that RVAs and CVAs tend to colocalize is a practical insight, suggesting that future GWAS for rare variants might be most fruitful by focusing on regions already implicated by common variants, rather than searching blindly across the entire genome.
Critically, while the paper largely resolves the still-missing heritability for many traits, it highlights that some traits (e.g., number of children, telomere length) still show a substantial gap. This indicates that other factors, such as ultra-rare variants, structural variants not well-tagged by SNPs, non-additive genetic effects, or gene-environment interactions, likely play a larger role for these specific traits. The caution regarding ultra-rare variant heritability estimation (due to model misspecification leading to negative heritability) is a vital self-critique, emphasizing that statistical methods need further refinement for these extremely rare variants.
From a broader perspective, the study underscores the persistent bias towards European ancestry in large-scale genetic research. While understandable given data availability, this limits direct applicability to other populations and necessitates similar WGS efforts in diverse cohorts. The empirical assessment of LD score regression for rare variants also provides a valuable technical contribution, guiding methodological development in heritability estimation from summary statistics.
Overall, this paper provides a robust foundation for future genetic research, shifting the focus from "is heritability missing?" to "how can we fully leverage all available genetic information to understand and predict complex traits?". Its methods and conclusions are highly transferable, offering a blueprint for analyzing other large WGS datasets and refining our understanding of genetic architecture across biology.
Similar papers
Recommended via semantic vector search.