Researchers have found that polygenic risk scores, which summarize a person's likelihood of developing diseases like diabetes and cancer, can be reverse-engineered to uncover underlying genetic data. This vulnerability raises privacy concerns, potentially allowing identification through public databases or reconstruction by insurers. The discovery highlights risks in sharing such scores, even anonymously.
Polygenic risk scores (PRS) aggregate the effects of numerous single-nucleotide polymorphisms (SNPs) in the genome to estimate disease predispositions. Companies like 23andMe and researchers use these scores to outline health risks, and individuals sometimes share them publicly for interpretation advice.
Traditionally viewed as low-risk for privacy due to the computational complexity of the knapsack problem—akin to deducing a phone number from its digits' sum—PRS are now shown to be exploitable. The key lies in the precise weights, up to 16 digits long, assigned to each SNP's contribution to disease risk, particularly in smaller models.
Gamze Gürsoy at Columbia University in New York explained: “Because the final polygenic risk score is constrained by a finite number of ways you could arrive at that number, and a statistically likely arrangement of the underlying SNPs, it can be deduced with a high degree of accuracy.” Alongside Kirill Nikitin, Gürsoy tested 298 PRS models using 50 or fewer SNPs on genetic data from 2353 individuals. By calculating possible genomes and filtering improbable mutations, they daisy-chained attacks across models, achieving 94.6 percent accuracy in reconstructing genotypes and predicting 2450 SNPs per person.
Notably, just 27 SNPs sufficed to identify someone in a database of 500,000 samples, with up to 90 percent precision for relatives. Individuals of African and East Asian descent faced higher identification risks due to underrepresentation in genetic databases. Gürsoy noted that 447 small, high-precision models in a public database are vulnerable.
“We wanted to point out that the risk is low, but under [some conditions], there might still be some leakage,” Gürsoy said, urging caution in research designs involving vulnerable groups. Ying Wang at Massachusetts General Hospital acknowledged existing data protections and computational limits but recommended treating small models as sensitive in clinical contexts and consent processes.
The findings stem from a preprint on bioRxiv (DOI: 10.64898/2026.02.16.706191).