Hackers and researchers have proven that genetic data can be matched to their owners, even if the databases they were added to were anonymized. This means that any service that depends on the analysis of genetic samples (e.g., those that trace a family tree) could be a source of personally-identifiable data if it were to be hacked.
Furthermore, it seems that people could be traced and named through the analysis of their relatives’ DNA. New research, published in the journal, Cell, appears to have confirmed this possibility.
Hacking: Now Genomes Too
The first proof that genetic data could be hacked and linked to their donors’ actual identities came to light in 2013. In this year, a computational biology researcher named Yaniv Erlich demonstrated that genetic samples contributed to databases anonymously could be used to ‘reverse engineer’ the names and other personal data of individuals. This could be done completely online, and with not much more information than the Y-chromosome of the genomes in question and the probability that a male child would have the surname of their male parent.
Therefore, Ehrlich and his team asserted that it was highly probable to specifically pinpoint the names of male donors from the United States.
Ehrlich and his team, working across institutions such as Broad Institute, MIT, Massachusetts General Hospital and the International Computer Science Institute in California, published their findings in the journal, Science.
Their report showed, for the first time, information that should ideally remain private could be inferred with high accuracy by exploiting the data on both biological and cultural conventions of hereditary. Furthermore, this could be done using the conventional computational research capacities of the day.
Yaniv Ehrlich gives a talk in which he explains more about the science of genome hacking. (Credit: Yaniv Ehrlich-Whitehead Institute/PMWC/Vimeo)
So, who is potentially subject to this new threat to living anonymously?
The answer may be a large proportion of the Earth’s population, which could become even greater in the future. Currently, thousands of people donate their genetic material to large-scale databases, and often even pay for the privilege of doing so. This is due to the strong popularity of “genetic genealogy” and the companies who provide it. Examples of these organizations include 23andMe or MyHeritage. Most of them operate online, which makes the availability of these companies’ sample-collection kits and the means of including their contents in corporate databases even easier.
The trend of heritage-tracing is particularly strong in the United States, and even more so among its Caucasian population. Therefore, a hacking strike on such an organization may make it possible to identify them. Once this has been achieved, it becomes possible to apply the same to a hacked donor’s relatives, even down to their second or third cousins, or vice versa.
This process can be performed, as further research on the subject has shown, by identifying specific markers or loci in the genome for specific traits or other indicators of heritage. These markers are also the part of the genome often used to identify perpetrators in a criminal investigation. The markers are then correlated with location (either of the donor or that of their parents or other relatives) as well as public records of births, deaths, or marriages. Therefore, even distant relatives in these customers’ (mostly European) countries of origin may also be at risk of genome hacking, one day.
A promotional slide from the DNA-testing company, 23andMe. (Source: Public Domain)
Other ethnic groups, particularly in the United States, are also at risk of identification through genetic data. For example, those of African or Hispanic origin are subject to a much higher probability of arrest in the country. Should such individuals be convicted, they are subject to having their genetic data on file in databases at the state or federal level. Therefore, they and their relatives could also be unmasked through the abuse of these practices.
On the other hand, the relevant legal and personal risks to innocent family members are prevented through the judiciary and regulatory oversight. Similarly, access to genetic databases, in general, has been restricted (mainly to research institutions), in recent years, in response to threats to individuals.
Hacking Without Markers: Is it Possible?
The actual ability to hack genetic data may be much more difficult. But it can be easy to identify one or more people through a single genome than previously thought. This may have been demonstrated in a 2018 paper published in an October issue of the journal, Cell.
This paper, written by Noah Rosenberg of the Department of Biology at Stanford and his team, has indicated that relatives of a genetic database entry donor can be identified without the necessity of markers or their comparison. Instead, this team used a component of the genome called microsatellites, as well as some common variants.
These microsatellites are short tandem repeats of DNA, which are often found in the genome and passed down from parent to child. They can exist surrounded by variations called single-nucleotide polymorphisms (SNPs), which may also be specific to heritable traits, yet also exist removed from specific genes or other loci that are typically taken as forensic markers.
The microsatellites were gathered together from multiple official databases in the U.S. to form its own data entity called Combined DNA Index System (CODIS).
Rosenberg and his colleagues were also able to gather the SNPs that corresponded to each of these microsatellites using CODIS and other sources.
The researchers applied the pre-existing linkage-disequilibrium protocol to link the SNPs to the microsatellites across multiple datasets, as well as the BEAGLE genotype-comparison algorithm, and their own software developed for the purposes of this study, to conduct the analysis.
It was reported that they could accurately detect parent-child relationships in up to 32% of the 872 individual genomes used in the study, as well as up to 36% of the sibling relationships of the same donors.
This data generated over 600,000 SNPs, which could be associated with the 13 microsatellites used in the study. The researchers also noted that the number of relationships identified could trend towards 100% with increasing volumes of datasets.
Therefore, it now appears that malicious parties could extract personally-identifiable data from genetic database entry donors, even in the absence of marker data that was once thought crucial to this type of exploit.
Furthermore, it is also now possible that the families of hacked donors could also be at risk through the analysis of microsatellite data alone. These are worrying indications for the integrity and safety of private information in the future.
Top Image: Can genomes be hacked? Yes, says a certain study. (Source: Pixabay)
Genome Hackers Show No One’s DNA Is Anonymous Anymore, 2018, Wired, https://www.wired.com/story/genome-hackers-show-no-ones-dna-is-anonymous-anymore/, (accessed 14 Oct 2018)
Scientists Discover How to Identify People From 'Anonymous' Genomes, 2013, Wired, https://www.wired.com/2013/01/your-genome-could-reveal-your-identity/, (accessed 14 Oct 2018)
M. Gymrek, et al. (2013), ‘Identifying Personal Genomes by Surname Inference’, Science, 339 (6117), pp. 321-324
J. Kim, et al. (2018), ‘Statistical Detection of Relatives Typed with Disjoint Forensic and Biomedical Loci’, Cell
M. D. Edge, et al. (2017), ‘Linkage disequilibrium matches forensic genetic records to disjoint genomic marker sets’, Proc Natl Acad Sci U S A, 114 (22), pp. 5671-5676