Quantifying Privacy Risks for Continuous Trait Data
In the context of life sciences, the rapid biotechnical development leads to the creation of huge amounts of biological data. The use of such data naturally brings concerns on human genetic privacy breaches, which also discourage biological data sharing. Prior studies have investigated the possibility of the privacy issues associated with individuals' trait data. However, there are few studies on quantitatively analyzing the probability of the privacy risk. In this paper, we fill this void by proposing a scheme for systematically breaching genomic privacy, which is centered on quantifying the probability of the privacy risk of continuous trait data. With well-designed synthetic datasets, our theoretical analysis and experiments lead to several important findings, such as: (i) The size of genetic signatures and the sensitivity (true positive rate) significantly affect the accuracy of re-identification attack. (ii) Both the size of genetic signatures and the minor allele frequency have a significant impact on distinguishing true positive and false positive matching between traits and genetic profiles. (iii) The size of the matching quantitative trait locus dataset has a large impact on the confidence of the privacy risk assessment. Validation with a real dataset shows that our findings can effectively estimate the privacy risks of the continuous trait dataset.