On the critical evaluation and confirmation of germline sequence variants identified using massively parallel sequencing


Kubiritova, Z.a,b, Gyuraszova, M.b,c, Nagyova, E.b,d, Hyblova, M.b,e, Harsanyova, M.b,e, Budis, J.e,f,g, Hekel, R.b,e,g, Gazdarica, J.b,e,g, Duris, F.e,g, Kadasi, L.a,b, Szemes, T.b,e,h, Radvanszky, J.a,e

aInstitute for Clinical and Translational Research, Biomedical Research Center, Slovak Academy of Sciences, Bratislava, Slovakia
bDepartment of Molecular Biology, Faculty of Natural Sciences, Comenius University, Bratislava, Slovakia
cInstitute of Molecular Biomedicine, Faculty of Medicine, Comenius University, Bratislava, Slovakia
dDepartment of Cardiology, Division Heart & Lungs, UMC Utrecht, University of Utrecht, the Netherlands
eGeneton Ltd., Bratislava, Slovakia
fDepartment of Computer Science, Faculty of Mathematics, Physics and Informatics, Comenius University, Bratislava, Slovakia
gSlovak Centre of Scientific and Technical Information, Bratislava, Slovakia
hComenius University Science Park, Bratislava, Slovakia


Although massively parallel sequencing (MPS) is becoming common practice in both research and routine clinical care, confirmation requirements of identified DNA variants using alternative methods are still topics of debate. When evaluating variants directly from MPS data, different read depth statistics, together with specialized genotype quality scores are, therefore, of high relevance. Here we report results of our validation study performed in two different ways: 1) confirmation of MPS identified variants using Sanger sequencing; and 2) simultaneous Sanger and MPS analysis of exons of selected genes. Detailed examination of false-positive and false-negative findings revealed typical error sources connected to low read depth/coverage, incomplete reference genome, indel realignment problems, as well as microsatellite associated amplification errors leading to base miss-calling. However, all these error types were identifiable with thorough manual revision of aligned reads according to specific patterns of distributions of variants and their corresponding reads. Moreover, our results point to dependence of both basic quantitative metrics (such as total read counts, alternative allele read counts and allelic balance) together with specific genotype quality scores on the used bioinformatics pipeline, stressing thus the need for establishing of specific thresholds for these metrics in each laboratory and for each involved pipeline independently.