Quantitative criteria for improving performance of buccal DNA for high-throughput genetic analysis

Background DNA from buccal brush samples is being used for high-throughput analyses in a variety of applications, but the impact of sample type on genotyping success and downstream statistical analysis remains unclear. The objective of the current study was to determine laboratory predictors of genotyping failure among buccal DNA samples, and to evaluate the successfully genotyped results with respect to analytic quality control metrics. Sample and genotyping characteristics were compared between buccal and blood samples collected in the population-based Genetic and Environmental Risk Factors for Hemorrhagic Stroke (GERFHS) study (https://gerfhs.phs.wfubmc.edu/public/index.cfm). Results Seven-hundred eight (708) buccal and 142 blood DNA samples were analyzed for laboratory-based and analysis metrics. Overall genotyping failure rates were not statistically different between buccal (11.3%) and blood (7.0%, p = 0.18) samples; however, both the Contrast Quality Control (cQC) rate and the dynamic model (DM) call rates were lower among buccal DNA samples (p < 0.0001). The ratio of double-stranded to total DNA (ds/total ratio) in the buccal samples was the only laboratory characteristic predicting sample success (p < 0.0001). A threshold of at least 34% ds/total DNA provided specificity of 98.7% with a 90.5% negative predictive value for eliminating probable failures. After genotyping, median sample call rates (99.1% vs. 99.4%, p < 0.0001) and heterozygosity rates (25.6% vs. 25.7%, p = 0.006) were lower for buccal versus blood DNA samples, respectively, but absolute differences were small. Minor allele frequency differences from HapMap were smaller for buccal than blood samples, and both sample types demonstrated tight genotyping clusters, even for rare alleles. Conclusions We identified a buccal sample characteristic, a ratio of ds/total DNA <34%, which distinguished buccal DNA samples likely to fail high-throughput genotyping. Applying this threshold, the quality of final genotyping resulting from buccal samples is somewhat lower, but compares favorably to blood. Caution is warranted if cases and controls have different sample types, but buccal samples provide comparable results to blood samples in large-scale genotyping analyses.


Background
Genetic studies involving high-throughput methods, such as genome-wide association studies or sequencing, are increasingly important and often rely on large cohorts of participants or patients. In the context of large genetic epidemiologic studies, DNA derived from buccal cells has the advantage of being readily mailed to remote regions whereas a blood draw frequently requires an in-person contact, and buccal collection may result in higher subject participation [1]. While newer options, such as saliva collection, offer highquality non-invasive methods for collecting DNA [2,3], many older epidemiologic studies, as well as a variety of other applications such as forensics, rely on buccal DNA samples for genetic analysis [4,5]. The primary concern about buccal samples is that they may be more likely to fail genotyping, either due to low yield or lower or more variable quality of DNA. Several studies have explored the quantity and quality of DNA extracted from buccal samples, with mixed conclusions about their utility [3,4,[6][7][8][9][10].
Sample and processing metrics to predict buccal DNA sample failure and improve genotyping quality are therefore essential. Additionally, no studies to our knowledge have examined the impact of sample type on estimates of allele frequency or other metrics likely to directly impact statistical analysis and interpretation of highthroughput genetic studies. Our group has previously explored the yield and utility of buccal DNA samples in the context of high-throughput SNP genotyping [11], and has shown that buccal DNA provide reasonable yields and concordance with paired blood samples. Other studies have also shown success with buccalderived DNA in this context [12,13]. However, our previous study was limited in scope and used genotype call rates and concordance rates as the only quality control metrics.
The objective of the current study, therefore, is to determine methods for predicting which buccal DNA samples are likely to provide unacceptable genotyping results, and to evaluate the performance of successful buccal genotyping with respect to factors important for correct statistical analysis and interpretation of the resulting data. We report that buccal DNA samples with <34% double-stranded DNA (dsDNA) to total DNA were 18 times more likely to fail than samples with ≥34% dsDNA. Also, we report that buccal samples exhibited lower call rates than blood samples; but samples that passed QC had high quality genotype clustering and exhibited minor allele frequency estimates comparable to HapMap. Taken together, these results indicate some loss of statistical power but minimal statistical bias when using these samples in high-throughput genetic studies.

Results
Comparisons of subject and sample characteristics by sample type are presented in Table 1. Buccal samples and blood samples did not differ with respect to the subject's age (p = 0.07), gender (p = 0.9) or stroke status (p = 0.17), as expected given the matched study design and balanced sample plating design. The sample types also did not differ on subjects' cigarette smoking status (p = 1.0), cigar smoking (p = 1.0) or frequency of alcohol use (p = 0.3; data not shown). Average daily intake of caffeinated beverages likewise did not differ by sample type (coffee: p = 0.6, tea: p = 0.08; soda: p = 0.98).
Quality control metrics differed somewhat between blood and buccal samples (Table 1). Overall failure rates did not differ significantly (11.3% buccal, 7.0% blood, p = 0.18), and the proportion of samples that met any specific failure threshold also did not significantly differ by sample type (cQC failure, p = 0.06; DM call rate failure, p = 0.13; gender mismatch, p = 0.4). However, buccal samples exhibited lower average cQC rates (p < 0.0001) and DM call rates (p < 0.0001) than blood samples. Interestingly, successful buccal samples had lower cQC rates (p < 0.0001) and DM call rates (p < 0.0001) than successful blood samples. Similarly, failed buccal samples also had lower DM call rates than failed blood samples (p < 0.01), which suggests an overall downward shift in the distribution of these quality metrics among buccal samples.
Phase I: Predicting pre-genotyping QC failure in Buccal samples Considering all sample quality and quantity variables; time between sample collection and genotyping; and relevant patient lifestyle factors (Table 2), lower ds/total DNA ratio was significantly (p < 0.0001) associated with buccal sample failure, while higher total DNA concentration was marginally associated (p = 0.05). Failed buccal samples were also collected more recently than those that did not fail (p < 0.0001). Because higher total DNA concentration helps define ds/total DNA ratio but was less strongly associated, we limited further analysis of buccal sample failure to ds/total DNA ratio and time since sample collection. From logistic regression analyses, each 10% absolute increase in ds/total DNA ratio (e.g., from 20% to 30%) is associated with a 40% lower odds of sample failure (OR = 0.61, 95% CI = 0.52-0.70, p < 0.0001), with a significant area under the ROC curve of 0.74 (95% confidence interval: 0.67-0.80, p < 0.0001). Importantly, the probability of sample failure increased significantly with very low ds/total ratios ( Figure 1). Years since sample collection was significant when added to this logistic model (OR = 0.86, 95% CI = 0.77-0.95, p = 0.004), but did not materially alter the findings for ds/total DNA ratio (OR = 0.62, 95% CI = 0.55-0.72, p < 0.0001; ROC AUC = 0.75). Using this analysis, we established a threshold of ds/total ratio ≥ 0.34 during the study to determine which samples should be processed, which balanced sensitivity for failure of 20.5% with specificity for failure of 98.7% in the first 3 plates run. Overall, samples with ds/total ratio ≥ 0.34 experienced a 9.5% failure rate (63 of 662), compared with a 65% failure rate (15 of 23) among those that did not meet the ds/total DNA threshold (OR = 17.8, 95% CI: 7.3-43.7, p < 0.0001; Table 3). By contrast, the higher "optimal" threshold ds/total ratio of ≥0.58 (70.5% sensitivity and 69.7% specificity) would have reduced the QC failure rate to 5.5%, but adoption of this threshold would have resulted in the elimination of 34% of our buccal samples, 77% (180/233) of which would ultimately have passed QC.

Phase II: Buccal versus blood genotype quality
Among samples that passed initial QC, sample call rates were modestly but significantly lower for buccal com-  Figure 3B].
Of the 906,600 SNPs on the Affymetrix 6.0 chip, 755,837 (83%) had a MAF <40%. Overall, MAF for both buccal and blood samples were similar to estimates from CEU HapMap standards. Figure 4A shows the correspondence between buccal and blood sample deviations from CEU HapMap MAF. Overall, median MAF did not differ from HapMap for both blood and buccal samples (e.g., median absolute difference = 0), but blood samples had higher variability (IQR = −2% to +3% in blood, -1% to +2% in buccal). Figure 4B provides the deviation of our samples from the CEU MAF by sample type and MAF category. For all MAF categories, the median deviation from the CEU was between 0 and 1%, with interquartile ranges extending as far as 5% different from CEU. Typically, blood samples demonstrated slightly greater differences from CEU than buccal samples. Despite IQRs that included no difference, all comparisons of MAF overall or by MAF group were significant at p < 0.0001 due to very large numbers of loci included in estimates (>17,000 SNPs included in each MAF group). Visual review of cluster plots did not reveal any systematic difference between sample types for any MAF category ( Figure 5).

Discussion
The results of this study demonstrate that applying a minimum threshold for the ratio of double-stranded to total DNA quantity can improve the likelihood of successful genotyping for buccal DNA samples. Furthermore, applying this minimal level of quality control results in buccal genotyping results that are only slightly lower than the gold standard of blood DNA. This study also suggests specific additional considerations that may further improve results from buccal samples. Overall, we have shown that buccal samples not only can be successfully genotyped, but also provide high quality analytic data suitable for large-scale or high-throughput genetic studies.
The major finding of this study is that it is important to monitor ds/total DNA ratio to identify buccal samples at high likelihood of failure. In this study, low ds/total DNA ratio was a significant independent predictor of pre-genotyping failure, assuming acceptable 260/280 and 260/230 ratios. In this study, buccal samples drawn more recently were also more likely to fail, contrary to expectations, but time since sample collection did not alter the effects of ds/total DNA ratio in our analysis. Reasons for this finding are unclear, but may be driven statistically by a gap of over a year in which recruitment was dormant (e.g., artificially making the differences between the medians larger), as study protocols and staffing were consistent throughout the study. These results, however, suggest that longer cryogenic storage times do not negatively impact buccal DNA quality. The threshold of 34% ds/total DNA was selected to provide good discriminative value and low failure rate, as samples not meeting Figure 1 Probability of sample failure by ds/total DNA ratio. Predicted probability of sample pre-genotyping failure from logistic regression analysis. Line represents predicted probability and shaded area is 95% confidence limits. In this large study, buccal samples had similar overall pre-genotyping failure rate, but lower cQC and lower DM call rates than blood samples, suggesting a shift toward lower overall quality. The similarity in overall failure rate may be somewhat optimistic, as buccal sample selection in the last 70% of this study was intentionally shifted toward those most likely to pass QC (e.g., ≥34% ds/total DNA); extracted samples below this threshold were not included in genotyping runs after the 3 rd plate. The current analysis suggests that buccal samples may require higher QC thresholds to ensure that cQC and DM quality metrics are comparable to blood samples, particularly for the DM call rate, which was lower for both passed and failed buccal samples compared with the blood samples. There is precedent for subjecting different "populations" of samples or loci to different thresholds to improve performance, which has shown promise in previous situations [14]. However, establishing higher thresholds would result in the rejection of more buccal samples in the QC phase, limiting sample size.
Birdseed genotyping results from passed samples were comparable for buccal and blood samples. Although all results were statistically better among blood samples, the magnitude of the differences are frequently nominal. This is consistent with a previous analysis of paired buccal and blood samples in this study that revealed similar call rates and 98.8% concordance of genotyping calls between buccal and blood samples, indicating high fidelity of genotyping across sample type [11]. Median call rates were high for both sample types (>99%) and deviations from HapMap minor allele frequencies for Caucasian populations were similarly minor. Analysis of MAF deviations showed similar patterns, even for rare SNPs that are frequently ignored in genetic studies for concern over accurate genotype calling. However, all MAFs estimated for this study deviated positively from HapMap estimates, which may be due to regional differences in population compared with the CEU HapMap. The overestimation of MAF may also suggest the presence of missing call bias where no-calls are more likely in one genotype than another [14], although this effect appears similar for buccal and blood samples. Visual inspection of cluster plots revealed no obvious issues, with distinct clusters and minimal no-calls for both sample types, even for SNPs of low MAF. Furthermore, the median test revealed no batch effects on estimated MAF by locus across plates (data not shown). However, the average raw intensities for buccal samples were significantly lower than the blood samples (data not shown), which suggests that buccal samples and blood samples should be clustered separately for accurate genotyping.
These similarities between buccal and blood samples suggest that both sample types can be used in highthroughput genetic studies with only minor caveats. However, with the increasing use of "out of study" controls in large-scale genetic studies, these minor differences may be magnified. In particular, when cases and controls are derived from different sample types (e.g., cases are buccal DNA, while controls are blood DNA), differences in sample failure rates, call rates and MAF estimation between buccal and blood samples may result in spurious false-positive associations with disease status [14]. It is thus imperative that sample types match between cases and controls, or are at least represent a balance of sample types in both groups.
There are several important strengths of this study. A large number of samples were evaluated in the context of an active genome-wide association study, and processed by the same laboratory personnel. Plates were constructed to incorporate both cases and matched controls of the same sample type on the same plate, and collection of buccal samples was conducted by trained study nurses using standardized protocols. Despite these significant strengths, this study also has certain limitations. The present analysis does not include paired buccal and blood samples from the same individuals, so we cannot determine concordance in genotyping between sample types. However, our previous study in the same population suggested buccal DNA provide reasonable concordance with paired blood samples [11]. Also, identification of the ds/total DNA ratio threshold was conducted in the course of the genotyping, which changed the decision about which samples would be genotyped in the remainder of study. Thus, it was not possible to fully evaluate whether the 34% threshold is optimal, or whether a different threshold would have been better if all samples had been run blinded to ds/total DNA ratio.

Conclusions
Overall, buccal samples did not demonstrate significantly worse QC failure rates than blood samples, although some metrics show worse performance. Following stringent QC, the quality of final genotyping resulting from buccal samples is somewhat lower, but compares favorably to blood. There is no evidence for differential loss of information or bias with respect to genotype calling from buccal vs. blood DNA samples. We recommend using caution, but not doubt, when conducting highthroughput genetic studies with buccal samples.

Methods
The Genetic and Environmental Risk Factors for Hemorrhagic Stroke (GERFHS) study is a large casecontrol study of hemorrhagic stroke in the Cincinnati, Ohio region. To maximize participation and representativeness of the cohort, buccal cytobrush collection was performed on the majority of subjects. Buccal brushes for genetic analysis were collected between 1997 and 2005, blood samples were collected for some subjects from 2000 to 2005, and blood was collected on all subjects after 2005. Each study participant included in this analysis had either a buccal sample or a blood sample genotyped, but not both. The GERFHS study employed a matched case-control design, with matching on age (±5 years), sex, and race; controls were also selected to have the same sample type (buccal or blood DNA samples) as their matched case whenever possible. All individuals included in this analysis were non-Hispanic white.
Buccal brushes were collected on each participant using CYTO-PAK Cytosoft Brushes (Medical Packaging Corp., Camarillo, CA). Study research nurses were trained to collect buccal cytobrush samples in a standardized fashion, as previously described [11]. Blood samples were drawn by hospital nursing personnel using two 10 ml purple-top tubes with EDTA solution from subjects during their hospital stay. Controls were recruited from the community using random-digit dialing. This study protocol was approved by the Institutional Review Boards of the University of Cincinnati and all participating hospitals, and all subjects provided written informed consent for genetic testing.

DNA extraction and genotyping
Both blood and buccal cell DNA were extracted using PureGene DNA extraction kits specific to sample type (Gentra Systems, Inc. Minneapolis, MN), according to manufacturer directions. Briefly, the extraction for buccal DNA was as follows: buccal brushes were cut and placed into a microfuge tube containing cell lysis solution and proteinase K, and incubated overnight at 55°C. After cooling, protein precipitation solution was added, vortexed and centrifuged. The supernatant was transferred to a tube containing isopropanol and glycogen solution, incubated at room temperature, and centrifuged. The DNA pellet was washed with 70% ethanol and air dried. TE buffer was added and incubated at 1 hour at 65°C prior to storage at −20°C.
The total concentration, 260/280, 260/230 ratios of genomic DNA for all samples were measured using a spectrophotometer (NanoDrop ND-1000, NanoDrop Technologies). Finally, double-stranded DNA (dsDNA) concentration was measured using the Quant-it dsDNA BR assay kit and Qubit fluorometer (Invitrogen). All DNA samples were normalized to 50 ng/μl dsDNA in reduced EDTA TE buffer. Samples that did not meet 260/280 ratio of at least 1.7 and 260/230 ratio of at least 1.0 were not included in genotyping.
Cases and their matched controls were arrayed on the same plates, and their genotypes called at the same time. Genotyping was performed on the Affymetrix GeneChip Scanner 3000 platform using Human SNP Array 6.0. The recommended protocol as described in the Affymetrix manual was followed. Five μl (250 ng) of dsDNA was digested with Sty I and ligated to Sty I adapters using T4 DNA ligase. Another 5 μl (250 ng) of dsDNA was digested with NspI and ligated to Nsp I adapters using T4 DNA ligase. Two digested samples were then PCR amplified individually using TITANIUM DNA amplification kit (Clontech) on an ABI9700 machine. PCR products were pooled and purified using the Agencourt AMPure magnetic beads (Beckman Coulter) and 96-well filter plate (E&K Scientific) followed by fragmentation and labeling. Samples were then injected into cartridges, hybridized, washed, and stained. Mapping array images were obtained using the GeneChip Scanner 3000 with GCOS software. Image files were uploaded to the Cincinnati Children's Hospital Medical Center (CCHMC) Genotyping Data Repository in 10 batches of~96 samples per batch. Samples were determined to be initial QC failures if they failed to meet the default Birdseed (v. 1.12.0) QC thresholds of Contrast Quality Control (cQC) > 0.4, Dynamic Model (DM) call rate >83%, or were gender mismatches based on actual versus inferred gender. A subset of samples that failed initial QC were re-extracted and/or rehybridized in an attempt to recover the sample; however, this rarely resulted in sample recovery. Samples that failed initial QC were excluded from full genotyping with Birdseed clustering. Birdseed genotyping was conducted using standard settings. Genotyping results were evaluated for evidence of batch effects using the median test, and results were not different by batch.

Analysis design
All data analysis was conducted using SAS v.9.2 (SAS Institute, Cary, NC). Descriptive characteristics between participants with buccal versus blood samples were compared using parametric t-tests, non-parametric Wilcoxon rank sum, χ 2 analyses, or Fisher's Exact tests, as appropriate to the distribution of the variable.
The statistical evaluation of DNA performance was partitioned into two phases. The first phase was designed to evaluate the role of time since sample collection; relevant patient lifestyle (e.g., smoking or drinking alcohol or caffeinated beverages); and quantitative laboratory-based sample metrics for the ability to discriminate buccal DNA samples likely to fail pregenotype calling quality control (QC) screening. Identification of independent predictors of sample failure was conducted in real-time (during the course of active genotyping) using data from the first three batches of samples (n = 270 buccal samples). Thresholds were established and then applied to the remaining samples run after that point. After the conclusion of all genotyping, analyses were conducted to compare QC metrics between buccal and blood samples and to calculate failure rates based on the previously established thresholds. This analysis sample set included results from both successful and failed buccal and blood samples, limited to one result per unique individual (n = 850) to avoid inflation of sample failure due to failed repeat hybridizations for some samples.
For this first phase of analysis, the dependent variable was success or failure of buccal DNA to pass pregenotyping QC metrics. Independent variables for the first phase included the time between sample collection and genotyping; relevant patient lifestyle variables (current cigarette smoking, current cigar smoking, frequency of alcohol use, average daily intake of caffeinated coffee, tea or soda); and the quantity (total DNA concentration, dsDNA concentration, ds/total DNA ratio) and quality (260/280 and 260/230 ratios) of extracted DNA. Each variable was examined for outliers, and variables with excessive skewness or kurtosis (<−1 or >1) were natural log transformed to improve normality; both total DNA and dsDNA concentration were thus analyzed in log-transformed units. Because of differences in measurement method and sensitivity between the Nanodrop and Qubit DNA quantitation techniques, ds/total DNA ratios occasionally exceeded 1.0. Buccal DNA sample characteristics between samples that passed initial QC versus those that failed were compared using Wilcoxon rank sum statistics (for continuous variables) or Fisher's Exact Test (for categorical variables). Logistic regression was used to construct receiver operator characteristic (ROC) curves and establish thresholds of significant variables to distinguish initial QC success from failure.
The second phase of analysis was designed to test the performance of buccal versus blood DNA samples that passed initial QC. Metrics for this analysis included sample call rate, minor allele frequency (MAF) comparisons between buccal and blood samples in reference to Caucasian (CEU) HapMap samples, and average heterozygosity rates. Heterozygosity rates were calculated separately by sex across all chromosomes. Call rates and heterozygosity rates between buccal and blood samples were compared using Wilcoxon Signed Rank tests.
Low MAF can influence the performance of the clustering algorithms for calling genotypes. Absolute differences between buccal or blood MAF and the CEU reference were analyzed using Wilcoxon Signed Rank tests. In addition, we explored the effect of specific MAF categories (MAF < 0.5%, 0.5%-1.5%, 1.5%-2%, 2%-5%, 5%-10%, 10%-15% and 15%-40%) on the deviation of buccal or blood sample MAF from the CEU HapMap standard. MAF categories were established based on the CEU HapMap reference, with MAF >40% excluded from analysis to minimize the likelihood that the minor allele in HapMap would be the major allele in our population.

Competing interests
None of the authors reports any competing interests relative to the work presented in this manuscript.
Authors' contributions JGW and LJM conceived of and designed the analysis, assessed genotyping quality control, conducted and oversaw statistical analysis and drafted the manuscript. LD conducted statistical analysis. WMB, TDH and CDL provided input to the study design and analysis, and provided critical review of the manuscript. CJM provided data management and phenotype identification. MH collected buccal swab samples and coordinated subject recruitment. GS, SRI and HC conducted all genotyping and initial quality control processing. RD oversaw the genotyping process and DW participated in the design of the sub-study and provided critical review of the manuscript. All authors read and approved the final manuscript.