Whole-genome association studies on alcoholism comparing different phenotypes using single-nucleotide polymorphisms and microsatellites

Alcoholism is a complex disease. As with other common diseases, genetic variants underlying alcoholism have been illusive, possibly due to the small effect from each individual susceptible variant, gene × environment and gene × gene interactions and complications in phenotype definition. We conducted association tests, the family-based association tests (FBAT) and the backward haplotype transmission association (BHTA), on the Collaborative Study of the Genetics of Alcoholism (COGA) data provided by Genetic Analysis Workshop (GAW) 14. Efron's local false discovery rate method was applied to control the proportion of false discoveries. For FBAT, we compared the results based on different types of genetic markers (single-nucleotide polymorphisms (SNPs) versus microsatellites) and different phenotype definitions (clinical diagnoses versus electrophysiological phenotypes). Significant association results were found only between SNPs and clinical diagnoses. In contrast, significant results were found only between microsatellites and electrophysiological phenotypes. In addition, we obtained the association results for SNPs and microsatellites using COGA diagnosis as phenotype based on BHTA. In this case, the results for SNPs and microsatellites are more consistent. Compared to FBAT, more significant markers are detected with BHTA.


Background
Alcoholism is a serious public health problem. Genetic variants underlying alcoholism have been difficult to identify for many reasons, including issues with diagnoses, disease heterogeneity, gene × gene and gene × environment interactions. These reasons present a great challenge for human geneticists to identify genes associated with alcoholism susceptibility.
Recently, great efforts have been devoted to conducting genome-wide analysis on a large number of families to map genes for alcoholism. For example, the Collaborative Study of the Genetics of Alcoholism (COGA) collected 1,614 family members, including alcoholic people and their relatives. For each individual, a total of 15,840 single-nucleotide polymorphism (SNP) markers from Affymetrix and Illumina and 328 microsatellite markers have been genotyped. Both COGA diagnosis and DSM-IV diagnosis are used to define each person's phenotype. In addition, the electrophysiological phenotypes are tested by the Visual Oddball experiment with event-related potential (ERP) records and the Eyes Closed Resting electroencephalogram (EEG) experiment. The associations between alcoholism and ERP and EEG have been reported in several published papers [1,2].
In this paper we perform family-based association tests (FBAT) [3] based on SNPs and microsatellites, using both clinical diagnosis phenotypes and electrophysiological phenotypes, to identify genetic variants associated with alcoholism in the COGA dataset. In order to consider possible gene × gene interactions, we also perform backward haplotype transmission association (BHTA) tests [4] based on SNPs and microsatellites using COGA diagnosis phenotype.

FBATs for different phenotypes and different markers
The original transmission disequilibrium test (TDT) was proposed to test genetic linkage in the presence of association between a candidate marker and disease phenotype by comparing, among heterozygous parents, the total number of a specific allele transmitted to the affected offspring with what would be expected under the null hypothesis [5]. Laird and colleagues have extended the original TDT to a comprehensive association analysis approach called FBAT [3], which is implemented in the FBAT program [6]. Conditioning on the sufficient statistics for any nuisance parameters, the expected allele distributions are obtained under the null hypothesis of no association. This method avoids confounding due to model misspecification and admixture or population stratification. In this paper, we use FBAT to test association and linkage between genetic markers and phenotypes in the COGA dataset. The phenotypes analyzed include COGA diagnosis, DSM-IV diagnosis, and ERP electrophysiological phenotypes. The genetic markers analyzed include SNPs and microsatellites.
FBAT was performed for every SNP marker (15,406) and microsatellite marker (315) except those on the X chromosome; these markers were tested individually. All of the family members in COGA were included in the study. Individuals who never drink alcohol or have some symptoms but do not meet the diagnosis criteria were considered as having unknown disease phenotype. According to the t-tests between purely unaffected and affected unrelated persons, ttdt1 and ttdt4 channels in the ERP dataset have their p-values less than 0.1. ttdt1 corresponds to electrodes placed on the scalp location FP1, which is the far frontal left side channel, and ttdt4 corresponds to electrodes placed on the scalp location PZ, which is the pari-etal midline channel. These two measures are used as quantitative traits in FBAT. The offset values µ for COGA diagnosis and DSM-IV diagnosis results are set to be 0, and the offset values µ for the electrophysiological phenotypes are set to be the sample means. Here, µ is a nuisance parameter, and the misspecification of µ will not bias the test (different values of µ for COGA diagnosis and DSM-IV diagnosis (0.2 and 0.5) have been tested and similar results are obtained). The additive models are used for the genotype coding.
Efron's local false discovery rate method [7] was applied to the FBAT results to identify significant markers after multiple comparison adjustments. This method is implemented in the R package "locfdr" [8]. Let z be the test statistics or the transformed p-values (z = Φ -1 (p), where Φ indicates the standard normal cumulative function). Let f(z) be the density function of z. We assume f(z) = p 0 f 0 (z)+p 1 f 1 (z), where f 0 (z) is the density function for non-significant markers and f 1 (z) is the density function for significant markers. The natural spline method is applied to estimate f(z). f 0 (z) is the theoretical null distribution (the standard normal distribution) or the empirical null distribution that is a normal distribution with mean and variance estimated from the central part of the f(z) fit. The local false discovery rate is defined by f 0 (z)/ f(z), which is focusing on density. Benjamini and Hochberg's false discovery rate [9] corresponds to the "tailarea" of the local false discovery rate. The false discovery rate of z can be written as the weighted average of local false discovery rate of z i (z i is from z to ∞). Therefore, when we use a local false discovery rate 0.1 as our criterion, the corresponding false discovery rate should be less than 0.1. For SNPs, we used z as the test statistics because the distribution of the test statistic is approximately N(0,1) and chose f 0 (z) as the theoretical null. We used a full range of z to estimate f(z) and 5 degrees of freedom for splines and 60 breaks for the histogram counts. For microsatellites, we used the transformed p-values as z because the distribution of the test statistics is not approximately N(0,1) and choose f 0 (z) as the estimated empirical null. We used the full range of z to estimate f(z) and 5 degrees of freedom for splines and 60 breaks for the histogram counts. Markers with a local false discovery rate <0.1 were included in the summary results.

BHTA approach for different markers
Another extension of the original TDT is the BHTA algorithm [4]. In BHTA, the inferred haplotypes are treated as alleles in TDT. The haplotypes transmitted to the affected offspring are compared with the expected haplotype distribution among all the offspring, where haplotype has a generalized definition in this procedure [4]. For BHTA, a small number of markers are randomly selected each time to construct a candidate haplotype. A backward selection (page number not for citation purposes) algorithm is then used to screen out unimportant markers one by one until only the important markers associated with the trait remain. The sampling is repeated many times and the markers returned most often are considered as the associated markers. BHTA may take the interactions between markers into account because it considers haplotype information, and BHTA is computationally efficient for a whole-genome scan study. In this paper, we use BHTA to identify markers associated with disease phenotype for the COGA dataset accounting for both joint and marginal effects.
The imputation of missing genotypes and the inference of haplotypes given multilocus unphased genotypes were performed according to the procedure described in Lo and Zheng [10]. There are 266 trios with an affected child in the study. The families with more than one affected child were partitioned into multiple trios, and this extension is validated by Lo and Zheng [4]. Microsatellites were dichotomized according to their repeat numbers with the probability of "allele 0" as close to 0.5 as possible. Based on COGA diagnosis, for the 15,406 SNPs, we sampled 30 markers each time and repeated the sampling 200,000 times. For the 315 microsatellites, we sampled 30 markers each time and repeat the sampling 20,000 times. For each sampling, the haplotype information based on the 30 markers was considered and the unimportant markers were deleted. The returned frequency for each marker was recorded.
The local false discovery rate (fdr) method [7] was applied to the returned frequencies to separate the significant markers and the non-significant markers. We used the returned frequencies as z and chose f 0 (z) as the estimated empirical null. The full range of z was used to estimate f(z) and 5 degrees of freedom were used for splines and 60 breaks were used for the histogram counts. Local fdr = 0.1 was chosen as the selection criterion, which corresponds to a returned frequency of 310 for SNPs and 908 for microsatellites.

FBAT results
A total of 6 SNPs were found to be associated with COGA diagnosis at local fdr = 0.1. They are located on chromosomes 3, 9, 13, 16, and 20. Four SNPs were associated with DSM-IV diagnosis at fdr = 0.1. They are located on chromosomes 1, 6, 9, and 11. SNP tsc0124879 on chromosome 9 is common for these two clinical diagnoses. For ERP, no significant SNP was detected at fdr = 0.1 for either the ttdt1 or ttdt4 channel. For microsatellites, D16S3253 on chromosome 16 was found to be associated with ttdt1 channel at fdr = 0.1. No significant microsatellites were detected at fdr = 0.1 for either COGA diagnosis or DSM-IV diagnosis. The above results are summarized in Table 1.

BHTA results
BHTA is only applied to COGA diagnosis in this study. For SNPs, using a local fdr = 0.1 as the criterion that corresponds to a returned frequency of 310, 23 SNPs were found to be significant with respect to the COGA diagnosis. Among these 23 SNPs, 3 are on chromosome 9, 3 on chromosome 13, 2 on chromosomes 1, 5, 6, and 14, and the other SNPs are on chromosomes 3, 4, 7, 8, 10, 15, 16, 18, and 20. SNP tsc0271621 on chromosome 13 was found to be significant based on both FBAT and BHTA. These results are summarized in Table 2. For microsatellites, using a local fdr = 0.1 as the criterion that corresponds to a returned frequency of 908, GATA175H06 on chromosome 9 and D2S2370 on chromosome 2 are significant.

Discussion
We have obtained the FBAT results for different phenotypes for SNPs and microsatellites. The results for COGA diagnosis and DSM-IV diagnosis are similar because 27 out of the top 50 markers are shared between these two diagnoses (data not shown). However, the results for clinical diagnoses are different from those for electrophysiological phenotypes. For the two clinical diagnoses, 6 and 4 significant SNPs were found at fdr = 0.1, with no significant microsatellites. Among the significant SNPs, SNP tsc0124879 on chromosome 9 is common for the two clinical diagnoses. For the ERP channel ttdt1, one significant microsatellite (D16S3253) was found at fdr = 0.1, with no significant SNPs. Because the SNP scan has a higher resolution than the microsatellite scan, it is more likely that we would identify more significant SNPs in this study due to the better coverage in terms of linkage disequilibrium. However, the underlying reasons for the different results for the clinical phenotypes and electrophysiological phenotypes are unclear. One possible reason may be that the electrophysiological phenotypes are associated with disturbed cognitive processing, which involves not only alcoholism but also other psychiatric behaviors.

Conclusion
In this study, we compared the use of different phenotypes (clinical phenotypes and electrophysiological phenotypes) and different types of genetic markers (SNPs and microsatellites) to identify genetic variants underlying alcoholism in the framework of family-based association tests. Significant SNPs were found for clinical phenotypes and a significant microsatellite was found for ERP phenotypes. There is little overlap of significant regions identified based on two different types of markers. Compared to FBAT, we have detected more significant SNPs using BHTA. For BHTA, the microsatellite results are consistent with the SNP results according to their close genetic positions (within 3 cM). Both FBAT and BHTA reveal that SNP tsc0271621 is significant.

Authors' contributions
LC participated in the design of the study, performed the analysis, and drafted the manuscript. NL, SW, and CO participated in the design of the study. NJC participated in the programming. HZ supervised the study, participated in its design, and helped to draft the manuscript. All authors read and approved the final manuscript.