We have demonstrated herein that disease-related intermediate traits can identify novel disease risk genes. Specifically, we used LDL-cholesterol traits to fine-map a linkage peak on chromosome 5 from the GENECARD study of early-onset CAD with integration of these results with association and linkage to cardiovascular disease. Using this approach, we have identified four candidate genes (EBF1, PPP2R2B, PRELID2, and SPOCK1) that may represent novel cardiovascular disease risk genes mediated through LDL cholesterol pathways.
Although no genome scans for CAD or MI have reported linkage to this region, there are several potentially related phenotypes that have been mapped to the 5q31 locus including inflammatory or autoimmune conditions (celiac disease , asthma , Grave's disease , psoriasis , and Crohn disease ) as well as cardiac and vascular phenotypes (cardiomyopathy , intracranial aneurysm , infantile hemangioma , and systolic and diastolic blood pressures [35, 36]). Several GWAS and meta-analyses have been published for lipid related traits [37–40]. However, only one study has reported a significant association for any genes on chromosome 5q31-33, for the gene T-cell immunoglobulin and mucin domain containing 4 (TIMD4) , which is 1.7 Mb centromeric to EBF1 and 9.9 Mb telomeric to PPP2R2B. Nominally significant results were obtained for SNPs within TIMD4 and its ligand (HAVCR2), which is 152 kb telomeric to TIMD4 (Additional Files 2 Table S1, 6 Table S1, and 8 Table S1).
Using a database of publicly available GWAS results from NHGRI, we looked for any reports of significant associations for SNPs within the 1 lod interval on chromosome 5. There are few references to cardiovascular disease or disease-related traits. However, none of these SNPs overlap with significant SNPs in our results, with the exception of two studies of hypertension/systolic blood pressure [41, 42], each of which reported one significant association with a SNP in EBF1; unfortunately, neither of those two SNPs was examined in our study.
The gene EBF1 is involved in hematopoiesis and immunity . Interestingly, studies of knockout mice have identified a role for EBF1 in metabolism , a cardiovascular disease related phenotype. The null mice described by Fretz et al. have a unique metabolic syndrome characterized by lipodystrophy, hypotryglyceridemia, and hypoglycemia, while having an increased metabolic rate and decreased leptin levels. The mouse lipodystrophy is characterized by an increase in yellow adipose tissue in bone marrow and a marked decrease in white adipose tissue (by as much as 90%), relative to wild type controls. These findings are consistent with EBF1's regulation of adipocyte progenitors [44, 45]. In our study, SNPs in EBF1 were significantly associated with LDL cholesterol traits and the CAD endpoints, with the exception of two-point linkage. In addition, EBF1 variation was associated with leptin levels in our sample, although the results for individual SNPs were inconsistent with their association with lipids and CAD endpoints. This may suggest that EBF1 has a similar role in regulating adiposity and lipid metabolism in humans, and that variants in the gene may represent good candidate polymorphisms for cardiovascular disease and dyslipidemia in humans.
Of the other candidate genes identified, SPOCK1 is associated with age at menarche via a genome-wide association study [46, 47]. SPOCK1 encodes a proteoglycan that functions as a protease inhibitor; although initially identified in testes , it is expressed in many human tissues including blood. SNPs within SPOCK1 were tested for sex-specific effects in our sample via a stratified analysis; no significant sex effects were observed (data not shown). PPP2R2B encodes a brain specific regulatory subunit of a protein phosphatase and is the causal locus for a Mendelian disease, a form of spinocerebellar ataxia (SCA12, OMIM# 604326). Little is known about the function of the gene PRELID2 other than it contains a 'prel-like domain,' from which its name is derived.
Using a strategy of analyses in parallel we have identified four novel candidate genes for cardiovascular disease. The ability to reduce the list of potential candidates within the linkage region on chromosome 5q31-33 from a few hundred to only four is proof of principle that this strategy may be a useful tool for analyzing complex traits. In addition, had we relied upon CAD endpoint analyses alone, we would have obtained less significant associations overall and would have prioritized a different set of candidate genes. One of the major strengths of our study is the detailed phenotype information available for both the GENECARD cohort and CATHGEN biorepository. The rigorous inclusion criteria and case definitions used in GENECARD and CATHGEN have led to objective measures for CAD endpoints, a phenotype that would otherwise have a subjective definition. In this particular study, direct sequencing, rather than SNP genotyping across multiple samples, would not have been appropriate; if we were to re-sequence a sub-set of the sample, it is not clear which individuals would be selected for such sequencing, particularly for the continuous quantitative traits we examine. Finally, our sample is of mixed ethnicity (Caucasian and African-American), which would necessitate a separate re-sequencing effort for each ethnicity.
This study has some limitations. First, all four genes may be CAD susceptibility genes and act independently (as the LD patterns in our samples suggest). However, it is possible that SNPs within our sample are in LD with causal SNPs at another locus and the association results from these four genes may not be independent. Second, we have focused on the consistency of the results at the gene level (i.e., which genes have SNPs that are significant in multiple analyses). However, it is not the case that the same SNPs are significant in those analyses or, in the case that the same SNP is significant, that the magnitude of that significance is similar between analyses. Thus, we cannot begin to identify individual SNPs within a candidate gene that are likely to be driving the results via direct or indirect biological action. This can be explained, in part, by the fact that the phenotypes, while correlated, are not perfectly correlated. Therefore, it is expected that there will be differences in the p-values for associations with different phenotypes and this could cause certain associations to fall outside the nominal p-value cutoff. Finally, the results were not interpreted in the context of correction for multiple comparisons. There are two main difficulties with applying such corrections to these results, i.e. a Bonferroni correction which would be overly conservative. First, the phenotypes examined are correlated, therefore the analyses conducted using more than one phenotype within the same sample are not independent. Second, if we look at the results sequentially, with one analysis conducted after another, then the prior probability that a given SNP in a gene of interest will be significant in subsequent analyses is non-negligible. We are not relying upon the magnitude of any given p-value to identify a single gene in the region as the most likely to explain the original evidence for linkage. Rather, we are suggesting that a set of genes be examined as likely candidate susceptibility loci for cardiovascular disease that is mediated by lipid levels.
In order to identify which gene or genes among the four we have selected contains variation for cardiovascular disease mediated by LDL cholesterol pathways, there are several methodologies available. Re-sequencing studies could be conducted in our sample, either in the entire population or by using individuals with extreme trait values (i.e., very high/low LDL-C levels or very early onset cardiovascular disease), as these data would capture all of the variation present in those samples and not rely upon common variants identified through a different, although ethnically similar, sample (i.e., CEPH Caucasians). In addition, those genes for which the biological function is known (i.e., EBF1) could have their level of activity or functionality directly assessed in genotyped samples. Such an approach can identify subsets of variation that appear to have functional consequences. However, due to LD within the variants, such results can still be ambiguous, in which case promoter and gene constructs can be created and assayed in the laboratory, allowing one to query the functional consequences of individual variants