A large-scale survey of genetic copy number variations among Han Chinese residing in Taiwan
© Lin et al. 2008
Received: 04 June 2008
Accepted: 24 December 2008
Published: 24 December 2008
Skip to main content
© Lin et al. 2008
Received: 04 June 2008
Accepted: 24 December 2008
Published: 24 December 2008
Copy number variations (CNVs) have recently been recognized as important structural variations in the human genome. CNVs can affect gene expression and thus may contribute to phenotypic differences. The copy number inferring tool (CNIT) is an effective hidden Markov model-based algorithm for estimating allele-specific copy number and predicting chromosomal alterations from single nucleotide polymorphism microarrays. The CNIT algorithm, which was constructed using data from 270 HapMap multi-ethnic individuals, was applied to identify CNVs from 300 unrelated Han Chinese individuals in Taiwan.
Using stringent selection criteria, 230 regions with variable copy numbers were identified in the Han Chinese population; 133 (57.83%) had been reported previously, 64 displayed greater than 1% CNV allele frequency. The average size of the CNV regions was 322 kb (ranging from 1.48 kb to 5.68 Mb) and covered a total of 2.47% of the human genome. A total of 196 of the CNV regions were simple deletions and 27 were simple amplifications. There were 449 genes and 5 microRNAs within these CNV regions; some of these genes are known to be associated with diseases.
The identified CNVs are characteristic of the Han Chinese population and should be considered when genetic studies are conducted. The CNV distribution in the human genome is still poorly characterized, and there is much diversity among different ethnic populations.
The human genome contains many DNA sequence variations, including single nucleotide polymorphisms (SNPs), short nucleotide insertions or deletions, tandem repeat sequences, and transposable elements . Recent human genome studies have revealed that copy number variations (CNVs) are more common than previously thought. Some CNVs are associated with gene expression levels and thus may contribute to phenotypic differences [2–7]. Each CNV is a DNA segment > 1 kb that shows variation within the population in terms of a deletion and/or an amplification. Although SNPs are regarded as the main source of phenotypic differences among humans, CNVs also have a large impact on differential gene expression .
Microarray-based approaches, so called "molecular karyotyping", have been used to detect subtle chromosomal structure variations on the genome-wide scale . SNP microarrays are high-resolution tools for genotyping that can be used to simultaneously detect copy number (CN) alterations and loss-of-heterozygosity [9, 10]. The copy number inferring tool (CNIT), an algorithm recently developed for use with Affymetrix GeneChips, efficiently predicts regions with subtle CN changes and is based on a hidden Markov model (HMM) . CNIT had higher accuracy and lower variation in CN estimation than other programs, including the copy number analysis tool  and the copy number analysis with regression and tree approach . In this study, 100 K GeneChip intensity data from 270 HapMap multi-ethnic individuals were used to determine the parameters of the CNIT algorithm. Intensity data from 300 normal unrelated individuals were then used to predict CNV regions using CNIT.
Stringent selection criteria were used to classify true and false CN-altered predictions: CN-altered regions found in at least two individuals were classed as CNV regions. A total of 230 copy number-variable regions (CNVRs) were identified in the sample population, 64 of which (27.83%) had a CNV allele frequency ≥ 1%. Of these 230 CNVRs, 133 (57.83%) had previously been reported in the genomic variant database http://projects.tcag.ca/variation/. The CNVRs ranged from 1.48 kb to 5.68 Mb (mean = 322 kb), and contained 449 genes and 5 microRNAs (miRNAs). Sixty-three CNVRs (27.39%) were associated with segmental duplications (SDs), which are known to induce non-allelic homologous recombination and produce structural variations.
Among the 300 individuals, 76 and 224 DNA samples were extracted from peripheral blood and cell lines, respectively. It is known that subtle chromosomal abnormalities can be induced during cell culture. Although all these cell lines were freshly cultured, some chromosomal abnormalities might still have been generated during culture. To exclude this possibility, only those CN-altered regions that were identified in more than two individuals were classed as CNVRs. Of the 549 CN-altered segments, 230 were classified as CNVRs in the Han Chinese population (Additional File 1). Among these 230 CNVRs, 133 (57.83%) had been reported previously and 97 (42.17%) were unique to this study. Ten of these novel CNVRs (10%) were validated using qPCR. Of the 230 CNVRs, 196 (85.22%) had simple deletion alleles, 27 (11.74%) had simple amplification alleles, and 7 (3.04%) were comprised of both deletion and amplification alleles. These CNVRs ranged from 1.48 kb to 5.68 Mb (mean = 322 kb) and covered a total of 2.47% of the human genome. As shown in Figure 3, these CNVRs are widely dispersed throughout the genome, with the exception of chromosomes 20 and X. Moreover, 64 of the 230 CNVRs had a CNV allele frequency greater than 1%. Of the CNVs in this study, 27% were associated with SDs, similar to previous results (24%) .
There were 449 genes located within 81 of the 230 CNVRs, and most of these genes (56.35%) had simple amplifications (Additional File 2). The 449 genes identified are involved in many biological functions, including ion transport, metabolism, and cell-surface functions. Some of the identified genes are involved in potassium and/or sodium transport, such as theKCNT1,KCNJ8,SLC2A6 andSLC16A3 genes, and some are known to be involved in metabolism, such as theAMY genes that catalyze the first step in digestion of dietary starch and glycogen . The copy numbers of theAMY1 gene are positively correlated with salivary amylase protein levels, which might result in differences in the digestion of starchy foods among individuals .
MicroRNAs in the copy number-variable regions (CNVRs)
# of observations a
Target Gene (P value) c
Loss and Gain
SNCG (9.71 × 10-6)
NELF (1 × 10-6)
PLXDC2 (7.1 × 10-6)
XYLT1 (1.5 × 10-7)
BDKRB2 (6.7 × 10-7) HOXB13 (7.1 × 10-6)
Genes in the copy number-variable regions (CNVRs) are associated with disease and disease susceptibility
# of observations b
Gene symbol c
Hyperprolinemia, type II
Erythrocytosis, familial, 3
Chondrodysplasia punctata, rhizomelic, type 2
Endplate acetylcholinesterase deficiency
Bernard-Soulier syndrome, type C
Retinitis pigmentosa 4
Platelet glycoprotein IV deficiency
Loss and Gain
Thrombotic thrombocytopenic purpura, familial
Maturity-onset diabetes of the young, type VIII
Ehlers-Danlos syndrome, type I
Dopamine beta-hydroxylase deficiency
Leigh syndrome, due to COX deficiency
Myasthenic syndrome, congenital, associated with episodic apnea
Cerebrooculofacioskeletal syndrome 1
Gyrate atrophy of choroid and retina with ornithinemia
Episodic ataxia/myokymia syndrome
Cardiomyopathy, dilated, 1O
Glycogen storage disease, type 0
Lactate dehydrogenase-B deficiency
Spastic ataxia, Charlevoix-Saguenay type
Muscular dystrophy, limb-girdle, type 2C
Charcot-Marie-Tooth disease, type 1C
Deafness, autosomal dominant 20/26
Alveolar soft-part sarcoma
Glycogen storage disease II
Sanfilippo syndrome, type A
Dermatitis, atopic, 4
Arrhythmogenic right ventricular dysplasia, familial, 11
Keratosis palmoplantaris striata I
Arrhythmogenic right ventricular dysplasia, familial, 10
Hypotrichosis, localized, autosomal recessive
Congenital disorder of glycosylation, type Ig
Megalencephalic leukoencephalopathy with subcortical cysts
In this report, a newly developed algorithm, CNIT, was used to identify CNVRs in a normal Han Chinese population. A key assumption of the CNIT algorithm construction is that the CN value of each SNP in the reference group should be two; however, this is not always realistic. Some SNPs with CN aberrations were excluded before constructing the CNIT algorithm due to the availability of CNV data from HapMap individuals. When these CNVs in the reference group were excluded, the identification rate of known CNVs in the test group increased from 23.30% to 25.46%. Owing to the stringent criteria in this study (that is, the parameters of the HMM method and the selection criteria), the false-negative rate (FNR) might be high, resulting in an underestimation of the CNV allele frequency. Nevertheless, these stringent criteria largely reduced the FPR and yielded reliable CNV results. Higher resolution microarrays or other technological approaches are needed to further address these structural variations in detail.
Most of the CNVRs (85.22%) identified in this study were simple deletions; only a few were simple amplifications (11.74%). The divergence might be due to the emission probability (EP) of the HMM method in the CNIT algorithm. The intensity distribution in a cell line with three copies of the X chromosomes was examined using 100 K GeneChip data and the CNIT algorithm. Intensity differences between samples containing gene amplifications (CN = 3) and those containing normal CNs (CN = 2) were less distinct than the difference between samples containing hemizygous deletions (CN = 1) and normal CNs (CN estimations for one, two and three copies of X chromosomes were 1.38 ± 0.42, 2.08 ± 0.47 and 3.03 ± 1.49, respectively). To eliminate false-positive findings, a stringent EP was used for amplifications (the EP for amplifications was the reverse of the EP for deletions), and as a result, some CNVs with amplifications might have been lost in this CNV survey.
In this large-scale survey of CNVs in a Han Chinese population in Taiwan, 97 CNVRs (42.17%) were unique; other CNVRs have previously been reported in Asian populations . In the current genomic variant database, most CNVs are rare (allele frequency < 3%). Fewer than 200 CNVRs had an allele frequency of > 5% and these CNVRs were detected using different microarray platforms such as Affymetrix 500 K GeneChips and CGH arrays. The probe distribution of the 100 K GeneChip data was quite different from the 500 K GeneChip or CGH array data. Therefore, it was difficult to precisely compare the results across different platforms. The reliability of the CNVRs found in this study was supported by stringent selection criteria, agreement with previous independent studies, qPCR analysis and CGH experiments.
Some CN-variable genes identified in this study are known to be correlated with diseases, but individuals in the current study were healthy. Disease models and environmental factors are critical to disease etiology, perhaps explaining why the individuals with gene CN changes were healthy. The well-developed CNIT algorithm for 100 K GeneChip data can be used in future studies to identify subtle chromosome abnormalities http://www.csjfann.ibms.sinica.edu.tw/EAG/Program/CNIT/CNIT.htm.
CNVs have recently been recognized as an important structural variation in the human genome. In this study, 42.17% of the identified CNVRs were novel, indicating that the CNV loci of the human genome are still not fully understood and that there is much diversity among different ethnic populations. Most CNV allele frequencies were low, and only two CNVRs had a frequency greater than 10%. This observation is consistent with results reported previously by Jakobssonet al . The CNVs reported in this study are characteristic of Han Chinese populations and should be considered when genetic studies are conducted.
Data from 376 unrelated individuals were randomly collected from the Taiwanese Cell and Genome Bank ; the raw microarray intensity data have been described in previous studies [11, 17]. Individual genotyping was performed by the National Genotyping Center (Academia Sinica, Taipei, Taiwan) using the Affymetrix 100 K GeneChip Human Mapping set (Affymetrix, Santa Clara, CA, USA) according to the manufacturer's instructions. The SNP call rate of these samples examined in this study was 98.04 ± 0.8%.
In addition, the intensity data from 100 K GeneChip microarrays of 270 individuals were downloaded from the HapMap project website http://www.hapmap.org/; the CNV data for these individuals were collected previously . These individuals included 30 Caucasian trios, 30 African trios and 90 unrelated Asian individuals. The HapMap reference group, which included 210 multi-ethnic unrelated individuals, was used to determine the parameters of CNIT for SNP CN estimation. Forty offspring were used as a training group for the HMM method, and 20 additional offspring were used as a test group to evaluate the FPR and FNR.
The CNIT algorithm showed good performance in CN estimation of SNPs and DNA segments , and well-characterized HapMap samples were used to determine the parameters for CNIT. In both single-point and multipoint CN estimations, results obtained from CNIT showed greater accuracy and reduced FPR and FNR compared with that obtained from other programs. When the smoothing procedure was not used, CNIT could detect small CN-altered regions that were missed with other programs. SNPs located in the reported CNV regions of the corresponding individuals were removed, such that the CN for most of the autosomal SNPs from the reference, training and test groups was two. The 100 K GeneChip consists of two chips, each with > 50,000 SNPs (58,960 for theXbaI chip and 57,244 for theHindIII chip). The reference group was used to process probe selection, estimate the coefficient of preferential amplification/hybridization, and construct a CN distribution for each SNP . After probe selection, 94,587 SNPs (mean intermarker distance = 30.9 kb) were used to represent CN dosage. Finally, EP and transition probabilities of the HMM method were determined using the training and test groups, and were used to identify true CN-altered regions and eliminate false-positive CN inference by considering contiguous SNPs .
Primer Express Software version 3.0 (Applied Biosystems, Foster City, CA, USA) was used to design PCR primers for the selected target CNVs. Quantitative PCR experiments were performed using the ABI PRISM 7900 Sequence Detector (Applied Biosystems). PCR reactions were performed using the Power SYBR-Green PCR reagent kit (Applied Biosystems) and each reaction mix (25 μl total) contained 2.5 ng genomic DNA for each CNV. qPCR comprised initial denaturation at 94°C for 3 minutes, 40 cycles of denaturation at 94°C for 15 seconds, and a combination of annealing and extension at 60°C for 60 seconds. The fluorescence signal was detected in real-time during the qPCR procedure. The primer pair for the long interspersed nuclear elements 1 (LINE1) sequence was used for normalization . The mean estimated CN was calculated from triplicate PCR reactions for each individual.
The Agilent 244 K CGH array (Agilent Technologies, Palo Alto, CA, USA) contains 236,000 coding and non-coding sequences at a resolution of 6.4 kb. Amplified DNA (2 μg) was used for Cy5/Cy3 labeling according to the manufacturer's protocol. After array hybridization and scanning, CN-altered regions were identified using CGH analytics 3.4 software (Agilent Technologies) with a threshold z-score of 2.5.
After CN estimation for each SNP, CN-altered regions were predicted using the HMM package http://www.cfar.umd.edu/~kanungo/software/software.htmlbased on single-point results. The means of SNPP values and CN represent the statistical significance and CN of state-changed segments, respectively. The above data analyses were performed using SAS/STAT version 8 software (SAS Institute, Cary, NC, USA).
comparative genomic hybridization
copy number inferring tool
copy number variation
copy number-variable region
erythrocytosis type 3
egl nine homolog 1
hidden Markov model
long interspersed nuclear elements 1
single nucleotide polymorphism
We thank the National Genotyping Center and National Clinical Core, Academia Sinica, Taiwan, for providing DNA samples and genotyping. This project was supported by the National Science Council grant of Taiwan (NSC 95-2314-B-001-013) and the Institute of Biomedical Sciences, Academia Sinica, Taiwan. We also thank the two anonymous reviewers for their constructive suggestions, which have largely improved this manuscript.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.