Genetic evidence supports linguistic affinity of Mlabri - a hunter-gatherer group in Thailand
© Xu et al. 2010
Received: 16 February 2010
Accepted: 19 March 2010
Published: 19 March 2010
Skip to main content
© Xu et al. 2010
Received: 16 February 2010
Accepted: 19 March 2010
Published: 19 March 2010
The Mlabri are a group of nomadic hunter-gatherers inhabiting the rural highlands of Thailand. Little is known about the origins of the Mlabri and linguistic evidence suggests that the present-day Mlabri language most likely arose from Tin, a Khmuic language in the Austro-Asiatic language family. This study aims to examine whether the genetic affinity of the Mlabri is consistent with this linguistic relationship, and to further explore the origins of this enigmatic population.
We conducted a genome-wide analysis of genetic variation using more than fifty thousand single nucleotide polymorphisms (SNPs) typed in thirteen population samples from Thailand, including the Mlabri, Htin and neighboring populations of the Northern Highlands, speaking Austro-Asiatic, Tai-Kadai and Hmong-Mien languages. The Mlabri population showed higher LD and lower haplotype diversity when compared with its neighboring populations. Both model-free and Bayesian model-based clustering analyses indicated a close genetic relationship between the Mlabri and the Htin, a group speaking a Tin language.
Our results strongly suggested that the Mlabri share more recent common ancestry with the Htin. We thus provided, to our knowledge, the first genetic evidence that supports the linguistic affinity of Mlabri, and this association between linguistic and genetic classifications could reflect the same past population processes.
The Mlabri are a hill tribe in northern Thailand, inhabiting a dispersed area along the border with Laos [1, 2]. Today, they are a small population of nomadic hunter-gatherers, unusual in a region of almost entirely agricultural economies . The modern population size is estimated at around 300 individuals, with some estimates being as low as 100 . The name Mlabri is a Thai/Lao alteration of the word Mrabri, which appears to derive from a Khmuic term for "people of the forest" - in Khmu, mra means "person" and bri "forest". They are also known locally as Phi Tong Luang or "spirits of the yellow leaves", apparently because they abandon their shelters when the leaves begin to turn yellow with the onset of the dry season.
Little is known about the origins of the Mlabri and most evidence comes from linguistic studies. The Mlabri language is classified as a Khmuic language, a subgroup of the Mon-Khmer language in the Austro-Asiatic language family . The available linguistic evidence suggests that the present-day Mlabri language most likely arose from Tin, a Khmuic language [2, 6]. However, so far there is no genetic evidence supporting this idea. A recent study suggested Mlabri was founded recently from an agricultural group, thus representing a typical example of cultural reversion . This work, although very interesting, was criticized for not including any of populations neighboring the Mlabri, such as the Htin, Hmong, and northern Thai. As a result, these authors were unable to demonstrate any similarities in the genetic and linguistic affinity of the Mlabri, and so made little comment on the possible source population(s) from which the Mlabri originated .
In this study, we analyzed populations samples from throughout northern Thailand, including the Mlabri as well as several neighboring groups, including the Htin, Hmong, Yao, and other populations speaking Austro-Asiatic and Tai-Kadai languages. Four HapMap population samples, representing Altaic, Sino-Tibetan, Indo-European and Niger-Congo language speakers, were also included in this study. We conducted a genome-wide analysis on these samples using 50K SNPs, to investigate the genetic affinity of the Mlabri, examine the concordance of genetic and linguistic affinities, and further explore probable origin(s) of this enigmatic hunter-gatherer group.
Since this is the first genome-wide genetic study of this enigmatic population, we calculated several population genetic parameters, including SNP diversity, haplotype diversity and linkage disequilibrium (LD).
The genetic characteristics obtained from above analysis, such as significantly increased LD and extremely reduced haplotype diversity are both consistent with the view from a previous study  that the Mlabri were recently founded from a very small number of individuals. The available linguistic evidence suggests that the present-day Mlabri language arose from a Khmuic language, most likely Tin [2, 6, 7]. To search for the group that gave rise to the founders of Mlabri and to examine if the genetic affinity is consistent with linguistic affinity, we further investigated the genetic relationship of Mlabri and other populations. The rational is that the group with closest genetic relationship with Mlabri, if also consists with linguistic relationship, is most likely the genetic and linguistic founder source.
However, the Htin showed signs of admixture in both STRUCTURE and frappe analyses (Figure 6A, B). This raised the concern that whether the close relationship between Mlabri and Htin was confounded by external immigrants from other populations, given that about half of components of Htin are also found in both Austro-Asiatic and Tai-Kadai populations at Ks>4 in STRUCTURE results (Figure 6A). We thus further investigated this potential confounding effect by reconstructing the phylogenetic relationships of those clusters inferred from STRUCTURE and frappe (referred to as the "component tree"). The rationale is that the component tree, given the statistical independence of the components, should reveal an evolutionary history that is less perturbed by recent gene flow and admixture than is a population phylogeny. At K = 8, both STRUCTURE and frappe identified a cluster predominant in the Htin, and with each of the other seven clusters easily associated with a predominant linguistic or ethnic group. We therefore refer to the eight clusters (or components) by their representative linguistic or ethnic group as follows: Altaic/Sino-Tibetan, Hmong-Mien, Tai-Kadai, Austro-Asiatic, Mlabri, Htin, European and African. The component tree was reconstructed based on allele frequencies in each cluster inferred from the STRUCTURE analysis (Figure 7B). We found that the Mlabri specific and Htin specific component clustered tightly on the tree (supported by 100% of bootstrap replicates), strongly indicating once again that the Mlabri share a more recent ancestry with the Htin than with any other group in our sample.
In this study, we analyzed genome-wide SNP data on the Mlabri, as well as several neighboring populations and HapMap population samples. The Mlabri population shows several substantial differences from the other populations: significantly increased LD, extremely reduced haplotype diversity and small effective population size (29), all of which are consistent with the view that the Mlabri were recently founded from a very small number of individuals of an agricultural group but subsequently adopted their current hunting-gathering lifestyle, as proposed by a recent study based primarily on mtDNA and Y chromosome data . Although an alternative scenario could also explain the above genetic characteristics of Mlabri, i.e. the Mlabri are an ancient hunter-gatherer group and maintain their hunting-gathering lifestyle from the very beginning but experienced a severe bottleneck event in the history, the results from the clustering analyses do not favor this scenario. If the Mlabri are an ancient hunter-gatherer group, we expect Mlabri is outside of the clade of all Asian populations and close to the root of Asian clade, but Mlabri is actually inside of Asian clade with Austro-Asiatic group outside on both population tree (Figure 7A) and component trees (Figure 7B, C) where no signal of admixture was found have disturbed tree topology.
Both model-free and model-based clustering analyses strongly suggest that the Mlabri share a degree of common ancestry with the Htin, a group speaking Tin language. In this case -- as is the general rule in many human populations -- the genetic affinity of these populations is consistent with its linguistic affinity. This result, to our knowledge, is the first genetic evidence supporting the linguistic affinity of the Mlabri and Tin languages. Cavalli-Sforza and colleagues showed an apparent congruence between linguistic phyla and genetic clusters, and they proposed that this congruence indicates "considerable parallelism between genetic and linguistic evolution" . Subsequent studies using diverse scales and methodologies have found variable degrees of association between linguistic and genetic classifications [17–22]. Some typical examples of exceptions are populations with language replacement [23–26] or recent admixture between divergent populations [27, 28]. However, human genetic and linguistic diversity have been proposed to be generally correlated, either through a direct link, whereby linguistic and genetic affiliations reflect the same past population processes, or an indirect one, where the evolution of the two types of diversity is independent but conditioned by the same geographic factors .
Hunting and gathering was presumably the subsistence strategy employed by human societies for more than two million years, until the end of the Mesolithic period. Contemporary hunter-gatherer groups are often thought to serve as models of an ancient lifestyle that was typical of human populations prior to the development of agriculture. However, there has been complex interaction between hunter-gatherers and non-hunter-gatherers for millennia. There are contemporary hunter-gatherer peoples who, after contact with other societies, continue their ways of life with very little external influence. There are also contemporary groups usually identified as hunter-gatherers do not have a continuous history of hunting and gathering, and in many cases their ancestors were agriculturalists and/or pastoralists who were pushed into marginal areas as a result of migrations, economic exploitation, and/or violent conflict . Our current data are not sufficient to distinguish the two scenarios, but in case cultural reversion occurred in the history of Mlabri, the Htin is most likely the source population from which the Mlabri genetically originated. The Htin samples in this study speak Mal language, represent only one of the two varieties (Mal and Prai) of Tin language [31, 32], it is possible to further determine which variety the Mlabri language originated from by comparing the genetic relationships between the Mlabri and populations speak the two Tin varieties, although such evidence is indirect and would only make sense when the assumption hold that the genetic origin of the Mlabri was not earlier than the divergence of the two language varieties and there was no language replacement.
In summary, our results strongly suggested that the Mlabri share more recent common ancestry with the Htin, a group speaking a Tin language. This result, to our knowledge, is the first genetic evidence supporting the linguistic affinity of the Mlabri and Tin languages. We proposed that Htin is most likely the source population from which the Mlabri genetically originated in case cultural reversion occurred in the history of Mlabri.
Information of population samples.
Genotype data of 13 Thailand population samples generated using Affymetrix Genechip Human Mapping 50K Xba array were obtained from the Pan-Asian SNP Initiative . Detailed information about data filtration and data quality control was described elsewhere . Genotypes of 60 YRI, 60 CEU, 45 CHB and 44 JPT samples were obtained from the International HapMap Project [34–36] (HapMap public released #23a, 2008-04-01). Most of the analyses in this study used the markers that genotyped in both PanAsia project and HapMap project, including 55,561 autosomal SNPs shared by 13 Thailand population samples and 4 HapMap population samples.
Haplotypes of 22 autosomes were inferred for each individual from its genotypes with fastPHASE  version 1.2. "Population labels" were applied during the model fitting procedure to enhance accuracy. The number of haplotype clusters was set to 20, the number of random starts of the EM algorithm (-T) was set to 20, and the number of iterations of EM algorithm (-C) was set to 50. This analysis was used to generate a "best guess" estimate of the true underlying patterns of haplotype structure . We run fastPHASE for 55,561 SNPs shared by 17 populations, and only unrelated individuals were included.
Heterozygosity for each SNP (HSe) was calculated based on allele frequencies.
To calculate heterozygosity for haplotypes (HHe), the genome was divided into 500-kb regions, with each region having roughly 14 SNPs. HHe were calculated for each region using haplotype frequencies . Considering the substantial variation of recombination across human genome [39, 40], we adopted a slide window strategy and let the sliding window move 100 kb each time. For each population, HHe were averaged over all windows.
The number of haplotypes was obtained by counting the number of haplotypes for a given window size, i.e. 500-kb or 1-Mb, respectively, for each population. The same sliding-window scheme as mentioned before was employed. Since this measurement could be affected by sample size, we sampled 36 chromosomes (equal to the sample size of Mlabri) without replacement in each population. Note that Mlabri has the smallest sample size in all the populations studied. For a population with sample size larger than 36 chromosomes, the sampling was repeated 100 times for each segment and the average of the number of haplotypes of all replications was taken as the number of haplotypes.
The cumulative proportion given a number of haplotypes was obtained by estimating the proportion of the sliding-windows across the genome carrying equal or less haplotypes.
Linkage disequilibrium (LD) between SNPs were measured using r2 following Hill and Weir  and calculated from haplotype data.
Principal component analysis (PCA) was performed at individual level using EIGENSOFT version 2.0 .
We used an allele sharing distance (ASD) [9, 43] as a measure of genetic distance between individuals and a 454 × 454 inter-individual genetic distance matrix was generated according to genotypes of 55,561 autosomal SNPs.
The tree of individuals was reconstructed based on ASD distance and using Neighbor-Joining algorithm  with the Molecular Evolutionary Genetics Analysis software package (MEGA version 4.0) . Trees of populations as well as components were reconstructed using maximum likelihood method  with CONTML program in PHYLIP package .
Ancestry of each person was inferred using a Bayesian cluster analysis as implemented in the STRUCTURE program [12, 46]. We ran STRUCTURE from K = 2 to K = 18 and repeated 10 times for each single K. All STRUCTURE runs used 20,000 iterations after a burn-in of length 30,000, with the admixture model and assuming that allele frequencies were correlated .
The program frappe  implements a maximum likelihood method to infer genetic ancestry of each individual. As in STRUCTURE analysis, this analysis considers each person's genome as having originated from K ancestral, but unobserved, populations whose contributions are described by K coefficients that sum to 1 for each individual . The program was run for 10,000 iterations from K = 2 to 18 and repeated 10 times for each single K.
We thank Dr. Mark Stoneking for his helpful discussion. This work was supported by grants from the National Outstanding Youth Science Foundation of China (30625016), National Science Foundation of China (30890034, 30971577), and 863 Program (2007AA02Z312). LJ was also supported by Shanghai Leading Academic Discipline Project (B111) and the Center for Evolutionary Biology. SX was also supported by Science and Technology Commission of Shanghai Municipality (09ZR1436400) and the Knowledge Innovation Program of Shanghai Institutes for Biological Sciences, Chinese Academy of Sciences (2008KIP311). SX gratefully acknowledges the support of SA-SIBS Scholarship Program and K.C. Wong Education Foundation, Hong Kong. The participants of the HUGO Pan-Asian SNP Consortium are arranged by surname alphabetically in the following.
Mahmood Ameen Abdulla,1 Ikhlak Ahmed,2 Anunchai Assawamakin,3,4 Jong Bhak,5 Samir K. Brahmachari,2 Gayvelline C. Calacal,6 Amit Chaurasia,2 Chien-Hsiun Chen,7 Jieming Chen,8 Yuan-Tsong Chen,7 Jiayou Chu,9 Eva Maria C. Cutiongco-de la Paz,10 Maria Corazon A. De Ungria,6 Frederick C. Delfin,6 Juli Edo,1 Suthat Fuchareon,3 Ho Ghang,5 Takashi Gojobori,11,12 Junsong Han,13 Sheng-Feng Ho,7 Boon Peng Hoh,14 Wei Huang,15 Hidetoshi Inoko,16 Pankaj Jha,2 Timothy A. Jinam,1 Li Jin,17,38 Jongsun Jung,18 Daoroong Kangwanpong,19 Jatupol Kampuansai,19 Giulia C. Kennedy,20,21 Preeti Khurana,22 Hyung-Lae Kim,18 Kwangjoong Kim,18 Sangsoo Kim,23 Woo-Yeon Kim,5 Kuchan Kimm,24 Ryosuke Kimura,25 Tomohiro Koike,11 Supasak Kulawonganunchai,4 Vikrant Kumar,8 Poh San Lai,26,27 Jong-Young Lee,18 Sunghoon Lee,5 Edison T. Liu,8 Partha P. Majumder,28 Kiran Kumar Mandapati,22 Sangkot Marzuki,29 Wayne Mitchell,30,31 Mitali Mukerji,2 Kenji Naritomi,32 Chumpol Ngamphiw,4 Norio Niikawa,40 Nao Nishida,25 Bermseok Oh,18 Sangho Oh,5 Jun Ohashi,25 Akira Oka,16 Rick Ong,8 Carmencita D. Padilla,10 Prasit Palittapongarnpim,33 Henry B. Perdigon,6 Maude Elvira Phipps,1,34 Eileen Png,8 Yoshiyuki Sakaki,35 Jazelyn M. Salvador,6 Yuliana Sandraling,29 Vinod Scaria,2 Mark Seielstad,8 Mohd Ros Sidek,14 Amit Sinha,2 Metawee Srikummool,19 Herawati Sudoyo,29 Sumio Sugano,37 Helena Suryadi,29 Yoshiyuki Suzuki,11 Kristina A. Tabbada,6 Adrian Tan,8 Katsushi Tokunaga,25 Sissades Tongsima,4 Lilian P. Villamor,6 Eric Wang,20,21 Ying Wang,15 Haifeng Wang,15 Jer-Yuarn Wu,7 Huasheng Xiao,13 Shuhua Xu,38 Jin Ok Yang,5 Yin Yao Shugart,39 Hyang-Sook Yoo,5 Wentao Yuan,15 Guoping Zhao,15 Bin Alwi Zilfalil,14 Indian Genome Variation Consortium2
1Department of Molecular Medicine, Faculty of Medicine, and the Department of Anthropology, Faculty of Arts and Social Sciences, University of Malaya, Kuala Lumpur, 50603, Malaysia. 2Institute of Genomics and Integrative Biology, Council for Scientific and Industrial Research, Mall Road, Delhi 110007, India. 3Mahidol University, Salaya Campus, 25/25 M. 3, Puttamonthon 4 Road, Puttamonthon, Nakornpathom 73170, Thailand. 4Biostatistics and Informatics Laboratory, Genome Institute, National Center for Genetic Engineering and Biotechnology, Thailand Science Park, Pathumtani 12120, Thailand. 5Korean BioInformation Center (KOBIC), Korea Research Institute of Bioscience and Biotechnology (KRIBB), 111 Gwahangno, Yuseong-gu, Deajeon 305-806, Korea. 6DNA Analysis Laboratory, Natural Sciences Research Institute, University of the Philippines, Diliman, Quezon City 1101, Philippines. 7Institute of Biomedical Sciences, Academia Sinica, 128 Sec 2 Academia Road Nangang, Taipei City 115, Taiwan. 8Genome Institute of Singapore, 60 Biopolis Street 02-01, 138672, Singapore. 9Institute of Medical Biology, Chinese Academy of Medical Science, Kunming, China. 10Institute of Human Genetics, National Institutes of Health, University of the Philippines Manila, 625 Pedro Gil Street, Ermita Manila 1000, Philippines. 11Center for Information Biology and DNA Data Bank of Japan, National Institute of Genetics, Research Organization of Information and Systems, 1111 Yata, Mishima, Shizuoka 411-8540, Japan. 12Biomedicinal Information Research Center, National Institute of Advanced Industrial Science and Technology, 2-42 Aomi, Koto-ku, Tokyo 135-0064, Japan. 13National Engineering Center for Biochip at Shanghai, 151 Li Bing Road, Shanghai 201203, China. 14Human Genome Center, School of Medical Sciences, Universiti Sains Malaysia, 16150 Kubang Kerian, Kelantan, Malaysia. 15MOST-Shanghai Laboratory of Disease and Health Genomics, Chinese National Human Genome Center Shanghai, 250 Bi Bo Road, Shanghai 201203, China. 16Department of Molecular Life Science Division of Molecular Medical Science and Molecular Medicine, Tokai University School of Medicine, 143 Shimokasuya, Isehara-A Kanagawa-Pref A259-1193, Japan. 17State Key Laboratory of Genetic Engineering and MOE Key Laboratory of Contemporary Anthropology, School of Life Sciences, Fudan University, 220 Handan Road, Shanghai 200433, China. 18Korea National Institute of Health, 194, Tongil-Lo, Eunpyung-Gu, Seoul, 122-701, Korea. 19Department of Biology, Faculty of Science, Chiang Mai University, 239 Huay Kaew Road, Chiang Mai 50202, Thailand. 20Genomics Collaborations, Affymetrix, 3420 Central Expressway, Santa Clara, CA 95051, USA. 21Veracyte, 7000 Shoreline Court, Suite 250, South San Francisco, CA 94080, USA. 22The Centre for Genomic Applications (an IGIB-IMM Collaboration), 254 Ground Floor, Phase III Okhla Industrial Estate, New Delhi 110020, India. 23Soongsil University, Sangdo-5-dong 1-1, Dongjak-gu, Seoul 156-743, Korea. 24Eulji University College of Medicine, 143-5 Yong-du-dong Jung-gu, Dae-jeon City 301-832, Korea. 25Department of Human Genetics, Graduate School of Medicine, University of Tokyo, 7-3-1 Hongo, Bunkyo-ku, Tokyo 113-0033, Japan. 26Department of Paediatrics, Yong Loo Lin School of Medicine, National University of Singapore, National University Hospital, 5 Lower Kent Ridge Road, 119074, Singapore. 27Population Genetics Lab, Defence Medical and Environmental Research Institute, DSO National Laboratories, 27 Medical Drive, 117510, Singapore. 28Indian Statistical Institute (Kolkata) 203 Barrackpore Trunk Road, Kolkata 700108, India. 29Eijkman Institute for Molecular Biology, Jl. Diponegoro 69, Jakarta 10430, Indonesia. 30Informatics Experimental Therapeutic Centre, 31 Biopolis Way, 03-01 Nanos, 138669, Singapore. 31Division of Information Sciences, School of Computer Engineering, Nanyang Technological University, 50 Nanyang Avenue, 639798, Singapore. 32Department of Medical Genetics, University of the Ryukyus Faculty of Medicine, Nishihara, 207 Uehara, Okinawa 903-0215, Japan. 33National Science and Technology Development Agency, 111 Thailand Science Park, Pathumtani 12120, Thailand. 34Monash University (Sunway Campus), Jalan Lagoon Selatan, 46150 Bandar Sunway, Selangor, Malaysia. 35RIKEN Genomic Sciences Center, W502, 1-7-22 Suehiro-cho, Tsurumi-ku, Yokohama 230-0045, Japan. 36Department of Biochemistry, University of Hong Kong, 3/F Laboratory Block, Faculty of Medicine Building, 21 Sasson Road, Pokfulam, Hong Kong. 37Laboratory of Functional Genomics, Department of Medical Genome Sciences Graduate School of Frontier Sciences, University of Tokyo (Shirokanedai Laboratory), 4-6-1 Shirokanedai, Minato-ku, Tokyo 108-8639, Japan. 38Chinese Academy of Sciences-Max Planck Society Partner Institute for Computational Biology, Shanghai Institutes of Biological Sciences, Chinese Academy of Sciences, 320 Yueyang Rd., Shanghai 200031, China. 39Genomic Research Branch, National Institute of Mental Health, National Institutes of Health, 6001 Executive Boulevard, Bethesda, MD 20892 USA. 40Research Institute of Personalized Health Sciences, Health Sciences University of Hokkaido, Tobetsu 061-0293, Japan.
This article is published under license to BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative Commons Attribution License (http://creativecommons.org/licenses/by/2.0), which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited.