Private haplotypes can reveal local adaptation

Background Genome-wide scans for regions that demonstrate deviating patterns of genetic variation have become common approaches for finding genes targeted by selection. Several genomic patterns have been utilized for this purpose, including deviations in haplotype homozygosity, frequency spectra and genetic differentiation between populations. Results We describe a novel approach based on the Maximum Frequency of Private Haplotypes – MFPH – to search for signals of recent population-specific selection. The MFPH statistic is straightforward to compute for phased SNP- and sequence-data. Using both simulated and empirical data, we show that MFPH can be a powerful statistic to detect recent population-specific selection, that it performs at the same level as other commonly used summary statistics (e.g. FST, iHS and XP-EHH), and that MFPH in some cases capture signals of selection that are missed by other statistics. For instance, in the Maasai, MFPH reveals a strong signal of selection in a region where other investigated statistics fail to pick up a clear signal that contains the genes DOCK3, MAPKAPK3 and CISH. This region has been suggested to affect height in many populations based on phenotype-genotype association studies. It has specifically been suggested to be targeted by selection in Pygmy groups, which are on the opposite end of the human height spectrum compared to the Maasai. Conclusions From the analysis of both simulated and publicly available empirical data, we show that MFPH represents a summary statistic that can provide further insight concerning population-specific adaptation.

where T 1 and T 2 are described in Weir (1996). In the case of "F ST -SNP", m=number of SNPs in the window and v=2, while in "F ST -haplotypes", m=1 (one window) and v=number of distinct haplotype alleles observed in the window.

Methods:
We mixed the known phase of the simulated data according to a binomial random sampling of the SNP alleles of each individual. We phased the data using fastPHASE (Scheet and Stephens 2006) and with the following command: fastPHASE_Linux -T10 -u(labels.txt) -o(output) file.inp The phasing was performed for each subpopulation separately and for simulated data from both with and without selection for different values of rho. We counted the cases where the window containing the inserted mutation (under selection) had an MFPH value greater than the mean of the neutral cases plus 2 standard deviations.

Results:
In the simulations without selection (the neutral case), we observe around 5% of false positives regardless of using the known phase or for the statistically phased data (Table S3). For the selection case, we see a substantial overlap of significant MFPH values between the rephased data and in the known-phase data. We note that there are a few more significant cases in the rephrased data compared to the known-phase data. This may be due to phasing the data within sub-populations. We also note that increasing the recombination rate has basically no effect on the difference of the number of significant cases (between known-phase and statistically phased data).

Methods
We used the recombination maps available from 1000 Genomes (based on the 2.5M SNP data in CEU and MKK). The CEU and the MKK maps show very similar patterns of recombination. However, to investigate whether a difference in recombination rate between the focal population and the reference population would lead to inflated MFPH values, we calculated the correlation between the MFPH (computed for 415 kbp windows) and the differences in genetic distance (in cM) between the CEU and the MKK (for the same 415 kbp windows). An issue when trying to disentangle such an effect from the already noted negative correlation between recombination rate and MFPH is if a larger difference in recombination rate between the two populations is expected in high-recombination regions. If this is the case, we should scale the difference by dividing by the mean recombination rate in the two populations while, on the other hand, if this is not the case one should study the difference. Hence, for each bp-window we calculated the genetic distance in cM based on the genetic map from i) the focal population and ii) the reference population. Referring to the former by d f and the latter by d r , we calculated the correlation of MFPH to both d f -d r and to (d fd r )/(0.5(d f +d r )). This computation was performed with CEU as the focal population as well as with MKK as the focal population. The results are shown in Tables S4 and S5. The choice of d f -d r or (d f -d r )/(0.5(d f +d r )) has little effect and the correlation between MFPH and difference in genetic distance is small (both using CEU and MKK as focal population).

Supplementary Figures & Tables:
Figure S1: Allele trajectories for the deterministic mutation at site 50,001 in population 3. The parameter settings were ρ=0.001, m=1, θ=0.001, t m =100. N=500. A: Simulations without selection (G=0). B: Simulations with selection (G=150). Note that there is no selection for the deterministic mutation in population 1 and 2. Figure S2: Distribution of MFPH values in neutral simulations. "no mutation": MFPH values for 10,000 simulations in population 3 with no mutation occurring at t m , "population 3": MFPH values for 100 simulations in population 3 conditioning on a mutation occurring at t m , "population 1": MFPH values in population 1 for the same simulations as for "population 3". The window considered corresponds to 5kb from position 47.5kb, the mutation position is 50,001bp. ρ=0.001, m=1, θ=0.001, t m =100, t s =50, N=500 and G=0. Figure S3: Influence of parameters on MFPH calculated with 1kb and 10kb window-sizes with default parameters G=150 (G=0 for neutral simulations), ρ=0.001, m=1, θ=0.001, t s =50, t m =100, N=500. Mean MFPH over 100 simulations was calculated on a window containing site 50,001. A to D: window-size = 1 kb. E to H: window-size =10 kb. Figure S4: Influence of parameters on Tajima's D and Fay and Wu'H computed on simulated data. Mean values over 100 simulations were computed for a 5 kb window surrounding site 50,001 with default parameters G=150 (G=0 for neutral simulations), ρ=0.001, m=1, θ=0.001, t s =50, t m =100, N=500. Red lines represent simulations with selection and blue lines simulation without selection. A to D: Fay & Wu's, E to H: Tajima's D. Figure S5: MFPH on HapMap III phased data for chromosome 1-22. Odd numbered chromosomes are represented by light blue and even numbered chromosomes by dark blue. The size of the window was set to 200 SNPs with a 1 SNP step length between windows. The genomic positions of the genes LCT, EDAR (both on chromosome 2) and ADH1B (on chromosome 4) are indicated. The position with a star represents the region of DOCK3-gene. MFPH is calculated in one population (with sample size 34 chromosomes or 17 individuals) compared to the merged two others (34 individuals).

Figure S6
: MFPH on HapMap III phased data for chromosome 1-22. Odd numbered chromosomes are represented by light blue and even numbered chromosomes by dark blue. The size of the window was set to 415 kbp with a 1 SNP step length between windows. The genomic positions of the genes LCT, EDAR (both on chromosome 2) and ADH1B (on chromosome 4) are indicated. The position with a star represents the region of DOCK3-gene. MFPH is calculated in one population (with sample size 34 chromosomes or 17 individuals) compared to the merged two others (34 individuals).      Figure S13: MFPH on HapMap III phased data for chromosome 2 taking all populations into account. A-C is replicated from figure 5 in the main text for comparison. Window-size is 200 SNPs with a 1 SNP step length between windows. The genomic positions of the genes LCT and EDAR are indicated. In D-F one focal population is compared to all Hapmap 3 populations (instead of only two groups as in A-C). Note that the sample size is lower (9 individuals) in D-F. The selection signal around the LCT gene is lost in CEU and the selection signal around the EDAR gene is lost in the JPT+CHB populations. Figure S14: The top MFPH windows with XP-EHH, iHS and F ST SNP after excluding chromosome 2 in MKK. For XP-EHH, the reference population was the combined CEU and JPT+CHB. F ST values were calculated for each SNP using all three populations and with negative values set to zero Transparent points are values less than two standard deviations from the genome-wide mean. Non-transparent dots are values that are more than two standard deviation from the mean. Horizontal dotted bars give the genome-wide mean value of the statistics (standard deviation in dotted vertical bars) while continuous bars show the local mean. Horizontal bars in yellow show the gene positions. A) Region around 51mb on chromosome 3 B) Region around 101mb on chromosome 3. Figure S15: The top MFPH windows with XP-EHH, iHS and F ST SNP after excluding chromosome 2 in CEU. For XP-EHH, the reference population was the combined MKK and JPT+CHB. F ST values were calculated for each SNP using all three populations and with negative values set to zero. Transparent points are values less than two standard deviations from the genome-wide mean. Non-transparent points are values that are more than two standard deviation from the mean. Horizontal dotted bars give the genome-wide mean value of the statistics (standard deviation in dotted vertical bars) while continuous bars show the local mean. Horizontal bars in yellow show the gene positions. Figure S16: Visual genotypes of SNPs in the 51mb region of chromosome 3 covering MAPKAPK3, CISH and DOCK3 genes in three pygmy populations (Baka, Bakola, Bedzan; (Lachance et al., 2012) and in MKK (Drmanac et al., 2010). The region contains a total of 1619 SNPs, we show here SNPs with a minimum allele frequency of 30% (312 SNPs). Each line represents a SNP and each column an individual. 1: 50.6mb to 51mb, 2: 51mb to 51.4 mb. MKK_4 is different from the 3 other MKK individuals and Bedzan_1 is different from the 4 other pygmy individuals. The latter is likely to be a consequence of the Bedzan population being the most admixed pygmy population (Verdu et al., 2009).
Individual references in the databases (Assembly Pipeline version 1.10 and CGA Tools 1.4): •    S2: Frequency of the most frequent allele defined in a window of size 5 kb surrounding site 50,001 in population 3. The first ("Total") to the fifth column ("f(non-private)") refers to simulations (among a 100 simulations) for which the value of MFPH was at least two standard deviations higher than the mean (mean and standard deviation were calculated on neutral simulations with G=0,ρ=0.001, m=1, θ=0.001, t s =50, t m =100, N=500). The sixth to the tenth columns refers to simulations were the value of MFPH was less than two standard deviations higher than the mean. "Total" is the total number of simulations. "Private" is the number of simulations (among the "total") where the most frequent haplotype only existed in population 3, "f(private)" is the mean (population) frequency of these haplotypes. "Non-private" is the number of simulations (among the "total") where the most frequent haplotype in population 3 existed in at least one of population 1 or 2 and "f(non-private)" is the mean frequency of these haplotypes in population 3. Note that "unique" is used instead of "private" since, in the main text, the latter was defined relative to a sample while "unique" is relative to the whole (sub)population. Table S3: Number of simulated cases out of a 100 that show a significant MFPH value (greater than the mean of the neutral case plus 2sd). "Rephased only" refers to the number of cases that were significant only in the rephased data, "Known phase only" refers to the number of cases that were significant only in the original known phase.