### Similarity Measures

Let
and
be the marginal similarities of the *i*
^{th} and *j*
^{th} subjects at genes *G*
_{1} and *G*
_{2}, respectively. They can be obtained based on unphased multi-marker genotypes or statistically inferred haplotypes, and they can be scaled from 0 to 1. Here we list some commonly used similarity measures, which can be traced back to [15, 16].

#### A. Diplotype perspective

A.1. Similarity measure based on identity-by-state (IBS) allele sharing (referred to as 'IBS'):

where *L* is the number of loci considered in *G*
_{
k
};
and
are respectively the genotypes of the *i*
^{th} and *j*
^{th} subjects at the *l*th locus in *G*
_{
k
};
is the number of alleles shared in common for the *i*
^{th} and *j*
^{th} subjects at the *l*th locus in *G*
_{
k
}, which has possible values of 0, 1, and 2.

A.2. Similarity measure based on IBS inversely weighted by genotype frequencies (referred to as 'W-IBS'):

where
, and
is the frequency of genotype
. The implication of this weight is that subjects sharing rare alleles may have more similar genomes than do subjects sharing common alleles.

#### Joint Similarity Regarding Two Genes

A similarity measure accounting for the joint association of genes

*G*
_{1} and

*G*
_{2} for the

*i*
^{th} and

*j*
^{th} subjects is

where
ranges from 0 to 1, too. The joint similarity (
) will be high if both of the two marginal similarities (
and
) are high. That is, with respect to the two genes, the *i*
^{th} and *j*
^{th} subjects will be regarded as 'similar' if they are similar in both genes.

#### B. Haplotype perspective

B.1. Similarity based on the counting measure for haplotypes (referred to as 'COUNT'):

Let

*h*
_{
i
} and

*h*
_{
j
} be the

*i*
^{th} and

*j*
^{th} categories of haplotypes in a gene,

and

are the alleles at the

*l*th locus on

*h*
_{
i
} and

*h*
_{
j
}, respectively. The similarity based on the counting measure for haplotypes is

where
is 1 if the alleles at the *l*th locus match for the *i*
^{th} and *j*
^{th} haplotypes.

B.2. Similarity based on the matching measure for haplotypes (referred to as 'MATCH'):

Let

*h*
_{
i
} and

*h*
_{
j
} be the

*i*
^{th} and

*j*
^{th} categories of haplotypes in a gene, then the similarity based on the matching measure for haplotypes is

where *s*(*h*
_{
i
}, *h*
_{
j
}) is 1 only when *all* alleles match for the *i*
^{th} and *j*
^{th} haplotypes, otherwise *s*(*h*
_{
i
}, *h*
_{
j
}) is 0.

#### Joint Similarity Regarding Two Genes

Let

be the

*u*
^{th} possible diplotype (i.e., the pair of haplotypes a subject possesses) in

*G*
_{
k
} of the

*i*
^{th} subject, where

, and where

is the number of possible diplotypes in

*G*
_{
k
} for the

*i*
^{th} subject.

is the posterior probability that the

*i*
^{th} subject has the

*u*
^{th} possible diplotype in

*G*
_{
k
}, given the unphased genotypes (

).

can be inferred by the expectation-maximization (EM) algorithm [

17]. Then a similarity measure accounting for the joint association of genes

*G*
_{1} and

*G*
_{2} for the

*i*
^{th} and

*j*
^{th} subjects is

where
and
can be obtained based on the counting measure or the matching measure.
ranges from 0 to 1, too.

### Simulation Study

Simulation studies were conducted to evaluate the performance of our method. We extended the simulation scheme of Li et al. [18] to two chromosomal regions. In each region, 4,000 haplotypes across 300 kb were generated using the coalescent-based program ms [19]. The effective population size was set at 10,000, the recombination rate per base pair (bp) per generation was set at 10^{-9}, and 300 SNPs were simulated in each region. For the human genome, recombination occurs at an average rate of about 10^{-8} per bp per generation [20]. Our recombination rate, 10^{-9} per bp per generation, is the low end of the recombination rates in the human genome [18], representing a stronger linkage disequilibrium (LD). We chose this rate because multi-marker approaches are primarily designed for strong-LD regions. In each chromosomal region, 2,000 diplotypes were generated by randomly pairing the 4,000 haplotypes. Then the 2,000 diplotypes of the first region were randomly paired with the 2,000 diplotypes of the second region, to form 2,000 subjects. In this way, we generated 300 datasets.

We then considered nine disease models listed in Additional file 1. Additional file 1 lists the causal allele frequencies, the penetrance values of two-locus genotypes, and the marginal penetrance values of one-locus genotypes, for all disease models. Model 0 was used to evaluate Type-I error rates, while the other eight models were used to evaluate powers. Models 1-6 exhibit interactions in the absence of main effects when genotypes conform to Hardy-Weinberg equilibrium. We used these six disease models because they further challenged the ability of our method to discover the joint associations (or 'interactions' in this situation) of gene pairs. Models 7 and 8 exhibit both interactions and main effects. Model 7 is the *jointly dominant-dominant model*, which requires at least one copy of the disease allele from both loci to be affected [21, 22]. Model 8 has the same penetrance table with Model 3, but has different causal allele frequencies. We deliberately let the causal allele frequency of one locus be smaller than that of another locus.

For each dataset, we first randomly selected two SNPs (each from among 300 SNPs in a region) with similar MAFs to those of the causal SNPs (the tolerable difference was set to be 0.02), pretending them as the two causal SNPs. We then used the *H-clust* method [23, 24] to choose tag SNPs with a subset formed by 200 subjects randomly drawn from the pool of 2,000 subjects. Tag SNPs were chosen with quality (MAF > 0.1) and correlation (the cut-off value for finding clusters was set to be 0.85). In each repetition, cases and controls were sampled with replacement from the pool of 2,000 subjects, where case/control status was assigned according to the genotypes of the two causal SNPs. After generating the phenotypes, the genotypes of the causal SNPs were removed from our datasets. Each chromosomal region was formed by eight SNPs - four to the left and four to the right of every causal SNP.

We evaluated the performance of our method with the matching measure ('MATCH') and the counting measure ('COUNT') of haplotypes. We also used two genotype similarity measures: 'IBS' and 'W-IBS'. We compared these with the *HapForest* approach [13]. *HapForest* is based on a tree structure, and is naturally suitable for analyzing interactions. Following the instructions of *HapForest*, we first invoked *SNPHAP* [25] to estimate the haplotype frequencies for each individual. Then *HapForest* was used to identify haplotypes and haplotype-haplotype interactions in association with the disease. This method suggests potential epistasis among significant haplotypes. For *HapForest*, a rejection of null hypothesis was defined as the identification of at least one significant haplotype from any of the two chromosomal regions.

The Pearson's *χ*
^{2} test was also performed for comparison, in which the joint haplotype distributions of the two chromosomal regions were compared between cases and controls. Rather than using the asymptotic *χ*
^{2} distribution, we randomly assigned the disease status in each permutation and determined the *P* value of observed *χ*
^{2} statistics. To calculate haplotype similarities from unphased multi-marker genotypes, we first inferred haplotype phases by the EM algorithm, using the function of 'haplo.em' in the 'haplo.stats' package [17]. The obtained posteriors were then treated as weights, and all possible haplotype pairs were considered with their probabilities (see equation (2)). All the haplotypes with frequencies less than 0.01 are considered to be rare haplotypes. To avoid possible genotyping errors, we follow Sha et al. [26] to merge each rare haplotype with its most similar common haplotype (see the modified EM algorithm proposed by Sha et al. [26]). For example, Haplotype A (1-1-1-2-1-1-1-1) is considered to be a rare haplotype because its frequency is less than 0.01. Haplotypes C (1-1-1-1-1-1-1-1) and F (1-1-1-2-2-1-1-1) are the most similar haplotypes to Haplotype A (both with a similarity of 0.875 by using the counting measure), and their haplotype frequencies are 0.2 and 0.1, respectively. We merge Haplotype A with Haplotype C, the most similar haplotype with the highest frequency. We then update the haplotype data by replacing Haplotype A with Haplotype C.

We also compared our methods with the tests for SNP × SNP epistasis by using case-control data or case-only data (with the --fast-epistasis command implemented by PLINK-1.07) [27]), hereafter referred to as 'CS-CN' and 'CS', respectively. In our simulation, each chromosomal region was formed by eight SNPs, and there were 64 tests for SNP × SNP epistasis. We recorded the minimum *P* value (*P*
_{min}) from among all the 64 *P* values, and then adjusted this *P*
_{min} on the basis of Sidak correction [28], with an effective number of tests, *M*
_{
eff
}. That is, we adjusted the minimum *P* value (*P*
_{min}) by
.

We then evaluated the validity and power of the eight tests with the 300 datasets. For each dataset, we recorded the *P* values of 50 repetitions (so there were 15,000 *P* values in total); in each repetition, *P* values were obtained with 1,000 permutations. Given a significance level, the type I error rate (if under Model 0) or power (if under Models 1-8) was the proportion of the number of *P* values smaller than the significance level to the total number of *P* values.

For CS-CN and CS, the *P* value used was
. The effective number of tests (*M*
_{
eff
}) was estimated by the eigenvalue-based approach [29, 30]. For each subject, we had 8+8 genotype coding values (0, 1, or 2), and 64 pair-wise products of genotype coding values, one from a SNP in region 1 and another from a SNP in region 2. Based on *n* subjects, we obtained a 64 × 64 correlation matrix for these 64 pair-wise products of genotype coding values. Then the eigenvalues of this correlation matrix were calculated to estimate the effective number of tests (see [29, 30], or see [31] for a nice review).