Modeling the effect of a genetic factor for a complex trait in a simulated population

Genetic Analysis Workshop 14 simulated data have been analyzed with MASC(marker association segregation chi-squares) in which we implemented a bootstrap procedure to provide the variation intervals of parameter estimates. We model here the effect of a genetic factor, S, for Kofendrerd Personality Disorder in the region of the marker C03R0281 for the Aipotu population. The goodness of fit of several genetic models with two alleles for one locus has been tested. The data are not compatible with a direct effect of a single-nucleotide polymorphism (SNP) (SNP 16, 17, 18, 19 of pack 153) in the region. Therefore, we can conclude that the functional polymorphism has not been typed and is in linkage disequilibrium with the four studied SNPs. We obtained very large variation intervals both of the disease allele frequency and the degree of dominance. The uncertainty of the model parameters can be explained first, by the method used, which models marginal effects when the disease is due to complex interactions, second, by the presence of different sub-criteria used for the diagnosis that are not determined by S in the same way, and third, by the fact that the segregation of the disease in the families was not taken into account. However, we could not find any model that could explain the familial segregation of the trait, namely the higher proportion of affected parents than affected sibs.


Background
The aim of this work is to study and model the marginal effect of one susceptibility locus involved in the determinism of Kofendrerd Personality Disorder (KPD) in the Aipotu population. The presence of a susceptibility locus closely linked to the marker C03R0281 was shown by the existence of both strong association and genetic linkage between this marker and the trait [1,2]. Before modeling the marginal effect of this factor, we searched for the replicate that best represented this effect. The modeling (estimation of the allele frequency and marginal penetrances) is carried out through the MASC method [3], using the information provided by the marker C03R0281 denoted M hereafter. The variation intervals for the parameter estimates are obtained through a bootstrap procedure we incorporated in the MASC (marker association segregation chi-squares) program.

Selection of the best replicate
We want to select the replicates that best represent the distributions in the region of marker M (C03R0281) in terms of both association and linkage. Estimating the parameters and their variation intervals in this sample is then equivalent to evaluating them in the whole set of replicates. In the pooled sample set (10,000 families), we consider one index (an affected case) by family and his genotype for the marker M. For each index, we also consider his identity-by-descent (IBD) sharing for M with one random affected sib (families of the Aipotu population have been selected as having at least two sibs affected with KPD). In order to have a reliable IBD sharing for each sib pair, we ordered one SNP packet (153) surrounding marker M. The IBD sharing is obtained by maximum likelihood estimation using the information provided by the whole set of markers in the M region on chromosome 3. Figure 1 gives the genotype distribution of the 10,000 index cases for the marker M. The two alleles of M, M1 and M2, have a frequency of 0.56 and 0.44, respectively. For each marker genotype, the index cases are classified according to their IBD sharing with one affected sib (stratified IBD distribution).
Because we are looking for the model that best explains these four independent distributions (one genotype distribution and three stratified IBD distributions), we first determined the replicates that best reflect these distributions. We computed the distance of each of the 100 replicates to the pooled sample by a chi-square statistics, equal to the sum of the four independent chi-squares obtained by comparing the distributions observed in the replicate and in the pooled sample.

Modeling the genetic effect
We modeled the effect of the susceptibility factor by the MASC method. For a given genetic model, the MASC method computes the four expected distributions described in Figure 1 and the previously described chisquare statistics (the sum of four chi-squares between the observed and expected distributions). The chi-square is minimized over the parameters left free to vary. The fit of the model to the observed data is then tested (8 df minus the number of parameters free to vary). The parameters of the genetic model are the penetrances of each genotype and the coupling between the marker alleles and the susceptibility factor alleles. The expected distributions are computed conditionally to the fact that the index cases have at least one affected sib. They depend on the frequency of the marker alleles in the general population. The marker allele frequencies may be assumed to be already known (situation 1), to be obtained through a control sample (situation 2), or to be obtained through the parental alleles which have not been transmitted to the affected cases used for the family ascertainment (situation 3).

Computing the intervals of variation for the parameter estimates
We implemented a bootstrap procedure for calculating the variation intervals of the parameter estimates in the MASC program. The uncertainty on the parameters is due to the sampling of families and of controls, when the marker allele frequencies are inferred from a control sample. For each bootstrapped family set (1,000 replicates), we estimated the parameters considering the three possibilities described above for the marker allele frequencies.
In situation 1, there is no uncertainty induced by the marker allele frequencies. In situation 2, the bootstrap procedure is applied to both the family sample and a sample of 50 controls randomly drawn among the 100 control samples. In situation 3, the bootstrap is only applied to the family sample. For the three situations, we obtain the distribution of the parameter estimates and provide the 95% intervals. Table 1 gives the ten replicates that best represent the distributions in the marker M area (the smallest chi-squares). Replicate 97 was chosen for the following analyses.

Modeling of the genetic effects
We tested a biallelic susceptibility locus model (S1, S2) and estimate four parameters: 2 relative penetrances, λ1 and λ2, and 2 coupling probabilities, c 11 and c 12 , where λ1 = P(affected | S1S2) / P(affected | S1S1), Distribution of the marker C03R0281 for the 10,000 index cases of the Aipotu population Figure 1 Distribution of the marker C03R0281 for the 10,000 index cases of the Aipotu population. The first level shows the genotype distribution of the 10,000 index cases for the marker M. The second level shows, for each marker genotype, the index cases classified according to their IBD sharing with one affected sib (stratified IBD distribution). λ2 = P(affected | S2S2) / P(affected | S1S1), c 11 = P(S1 | M1), c 12 = P(S1 | M2).
However, many biallelic models are not rejected. We give several models compatible with the observations made on M in replicate 97 (Table 2). To discriminate between these models, we looked for those that also fit the observations on the closely linked markers. Among these markers, three in packet 153 were also associated with KPD: SNPs 16, 17, and 18 (results shown in Table 3). The direct effect of SNP 16, 17, or 18 was also rejected and these three SNPs were not significantly in linkage disequilibrium (LD) with each other or with marker M (SNP 19). We therefore concluded that the functional polymorphism has not been typed and is in LD with the four studied SNPs. Table 4 gives the 95% variation intervals of the disease allele frequency q. The intervals are given assuming uncertainty or not on the allele frequencies of marker M. Because the results are very close using a sample of 50 controls or the untransmitted alleles of the 100 families, only results for the latter situation are given in Table 4. In order to show the effect of the family sample size on the variation intervals, we also give the results obtained on the family sample resulting from pooling the two best replicates (97 and 63; Table 1) and from pooling the five best replicates (97, 63, 19, 48 and 56; Table 1). We show that the estimates at the susceptibility allele frequency decrease with the number of families (0.24 per 1 replicate; 0.20 per 5 replicates). However, the uncertainty on the disease allele frequency estimate is very large. The size of the variation interval decreases when the sample size increases mainly by the upper limit. Expected for the largest sample size (500 families), the size of the variation interval does not depend on whether the uncertainty on the marker allele frequencies is taken into account or not.

Before knowing the simulation model
We have modeled one susceptibility locus S for KPD using the diagnosis criteria of the Aipotu population. It is very likely that the different sub-criteria used for this diagnosis are not determined by S in the same way. The distribution of the sub-phenotypes in all the affected and all the unaffected individuals as well as the IBD distribution between affected sibs in the pooled set of 10,000 families is given a 95% variation interval of the disease allele frequency q when the marker allele frequencies are known ("no uncertainty") and when they are estimated from the untransmitted parental alleles ("uncertainty").   in Table 5. All affected individuals display the sub-phenotype combination (e + f + h) compared to only 4‰ of unaffected individuals. This means that the diagnostic criteria in Aipotu are equivalent to having simultaneously sub-phenotypes (e + f + h). There is no distortion in the IBD distribution of sub-phenotypes i, j, and l. In addition, i and j have the same frequency in affected and unaffected individuals. The sub-phenotype k shows the strongest IBD distribution distortion (0.1, 0.49, 0.41 for IBD = 0, 1, 2, respectively).
The observed distributions do not provide any information on the dominant parameter of the disease locus. Indeed, the 95% variation interval of λ1 ranges from 0 to 1. Information can however be improved by taking into account the familial recurrence risk for KPD. The proportion of affected parents is 0.2 and the proportion of affected sibs of indexes is 0.1, after excluding the two sibs by which the families were ascertained ( Table 6). The recurrence risk is thus twice as high in parent than in sibs, which cannot be explained by different genetic models, except different penetrances between the generations.

After knowing the simulation model
To validate our bootstrap procedure, we looked to see if the true parameters used for the simulation were included in our variation intervals. The value of the disease allele frequency used in the simulation is 0.15. This value is included in the variation intervals for the three sample sizes we used (100, 200, and 500 families). The larger the sample size, the closer the estimate to the true value.
Note that the true value of the dominance parameter cannot be inferred from the provided answers without extensive work. Indeed, the KPD phenotype is a mixture of different phenotypes, each one corresponding to different models of interaction between D2 and another susceptibility locus. Because there is no generation effect in the simulation, we still cannot explain the greater risk for parents than for sibs.