A statistical measure for the skewness of X chromosome inactivation based on family trios

Xu, Si-Qi; Zhang, Yu; Wang, Peng; Liu, Wei; Wu, Xian-Bo; Zhou, Ji-Yuan

doi:10.1186/s12863-018-0694-8

Methodology article
Open access
Published: 05 December 2018

A statistical measure for the skewness of X chromosome inactivation based on family trios

Si-Qi Xu¹,
Yu Zhang¹,
Peng Wang¹,
Wei Liu¹,
Xian-Bo Wu² &
…
Ji-Yuan Zhou ORCID: orcid.org/0000-0003-0866-4402¹

BMC Genetics volume 19, Article number: 109 (2018) Cite this article

2253 Accesses
5 Citations
Metrics details

Abstract

Background

X chromosome inactivation (XCI) is an important gene regulation mechanism in females to equalize the expression levels of X chromosome between two sexes. Generally, one of two X chromosomes in females is randomly chosen to be inactivated. Nonrandom XCI (XCI skewing) is also observed in females, which has been reported to play an important role in many X-linked diseases. However, there is no statistical measure available for the degree of the XCI skewing based on family data in population genetics.

Results

In this article, we propose a statistical approach to measure the degree of the XCI skewing based on family trios, which is represented by a ratio of two genotypic relative risks in females. The point estimate of the ratio is obtained from the maximum likelihood estimates of two genotypic relative risks. When parental genotypes are missing in some family trios, the expectation-conditional-maximization algorithm is adopted to obtain the corresponding maximum likelihood estimates. Further, the confidence interval of the ratio is derived based on the likelihood ratio test. Simulation results show that the likelihood-based confidence interval has an accurate coverage probability under the situations considered. Also, we apply our proposed method to the rheumatoid arthritis data from USA for its practical use, and find out that a locus, rs2238907, may undergo the XCI skewing against the at-risk allele. But this needs to be further confirmed by molecular genetics.

Conclusions

The proposed statistical measure for the skewness of XCI is applicable to complete family trio data or family trio data with some paternal genotypes missing. The likelihood-based confidence interval has an accurate coverage probability under the situations considered. Therefore, our proposed statistical measure is generally recommended in practice for discovering the potential loci which undergo the XCI skewing.

Background

Many human diseases are associated with genes on X chromosome, such as asthma, autoimmune diseases, cancers, some neurological and psychiatric diseases [1,2,3,4,5]. Most of these X-linked diseases often exhibit sex-specific patterns of susceptibility due to the difference in the number of copies of X chromosome between two sexes. Females have two copies of X chromosome whereas there is only one copy in males. To equalize the expression levels of X chromosome between sexes, dosage compensation is achieved by an important gene regulation mechanism in mammalian females, X chromosome inactivation (XCI), which results in expression silencing of one of two X chromosomes in females [6]. Up to 75% genes on X chromosome are subject to XCI, while there are about 15% escaping from inactivation and expressed from both X chromosomes, and the remaining 10% show variable inactivation patterns in different human cell lines [7].

During the process of XCI, one of two X chromosomes in females is chosen to be inactivated in a random way. This means that roughly 50% of cells in females have the paternal X chromosome expressed, while the others express the maternal one. Although random XCI occurs commonly, the XCI skewing also takes place in females, which is defined as the phenomenon of nonrandom inactivation that one of X chromosomes is selected to be silenced with a probability deviating from 50% [8]. Generally, the skewness of XCI is caused by a second selection mechanism. When the mutation on X chromosome affects the survival and proliferation of cells, the amount of cells carrying an active mutant X chromosome will become larger or smaller than that of cells with an active wild-type X chromosome, which thus leads to the skewness of XCI [9, 10]. Negative selection, where the mutation gives a growth disadvantage to cells, frequently happens in female carriers with X-linked diseases, such as mental retardation disorders, Wiskott-Aldrich syndrome and X-linked severe combined immunodeficiency [11,12,13]. On the other hand, when the mutation provides a growth advantage to cells, positive selection occurs and can result in some diseases, such as adrenoleukodystrophy and breast cancer [14, 15].

In genetic association studies on X chromosome, Clayton [16] first took XCI into consideration. Due to XCI, the genotypic effect of homozygous females can be treated the same as that of hemizygous males. Therefore, the genotypic scores are given to be 0, 1 and 2 corresponding to three genotypes at a diallelic locus on X chromosome in females, and 0 and 2 corresponding to two genotypes in males. However, Wang et al. [17] pointed out that such coding strategy only considers the situation of random XCI, but ignores the skewness of XCI and escape from XCI. To account for all possible situations of XCI, Wang et al. suggested that the genotypic score for the heterozygous females, denoted by γ, can be any possible values between 0 and 2. Under XCI, the value of γ reflects the degree of inactivation skewing, with γ/2 representing the proportion of cells having the mutant allele active. As such, γ = 1 stands for random XCI, while γ between 1 and 2 indicates the XCI skewing toward the mutant allele and γ between 0 and 1 denotes the XCI skewing against the mutant allele. For example, γ = 0.5 means that the skewness of XCI is against the mutant allele with 25% cells expressing the mutant allele and the other 75% cells expressing the normal allele. On the other hand, in molecular genetics, the XCI skewing pattern can be identified by assays taking advantage of differential methylation between the active and inactive X chromosomes or mRNA transcription in cells [18,19,20]. However, since the XCI pattern always varies among cell lines [21, 22], these assays, which usually use cells from only a few tissues to investigate the XCI patterns, cannot present the status of the whole body [10]. Further, there is no statistical measure available for detecting the XCI skewing pattern in population genetics as yet.

Therefore, in this article, we give the expression of γ for family trios with both parents and one affected daughter in the presence of association between the disease and genotypes. In fact, γ is a function of two genotypic relative risks (GRRs) in females. In addition, we obtain the point estimate of γ from the maximum likelihood estimates (MLEs) of the GRRs. When parental genotypes are missing in some family trios, the expectation-conditional-maximization (ECM) algorithm [23] is used to obtain the corresponding MLEs. Further, the confidence interval (CI) of γ is derived based on a likelihood ratio test (LRT). Finally, simulation study is conducted to investigate the performance of our proposed method. The simulation results show that the proposed likelihood-based CI has an accurate coverage probability under the situations considered. For practical use, we apply our proposed method to the rheumatoid arthritis (RA) data from USA.

Methods

Notations

Consider an X-linked diallelic locus with mutant allele A and normal allele a. Let p_m and q_m = 1 − p_m denote the allele frequencies of A and a in males, respectively. Suppose that p_f is the allele frequency of A and ρ is the inbreeding coefficient in females. Then, the frequencies of genotypes aa, Aa and AA in females are respectively g₀ = (1 − p_f)² + ρp_f(1 − p_f), g₁ = 2(1 − ρ)p_f(1 − p_f) and $ {g}_2={p}_f^2+\rho {p}_f\left(1-{p}_f\right) $. Note that Hardy-Weinberg equilibrium holds in the population under study when ρ = 0 and p_m = p_f. Let f₀, f₁ and f₂ respectively represent the penetrances in females with genotypes aa, Aa and AA. The GRRs in females are defined as λ₁ = f₁/f₀ and λ₂ = f₂/f₀.

Relationship between penetrances and XCI skewing in females

Let the genotypic scores be 0, γ and 2 corresponding to females with genotypes aa, Aa and AA, respectively, where γ ∈ [0, 2] represents the XCI skewing pattern. We assume that a generalized genetic model holds [24, 25], which is defined as f₀ ≤ f₁ ≤ f₂ (i.e., 1 ≤ λ₁ ≤ λ₂) with at least one inequality being strict, in the presence of association between the disease and genotypes in females. If f₁ is unknown, then it can be expressed as a function of γ, denoted by f₁(γ). To derive the expression of f₁(γ), let $ {f}_1^{\prime}\left(\gamma \right) $ be the first order derivative of f₁(γ) with respect to γ. As such, f₁(γ) can be approximated by a first order Taylor expansion around γ = 1 as follows,

$$ {f}_1\left(\gamma \right)\approx {f}_1(1)+{f}_1^{\prime }(1)\left(\gamma -1\right). $$

(1)

On the other hand, when the XCI skewing is completely against A, we have γ=0 and f₁ = f₀. So, from Eq. (1), $ {f}_0={f}_1(0)\approx {f}_1(1)-{f}_1^{\prime }(1) $. Similarly, when the XCI skewing is completely toward A, we have γ=2 and f₁ = f₂. Then $ {f}_2={f}_1(2)\approx {f}_1(1)+{f}_1^{\prime }(1) $. Hence, f₁(1) = (f₂ + f₀)/2 and $ {f}_1^{\prime }(1)=\left({f}_2-{f}_0\right)/2 $. Therefore, Eq. (1) turns to be

$$ {f}_1\left(\gamma \right)\approx {f}_0+\frac{\gamma \left({f}_2-{f}_0\right)}{2}. $$

(2)

From Eq. (2), we notice that f₁ is around the midpoint between f₀ and f₂ under random XCI (i.e., γ=1). Actually, Eq. (2) means that the penetrance for heterozygous females is approximately linear in the genotypic score γ around γ=1.

Further, if f₁ is known, then we can obtain γ from Eq. (2) as follows, which is a function of f₀, f₁ and f₂, or λ₁ and λ₂,

$$ \gamma \approx \frac{2\left({f}_1-{f}_0\right)}{f_2-{f}_0}=\frac{2\left({\lambda}_1-1\right)}{\lambda_1-1}. $$

Note that the value of γ attains its maximum (γ = 2) when λ₁ = λ₂ ≠ 1, and γ = 0 when λ₁ = 1 and λ₂ ≠ 1. Assume that $ {\widehat{\lambda}}_1 $ and $ {\widehat{\lambda}}_2 $ are the MLEs of λ₁ and λ₂, respectively. Then, the point estimate of γ, $ \widehat{\gamma} $, can be obtained by $ 2\left({\widehat{\lambda}}_1-1\right)/\left({\widehat{\lambda}}_2-1\right) $.

MLEs of λ ₁ and λ ₂ using family trios without missing genotypes

Here we only include the trios with affected daughters in the analysis. Male offspring are not investigated because they are not informative of λ₁ and λ₂. Firstly, we consider complete family trios, each with both typed parents and an affected typed daughter. Let F, M and C represent the numbers of allele A in father, mother and daughter, respectively, and D denote that the daughter is affected. Eight possible types of FMC (i.e., 000, 010, 011, 021, 101, 111, 112 and 122) together with the corresponding probabilities P(FM), P(C| FM) and P(FMC| D) are shown in Table 1. Let Ω be the set of the eight possible types of FMC listed in Table 1 and then P(FMC| D) is derived as

$$ P\left( FM C\left|D\right.\right)=\frac{P(FM)P\left(C\left| FM\right.\right)P\left(D\left|C\right.\right)}{\sum_{F^{\prime }{M}^{\prime }{C}^{\prime}\in \Omega}P\left({F}^{\prime }{M}^{\prime}\right)P\left({C}^{\prime}\left|{F}^{\prime }{M}^{\prime}\right.\right)P\left(D\left|{C}^{\prime}\right.\right)}. $$

(3)

Table 1 Eight types of possible family trios and the corresponding probabilities

Full size table

Equation (3) holds when the disease status of a daughter is only related to her own genotype, and P(C| FM) is determined by Mendelian transmission, which is equal to 0.5 for heterozygous mother and 1 otherwise. Assume that P(FM) = P(F)P(M), and divide the numerator and denominator of Eq. (3) by f₀. Then, P(FMC| D) for each trio type can be written as the last column in Table 1, where R = q_mg₀ + 0.5q_mg₁(1 + λ₁) + q_mg₂λ₁ + p_mg₀λ₁ + 0.5p_mg₁(λ₁ + λ₂) + p_mg₂λ₂. The detailed derivation of P(FMC| D) in Table 1 is given in Additional file 1: Appendix A.

Since we find that it is more convenient to directly estimate g₀ and g₁ rather than ρ and p_f, we let θ= (p_m, g₀, g₁, λ₁, λ₂)^T be the parameter vector of interest. As such, the log-likelihood function of the observed data conditional on the daughter being affected is given by

$$ \ln L\left(\theta \right)=\sum \limits_{FMC\in \Omega}{n}_{FMC}\ln P\left( FMC\left|D\right.\right), $$

where n_FMC is the number of the family trios of type FMC. To obtain the MLE of θ, numerical methods, such as Newton-Raphson algorithm (by using “maxLik” package in R software [26]) and the ECM algorithm introduced later, are applied. We choose the initial values of p_m, g₀, g₁, λ₁ and λ₂ as follows: $ {\widehat{p}}_m^{(0)}=\#\left(F=1\right)/\#\left(F\in \left\{0,1\right\}\right), $ $ {\widehat{g}}_0^{(0)}=\#\left(M=0\right)/\#\left(M\in \left\{0,1,2\right\}\right) $, $ {\widehat{g}}_1^{(0)}=\#\left(M=1\right)/\#\left(M\in \left\{0,1,2\right\}\right), $ $ {\widehat{\lambda}}_1^{(0)}={n}_{011}/{n}_{010} $ and $ {\widehat{\lambda}}_2^{(0)}=\left({n}_{011}{n}_{112}\right)/\left({n}_{010}{n}_{111}\right) $, where # denotes the counting measure. The details about the choice of these initial values are shown in Additional file 1: Appendix B.

MLEs of λ ₁ and λ ₂ using family trios with parental genotypes missing

It is common that parental genotypes are missing in some family trios. For trios with paternal or maternal genotype missing, we call them “mother-daughter pairs” or “father-daughter pairs”, denoted by MC and FC, respectively. Thus, MC takes possible genotypes from Ω_MC={00, 01, 10, 11, 12, 21, 22}, and FC takes possible genotypes from Ω_FC={00, 01, 11, 12}. As for trios with both parental genotypes missing, we refer to them as “single daughters”. The probabilities of the mother-daughter pair MC, father-daughter pair FC and single daughter C given the daughter being affected are respectively$ P\left( MC|D\right)={\sum}_{F\in \left\{0,1\right\}}P\left( FMC|D\right) $, $ P\left( FC|D\right)={\sum}_{M\in \left\{0,1,2\right\}}P\left( FMC|D\right) $ and $ P\left(C|D\right)={\sum}_{F\in \left\{0,1\right\}}{\sum}_{M\in \left\{0,1,2\right\}}P\left( FMC|D\right) $, where P(FMC| D) are given in Table 1.

Suppose that we collect n_FMC family trios of type FMC, n_{1m, MC} mother-daughter pairs of type MC, n_{1f, FC} father-daughter pairs of type FC and n_{0, C} single daughters of type C, where the subscripts 1 m, 1f and 0 respectively mean that each trio has only a mother, only a father and no parents. Then, the log-likelihood function of the observed data is

$$ {\displaystyle \begin{array}{l}\ln L\left(\theta \right)=\sum \limits_{FMC\in \Omega}{n}_{FMC}\ln P\left( FMC\left|D\right.\right)+\sum \limits_{MC\in {\Omega}_{MC}}{n}_{1m, MC}\ln P\left( MC\left|D\right.\right)\\ {}+\sum \limits_{FC\in {\Omega}_{FC}}{n}_{1f, FC}\ln P\left( FC\left|D\right.\right)+\sum \limits_{C\in \left\{0,1,2\right\}}{n}_{0,C}\ln P\left(C\left|D\right.\right).\end{array}} $$

(4)

Let N₂, N_1m, N_1f and N₀ respectively be the numbers of family trios, mother-daughter pairs, father-daughter pairs and single daughters. Then, $ {N}_2={\sum}_{FMC\in \Omega}{n}_{FMC} $, $ {N}_{1m}={\sum}_{MC\in {\Omega}_{MC}}{n}_{1m, MC},{N}_{1f}={\sum}_{FC\in {\Omega}_{FC}}{n}_{1f, FC} $, $ {N}_0={\sum}_{C\in \left\{0,1,2\right\}}{n}_{0,C} $ and the total sample size N = N₂ + N_1m + N_1f + N₀.

Since it is not so easy to obtain the MLE of θ from the above observed log-likelihood function (4), the ECM algorithm will be employed. Assume that $ {n}_{1m, MC}={\sum}_{F\in \left\{0,1\right\}}{z}_{1m, FMC} $, $ {n}_{1f, FC}={\sum}_{M\in \left\{0,1,2\right\}}{z}_{1f, FMC} $ and n_{0, C}= $ {\sum}_{F\in \left\{0,1\right\}}{\sum}_{M\in \left\{0,1,2\right\}}{z}_{0, FMC} $, where z_{1m, FMC}, z_{1f, FMC} and z_{0, FMC} are the unobserved numbers of trios FMC for mother-daughter pairs MC, father-daughter pairs FC and single daughters C, respectively (see Additional file 1: Tables S1-S3). Then, the log-likelihood function for the complete data (n_FMC, z_{1m, FMC}, z_{1f, FMC}, z_{0, FMC}) can be written as

$$ \ln {L}_C\left(\theta \right)=\sum \limits_{FMC\in \Omega}\left({n}_{FMC}+{z}_{1m, FMC}+{z}_{1f, FMC}+{z}_{0, FMC}\right)\ln P\left( FMC|D\operatorname{}\right). $$

The following ECM algorithm contains one E-step and five CM-steps at each iteration. In the E-step at iteration (k + 1), we obtain the conditional expectation of lnL_C(θ) with respect to the conditional distributions of z_{1m, FMC}, z_{1f, FMC} and z_{0, FMC} given n_{1m, MC}, n_{1f, FC} and n_{0, C}, respectively. z_{1m, FMC} ∣ n_{1m, MC}, z_{1f, FMC} ∣ n_{1f, FC} and z_{0, FMC} ∣ n_{0, C} follow the binomial distributions with respective success probabilities P(F| MC, D), P(M| FC, D) and P(FM| C, D). Thus, the Q function is given by

$$ Q\left(\theta |{\widehat{\theta}}^{(k)}\right)=\sum \limits_{FMC\in \Omega}\left[{n}_{FMC}+{E}_{{\widehat{\theta}}^{(k)}}\Big({z}_{1m, FMC}|{n}_{1m, MC}\right)+{E}_{{\widehat{\theta}}^{(k)}}\left({z}_{1f, FMC}\left|{n}_{1f, FC}\right)+{E}_{{\widehat{\theta}}^{(k)}}\left({z}_{0, FMC}|{n}_{0,C}\right)\right]\ \ln P\left( FMC|D\right), $$

(5)

$ \mathrm{where}\kern0.5em {\hat{\theta}}^{(k)}={\left({\hat{p}}_m^{(k)},{\hat{g}}_0^{(k)},{\hat{g}}_1^{(k)},{\hat{\lambda}}_1^{(k)},{\hat{\lambda}}_2^{(k)}\right)}^T $ is the MLE of θ at iteration k,

$$ {E}_{{\widehat{\theta}}^{(k)}}\left({z}_{1m, FMC}|{n}_{1m, MC}\right)={n}_{1m, MC}P\left(F| MC,D;{\widehat{\theta}}^{(k)}\right)={n}_{1m, MC}\frac{P\left( FMC|D;{\widehat{\theta}}^{(k)}\right)}{\sum \limits_{F^{\prime}\in \left\{0,1\right\}}P\left({F}^{\prime } MC|D;{\widehat{\theta}}^{(k)}\right)}, $$

$$ {E}_{{\widehat{\theta}}^{(k)}}\left({z}_{1f, FMC}|{n}_{1f, FC}\right)={n}_{1f, FC}P\left(M| FC,D;{\widehat{\theta}}^{(k)}\right)={n}_{1f, FC}\frac{P\left( FMC|D;{\widehat{\theta}}^{(k)}\right)}{\sum \limits_{M^{\prime}\in \left\{0,1,2\right\}}P\left(F{M}^{\prime }C|D;{\widehat{\theta}}^{(k)}\right)} $$

and

$$ {E}_{{\widehat{\theta}}^{(k)}}\Big({z}_{0, FMC}\left|{n}_{0,C}\right)={n}_{0,C}P\left( FM|C,D;{\widehat{\theta}}^{(k)}\right)={n}_{0,C}\frac{P\left( FM C|D;{\widehat{\theta}}^{(k)}\right)}{\sum \limits_{F^{\prime}\in \left\{0,1\right\}}\sum \limits_{M^{\prime}\in \left\{0,1,2\right\}}P\left({F}^{\prime }{M}^{\prime }C|D;{\widehat{\theta}}^{(k)}\right)}. $$

In the CM-steps, the Q function is maximized with respect to each of components of θ in turn, with the others fixed at their previous values. The MLE of θ at iteration (k + 1) are given in Additional file 1: Appendix B. The initial value of θ is obtained only based on N₂ complete family trios when N₂ ≠ 0 (see Additional file 1: Appendix B). However, when N₂ = 0, we estimate the initial values of λ₁ and λ₂ by replacing unknown n₀₁₀, n₀₁₁, n₁₁₁ and n₁₁₂ values in $ {\widehat{\lambda}}_1^{(0)}={n}_{011}/{n}_{010} $ and $ {\widehat{\lambda}}_2^{(0)}=\left({n}_{011}{n}_{112}\right)/\left({n}_{010}{n}_{111}\right) $ by their respective conditional expectations (see Additional file 1: Tables S1-S3). For example, n₀₁₁ is replaced by

$$ E\left({z}_{1m,011}|{n}_{1m,11}\right)+E\left({z}_{1f,011}|{n}_{1f,01}\right)+E\left({z}_{\mathrm{0,011}}|{n}_{0,1}\right) $$

$$ ={n}_{1m,11}{\widehat{q}}_m^{(0)}+{n}_{1f,01}\bullet \frac{0.5{\widehat{g}}_1^{(0)}}{0.5{\widehat{g}}_1^{(0)}+{\widehat{g}}_2^{(0)}}+{n}_{0,1}\bullet \frac{0.5{\widehat{q}}_m^{(0)}{\widehat{g}}_1^{(0)}}{{\widehat{p}}_m^{(0)}{\widehat{g}}_0^{(0)}+0.5{\widehat{g}}_1^{(0)}+{\widehat{q}}_m^{(0)}{\widehat{g}}_2^{(0)}}, $$

where $ {\widehat{p}}_m^{(0)} $, $ {\widehat{q}}_m^{(0)}=1-{\widehat{p}}_m^{(0)} $, $ {\widehat{g}}_0^{(0)} $, $ {\widehat{g}}_1^{(0)} $ and $ {\widehat{g}}_2^{(0)}=1-{\widehat{g}}_0^{(0)}-{\widehat{g}}_1^{(0)} $are the initial values of p_m, q_m, g₀, g₁ and g₂, respectively. The details about the choice of these initial values are shown in Additional file 1: Appendix B. Given the initial value of θ, the steps mentioned above continue until the convergence criterion is satisfied. For example, the absolute differences between the estimates of the parameters at two consecutive iterations are all less than 10⁻⁷. In addition, note that the ECM algorithm still works when there are no missing genotypes in all the family trios. However, it contains only the CM steps in this situation and can be regarded as a special case of the cyclic coordinate ascent method, which is simple and stable [23].

Confidence interval of γ based on likelihood method

To obtain the CI, we first construct a LRT for testing the null hypothesis H₀: γ = γ₀ as follows,

$$ \mathrm{LRT}=2\ln \frac{L\left(\widehat{\theta}\right)}{L\left({\overset{\sim }{\theta}}_0\right)}, $$

where $ \widehat{\theta}= $ ($ {\widehat{p}}_m $, $ {\widehat{g}}_0 $, $ {\widehat{g}}_1 $, $ {\widehat{\lambda}}_1 $, $ {\widehat{\lambda}}_2 $)^T is the MLE of θ under H₁. Let θ₀= (p_m, g₀, g₁, λ₂)^T be the parameter vector under H₀ with γ₀ = 2(λ₁ − 1)/(λ₂ − 1) (i.e., λ₁ = γ₀(λ₂ − 1)/2 + 1), and then $ {\overset{\sim }{\theta}}_0= $ ($ {\overset{\sim }{p}}_m $, $ {\overset{\sim }{g}}_0 $, $ {\overset{\sim }{g}}_1 $, $ {\overset{\sim }{\lambda}}_2 $)^T denotes the MLE of θ₀. The choice of the initial value of θ₀ and the solution of $ {\overset{\sim }{\theta}}_0 $ using family trios with missing parental genotypes is given in Additional file 1: Appendix B. The LRT asymptotically follows a chi-square distribution with the degree of freedom being one (i.e., $ {\chi}_1^2 $).

At the significance level α, the 100(1 − α)% confidence interval of γ based on the LRT is

$$ \left\{\gamma :\mathrm{LRT}\left(\gamma \right)<\kern0.5em {\chi}_{1-\alpha, 1}^2\right\}, $$

and the confidence limits are the values that satisfy

$$ \mathrm{LRT}\left(\gamma \right)={\chi}_{1-\alpha, 1}^2. $$

(6)

Note that there is no closed-form solution of Eq. (6). Thus, numerical method is applied, such as functions from “rootSolve” package in R software [26]. Let γ_L and γ_U be two unequal roots of Eq. (6) with γ_L < γ_U. Generally, the 100(1 − α)% CI of γ would be (γ_L, γ_U). However, since the true value of γ is bounded in [0, 2], the original estimated CI of γ will be truncated by [0, 2] if necessary. As such, the ultimate CI of γ is (γ_L, γ_U)∩[0, 2], which is easier to be interpreted than the origin CI.

Discontinuity problem of confidence interval of γ

Note that γ is a ratio, so like other ratio estimates [27], we find that the proposed CI may consist of two disjoint intervals, such as [0, 0.03)∪(0.59, 2]. In this article, this kind of CIs is referred to as “discontinuous CI” for convenience. Let’s take a close look at this discontinuity problem by the following example. Consider the situation of (n₀₀₀, n₀₁₀, n₀₁₁, n₀₂₁, n₁₀₁, n₁₁₁, n₁₁₂, n₁₂₂)=(191, 89, 112, 54, 114, 59, 62, 19). Then, $ \widehat{\gamma} $ is 1.92 and two roots of Eq. (6) are 0.03 and 0.59, respectively. If the CI is set to be (0.03, 0.59) normally, to our surprise, $ \widehat{\gamma} $ is located outside this CI. When testing the null hypothesis H₀: γ = γ₀, we find that γ₀ taking values between 0.03 and 0.59 is rejected by the LRT. This means that the interval (0.03, 0.59) is actually a rejection region of the corresponding LRT rather than an acceptance region. Hence, the CI of γ turns to be (−∞, 0.03)∪(0.59,+∞), and [0, 0.03)∪(0.59, 2] after being truncated. The discontinuous CI may occur when the denominator of the ratio is close to zero (i.e., λ₂ is close to 1 in this article) [28]. In fact, when λ₂ = 1, we assume that we cannot obtain information on the XCI skewing pattern according to the CI of γ. This is because our proposed method measures XCI skewness in the presence of association between the disease and genotypes (i.e., λ₂ ≠ 1). On the other hand, although these discontinuous CIs are considered to be uninformative and are difficult to be interpreted, there is no satisfactory “objective” method for dealing with this problem well [27].

Simulation settings

To assess the performance of the proposed method, we conduct the following simulation study. The sample size N is taken to be 700, which is close to that of RA data (757 pedigrees) [29]. We consider six different combinations of (N₂, N_1m, N_1f, N₀), which are referred to as six “missing patterns” (MP1–MP6) for convenience and are shown in Table 2. When the missing pattern changes from MP1 to MP4, the number of case-parents trios N₂ decreases and the number of single daughters N₀ increases with N_1m = N_1f. For MP5 and MP6, the number of mother-daughter pairs N_1m is different from that of father-daughter pairs N_1f with N₂ = N₀. In addition, (p_m, p_f) is set to be (0.30, 0.30), (0.25, 0.30), (0.30, 0.25), (0.20, 0.20), (0.15, 0.20) and (0.20, 0.15), and we assume that ρ=0 and 0.05, and λ₂=1.5 and 2. λ₁ is calculated from λ₁ = γ(λ₂ − 1)/2 + 1, where γ varies from 0 to 2 in increments of 0.5. Given p_m, p_f, ρ, λ₂ and γ, N₂ case-parents trios are randomly generated from a multinomial distribution with probabilities P(FMC| D) shown in Table 1. Similarly, N_1m mother-daughter pairs, N_1f father-daughter pairs and N₀ single daughters are randomly drawn from the multinomial distributions with probabilities $ P\left( MC|D\right)={\sum}_{F\in \left\{0,1\right\}}P\left( FMC|D\right) $, $ P\left( FC|D\right)={\sum}_{M\in \left\{0,1,2\right\}}P\left( FMC|D\right) $ and $ P\left(C|D\right)={\sum}_{F\in \left\{0,1\right\}}{\sum}_{M\in \left\{0,1,2\right\}}P\left( FMC|D\right) $, respectively. The simulations are based on k=10,000 replicates and 5% significance level.

Table 2 Six simulation settings for different combinations of (N₂, N_1m, N_1f, N₀) with total sample size N being fixed at 700

Full size table

We assess the statistical properties of the CI by the following indexes. Let the coverage probability (CP) be the proportion that the CI contains the true value of γ among k replicates. Note that under H₀: γ = γ₀, the estimated type І error rate of the LRT is 1−CP. ML and MR denote the left tail error and the right tail error (missing the true value of γ), respectively, with ML $ =\#\left[\left(\gamma <{\gamma}_L\right)\cap \left({\gamma}_L\le \widehat{\gamma}\le {\gamma}_U\right)\right]/k $ and MR $ =\#\left[\left(\gamma >{\gamma}_U\right)\cap \left({\gamma}_L\le \widehat{\gamma}\le {\gamma}_U\right)\right]/k $. Further, we use ML/(ML + MR) to measure the balance of ML and MR, which will be close to 0.5 when the balance is achieved. Notice that we do not consider those discontinuous CIs when calculating ML and MR, since we cannot distinguish between the left side and the right side of the discontinuous CIs. As such, we use DP$ =1-\#\left({\gamma}_L\le \widehat{\gamma}\le {\gamma}_U\right)/k $ to represent the proportion of the discontinuous CIs among k replicates. In addition, note that the distribution of a ratio is not necessarily symmetric [30, 31], and the median can be always used to estimate the central tendency of a skewed distribution better than the mean [32]. So, we give the median of the point estimates of γ over k replicates under each simulation scenario. Further, for simulating the power of the LRT, we fix γ₀ at 0, 1 and 2. Finally, we also compare the simulation results under MP1 (consisting of only 700 family trios with both parents) based on the ECM algorithm with those based on the Newton-Raphson algorithm. It is found that the results of the two algorithms are almost consistent with each other (data not shown for brevity). Therefore, we only give the simulation results on the basis of the ECM algorithm in the following section.

Results

Simulation results

Table 3 lists the estimated CP, ML/(ML + MR) and DP of the likelihood-based CI of γ against MP and γ with ρ=0, λ₂=1.5, and (p_m, p_f) being (0.30, 0.30), (0.25, 0.30) and (0.30, 0.25). It is shown in the table that the CP is around 95% under the situations considered. On the other hand, we find that ML/(ML + MR) and DP appear not to be greatly affected by (p_m, p_f). However, the value of γ has strong effect on ML/(ML + MR) and DP. When γ takes values on the boundary (i.e., 0 and 2), ML/(ML + MR) always stays close to 1 and 0, respectively, which indicates extreme imbalance of two tail errors. DP increases as γ gets close to the boundary. Also, the missing pattern has great influence on both ML/(ML + MR) and DP. When the missing pattern varies from MP1 to MP4, where the number of case-parents trios decreases and that of single daughters increases, ML/(ML + MR) becomes more and more far away from 0.5 and DP sharply increases. We also find that under MP5, where the number of mother-daughter pairs is larger than that of father-daughter pairs, ML/(ML + MR) is a little closer to 0.5 and DP becomes smaller compared to those under MP6.

Table 3 Statistical properties of likelihood-based confidence interval of γ against missing pattern (MP) and γ with ρ=0, λ₂=1.5, and (p_m, p_f) being (0.30, 0.30), (0.25, 0.30) and (0.30, 0.25)^a

Full size table

Table 4 shows the corresponding statistical properties of the CI of γ with ρ=0, λ₂=2, and (p_m, p_f) being (0.30, 0.30), (0.25, 0.30) and (0.30, 0.25). As expected, the CP is still controlled well, the ML/(ML + MR) is closer to 0.5 and DP is lower with larger λ₂. We also investigate the effect of ρ=0.05, and the corresponding results are similar to those of ρ=0 (see Additional file 1: Tables S4 and S5). On the other hand, when (p_m, p_f) is set to be (0.20, 0.20), (0.15, 0.20) and (0.20, 0.15), the results are similar to those when (p_m, p_f) being taken as (0.30, 0.30), (0.25, 0.30) and (0.30, 0.25) (see Additional file 1: Tables S6–S9). In addition, the median of the point estimates of γ among k replicates under each simulation scenario is shown in Additional file 1: Figures S1 and S2. From Additional file 1: Figure S1, we can see that the median of $ \widehat{\gamma} $ gets more far away from the true value of γ as the missing pattern varies from MP1 to MP4, and it is always slightly closer to γ under MP5 than that under MP6. The increase of the value of λ₂ also improves the accuracy of the median of $ \widehat{\gamma} $, while the values of ρ, p_m and p_f seem to have no great influence on the median of $ \widehat{\gamma} $.

Table 4 Statistical properties of likelihood-based confidence interval of γ against missing pattern (MP) and γ with ρ=0, λ₂=2, and (p_m, p_f) being (0.30, 0.30), (0.25, 0.30) and (0.30, 0.25)^a

Full size table

The simulated powers of the LRT for testing H₀ : γ = γ₀ with (p_m, p_f) = (0.30, 0.30), (0.25, 0.30) and (0.30, 0.25) are given in Figs. 1, 2, 3, 4. Fig. 1 shows the simulated powers of the LRT against γ under MP1–MP4 with ρ= 0 and λ₂=1.5. From the first row to the third row of the panels in Fig. 1, (p_m, p_f) is set to be (0.30, 0.30), (0.25, 0.30) and (0.30, 0.25), respectively. From the first column to the third column, γ₀ is fixed at 0, 1 and 2, respectively. It is found that the power increases as the value of ∣γ − γ₀∣ gets larger. For example, in Fig. 1(a), when testing for H₀: γ = γ₀ with γ₀= 0 (XCI skewing completely against mutant allele), the power under γ=1.5 (75% cells express mutant allele) is greater than that under γ= 0.5 (25% cells express mutant allele). On the other hand, when the missing pattern changes from MP1 to MP4, the loss in power is always substantial. Also, we compare the corresponding powers under MP5 and MP6 in Fig. 2. The power under MP5 is always higher than that under MP6 when γ ≠ γ₀, which implies that the mother-daughter pairs contain more information on the skewness of XCI than the father-daughter pairs. This is not surprising because when the father’s genotype is missing in a trio, it can be inferred according to the available mother’s and daughter’s genotypes, except for the mother-daughter pair of type MC = 11, whereas we cannot infer the missing mother’s genotypes from any father-daughter pairs. In addition, Figs. 3 and 4 give the simulated powers of the LRT under ρ=0 and λ₂=2 for MP1–MP4 and MP5–MP6, respectively. It is shown that the increase of λ₂ leads to the growth in power (Fig. 3 vs. Fig. 1, Fig. 4 vs. Fig. 2). We also simulate the powers under ρ=0.05 and (p_m, p_f) = (0.30, 0.30), (0.25, 0.30) and (0.30, 0.25), which are similar to those under ρ=0 (see Additional file 1: Figures S3–S6). Finally, the simulated powers under (p_m, p_f) = (0.20, 0.20), (0.15, 0.20) and (0.20, 0.15) are given in Additional file 1: Figures S7–S14, which are always lower than those under (p_m, p_f) = (0.30, 0.30), (0.25, 0.30) and (0.30, 0.25), respectively.

Application to RA data

Rheumatoid arthritis (RA) is an autoimmune disease, which has been reported to be associated with the skewness of XCI [33]. To investigate the XCI skewing patterns at the X-linked loci associated with RA, we apply our proposed method to the data from North American Rheumatoid Arthritis Consortium [29], which is made available from Genetic Analysis Workshop 15 [34]. The dataset includes 757 pedigrees and 293 single nucleotide polymorphism (SNP) markers on X chromosome. In this application, one nuclear family with a typed affected daughter is selected randomly from each pedigree. As such, a total of 703 nuclear families are included, which contains 64 case-parents trios, 179 mother-daughter pairs, 37 father-daughter pairs and 423 single daughters.

Since our proposed method is applicable in the presence of association, we ultimately measure the degree of XCI skewing at five SNPs which have been found to be associated with RA by the XMCPDT method at the significance level of 1% [35]. Notice that the XMCPDT method is conducted based on 246 pedigrees from the RA dataset. We identify the at-risk allele by the value of $ {\widehat{\lambda}}_2 $, and denote the estimates of the frequencies of the at-risk allele in males and females obtained from the ECM algorithm by $ {\widehat{p}}_m $ and $ {\widehat{p}}_f $, respectively. Table 5 lists the p-value of XMCPDT, the values of ($ {\widehat{p}}_m $, $ {\widehat{p}}_f $), $ {\widehat{\lambda}}_2 $ and $ \widehat{\gamma} $, and 95% CI of γ based on the proposed likelihood method for each of five SNPs. From Table 5, we find that there are three SNPs (rs916685, rs1043034 and rs2005463) with the 95% CIs containing the value of γ=1, which indicates the random XCI. On the other hand, the XCI skewing at rs2238907 is found with $ \widehat{\gamma}= $0.35 and the 95% CI being [0, 0.79), which suggests that the skewness of XCI is against the at-risk allele with 17.5% (0.35/2) cells in heterozygous females having the at-risk allele active, while the other 82.5% cells keeping the normal allele active. However, the 95% CI of γ at rs1264064 is [0, 2], providing no information on the XCI skewing pattern. In addition, we evaluate $ \widehat{\gamma} $’s and the 95% CIs of γ at the rest 288 SNPs in the RA dataset, and find that there are 21 SNPs with nonrandom XCI pattern. But note that, if we assume that all of 293 SNPs except for rs2238907 are under H₀: γ = 1, then the corresponding false positive rate would be 0.0719 (21/292), which is still below the upper bound $ 0.05+1.96\kern0.5em \times \kern0.5em \sqrt{0.05\kern0.5em \times \kern0.5em \left(1-0.05\right)/292}=\mathrm{0.0750.} $ Besides, association between these 21 SNPs and RA has not been found by XMCPDT at the 1% significance level, so we should draw conclusions with this caution.

Table 5 Application of proposed method to RA dataset with p-values of XMCPDT less than 1% significance level

Full size table

Discussion

In this article, we propose a statistical measure γ of the degree of the XCI skewing for family trio data, which can be represented as a ratio of two GRRs in females in the presence of association between the disease and genotypes. Further, we obtain the point estimate of γ, which is constructed by the MLEs of two GRRs in females. When there are missing parental genotypes in some family trios, the ECM algorithm is used to estimate the two GRRs. The CI of γ is derived from the likelihood method by inverting the LRT. We conduct the simulation study under various simulation settings, including six missing patterns of families, six groups of allele frequencies, two different values of inbreeding coefficient in females, two different values of λ₂ and five different values of γ. The simulation results show that the proposed likelihood-based CI of γ has an accurate CP under the situations considered, while ML/(ML + MR) and DP of the CI of γ and the median of estimates of γ are influenced by the values of γ, λ₂ and the missing pattern. Similarly, the simulated power of LRT is affected by the values of ∣γ − γ₀∣, λ₂, (p_m, p_f) and the missing pattern. Finally, we apply our proposed method to the RA data from USA and find out a locus, rs2238907, which may undergo the XCI skewing against the at-risk allele.

Many X-linked diseases are always associated with XCI skewing in females. Our proposed statistical measure γ provides information on the potential loci subject to XCI skewing, thus it is helpful to uncover the pathogenesis of X-linked diseases. However, most of the statistical studies on X chromosome today focus mainly on the association tests [17, 24, 25, 36,37,38], so there are no other statistical methods available to measure the skewness of XCI. On the other hand, although the XCI skewing pattern can also be detected by differential methylation between the active and inactive X chromosomes or mRNA transcription in cells, our proposed statistical method takes use of population data to measure the skewness of XCI. Thus, it can reflect the average level of the XCI skewing in female population.

There are some issues in our proposed method. First of all, the original CI is truncated by [0, 2] to enhance the interpretability of the CI. However, when the whole original CI lies outside [0, 2], the CI ultimately obtained after truncation would be empty. Although it is hard to interpret this kind of CI containing no values, the simulation results show that when γ takes values on the boundary (i.e., 0 and 2), these empty CIs seldom occur. For example, when γ = 0, two tail errors (ML and MR) are extremely imbalance with ML/(ML + MR) being 1 or close to 1. This means that there are no or very few original CIs whose upper limit is below 0. Likewise, ML/(ML + MR) is 0 or close to 0 when γ = 2, which implies that there are no or very few original CIs whose lower limit is beyond 2. On the other hand, the proposed likelihood method has its own drawback in deriving the CI of a ratio like any other ratio estimation methods. We find that the likelihood-based CI of γ may consist of two disjoint intervals, such as [0, 0.03) ∪ (0.59, 2], and it is also difficult for us to interpret. For example, if $ \widehat{\gamma}=1.92 $ and the CI of γ is [0, 0.03) ∪ (0.59, 2], then the corresponding LRT would reject the null hypothesis of γ₀ = 0.5, and accept that of γ₀ = 0.01. It is hard to explain that the LRT rejects a γ₀ being close to $ \widehat{\gamma} $, while accepts one being far away from $ \widehat{\gamma} $. Although this kind of CI is undesirable, it is also inevitable and can be regarded as a hint of λ₂ being close to 1 [39]. In addition, it should be noted that the ECM algorithm is not applicable when all the family trios are “single daughters”, since the MLE of θ may not be uniquely specified under this situation (the details see Additional file 1: Appendix C). However, if the other family trios were collected, then the single daughters can make contribution to the MLE of θ in the ECM algorithm together with these trio data (the details see Additional file 1: Appendix D). Finally, we assume that the genotypes’ frequencies in males and females (p_m, g₀ and g₁) are unknown and estimate them together with λ₁ and λ₂ in our simulation study and real data application. Alternatively, if we can obtain information on the allele frequencies from the online databases, such as the Allele Frequency Net Database [40] and the UCSC Genome Browser Database [41], then it is unnecessary to re-estimate p_m, g₀ and g₁, which will reduce the number of parameters so that the ECM algorithm runs faster.

Note that the ECM algorithm can converge to a local maximum of the log-likelihood function instead of a global maximum [23]. To investigate this, we randomly choose 1000 initial values of θ (θ₀) from the parameter space and regard the MLE of θ (θ₀) with the maximum log-likelihood among 1000 $ \ln L\left(\widehat{\theta}\right) $'s ($ \ln L\left({\overset{\sim }{\theta}}_0\right) $'s) as the global MLE of θ (θ₀). We conduct a simulation study under the simulation settings with ρ=0, λ₂= 1.5 and (p_m, p_f) = (0.30, 0.30), and the details see Additional file 1: Appendix E. The simulation results (see Additional file 1: Tables S10 and S11) show that the values of $ \widehat{\theta} $ and $ \ln L\left(\widehat{\theta}\right) $ ($ {\overset{\sim }{\theta}}_0 $ and $ \ln L\left({\overset{\sim }{\theta}}_0\right) $) based on one initial value estimated by the method described in Additional file 1: Appendix B are very close to those based on 1000 initial values under all the simulated situations when N₂ (the number of complete family trios) is not too small, such as MP1 and MP2, which may indicate that the ECM algorithm based on the estimated initial value converges towards the global maximum. As for MP3-MP6, except that $ {\overset{\sim }{\theta}}_0 $ with (γ₀, γ) = (1, 2) under MP5 and MP6, $ {\overset{\sim }{\theta}}_0 $ with (γ₀, γ) = (1, 1) and (1, 2) under MP3, and $ {\overset{\sim }{\theta}}_0 $ with (γ₀, γ) = (1, 1), (1, 1.5) and (1, 2) under MP4 may converge to a local maximum, all the other $ \widehat{\theta} $ and $ {\tilde{\theta}}_0 $ results converge to the global maximum. Further, for these seven cases, we try and randomly select ten groups of initial values of θ₀ from the parameter space and regard $ {\tilde{\theta}}_0 $ with the maximum log-likelihood among ten $ \ln L\left({\tilde{\theta}}_0\right) $'s as the final MLE of θ₀. We find that $ {\tilde{\theta}}_0 $’s based on ten and 1000 initial values are very close to each other under all the seven simulated situations (see Additional file 1: Table S11). So, if N₂ is zero or too small, we recommend using multiple initial values (such as ten) for obtaining the global MLE of θ₀. On the other hand, family trios with both parents are always fortunately collected in the family-based studies in practice.

In future studies, we will extend our proposed method to incorporate covariates by using nuclear families with affected and unaffected offspring. Furthermore, to facilitate the interpretability of the CI of γ, we will utilize the prior information, such as the order of the GRRs in females and the information of the presence of association.

Conclusions

The proposed statistical measure for the skewness of XCI is applicable for complete family trio data or family trio data with some paternal genotypes missing. The likelihood-based CI has an accurate CP under the situations considered. Therefore, our proposed statistical measure is generally recommended in practice for discovering the potential loci which undergo the XCI skewing.

Abbreviations

CI:: Confidence interval
CP:: Coverage probability
DP:: Proportion of the discontinuous CIs
ECM:: Expectation-conditional- maximization
GRR:: Genotypic relative risk
LRT:: Likelihood ratio test
ML:: Left tail error
MLE:: Maximum likelihood estimate
MP:: Missing patterns
MR:: Right tail error
RA:: Rheumatoid arthritis
SNP:: Single nucleotide polymorphism
XCI:: X chromosome inactivation
XMCPDT:: Monte Carlo pedigree disequilibrium test on X chromosome

References

Postma DS. Gender differences in asthma development and progression. Gend Med 2007;4 Suppl B: S133–46.
Article PubMed Google Scholar
Invernizzi P, Pasini S, Selmi C, Gershwin ME, Podda M. Female predominance and X chromosome defects in autoimmune diseases. J Autoimmun. 2009;33:12–6.
Article CAS PubMed Google Scholar
Dorak MT, Karpuzoglu E. Gender differences in cancer susceptibility: an inadequately addressed issue. Front Genet. 2012;3:268.
Article PubMed PubMed Central Google Scholar
Qureshi IA, Mehler MF. Genetic and epigenetic underpinnings of sex differences in the brain and in neurological and psychiatric disease susceptibility. Prog Brain Res. 2010;186:77–95.
Article CAS PubMed PubMed Central Google Scholar
Skuse DH. Imprinting, the X-chromosome, and the male brain: explaining sex differences in the liability to autism. Pediatr Res. 2000;47:9–16.
Article CAS PubMed Google Scholar
Chow JC, Yen Z, Ziesche SM, Brown CJ. Silencing of the mammalian X chromosome. Annu Rev Genomics Hum Genet. 2005;6:69–92.
Article CAS PubMed Google Scholar
Carrel L, Willard HF. X-inactivation profile reveals extensive variability in X-linked gene expression in females. Nature. 2005;434:400–4.
Article CAS PubMed Google Scholar
Minks J, Robinson WP, Brown CJ. A skewed view of X chromosome inactivation. J Clin Invest. 2008;118:20–3.
Article CAS PubMed Google Scholar
Van den Veyver IB. Skewed X inactivation in X-linked disorders. Semin Reprod Med. 2001;19:183–91.
Article CAS PubMed Google Scholar
Brown CJ. Skewed X-chromosome inactivation: cause or consequence? J Natl Cancer Inst. 1999;91:304–5.
Article CAS PubMed Google Scholar
Plenge RM, Stevenson RA, Lubs HA, Schwartz CE, Willard HF. Skewed X-chromosome is a common feature of X-linked mental retardation disorders. Am J Hum Genet. 2002;71:168–73.
Article CAS PubMed PubMed Central Google Scholar
Prchal JT, Carroll AJ, Prchal JF, Crist WM, Skalka HW, Gealy WJ, Harley J, Malluh A. Wiskott-Aldrich syndrome: cellular impairments and their implication for carrier detection. Blood. 1980;56:1048–54.
CAS PubMed Google Scholar
Conley ME, Lavoie A, Briggs C, Brown P, Guerra C, Puck JM. Nonrandom X chromosome inactivation in B cells from carriers of X chromosome-linked severe combined immunodeficiency. Proc Natl Acad Sci U S A. 1988;85:3090–4.
Article CAS PubMed PubMed Central Google Scholar
Salsano E, Tabano S, Sirchia SM, Colapietro P, Castellotti B, Gellera C, Rimoldi M, Pensato V, Mariotti C, Pareyson D, Miozzo M, Uziel G. Preferential expression of mutant ABCD1 allele is common in adrenoleukodystrophy female carriers but unrelated to clinical symptoms. Orphanet J Rare Dis. 2012;7:10.
Article PubMed PubMed Central Google Scholar
Kristiansen M, Langerød A, Knudsen GP, Weber BL, Børresen-Dale AL, ørstavik KH. High frequency of skewed X inactivation in young breast cancer patients. J Med Genet. 2002;39:30–3.
Article CAS PubMed PubMed Central Google Scholar
Clayton D. Testing for association on the X chromosome. Biostatistics. 2008;9:593–600.
Article PubMed PubMed Central Google Scholar
Wang J, Yu R, Shete S. X-chromosome genetic association test accounting for X-inactivation, skewed X-inactivation, and escape from X-inactivation. Genet Epidemiol. 2014;38:483–93.
Article CAS PubMed PubMed Central Google Scholar
Kubota T. A new assay for the analysis of X-chromosome inactivation in carriers with an X-linked disease. Brain Dev. 2001;23(Suppl 1):S177–81.
Article PubMed Google Scholar
Busque L, Zhu J, DeHart D, Griffith B, Willman C, Carroll R, Black PM, Gilliland DG. An expression based clonality assay at the human androgen receptor locus (HUMARA) on chromosome X. Nucleic Acids Res. 1994;22:697–8.
Article CAS PubMed PubMed Central Google Scholar
Szelinger S, Malenica I, Corneveaux JJ, Siniard AL, Kurdoglu AA, Ramsey KM, Schrauwen I, Trent JM, Narayanan V, Huentelman MJ, Craig DW. Characterization of X chromosome inactivation using integrated analysis of whole-exome and mRNA sequencing. PLoS One. 2014;9:e113036.
Article CAS PubMed PubMed Central Google Scholar
Carrel L, Willard HF. Heterogeneous gene expression from the inactive X chromosome: an X-linked gene that escapes X inactivation in some human cell lines but is inactivated in others. Proc Natl Acad Sci U S A. 1999;96:7364–9.
Article CAS PubMed PubMed Central Google Scholar
Cotton AM, Lam L, Affleck JG, Wilson IM, Peñaherrera MS, McFadden DE, Kobor MS, Lam WL, Robinson WP, Brown CJ. Chromosome-wide DNA methylation analysis predicts human tissue-specific X inactivation. Hum Genet. 2011;130:187–201.
Article CAS PubMed PubMed Central Google Scholar
Meng XL, Rubin DB. Maximum likelihood estimation via the ECM algorithm: a general framework. Biometrika. 1993;80:267–78.
Article Google Scholar
Chen Z, Ng HKT, Li J, Liu Q, Huang H. Detecting associated single-nucleotide polymorphisms on the X chromosome in case control genome-wide association studies. Stat Methods Med Res. 2017;26:567–82.
Article PubMed Google Scholar
Wang P, Xu SQ, Wang BQ, Fung WK, Zhou JY. A robust and powerful test for case–control genetic association study on X chromosome. Stat Methods Med Res. 2018. https://doi.org/10.1177/0962280218799532.
Team RC. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing. 2013. http://www.r-project.org. 2018.
Liseo B. Bayesian and conditional frequentist analyses of the Fieller’s problem. A critical review Metron 2003;LXI:133–50.
Hwang JTG. Fieller’s problems and resampling techniques. Stat Sinica. 1995;5:161–71.
Google Scholar
Amos CI, Chen WV, Remmers E, Siminovitch KA, Seldin MF, Criswell LA, Lee AT, John S, Shephard ND, Worthington J, Cornelis F, Plenge RM, Begovich AB, Dyer TD, Kastner DL, Gregersen PK. Data for genetic analysis workshop (GAW) 15 problem 2, genetic causes of rheumatoid arthritis and associated traits. BMC Proc. 2007;1(Suppl 1):S3.
Article PubMed PubMed Central Google Scholar
Marsaglia G. Ratios of normal variables and ratios of sums of uniform variables. J Am Stat Assoc. 1965;60:193–204.
Article Google Scholar
Diaz-Francés E, Rubio FJ. On the existence of a normal approximation to the distribution of the ratio of two independent normal random variables. Stat Pap. 2013;54:309–23.
Article Google Scholar
Wilcox RR, Keselman HJ. Modern robust data analysis methods: measures of central tendency. Psychol Methods. 2003;8:254–74.
Article PubMed Google Scholar
Chabchoub G, Uz E, Maalej A, Mustafa CA, Rebai A, Mnif M, Bahloul Z, Farid NR, Ozcelik T, Ayadi H. Analysis of skewed X-chromosome inactivation in females with rheumatoid arthritis and autoimmune thyroid diseases. Arthritis Res Ther. 2009;11:R106.
Article CAS PubMed PubMed Central Google Scholar
Genetic Analysis Workshop. 1982. https://www.gaworkshop.org. Accessed 3 Oct 2018.
Zou QL, You XP, Li JL, Fung WK, Zhou JY. A powerful parent-of-origin effects test for qualitative traits on X chromosome in general pedigrees. BMC Bioinformatics. 2018;19:8.
Article PubMed PubMed Central Google Scholar
Hickey PF, Bahlo M. X chromosome association testing in genome wide association studies. Genet Epidemiol. 2011;35:664–70.
Article PubMed Google Scholar
Ma L, Hoffman G, Keinan A. X-inactivation informs variance-based testing for X-linked association of a quantitative trait. BMC Genomics. 2015;16:241.
Article PubMed PubMed Central Google Scholar
Choi S, Lee S, Qiao D, Hardin M, Cho MH, Silverman EK, Park T, Won S. FARVATX: family-based rare variant association test for X-linked genes. Genet Epidemiol. 2016;40:475–85.
Article PubMed PubMed Central Google Scholar
Gleser LJ, Hwang JT. The nonexistence of 100(1-α)% confidence sets of finite expected diameter in errors-in-variables and related models. Ann Stat. 1987;15:1351–62.
Article Google Scholar
Allele Frequency Net Database. 2015. http://www.allelefrequencies.net/. Accessed 3 Oct 2018.
UCSC Genome browser database. 2000. http://genome.ucsc.edu/. Accessed 3 Oct 2018.

Download references

Acknowledgments

The authors thank the editor and three reviewers for helpful comments that greatly improve the presentation of the article.

Funding

This study was supported by the National Natural Science Foundation of China grants 81,773,544, 81,373,098 and 81,573,207 and the National and Guangzhou University Students’ Innovation and Enterprise Training Project of China grant 201,612,121,017. The authors thank the Genetic Analysis Workshops for providing the RA data, which were supported by the National Institutes of Health grant R01 GM031575. The RA data were gathered with the support of grants from the National Institutes of Health grants N01-AR-2-2263 and R01-AR-44422, and the National Arthritis Foundation.

Availability of data and materials

The dataset supporting the conclusions of this article is from North American Rheumatoid Arthritis Consortium, which is made available from Genetic Analysis Workshop 15 (http://www.gaworkshop.org/).

Author information

Authors and Affiliations

State Key Laboratory of Organ Failure Research, Ministry of Education and Guangdong Provincial Key Laboratory of Tropical Disease Research, Department of Biostatistics, School of Public Health, Southern Medical University, Guangzhou, China
Si-Qi Xu, Yu Zhang, Peng Wang, Wei Liu & Ji-Yuan Zhou
Guangdong Provincial Key Laboratory of Tropical Disease Research, Department of Epidemiology, School of Public Health, Southern Medical University, Guangzhou, China
Xian-Bo Wu

Authors

Si-Qi Xu
View author publications
You can also search for this author in PubMed Google Scholar
Yu Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Peng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Wei Liu
View author publications
You can also search for this author in PubMed Google Scholar
Xian-Bo Wu
View author publications
You can also search for this author in PubMed Google Scholar
Ji-Yuan Zhou
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SQX, YZ, PW, WL, XBW and JYZ all contributed to the study design, analytical preparation and the writing of the manuscript. SQX and YZ performed the simulation studies. PW, WL, XBW and JYZ analyzed the data and revised the manuscript. All authors have read and approved the final manuscript.

Corresponding authors

Correspondence to Xian-Bo Wu or Ji-Yuan Zhou.

Ethics declarations

Ethics approval and consent to participate

Not applicable.

Consent for publication

Not applicable.

Competing interests

The authors declare that they have no competing interests.

Publisher’s Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Additional file

Additional file 1:

Appendix A. Derivation of P(FMC| D) in Table 1. Appendix B. Choice of initial value of θ (θ₀) and MLE of θ (θ₀) using family trios with missing parental genotypes. Appendix C. Inapplicability of ECM algorithm when using only single daughters. Appendix D. Contribution of single daughters to estimate of θ in ECM algorithm. Appendix E. Effect of different initial values of θ (θ₀) on ECM algorithm. Tables S1–S3. The conditional probabilities and conditional expectations for seven types of possible mother-daughter pairs, four types of possible father-daughter pairs and three types of possible single daughters, respectively. Tables S4–S5. Statistical properties of likelihood-based confidence interval of γ against missing pattern (MP) and γ with ρ=0.05, (p_m, p_f) = (0.30, 0.30), (0.25, 0.30) and (0.30, 0.25), and λ₂=1.5 and 2, respectively. Tables S6–S9. Statistical properties of likelihood-based confidence interval of γ against missing pattern (MP) and γ with (p_m, p_f)= (0.20, 0.20), (0.15, 0.20) and (0.20, 0.15), ρ=0 and 0.05, and λ₂=1.5 and 2, respectively. Table S10. Averages of absolute differences of each element of $ \widehat{\theta} $ and $ \ln L\left(\widehat{\theta}\right) $ between ECM₁ and ECM₁₀₀₀ with ρ=0, λ₂= 1.5 and (p_m, p_f) = (0.30, 0.30) under MP1-MP6. Table S11. Averages of absolute differences of each element of $ {\tilde{\theta}}_0 $ and $ \ln L\left({\tilde{\theta}}_0\right) $ between ECM₁/ECM₁₀ and ECM₁₀₀₀ with ρ=0, λ₂= 1.5 and (p_m, p_f) = (0.30, 0.30) under MP1-MP6. Figures S1–S2. Medians of point estimates of γ against MP for different p_m, p_f and λ₂ values with ρ=0 and 0.05, respectively. Figures S3–S6. Estimated powers of LRT against γ with ρ=0.05 and (p_m, p_f) being (0.30, 0.30), (0.25, 0.30) and (0.30, 0.25) under MP1–MP4 and MP5–MP6, and λ₂=1.5 and 2, respectively. Figures S7–S14. Estimated powers of LRT against γ with (p_m, p_f) being (0.20, 0.20), (0.15, 0.20) and (0.20, 0.15) under MP1–MP4 and MP5–MP6, ρ=0 and 0.05, and λ₂=1.5 and 2, respectively. (PDF 2328 kb)

Rights and permissions

Open Access This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated.

Reprints and permissions

About this article

Cite this article

Xu, SQ., Zhang, Y., Wang, P. et al. A statistical measure for the skewness of X chromosome inactivation based on family trios. BMC Genet 19, 109 (2018). https://doi.org/10.1186/s12863-018-0694-8

Download citation

Received: 12 June 2018
Accepted: 08 November 2018
Published: 05 December 2018
DOI: https://doi.org/10.1186/s12863-018-0694-8

A statistical measure for the skewness of X chromosome inactivation based on family trios

Abstract

Background

Results

Conclusions

Background

Methods

Notations

Relationship between penetrances and XCI skewing in females

MLEs of λ 1 and λ 2 using family trios without missing genotypes

MLEs of λ 1 and λ 2 using family trios with parental genotypes missing

Confidence interval of γ based on likelihood method

Discontinuity problem of confidence interval of γ

Simulation settings

Results

Simulation results

Application to RA data

Discussion

Conclusions

Abbreviations

References

Acknowledgments

Funding

Availability of data and materials

Author information

Authors and Affiliations

Contributions

Corresponding authors

Ethics declarations

Ethics approval and consent to participate

Consent for publication

Competing interests

Publisher’s Note

Additional file

Additional file 1:

Rights and permissions

About this article

Cite this article

Share this article

Keywords

BMC Genomic Data

Contact us

MLEs of λ ₁ and λ ₂ using family trios without missing genotypes

MLEs of λ ₁ and λ ₂ using family trios with parental genotypes missing