Genetic Analysis Workshop 13: simulated longitudinal data on families for a system of oligogenic traits.

The Genetic Analysis Workshop 13 simulated data aimed to mimic the major features of the real Framingham Heart Study data that formed Problem 1, but under a known inheritance model and with 100 replicates, so as to allow evaluation of the statistical properties of various methods. The pedigrees used were the 330 real pedigree structures (comprising 4692 individuals) with some minor changes to protect confidentiality. Fifty trait genes and 399 microsatellite markers were simulated by gene dropping on 22 autosomal chromosomes. Assuming random ascertainment of families, a system of eight longitudinal quantitative traits (designed to be similar to those in the real data) was generated with a wide range of heritabilities, including some pleiotropic and interactive effects. Genes could affect either the baseline level or the rate of change of the phenotype. Hypertension diagnosis and treatment were simulated with treatment availability, compliance, and efficacy depending on calendar year. Nongenetic traits of smoking and alcohol were generated as covariates for other traits. Death was simulated as a hazard rate depending upon age, sex, smoking, cholesterol, and systolic blood pressure. After the complete data were simulated, missing data indicators were generated based on logistic models fitted to the real data, involving the subject's history of previous missing values, together with that of their spouses, parents, siblings, and offspring, as well as marital status, only-child indicators, current value at certain simulated traits, and the data collection pattern on the cohort into which each subject was ascertained.


Background
Our goal in simulating data for Genetic Analysis Workshop 13 (GAW13) was to provide a data set with the basic features of the real data [1], a set of families from the Framingham Heart Study (FHS) [2], but under a known "true" inheritance model. The Framingham study has a number of unique features, but those we focused on replicating in our simulated set were the longitudinal collection over many years of several related traits on a large set of pedigrees and the availability of a complete genome screen with microsatellite markers. There has been a rapidly growing statistical literature on the analysis of dependent data, including longitudinal data, but seldom have genetic analyses addressed simultaneously the complexities of dependencies both within individuals over time and between individuals within pedigrees. Longitudinal data pose additional challenges with potentially informative missingness. This simulated set allows studies of false-positive rates and power for methods that might be applicable to the real data. It was our intention to encourage comparisons between results from the real and simulated sets, in the hope that some groups would find both sets useful in developing new methods.
To facilitate the use of both real and simulated data together, the simulated data set contains variables with the same names and in the same format as the real data. As with the real data, the simulated data contains measures of height (HT), weight (WT), high density lipoprotein (HDL), total cholesterol (CHOL), triglycerides (TG), glucose (GLUC), systolic blood pressure (SBP), hypertension diagnosis and treatment (T), cigarettes smoked per day (SMK), and quantity of alcohol consumed per week (DRINK). These variables were simulated longitudinally on two cohorts drawn from 330 pedigrees containing 4692 individuals, with data collection on each cohort starting about 30 years apart. The first cohort was examined 21 times at 2-year intervals, while the second was examined 5 times with an 8-year interval between the first two exams and 4-year intervals between subsequent exams. A missing data pattern was simulated to mimic that seen in the real data. To avoid any potential confusion with the real data, the placement of some individuals within some pedigrees was changed and all the sexes were randomized.
Underlying the phenotype simulation, we simulated 449 genetic loci on 22 autosomal chromosomes via random gene drop. These included 399 microsatellite markers and 50 trait loci. We used a sex-specific map -another first for a GAW simulation -and the allele frequencies of the markers provided for the Framingham Heart Study data. The trait loci were randomly placed, but some chromosomes were excluded from having loci placed on them, so false-positive rates could be assessed. The 50 trait genes fed into a complex model ( Figure 1), with some genes affecting the "baseline" trait value, and others affecting change in the trait over time. Some genes directly affect only one trait; others affect several. Some effects of these trait loci are large and easy to detect, some are smaller and more difficult to detect, and some are so small we expect them to be impossible to detect in a single replicate. We included genes of miniscule effect both to add a degree of realism to the simulation and in the hope that our expectation will be proven wrong.
Despite the complexity in this model, we are under no illusion that we met the impossible goal of exactly modelling the unknown biological mechanisms underlying these traits. Our primary concern was to provide a data set with a variety of different types of effects. These effects are designed to help us understand what types of genetic effects can and cannot be detected in a real data set like that collected in the Framingham study. In deciding what effects to include, we gave consideration to the correlations in the real data and the advice of biologists. We placed some limits on the types of models we considered: for reasons of limited human and computer time, we excluded any models that contained feedback loops, which almost certainly exist between some of these traits in reality: it is possible that increasing cholesterol levels contributes to weight gain, which in turn contributes to higher cholesterol levels, but this type of interaction was not included. Similarly, we only allowed the genetic effects to interact with each other and the environmental effects additively or multiplicatively: we felt it was more important to focus our time on the longitudinal aspects of the model. This simulated data set is designed to aid in the testing of methods, not to illustrate the workings of the human body.
Although our simulation is far less complex than reality, there is still much complexity in the model. We have interacting environmental variables: smoking is affected by parental smoking and birth year, while drinking is affected by smoking and birth year. Both the criteria for hypertension treatment and the treatment itself change with time. There are both direct sex effects and sex-moderated effects. The 50 loci we included represent nearly an order of magnitude more than have been included in previous GAW simulations. Of the eight simulated traits that were directly affected by trait loci, weight had the smallest number directly affecting it, with three, and glucose the most, with 16. Some trait loci had direct effect on two or three traits. When indirect effects are considered as well, the picture is even more complex: all 50 trait loci have an effect, direct or indirect, on hypertension diagnosis and DBP. We expect that this data set contains far more complexity than can be detected, even if all 100 simulated replicates of the data are analyzed simultaneously. In this respect, we believe that this simulated set does mirror reality.

Methods
For this simulation, we broke the process into several parts. First, we constructed pedigrees based on those in the Framingham study, along with simulated birth years. This was followed by simulation of genotypes via random gene drop. These genotypes and the pedigrees were then fed into a program to generate the longitudinal data. Finally, a missing data program was run to determine which variable values were observed, including whether each individual was a member of Cohort 1, Cohort 2, or neither.

Pedigree structures
The pedigrees and individual birth years were the same for all 100 replicates of the simulated data. We took the pedigrees provided by the Framingham Heart Study for GAW and randomized the sexes. In an effort to further obscure any connection between the real and simulated pedigrees, in a few of the larger pedigrees, we deleted an individual from one sibship and added an individual to another sibship within the same pedigree, thus maintaining the same sample size. The sample contained 4692 individuals in 330 families containing 7 to 84 people. Two of the families contained marriage loops. Because the ascertainment and missing data part of the simulation contained a random component, the number of individuals genotyped and phenotyped differed slightly between replicates: in the first replicate, which was the one GAW participants using only a single replicate were advised to use, 1720 individuals were genotyped and 2860 had some phenotype information available. We approximated the ages in the real data by assuming the first visit for the "original" cohort occurred in 1948 and the first visit for the "offspring" cohort occurred in 1972. We used these approximate ages to obtain distributions for age differences between mother and first child for sibships of different sizes, between set of sibs for sibships of different sizes, and for husband-wife age differences. We discarded some outliers from these distributions and then used them to randomly assign ages to the pedigrees by selecting a reference individual for each pedigree in a sibship without offspring, assigning that individual a birth year between 1937 and 1942 (uniform distribution), and spanning the pedigree by drawing randomly, with replacement, from the age difference distributions. If all individuals in a pedigree were born after 1952, or at least one individual was born after 1935 in a sibship without offspring, then random age assignment was repeated until obtaining one Diagram of relationships between simulated traits and genes Figure 1 Diagram of relationships between simulated traits and genes. Arrows indicate causal relationships between traits. Most correlations are positive, but a "-" indicates a negative correlation. An "*" and trait name next to an arrow indicates that the relationship is mediated by the named trait.
that met the requirements. Four pedigrees failed to have ages assigned after 500 iterations of the process, so we took one of the random age assignments for each of these four pedigrees and modified it so that it passed the rejec-  tion criteria. It was because of the complexity we encountered in this age assignment process and the limited time available that we decided to use the same ages for all 100 replicates.

Marker and trait genotypes
The genotype simulation was done with the Genedrop program in the PANGAEA: Morgan package http:www.stat.washington.edu/thompson/Genepi/pan gaea.shtml. The simulated markers are based on those available in the real data, which was genotyped using the Marshfield mapping panel 8A. We used the sex-specific Haldane map distances obtained by converting the maker positions found at the Marshfield web site http:// www.marshfieldclinic.org/research/genetics/ from Kosambi to Haldane centimorgans. The allele frequencies used in the simulation were those provided to us by the Framingham Heart Study group. We assumed Hardy-Weinberg and linkage equilibrium between all loci, both trait and marker. The 50 simulated trait genes were randomly placed on the odd-numbered chromosomes (Table  1), leaving the even-numbered chromosomes available for false-positive studies. Initially, we simulated the trait genes with 20 equally frequent alleles to give us flexibility in assigning those genes to different traits with different allele frequencies. These 20 alleles were reduced to two, three, or four alleles when we generated the trait values.
To generate the longitudinal data, we simulated data points for each individual in every year, from the time they were 20 until they reach 100. In all the trait models we used here, we restricted genes to be one of two types: "baseline" genes have a constant absolute effect on the trait value over time, while "slope" genes have a changing effect over time. In terms of percentage variance contribution to the trait, baseline genes will tend to have a decreas-ing effect with time, while slope genes will have an increasing effect. The numerical effects of each genotype at each locus are given in Table 2. Note that the variance given for each locus is the absolute variance for that locus rather than the proportional variance contribution: we include these numbers as a guide for judging the relative importance of each locus within a trait. Although the trait model itself is complex (Figure 1), it has a flow to it, and we cover the simulation of the trait values in that order.
The genotype simulation did not allow for any pedigree or genotyping errors and treated the map function as correctly specified. Likewise in what follows, no errors were simulated in the phenotype data, other than missing data.

Phenotypes
Height Height (HT) is a "simple oligogenic" trait, in that all effects are additive and there is no time dependence other than random noise. The two sexes s = (m, f) were given different means and variances: we used the values from the real data: µ HT (m) = 68.13 in, µ HT (f) = 62.57 in, σ HT (m) = 3.02 in, σ HT (f) = 2.70 in. Ten loci contribute to this trait, G b1 ,...,G b10 , with contributions to the sex-specific variance of 40%, 20%, 10%, 5%, 2%, 2%, 2%, 1%, 1%, and 1%. In addition, a random environmental effect (constant over age) r HT contributes 14% of the sex-specific variance and an age-specific random effect at age a ("measurement error"), ε HT (a), contributes 2% (both normally distributed). The formula used was: where g bi is the effect of each individual's genotype at locus G bi (Table 2).

Weight
We chose to make weight (WT) strongly dependent on height via a BMI-like relationship, i.e., proportional to height in meters squared. Weight was simulated in pounds and height in inches, so some unit conversion was required. Since a substantial number of studies have reported localizations of genes influencing BMI in real data, we wanted to provide a simulated data set in which analysis of the simulated BMI could reasonably be expected to localize some of the simulated trait loci. All the loci that contribute to variation in height contribute indirectly to variation in weight through the height variable. In addition, one locus contributes to baseline weight, G b11 , and two contribute to a logarithmic change in weight with age, G s1 and G s2 .

Smoking and drinking
Both smoking and drinking were simulated as "environmental" variables (i.e., none of the simulated genes had a direct effect on either variable). We used patterns observed in the real data and from Johnson and Gerstein [3] as guides in simulating these variables. We found that the frequencies from Johnson and Gerstein did not produce the smoking and drinking rates observed in the real data, so we followed the trends in that paper while increasing the rates. We assumed that children of smokers were twice as likely to smoke as children of non-smokers and that smokers were 10% more likely to drink than non-smokers, to produce a sex by cohort probability table ( Table 3). The probability of smoking depends on parental smoking. The probability of being a drinker depends on whether an individual is a smoker. We sampled from real-sample sex-specific distributions of drinks and cigarettes to get the simulated quantities. For drinkers, this sampled value was modified by height, so the quantity consumed has a genetic component of variation through height, but whether or not someone drinks does not. We chose to make the quantity consumed dependent on height, based on observing such a correlation in the real data. The correlations with BMI and weight were not as strong. Thus: where d r is the drink quantity (in grams per day), sampled from the sex-specific distribution, HT is height (without the measurement error term), and µ HT (s) is the sex-specific mean height. Quantities for both smoking (SMK) and drinking were fixed over time, but smokers have a probability (a -20)/1000 of quitting each year, so SMK(a) denotes the current number of cigarettes smoked per day. We also calculated pack-years (PY), assuming all smokers started at age 18.

Lipids
High density lipoprotein (HDL), triglycerides (TG), and glucose (GLUC) have eight baseline genes in common, G b12 ,...,G b19 , and TG and GLUC have four slope genes, G s3 ,...,G s6 , in common as well. In addition, three baseline genes (G b20 ,...,G b22 ) only contribute directly to HDL, three (G b23 ,...,G b25 ) only contribute to TG, and four (G b26 ,...,G b29 ) only contribute to GLUC. HDL and TG have direct sex effects, and, in addition, the effects of G s2 ,...,G s6 on TG are mediated by sex via a menopause-like effect in women. All three were simulated to correspond to the measurements in units of mg/dl found in the real data.
Weight has an effect on TG and GLUC, drinking on HDL and TG, and smoking on HDL. HDL and GLUC also have a direct effect on TG. All these effects were consistent with correlations seen in the real data. The equations used were: where µ HDL (s) = 41 mg/dl for men and 53 mg/dl for women, σ HDL (s) = 11 mg/dl for men and 12 mg/dl for women, is the baseline DRINK effect on HDL, where WT(a) is weight with no error, is the baseline SMK(a) effect on HDL, and r HDL and ε HDL (a) are independent fixed (mean 0, variance 0.1) and age-varying normal deviates, respectively. The genetic effects on HDL, g Hbi , are given in Table 2. While HDL has no slope term, the presence of weight in the drinking term and people quitting smoking will cause some age dependence.
Both GLUC and TG have exponential slope terms, but TG also has a "menopause effect" in the slope term.
where is the glucose baseline term, with g Gbi the effect of the genotype at locus G bi on GLUC (Table 2), WT(a) the weight without error, µ WT,s the sex-specific mean weight (176.03 pounds for men, 144.12 for women), and r GLUC,1 ~ N(0, √0.1), is the glucose slope term with g Gsi the effect of the genotype at locus G si on GLUC and r GLUC,2 ~ N(0, √0.003). The glucose slope term can be positive or negative and is small in absolute value.
Triglycerides had baseline levels depending on DRINK, HDL, GLUC, and WT, together with 11 genes, and a slope that depended upon four genes, modified in females by a complicated function of age and age at menopause, which was also random: where µ(s) = 75 mg/dl or 65 mg/dl and σ(s) = 25 mg/dl or 20 mg/dl for males or females, respectively, g Tbi is the effect of the genotype at locus G bi on TG, with µ WT (s) = 160 pounds for males or 135 pounds for females and ε TG (a) ~ N(0, √0.1), and S TG (G Tsi ,a) is a sexspecific slope term as follows: Both systolic and diastolic blood pressures were used in determining hypertension diagnosis and treatment. However, only SBP was provided with the real data and the simulated replicates, so, consequently, we kept the model for DBP simpler. Hypertension treatment and diagnosis were done separately. Hypertension was diagnosed if SBP > 140 mm Hg or DBP > 90 mm Hg for 2 years in a row, and once the diagnosis was made, it stuck. Hypertension thresholds remained constant, but both the thresholds for treatment and the efficacy of the treatments available changed with time. Before 1960 no treatment was available. From 1960 through 1975, there is a 50% chance (prescription + compliance) of drug treatment if DBP rises above 95. After the initial year of eligibility, there is a 90% chance that an individual remains in the same treatment class. This first treatment lowers SBP by 10 + N(0, √2) and DBP by 7 + N(0, √1.5). Between 1976 and 1985, this drug treatment remains available and, in addition, we add an "exercise" treatment which those with DBP > 90 have a 50% chance of getting and a 90% chance of remaining in the same treatment class after the initial year of eligibility. The "exercise" treatment lowers SBP by 5 + N(0,1) and DBP by 3 + N(0, √0.5). In 1986 and after, the first drug treatment is replaced by a treatment with a 75% chance of being administered to those with SBP > 150 or DBP > 95. After the initial year of eligibility, there is a 95% chance of remaining in the treated class, and those untreated have a 50% chance of switching to the treated class. This treatment lowers SBP by 20 + N(0, √2) and DBP by 12 + N(0, √1.5).

Death
Each person had an exponentially increasing probability of dying each year: where α = 5 × 10 -7 , PY is pack-years, and R(s) is a sex-specific risk factor of 200 if male and 230 if female. However, we placed the restriction that no person could die before their youngest child was born. If an individual was simulated with an age of death before the birth of their youngest child, we increased the age of death to their age when their youngest child was born.

Missing data simulation
The missing data pattern in the simulated data set was simulated to resemble the missing data pattern in the real data. In the real data, each visit had a planned subset of the measurements that were to be taken. For each visit in the simulated data, only the planned measurements for that visit are included. A visit is considered to be complete if all the planned measurements were taken.
In the real data, for each visit a subject may have none, some, or all of the planned measurements missing. Only visits with all the planned measurements missing were used to determine the missing data patterns for the simulated data sets. In the simulated data set, a visit is either entirely missing or is complete.
The missing data patterns were summarized into three variables, H ij as predictors of missingness M ij for subject i on visit j: an indicator of the previous visit being missing; an indicator of the visit two time periods ago being missing; and the proportion of possible visits that are missing. The pattern of missing data for an individual includes all three of these variables. The pattern of missing data for a spouse includes the indicator variables for the last two visits. For parent, and sibling history, only the percentage of missing observations is used.
For each cohort, a logistic model was used to predict the probability of a visit being missing given the subject's missing data pattern, the missing data pattern of first degree relatives, the missing data pattern of the spouse, the measurements at the last nonmissing visit, marital status, being an only child, and visit number. Observations after time of death were not included in the model because these are obviously missing.
For the initial cohort, the previous missing data pattern of the subject, , where denotes the proportion of missing values up to and including visit j, and corresponding values for the subject's spouse (sp), and the subject's siblings (sib) were predictors. Marital status (MS ij ), being an only child (OC i ), cholesterol (CHOL ij ), weight (WT ij ), and visit number (j) were also predictors. The following logistic models were  Table 4): For the offspring cohort, the missing data pattern of the subject, the subject's siblings, and the subject's parents (M pa = ( , )), together with indicators NA pa for the availability of parents in the data set, are predictors. Being an only child, cholesterol, weight, and visit number were also predictors. The following models were created for the offspring cohort based on the real data: The models were used to assign missing data patterns as follows: A subject in the original cohort was selected at random from a family and assigned a missing status for Time 3 based on Model 1. (No observations were missing for the first two visits.) The subject's first degree relatives in the original cohort and spouses were then assigned missing values given the already assigned data using Models 2-4. This continues until all subjects in the original cohort were assigned a missing status for Visit 3. The process is repeated for all other visits. Members of the offspring cohort were then assigned missing status. One of the subjects of a sibship was selected at random and assigned a missing value for Time 2 using Model 5. The other siblings were then assigned missing values based on the values of the already assigned siblings (Model 6). This was done for all sibships in the offspring cohort. The process was repeated for all the other times.
Genetic data were collected after a certain time point. Any subject alive at the time when the genetic data were collected had a probability of having genetic data based on how many visits were not missing during the period genetic data was collected. This is similar to what is seen in the real data. In the simulated data set, a subject that had no visits during the time of genetic data collection has no marker data. A subject that has one visit has an 82% chance of having genetic data, and a subject that has more than one visit has an 85% chance of having genetic data.

Validation
The data were tested at several stages. As the raw trait data were generated, we computed means and variances for each trait and compared the simulated values to those in the real data. We did many graphical comparisons of the simulated data and the real FHS data, particularly with trait vs. age plots. We compared correlation matrices in the simulated data to those in the real data that we were attempting to mimic. We ran linkage detection programs to confirm that we could detect some of the trait genes we simulated. Unfortunately, given the very tight time constraints, we were not able to check every part of the simulation to the degree we would have liked.
Despite our best attempts to check all aspects of the simulation, one serious bug escaped our notice and was detected by Dr. Martin Tobin, affecting the part of the missing data simulation concerning hypertension treatment. This invalidated the data for the purpose of addressing the important question of how to deal with this confounder, but did not affect the genetics of the problem at all. A corrected version of the entire data set was provided to all GAW participants within a month of discovery

Discussion
In creating this simulated data set, we intended to provide a data set that would facilitate the development of statistical methods for analyzing longitudinal data. This creation involved making a series of compromises -choices about how to model reality and what effects to include given the limited time available. We learned much in simulating this set, and, if we were to start over now and had the same amount of time available, we would make a number of choices slightly differently. Overall, however, we are pleased with the results of our simulation, and we think that having the simulation tightly linked to the real data set greatly increases the value of the simulation.
The simulators and simulation oversight committee agreed that it was of primary importance that the simulation reflect the longitudinal nature of the real data and include some informative missing data patterns. These areas have not been extensively studied in the context of genetic analysis and seem ripe with potential. Many traits have values that change as the body ages, and the contribution of such traits to complex genetic diseases is likely to be important. The missing data patterns included familial missing data patterns and measured covariate effects, neither of which included direct genetic influences. In addition, death also induces an informative missing data pattern with a number of indirect genetic influences on this pattern via SBP and cholesterol. Direct genetic influences and hypertension treatment influences on missing data patterns are something we would have liked to include, but did not simply for lack of time: each additional pattern must be carefully balanced to ensure that enough data is not missing.
The trait model used in this simulation was among the most complex ever used in a GAW simulation, but it still falls short of reality and we had many ideas on improving the model that we were unable to pursue. We considered generating the longitudinal data under a stochastic process, rather than the deterministic model with normally distributed random variables that we used. Unfortunately, we felt that writing a program to do this would take more time than we had. We started out thinking in terms of fairly simple models with genes influencing "baseline" or "slope." There are many other ways to model the effects of trait loci over time when one considers interactions and higher-order effects. Fifty trait loci on eight traits were not enough for all the effects that did occur to us. We would have liked to have many other types of effects, and had genes that influenced smoking, drinking, and play a direct role in treatment effect and death. However, there was some risk that we would lose sight of the longitudinal data focus if we included too many different types of effects.
Both the simulation and the genetic analysis of longitudinal data include many challenges. No simulation can perfectly match the complexity of real data, but we think that this simulation provides enough complexity to enable GAW participants to examine many of these challenges. We hope the results of analyses of this set provide answers to some of these challenges.