Heuristic exploitation of genetic structure in marker-assisted gene pyramiding problems

Background Over the last decade genetic marker-based plant breeding strategies have gained increasing attention because genotyping technologies are no longer limiting. Now the challenge is to optimally use genetic markers in practical breeding schemes. For simple traits such as some disease resistances it is possible to target a fixed multi-locus allele configuration at a small number of causal or linked loci. Efficiently obtaining this genetic ideotype from a given set of parental genotypes is known as the marker-assisted gene pyramiding problem. Previous methods either imposed strong restrictions or used black box integer programming solutions, while this paper explores the power of an explicit heuristic approach that exploits the underlying genetic structure to prune the search space. Results Gene Stacker is introduced as a novel approach to marker-assisted gene pyramiding, combining an explicit directed acyclic graph model with a pruned generation algorithm inspired by a simple exhaustive search. Both exact and heuristic pruning criteria are applied to reduce the number of generated schedules. It is shown that this approach can effectively be used to obtain good solutions for stacking problems of varying complexity. For more complex problems, the heuristics allow to obtain valuable approximations. For smaller problems, fewer heuristics can be applied, resulting in an interesting quality-runtime tradeoff. Gene Stacker is competitive with previous methods and often finds better and/or additional solutions within reasonable time, because of the powerful heuristics. Conclusions The proposed approach was confirmed to be feasible in combination with heuristics to cope with realistic, complex stacking problems. The inherent flexibility of this approach allows to easily address important breeding constraints so that the obtained schedules can be widely used in practice without major modifications. In addition, the ideas applied for Gene Stacker can be incorporated in and extended for a plant breeding context that e.g. also addresses complex quantitative traits or conservation of genetic background. Gene Stacker is freely available as open source software at http://genestacker.ugent.be. The website also provides documentation and examples of how to use Gene Stacker. Electronic supplementary material The online version of this article (doi:10.1186/s12863-014-0154-z) contains supplementary material, which is available to authorized users.


Recombination rates
When crossing two genotypes P and Q a number of possible genotypes can occur among the offspring due to recombination of alleles, each with a certain probability. This section describes how to compute these probabilities based on the recombination rates r i,p,q that indicate the expected frequency of crossovers between the pth and qth loci of the ith chromosome and which are inferred from the genetic map.
First, take the ith chromosome P i of genotype P and any haplotype H i with the same number of loci as P i . Suppose that P i contains l heterozygous loci with ordered indices s = (ν 1 , . . . , ν l ). Then, the probability P r[P i → H i ] that haplotype H i is produced from chromosome P i is computed as follows [1]: • If H i contains at least one allele which does not occur at the respective locus in P i then P r[P i → H i ] = 0, i.e. it is impossible that H i will be produced from P i .
• Else, if s is empty (all loci are homozygous), P r[P i → H i ] = 1.
• Else in case of a crossover between loci ν j and ν j+1 1 − r i,νj ,νj+1 otherwise where there has been a crossover between loci ν j and ν j+1 if The factor of 1/2 is introduced because every sequence of crossovers defines two complementary haplotypes which are inherited with equal probability. Now, take any chromosome G i with the same number of loci as the ith chromosomes P i and Q i of both parents P and Q. The probability P r[P i , Q i → G i ] that chromosomes P i and Q i will produce haplotypes which together form a chromosome G i is computed as follows: The second case accounts for the fact that the haplotypes might swap their originating parents. Gene Stacker explicitly models multiple chromosomes: the probability P r[P, Q → G] of obtaining the entire phase-known genotype G from crossing the phase-known parents P and Q, with k chromosomes, is computed by multiplying the independent chromosome probabilities: Note that this will account for up to 2 k identical phase-known genotypes G depending on how many haplotype pairs might swap their originating parents.

Joint population sizes
Sometimes several different genotypes or multiple occurrences of the same genotype are simultaneously targeted among offspring grown from a shared seed lot. In such case it is possible to compute a joint population size, i.e. the number of offspring that needs to be generated so that at least the number of desired occurrences of each targeted genotype are expected to be obtained. The same individual and overall success rates can still be guaranteed, but the computed joint population size is often much smaller than the sum of the population sizes required to obtain each targeted genotype individually. This procedure may therefore significantly reduce the total population size.
Suppose that m distinct phase-known genotypes G 1 , . . . , G m are targeted among the offspring, where f 1 , . . . , f m occurrences of the respective targets are desired, with f i ≥ 1, ∀i = 1, . . . , m and m i=1 f i = f . The joint population size N is then calculated as follows: 1. Compute the population sizes N i required to obtain each target G i individually using formula (1) from the main article.
2. Set N = max{N i |i = 1, . . . , m} and compute the joint success probability (see below) 3. If P < (γ ) f , adjust N using binary search in the interval are stated below. Taking the maximum in step 2 ensures a success rate of at least γ for every target individually. Subsequently, step 3 further increases this initial estimate of the joint population size to guarantee a joint success rate of at least (γ ) f so that this joint trial with f targets still contributes a factor of (γ ) f to the overall success rate -just as if f independent Bernoulli trials would have been performed. Such independent trials would require a population size of m i=1 (f i · N i ) to obtain all targets, which is the upper bound of I. By considering a joint trial, the total population size can be significantly reduced while both the same individual and overall success rates are still guaranteed. Experiments showed that in practice step 3 rarely has to be executed, leading to a significant reduction in population size with almost no computational overhead.
The joint success probability P r[|G 1 | ≥ f 1 & · · · & |G m | ≥ f m ; N ] of obtaining each targeted phase-known genotype G i at least f i times, when growing N plants in total, is computed as:  Figure S1: Alternatives of the same schedule, obtained from a different alignment of generations, which are equally good in terms of the considered objectives (number of generations, total population size and overall linkage phase ambiguity). Both alternatives are retained and will be considered for further extension.
where p Gi is the probability of observing G i among the offspring. These formulas follow from the multinomial probability distribution. The joint success probability is computed through its complement as the f i 's are expected to be small -often they are all equal to 1 -while N is expected to be much larger. Also, m is expected to be relatively small. Therefore, a direct computation would require to sum over significantly more terms, compared to this computation through the complement.  Figure S2: This figure shows two alternative alignments of generations in a crossing schedule, where the left alignment is preferred over the right alignment since it has a lower total population size, the same number of generations and the same overall linkage phase ambiguity. By performing the crossing that creates seed lot S3 in the first generation (left), one parent plant can be reused, while this plant has to be regrown if the crossing is postponed to the second generation (right). The left alignment is retained, but the right alignment is (greedily) discarded.
the same schedule, obtained from a different alignment of the generations of the smaller schedules from which the larger schedule was created. Both alignments are retained and will be considered for further extension. In contrast, Figure S2 shows two alternative alignments of generations in a crossing schedule where one alignment is greedily discarded (right) because it is dominated by the other alignment (left) in terms of the objectives (number of generations, total population size and overall linkage phase ambiguity). Figure S3 provides a detailed outline of the Gene Stacker algorithm. The input consists of a set of parental genotypes G, the desired ideotype I and a genetic map M. As output, an approximated Pareto frontier F is produced. The queue Q contains those schedules that still have to be extended and the algorithm iteratively dequeues partial schedules C from Q to create larger schedules C new function GeneStacker(G, I, M) Q ← [ ] queue containing schedules to be extended

Detailed algorithm
Pareto frontier for all parental genotypes P ∈ G do add minimal schedule growing P to Q add minimal schedules to queue end for while Q not empty do C ← dequeue element from Q schedule to be extended (≥ 1 alternatives compute seed lot obtained by selfing for all genotypes G ∈ S do consider each genotype in S as next target register new schedule (all alternatives) end for for all C ∈ P do combine with previous schedules remove non Pareto optimal alignments S ← Cross(final plant from C, final plant from C , M) compute seed lot obtained by crossing for all genotypes G ∈ S do consider each genotype in S as next target register new schedule (all alternatives) end for end for Add C to P add to list of already extended schedules end while return F return final Pareto frontier approximation end function resolve any depleted seed lots if ideotype I obtained and constraints satisfied then queue schedule for further extension end if end function Figure S3: Detailed outline of the Gene Stacker algorithm. The input consists of a set of parental genotypes G, the ideotype I and the genetic map M. Crossing schedules are iteratively extended to approximate the Pareto frontier of solutions with minimum number of generations, total population size and overall linkage phase ambiguity. by (a) selfing the final plant of C; and (b) combining C with each previously extended schedule C through a crossing of the final plants of both schedules. More precisely, every element C ∈ Q consists of a series of a ≥ 1 alternatives C[0], . . . , C[a − 1] of the same schedule. These alternatives arise because there are several ways to align or interleave the generations of two smaller schedules C and C when combining them into a larger schedule C new : each generation of C new either contains one single generation from C or C , or consists of the alignment of two generations; one from each of the smaller schedules (see previous section for examples).
Whenever a crossing or selfing is performed to extend a schedule, the corresponding seed lot S is constructed by (a) inferring all possible haplotypes that can be produced from each chromosome of both parents; (b) creating all pairwise combinations, per chromosome, of haplotypes produced by both parents; and (c) making all combinations of the obtained chromosomes. This yields the set of possible offspring. During generation, the corresponding probabilities and linkage phase ambiguities are computed.
After selfing the final plant of a partial schedule C, Gene Stacker considers each genotype G in the constructed seed lot S to be fixed as a possible next target. For each genotype G, the alternatives of a larger schedule C new are created by attaching G to each alternative C[i] of C. For each alternative of every newly created schedule C new it is checked whether there are any depleted seed lots, i.e. seed lots from which more seeds are taken than the amount provided by the performed crossing(s). In such case, Gene Stacker indicates that the crossing should be performed multiple times to provide additional seeds. For this, it may be necessary to have several duplicates of the crossed genotypes, taking into account the number of crossings that can be performed with a single plant. This affects the population sizes and may introduce new depleted seed lots; therefore, this process is repeated until all depleted seed lots have been resolved.
If the ideotype I is obtained, each alternative C new [i] for which all constraints are satisfied is registered in the Pareto frontier F: if C new [i] is not dominated by any other solution currently contained in F, it is added to F and all schedules dominated by C new [i] are removed from F. If the ideotype is not yet obtained, C new is added to the queue Q for further extension, where some alternatives C new [i] may be discarded by one of the applied heuristics or if it is predicted that all extensions will violate the constraints and/or will be dominated by an already obtained solution (pruning). Note that in the actual implementation, some (heuristic) pruning criteria are checked at other points through the execution to enable early pruning (e.g. before combining two specific partial schedules or before attaching a specific genotype G as the next target).
Gene Stacker continues until Q is empty; termination is guaranteed because of a required constraint on the number of generations. Note that the algorithm is not entirely exact as it greedily discards non Pareto optimal alignments of partial crossing schedules. However, the impact of this greedy approach on the solution quality is expected to be very small; it mainly prevents the introduction of most likely redundant generations in C new when generations of the combined schedules can easily be aligned without violating any constraints, and favors those alignments with the highest amount of reuse, resulting in the lowest total cost.

Heuristic presets
Gene Stacker provides five presets that each combine a set of well-chosen heuristics, offering convenient tradeoffs between solution quality and execution time. The specific heuristics included in each preset are listed in Table S1. In the default setting some less restrictive heuristics are applied compared to those enabled when switching to presets faster and fastest. On the other hand, preset better drops some heuristics and preset best does not apply any (optional) heuristics at all. Presets better, default and faster perform two runs as they apply one of the dual run heuristics H3s1 or H3s2, while preset fastest applies H3 in a single run.
6 Results for the first constructed example Figure S4 shows an overview of all reported solutions for the first constructed example, when running Gene Stacker in default mode with an overall success rate of γ = 0.95 and a maximum of 4 generations, 10% overall linkage phase ambiguity, 4 crossings per plant, 5000 plants per generation, and 2500 seeds obtained from each crossing. Three schedules are non-ambiguous, while the remaining two schedules have a small linkage phase ambiguity of 8.28% which in turn yields a (slightly) lower total population size. The approximated Pareto frontier clearly reflects the tradeoffs between the three objectives: minimizing the total population size, number of generations and overall linkage phase ambiguity.

Specification of real stacking problems
This section gives a full description of all discussed problems from cotton, tomato and rice.

Cotton
One example from cotton is considered. The problem consists of 6 parental genotypes and a heterozygous ideotype with 11 loci spread across 5 chromosomes:   Figure S4: Overview of all reported solutions for the first constructed example, when running Gene Stacker in default mode with an overall success rate of γ = 0.95 and a maximum of 4 generations, 10% overall linkage phase ambiguity, 4 crossings per plant, 5000 plants per generation, and 2500 seeds obtained from each crossing.

Tomato
Both considered stacking problems from tomato consist of the same 4 parental genotypes with 8 loci spread across 6 chromosomes: The genetic map states the following distances between subsequent loci on the same chromosome: • 2nd chromosome: 4 cM • 6th chromosome: 10cM The first problem (Tomato-1) has a homozygous ideotype

Rice
The two considered problems from rice have the same 8 parental genotypes with 10 loci spread across 6 chromosomes:

Solutions for real stacking problem from cotton
Figures S5 up to S8 show all solutions reported for the cotton example when applying preset fastest with a maximum of 5 generations. Running this preset took 2 hours and 15 minutes to complete; all other presets ran out of memory.
As the number of seeds obtained from a crossing is set to 250 some crossings are performed multiple times to provide a sufficient amount of seeds so that all targeted genotypes can be obtained among the offspring. Furthermore, a cotton plant can only be crossed twice (or selfed once); therefore, for some genotypes several duplicates are targeted to be able to perform all crossings. Population  Figure S5: Three generation solution for the cotton example. Some crossings are performed 6 up to 12 times to provide a sufficient amount of seeds. For most genotypes occurring in the schedule, several duplicates are targeted to be able to perform all crossings.
sizes are computed in such way that at least the required number of instances is expected among the offspring, for each targeted genotype (see section 2).
When restricting the number of generations to 4 instead of 5, preset faster reports a different solution with four generations ( Figure S9) that has a lower total population size compared to the respective schedule found by preset fastest, before being interrupted when the time limit of 24 hours is exceeded.  Figure S6: Four generation solution for the cotton example. Some crossings are performed twice to obtain a sufficient amount of seeds. For two genotypes occurring in the schedule, two duplicates are targeted to be able to perform all crossings.
Overall LPA: 0% # Plants: 1210   Figure S7: Five generation solution for the cotton example with zero linkage phase ambiguity.
Only the final crossing is performed twice to provide a sufficient amount of seeds. For two genotypes occurring in the schedule, two duplicates are targeted to be able to perform all crossings.  Figure S8: Five generation solution for the cotton example with an overall linkage phase ambiguity of 3.14%. In return, a reduction in the total population size is obtained compared to the reported non-ambiguous schedule with five generations. No crossings are performed multiple times in this schedule as a single crossing always provides enough seeds to obtain all targeted genotypes.  Figure S9: Additional four generation solution for the cotton example, reported by preset faster when restricting the number of generations to 4 instead of 5. This schedule has a lower total population size compared to the solution with four generations that is found by preset fastest.