Modeling the Inheritance of Ancestral DNA

A discrete, stochastic model, written in Python

I’m always impressed by a model that reproduces a real life phenomenon, but is so simple that it seems like its creation was inevitable. That’s how I felt when I first saw the standing ovation and flocking birds models. I’ve developed a model that reproduces the process of receiving variable amounts of DNA from ancestors and it could hardly be simpler. It relies on three rules, the latter two of which increase variability in shared DNA between relatives:

A consequence of these rules is that two relatives can expect to share with each other about half of what they each have from an ancestor. For example, siblings, who each share 50% of a parent’s DNA, should expect about 25 of those percentage points to overlap, i.e. ~25% of my mom’s genes are found in both me and my brother.

For this model, a simple assumption is made that men and women pass DNA to their children in the same manner. In reality, DNA from fathers recombines at lower rates than from mothers. So the variability in shared DNA could be greater from fathers, especially for successively all-male lines. Not taking into account the sex of the parent makes the model vastly simpler to implement and still provides useful approximations, especially for lines that are a fairly even mix of males and females.

A person inherits an average of 25% of a grandparent’s DNA, but this number varies within a few percentage points. Why is that? For the purposes of this model a person inherits exactly 50% of a parent’s DNA. That’s because, out of your 22 autosomal chromosomes, each parent gives you one of two homologues for each chromosome. (This is ignoring the differences in the sex chromosomes X and Y and that mitochondrial DNA comes only from the mother. Chromosome X is in fact only shared equally in females, who get one homologue from each parent. In males, all of the X chromosome is from the mother and all of the Y chromosome is from the father. In everyone, all mitochondrial DNA is from the mother.) Your mom gives you one homologue of your chromosome #1, and that homologue is typically a combination of both of her chromosome #1 homologues — one from each of her parents. This process is called recombination.

One of your chromosome #1 homologues is from your mother and it has exactly half of your mother’s chromosome #1 DNA; but, whichever homologue or combination of homologues she gave to you, it was done somewhat randomly such that there was an equal probability of getting DNA from each of your maternal grandparents. Usually your homologue will be a combination of genes from the two applicable grandparents. And, on average, a mother’s chromosomal homologues are passed to a child with about half of what she has from each grandparent, so her 50% becomes the grandchild’s 25%.

I should note that it isn’t unlikely that you would inherit a full chromosomal homologue from a grandparent, thereby inheriting only one grandparent’s DNA on that chromosomal homologue. The frequencies for these non-recombination events are shown in a table here. An average of 55 autosomal segments are passed from a parent to a child. So there are only about two recombination events per chromosome, on average. For the purposes of this model, all that really matters is that you get a random half of your mother’s two homologues and that that will result in your maternal homologue for a particular chromosome being about half from each of your maternal grandparents.

Using the property of halved DNA per generation, you have an average of 25% of a grandparent’s DNA and 12.5% of your great-grandparent’s DNA. However, we would expect the variability to be greater as the generations go further back. The model results will show how much variability increases per generation.

What if you’ve had your DNA tested, you’re trying to find an ancestor on your maternal grandmother’s side, and you get the idea that you just don’t have enough of her DNA to find people with matching DNA? After all, you probably only have about 25% of her DNA, so there could easily be about 4 times more DNA matches out there than the ones whom you’re seeing. If you can’t get your grandma’s DNA tested, the next best thing is to get as many as possible of her descendants to have their DNA tested. It would be even better if those people were her children. If you get your mother’s DNA tested, then your own DNA is no longer contributing to the percentage. The next best, after your grandmother’s children, would be the first cousins who are related to you via your grandmother.

Here are some percentages of your ancestors’ DNA that we would expect to reproduce:

These percentages all rely on the same principle that was a consequence of the three rules of the model: That the overlap between two children of a given parent will be about half of what they share. For example, if a son receives 50% of his mother’s DNA, and a daughter receives another 50%, chances are that those will neither be the same 50%, nor a completely different 50% — it’s much more likely to be in the middle of those two numbers. On average, they will each have a distinct 25% and a shared 25% of their mother’s DNA.

Here are two more scenarios:

As you can see, these numbers are getting less intuitive and harder to calculate in your head. The model I’ve developed predicts these numbers and far more complicated ones. But, more importantly, it predicts the variability that you’d expect to see.

Before the model could be valid, it had to reproduce, in a training model, the real-life variance in DNA that one would expect to see between people who share a particular ancestor. Those numbers don’t appear to be available. The useful number would be the average variance per generation, but that number certainly wouldn’t be found: the variance in shared DNA is zero when passed from parent to child, as it’s only introduced in the next generation. The only value I could find is the standard deviation of shared DNA between siblings. That value was 0.036 from a published study that analyzed 4,401 sibling pairs. Luckily, it was very easy to use the model to simulate two children and then compare their shared DNA. Running the model multiple times and then comparing the amount of shared DNA in each model run would provide a standard deviation. That value works just as well as would the standard deviation of the percentage of a grandparent’s DNA. If a model were to simulate the transfer of 50% of DNA from a parent to each child, and it uses a method of simulating the real variance due to recombination, and that model results in a standard deviation of 0.036 of shared DNA between siblings, then that model is valid as a method to simulate the variance due to recombination. It can then be used recursively for multiple generations.

How did I do it?

I started with a model wherein each parent has 100 labeled pieces (each distinct from the other pieces and those of the other parent), or segments, of DNA that are available to pass to a child. It would then be easy to pass a random 50 of those from each parent to a child. Doing so for two children, taking the intersection of those segments, counting the number of those shared segments, and then dividing that by the total number of segments the child now has (100) gives the fraction that the siblings share. The standard deviation that resulted was surprisingly close.

I then got off track a bit. There’s a very informative blog called The Coop Lab that uses Poisson distributions to solve these kind of problems. Some of the posts at that blog have conveyed that there are an average of 33 recombination events per generation. If 22 chromosomes are broken up into segments an average of 33 times, the result would be an average of 55 segments passed per generation. So if my model used 55 segments, that seemed to be a better number than 100. Unfortunately, the resulting standard deviation between siblings (~0.47) was much higher than the target. The reason is that, in real life, a different set of 55 segments would be passed to each child. In my model, different combinations of the same 55 segments were used each time. So using 55 segments underestimates the number of different ways that segments can be passed. (Lower numbers of segments will always result in a greater standard deviation.)

The next method I tried was to use a very large number of segments. I soon realized that there would hardly be a standard deviation at all if too many segments were used. The rule of 50% inheritance with 50% overlap would overcome any possible variance as the number of segments gets too high. A child would always get 50% of parents’ DNA and another child would get an almost perfectly (by half) overlapping different 50%.

It became apparent that the right method would be to use whatever number of segments resulted in a standard deviation of 0.036. The number of segments would have to be updated by the simulation in order for it to converge to the right value. Another problem was that the number of segments that a parent has available needs to be an even whole-integer number if half of them are going to be passed to each child. This constraint wouldn’t allow for the resolution needed to get a fine-tuned standard deviation. The solution for that problem was found by using random numbers to slightly vary the number of segments for a given trial. So, if an odd number of segments turned out to be closer to what was needed, half of the time the simulation could use one number lower and half of the time it could use one number higher.

The next number of segments I tried was 101. Each parent would then pass either 50 or 51 of these segments to a child. The 101 segments would be halved. The decimal portion (0.5) would be stored in a new variable. If a pseudo-randomly generated number from a uniform distribution was greater than the decimal portion, 50 segments would be used for each parent to each child. Otherwise, 51 segments would be used. This was allowed to vary for each trial, but remained constant within a trial. If the standard deviation overshot the target value, the number of segments needed to be increased. Otherwise, it needed to be decreased. To reduce the variability of the standard deviation in a simulation run, I used a high number of trials (200,401), each of which represents a comparison of one pair of siblings.

Using 101 segments moved the standard deviation in the wrong direction. Using the same principle as for 101 segments, decimal numbers could be introduced in the number of segments. If the input number was 99.5, the halved number would be 49.75. Thus, for ~25% of trials the number of passed segments would be 49 and ~75% of the trials would use 50 segments. As the number of segments appeared to converge around the high-90s, I re-adjusted the input number of segments and started the simulation over again, allowing it to change the number of segments by 0.05. The convergence of the number of segments and the resulting standard deviation is a method of the simulation to learn the right input for a given output. The segments were simply labeled for a simulated mother, consisting of m = {m0, m1, m2, …, mn-1} where n is the rounded (up or down) integer of halved segments multiplied by two. And for the father they were labeled p = {p0, p1, p2, …, pn-1}.


The problem of predicting the mean percentage of shared DNA between segments is trivial. The percentage nearly always comes out 50.0%. Whether the input number of segments is initialized as higher or lower, the number of segments tends to converge around 97.5. Since the target standard deviation is 0.036, with only three decimal places, any simulation output of standard deviation above 0.0355 and less than 0.0365 is an accurate prediction. Whether I use an input number of segments of 95 (standard deviation = 0.0364) or 100 (standard deviation = 0.03551), the standard deviation seems to fall within that range. However, using the number of segments that the simulation converges to (which was 97.5 segments) is preferable. If the target standard deviation had more decimal places, I believe that the number of segments could be tuned to even finer resolution, however, the number of siblings pairs would also need to be increased, which would cause the simulation to run for a burdensome length of time.

Is there any significance to the number 97.5? It’s tempting to think that a parent’s genome is broken up into an average of 97.5 segments that are available, half of which end up being passed to a child. But that number is already known to be about 55 segments. The difference is due to the fact that, as I stated above, a different set of real-life segments (chopped in different places) are passed to each child — not just other segments from the same bunch like in my model. Took make the model work in a manner more similar to real life, the available segments would have to be different even before selecting half of them. (But my goal is to make the model simpler than real-life while reporting an accurate result.) In my model, the number of segments in the simulation has to be greater than 55 in order to introduce more variance to account for those different ways of breaking up the segments. The best way I can describe the meaning of the number of segments found by the model is as follows: If a parent’s genome was broken up into a fixed number of pieces, and those same pieces were the ones that were available to pass on each time the parent has a child, i.e. they couldn’t be broken up smaller and parts of segments couldn’t be combined, then a number of available segments averaging (across all parents) almost exactly 97.5 would provide the right amount of variance between all of a person’s descendants. What the number may show is that ~55/97.5 = ~56% of the variance in percentage of inherited genes is due to randomly halving the genome for each child and the other ~44% is due to segments being broken up in different ways each time. Regardless of what the number means, the important thing is that, in finding the number that produces the right variance, the model then becomes useful to perform other tasks.

Simulating the shared percentage of DNA and standard deviation between siblings was considered a calibration step of the model. Most of the same code could then be used as a tool to simulate the mean and standard deviation of DNA passed down from ancestors to descendants. Some examples of how that tool can be used and the subsequent results can be found here.

The next step would be to allow the simulation to treat recombination from mothers and fathers differently for improved accuracy. Presumably, 97.5 segments would lie halfway between the ideal number of segments to use for maternal recombination and that of paternal recombination. However, that would require a large dataset of great-grandparent to great-grandchild relationships. Unfortunately, I think we’re a long way off from having such a dataset, and I’ve already learned that nobody would be willing to share it with me if they had it. However, since the number of paternal and maternal recombination events has been observed from some studies, I may be able to interpolate those numbers in the same way that the number 55 was stretched to 97.5.

I’d be happy to share my Python code for this simulation. I set out to do this project as a way to learn Python, so I’m sure it could be optimized if there were a need for that. It would then probably be easier to tune the number of input segments and standard deviation a little better. Feel free to write a response. Let me know what you think of this post or ask me about genetic genealogy or genealogical research. To see my other articles, click here.



Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store