How Much of an Ancestor’s DNA Do You Have?
Or, more interestingly, how much will that vary and how can you easily increase the number and quality of your DNA matches?
The model used to produce these numbers has been updated a few times. More accurate numbers can be found here.
The DNA testing industry is booming, often as people attempt to find close relatives or ancestors whom they couldn’t previously identify. If a person knows how to use the tools available for analyzing the DNA segments they share with matches, these goals can be fairly easily achieved. However, many people would be surprised to learn how little of a piece of the puzzle they actually have within their own genome. It would be hard to understate how important it is to get other relatives to test their DNA if you really want to find answers about your ancestors.
I’ve made a model, described in detail here, that not only predicts the percentage of an ancestor’s DNA that you could reproduce, which is a trivial calculation that you can often perform in your head, but also gives the range over which that percentage could vary (with 95% confidence).
These mean percentages are well known and can be found in many places, such as in the chart below.
As you can see in the bottom right corner of each block, you’ll share a fairly predictable percentage of DNA with certain relatives. With parents and siblings, that will be 50% (exactly for parents). With half-siblings, uncles, aunts, nieces, nephews, and grandparents, you’ll share about 25%. The number you see in the blocks above can vary by several percentage points. And they hardly give a clue as to what percentage of an ancestor’s genome you and a relative reproduce when both of you get your DNA tested.
In order to show the range of expected percentages, I developed a very simple model that calculates the percentage of reproduced DNA and I let it run 20,000 times. Using the bootstrapping method, percentages that fall within the middle 95% of values can be said to occur with 95% confidence. While the mean values are trivial (except as a check against errors in the model), the minimum and maximum values also show the range you would not expect percentages to fall outside of. Below are model results for various combinations of relatives and what percentage of DNA they could reproduce for ancestors up to great-grandparents.
The model relies on three rules, the latter two of which increase the variability of shared DNA between relatives. The first is that parents randomly pass half of their genome to their children. The second is that those parents pass their parents’ DNA somewhat randomly, but on average, half from each. The third is that relatives can expect their similar genomes to overlap by about half of what they have of their ancestors. For example, siblings, who each share 50% of a parent’s DNA, should expect about 25 of those percentage points to overlap.
This model does not differentiate between male and female ancestors, although it would be more accurate to do so. It happens that recombination from mothers to children is greater than that from fathers, resulting in more variability in lines that are majority male and less variability in lines that are majority female. Since this simple model doesn’t include differences in recombination, the results here are more like averages, or what you would expect if the numbers of your ancestors in a particular line were pretty close to half male and half female.
What can you do with a higher percentage of an ancestor’s DNA? Most websites don’t let you analyze relationships between mutual matches, but GEDmatch.com does. When you think that a chromosome segment of your genome came from an ancestor of interest, you can make a list of DNA relatives who share that segment with you. If you still don’t have enough information to prove which ancestor it’s from, you can compare those DNA relatives with each other on GEDmatch, excluding your own DNA this time. What you’ll find, if enough of them have well populated family trees, is that they share certain segments with each other that came from your ancestor, but that you didn’t inherit. Of course, if you manage your relatives’ kits, you can just analyze their matches at any website to which you’ve uploaded the data.
One final thing that I thought was interesting about these model results is that an adjustment can be applied based on already known percentages. For example, I already know that I have 29% of my maternal grandfather’s DNA and only 21% of my maternal grandmother’s DNA. If I’m wondering what percentage of DNA I share with my maternal grandfather’s father, I should be able to multiply the model results by 29/25. Based on the simulation results, I would’ve expected to share 7.3–17% of my DNA with my maternal grandfather’s father (with 95% confidence), but now I would expect to have inherited about 8.4–20% of that great-grandfather’s DNA. That ratio could be built into the model as a special function for calculating percentages adjusted by known ratios.
As a next step, I would like to treat recombination from mothers and fathers differently, however that would require a dataset of grandparent-grandchild relationships, with the additional constraint that the sex of the parent would need to be known. Because recombination occurs more in a mother’s genome than in a father’s, the shared DNA for maternal grandparent-grandchild relationships would have a lower standard deviation. The shared percentage of DNA for paternal grandparents would vary more from the expected 25%. I thought it would be a great idea to get standard deviations for sex-specific relationships in order to train a future model on those values, so I sent messages to just about everyone who has a dataset of shared DNA.
Update: I am so very grateful that nobody provided me with the simple aggregated statistics that I requested. In 2019, Carl Veller et al. finally released the standard deviations I had been waiting for. These are calculated from mathematical formulas and are therefore much more accurate than what I would’ve gotten for empirical data. (Empirical data are very accurate in some fields, but they’re wildly inaccurate in genetic genealogy. It’s a messy field for data.) This means that I was ready to make my model, but hadn’t started training it when the peer-reviewed statistics came out. I was checking the literature very frequently. When the standard deviations were finally available, I was able to make the most accurate shared DNA data that have ever existed.
Cover photo by Sharon McCutcheon. Feel free to ask me about modeling & simulation, genetic genealogy, or genealogical research. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. That model was also used to make a very accurate relationship prediction tool. Or, try a calculator that lets you find the amount of an ancestor’s DNA you have when combining multiple kits.