Can You Just Add the Averages and Ranges for Double Relationships?
People often ask how their DNA match with a cousin will be inflated if they have a double relationship with that cousin. Some helpful people have offered that you can look up shared averages and ranges for certain relationships and double all of them, or for two different types of relationship you can add the different averages and ranges together. I’ve been curious about how accurate this method is, so I’ve finally decided to test the idea in a definitive way.
Some examples of double relationships are double first cousins, in which case two people share the same four grandparents, but different parents; or double second cousins, in which case two of your four great-grandparent pairs are shared with your second cousin, rather than just one pair. It’s also possible for one of these two cousins to be ‘removed’ some number of times from some or all of the shared ancestors. Or both cousins could be removed: You are the second cousin, once removed, to your cousin, who has a additional relationship to you in which they’re your second cousin, once removed. I’ll note that all of the above relationships can, and often do, occur without anyone’s parents being related to each other.
I’ve figured out the answers to the above questions. I did so using a simulation that I developed that can’t get averages wrong for any relationships, for multiple relationships, or for combinations of DNA kits. Additionally, although I have near exact values coming soon, the ranges are the closest there are to reproducing standard deviations in the peer-reviewed literature. The parameters used for this article were similar to those used in my previously reported shared percentage charts.
The comparisons in this article are valid because the methodologies are the same throughout. If some of the ranges are a bit too narrow, all of the ranges are proportionately narrow. The results are the same as if you used perfect data (which do not exist), doubled the averages and ranges, and then compared those values to perfect data for double cousins.
Below are some tables that will show you when you can and when you can’t add averages, and that you can probably never accurately add ranges.
Table 1. All results for all tables are from simulations with 100k trials, and are for sex-averaged relationships. (Paternal relationships actually have much wider in range than maternal relationships. I usually report them separately.) The first row is for the simple case of first cousins. The values in the second row come from doubling the first row. Below that are values for double first cousins. The third row shows half-identical regions (HIR) plus fully-identical regions (FIR) shared between cousins, which is the most scientific way to report shared percentages or centiMorgans (cM). 23andMe reports results this way. The last row shows HIR only values, which is the way AncestryDNA reports shared DNA, and the way that GEDmatch reports it unless you check a box to see FIR only shared DNA. The averages are shown in the center column. The 99% confidence intervals are shown on either side of the average, and signify the range that most values will take for this relationship. 0.5% of values would be below the lower end and 0.5% of values would be above the higher end.
An interesting problem has arisen here. The regular first cousin averages can be doubled and then compared to double first cousin averages, but only on platforms that report HIR + FIR. Unfortunately, AncestryDNA only reports shared HIR DNA amounts. Doubling the known average for first cousins and comparing it to a match at AncestryDNA would cause you to be off by 3.1 percentage points, and that’s if you knew the match was a double first cousin, and they happened to share the exact average percentage with you. So you’re likely to be off by a lot more than 3.1 percentage points. The average will be right if you use 23andMe. Or, if you’re using GEDmatch, you could separately check the ‘FIR only’ box and add the resulting value to that of the unchecked (default) result.
The rule of doubling has proved inconvenient for averages. But it’s even worse for the ranges. In this case, no matter what platform you’re using, the values are going to be off. It turns out that doubling ranges, or adding the ranges for two different cousin levels, underestimates values at the lower end and overestimates values at the higher end. There’s a simple explanation for this. It’s the same reason that avuncular relationships have much narrower ranges than grandparent/grandchild relationships, despite having the same average of 25%. More meiosis events result in less variation. Since DNA is variable, there will always be some DNA relatives who share extreme values-ones that are much higher or lower than average. But, when it’s a double cousin relationship, what’s the probability that both relations are extremely high or that both are extremely low? It’s much more likely that, in the event that one relationship is extremely low in shared DNA, that the second relationship is average or even high, thus balancing the low value. Or even a moderately low value will balance out an extremely low value, so the vast majority of extreme values will be moderated by the other shared relationship. This reduces variation in double cousin relationships.
Table 2. All of the values and methodology are the same as in Table 1, except that percentages have been converted to cM. This was done by converting the percentages in Table 1 to fractions and then multiplying them by 7,174 cM, which is the total possible cM count of both copies of the genome at GEDmatch. The classic example people give for double first cousins is ‘when two brothers marry two sisters.’ But it can also occur when a man and a woman have a child, and then the man’s sister has a child with the woman’s brother. Since these two scenarios produce slightly different results, the values in Tables 1 & 2 are what you would see in a dataset in which one occurred half of the time, and the other occurred the other half.
If you prefer centiMorgans (cM) over percentages, you can consult Table 2 instead of Table 1. Using cM, the values obtained by doubling regular first cousin averages are likely to be 223 cM higher than what AncestryDNA would report for double first cousins, on average. This could lead to a lot of false conclusions. One should analyze suspected double first cousin matches at a different site.
Using the conventional wisdom of doubling ranges would lead you to underestimate double first cousins at the lower end by 173 cM and overestimate them at the higher end by 109 cM. The doubling method makes it unlikely to misidentify double first cousins as being some other relationship, but it makes it more likely that some other relationship is misidentified as double first cousins. Surely, the best solution is to use highly accurate double cousin averages and ranges, as can be found here.
Table 3. This table shows the differences between double second cousins and simply multiplying the values for second cousins by two. The methodology is the same as for all other tables. Sex-averaging for the third row was done by choosing alternating paths: The shared great-grandparents are the cousin’s mother’s father’s parents and the cousin’s father’s mother’s parents. The fourth row shows a different type of double second cousin. In all of the other above relationships, parents were not related to each other. In the fourth row, however, two pairs of great-grandparents are shared as a result of one of the cousin’s parents being first cousins to each other.
Table 3 shows, yet again, that doubling regular cousin values can work for averages. And, with double second cousins, one doesn’t have to worry about FIR. But the ranges are still much too wide after multiplying by two. This is for the same reason discussed above. In the case of double second cousins, the lower end of the range, after multiplying by two, underestimates the value by about 1.3 percentage points and the higher end overestimates by 2.6 percentage points.
And now there’s another problem. The bottom row in Table 3 shows a case in which doubling the values for one second cousin doesn’t get the right average. That’s because one of the cousins’s parents are first cousins to each other, and both of those great-grandparent pairs are shared with the other cousin. The resulting average is lower than what you get if you multiply the regular value by two. And the range is shifted lower. This is going to happen any time that the double relationship arises due to parents being related.
Table 4. All of the values and methodology are the same as in Table 3, except that percentages have been converted to cM. This was done in the same way that the percentages in Table 1 were converted for Table 2.
Table 4 shows the values for double second cousins in cM rather than percentage. We see that the value at the lower end of the range underestimates by 92 cM and the value at the upper end overestimates by 186 cM. Additionally, we see a case in which the average is 28 cM less because of a different configuration leading to the double relationship.
We’ve only looked at three scenarios: double first cousins and two types of double second cousins. But I believe that that’s enough to show the important points. Doubling the values for a single cousin relationship can get you close to the right averages, but sometimes it doesn’t, and it never gets the ranges right. I’ll reiterate the lessons learned from this experiment:
- If you find yourself investigating whether or not a match is a double cousin of some type, it’s important to determine if there might be fully-identical regions (FIR) shared. If so, AncestryDNA won’t be of much help.
- For any double cousin relationship, doubling the value for a single cousin relationship or adding together the values for two different levels of relationship is going to create a much wider range than what is actually possible.
- If any parents in the tree are related to each other, other than the farthest back ancestor pairs or the testers themselves, the shared average and ranges for double cousins are going to be lower than what you would get if you doubled the values or added the values from two different single cousin relationships.
I hope that you’ve found this information useful. If you have a request for me to add a certain multiple cousin relationship to my tables, I’d be glad to do so. It will be much better than adding averages and ranges together.
Feel free to ask me about modeling & simulation, genetic genealogy, or genealogical research. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. That model was also used to make a very accurate relationship prediction tool. Or, try a calculator that lets you find the amount of an ancestor’s DNA you have when combining multiple kits.
Originally published at http://www.dna-sci.com.