AutoClusters from MyHeritage
One company finally did it. They’ve released a tool that automates the very function that we’ve all been doing manually.
In the same week that I found out GEDMatch has brought back its X-Matching and family trees, I was surprised to see that MyHeritage has attempted to pull off the grandest of feats as far as genealogical tools are concerned. It isn’t often that I would give such terrific praise to any company. In the past year MyHeritage has gone from one of the DNA testing companies I used and thought about the least to the one I use and think about the most. Here I’ll talk about why this feature is such a good idea and what the limitations are.
I first got the idea of automating DNA matches by clusters in February, 2017. I’m sure I was nowhere near the first person. I was new to genetic genealogy at the time, but I had just left work at a university’s human behavior lab in the modeling and simulation department, so I had some experience with such ideas. I had been using a free software called Gephi, which makes beautiful graphs of social networks based on nodes (such as people) and edges (the fact that two nodes are linked, or even a quantified strength of that connection).
I thought, “Why not let Gephi sort my DNA matches for me?” Among a lot of other parameters, Gephi lets you specify the number of clusters that your nodes will be grouped into. In theory, I thought, if you specified two clusters, it could separate all of your DNA matches into those that are maternal and those that are paternal. If you specify five clusters, you might get three that are paternal and two that are maternal, or some other combination. If three of your grandparents were from the U.S. and one was from somewhere else, you might get no clusters at all from that grandparent whose family hadn’t been in the U.S. for a long time.
One problem that I had right away was that an enormous amount of data pre-proccessing was necessary. That February I spent a lot of my free time entering the number of centiMorgans (cM) that my mutual matches shared with each other. The number of cM I shared with my matches was easily available in spreadsheet-like format, but while you can look at mutual matches on a webpage, there’s no way to export or copy and paste your mutual matches’ relationships with each other.
MyHeritage’s new feature promises to do all of this for you, although not with the beautiful kinds of graphs that you can create in Gephi. Needless to say, when I found out it was available at the end of February this year, I requested it right away. It took just over 24 hours for my results to come in.
The results were in the form of a zipped file in an email. After unzipping it, I found that the contents were a CSV file, a HTML file, and a PDF titled “ReadMe.” One can see their results in either the CSV file or the HTML file. They showed me a list of 16 clusters, averaging about 5.5 matches per group.
MyHeritage would have no trouble compiling the data of not only your matches but the relationships between your mutual matches — they already have that information stored in a readily usable format. The AutoCluster tool only includes people that match 30 cM or more with you. This turned out to be a pretty big problem. Except for one group, I ended up only seeing matches from one side of the tree — the side that had ancestors from the U.S. However, I already know a lot about the ancestors on that side of the tree.
Within a few hours I was able to tell which part of my tree each group had come from except for one. In most of those cases, I could tell which side the cluster was from at least three generations back. In a few cases, I could only tell for two generations back. I realized that this tool would be much more useful for someone who hasn’t studied their DNA matches very much. I, on the other hand, have a text file that has almost all of my segments of DNA covered, ordered from chromosome 1 to chromosome X, and from segment starting point to segment end, and lists of people who match more strongly on those segments. I’ve done a lot of hard work already that this tool attempts to do. On top of that, I’ve done it in far greater detail. However, this tool would have been a great stepping-off point in my research. And it would have saved me quite a bit of time early on.
Another problem with the 30 cM threshold was that many people matched on one segment in the 20–30 cM range and then one or two segments in the 6–7 cM range. Those smaller segments may have been identical by chance matches, i.e. they don’t indicate a real match.
The 30 cM match threshold was chosen to ensure that all of the matches in the results were really good, definite matches. If I could have specified the threshold, I would have certainly chosen something more like 20 cM. But I wouldn’t have allowed matches that were 6 cM to contribute to that total. Probably better than centiMorgans would be to use SNPs. I think of SNPs as indicating whether or not a match is a definite one, while cM give a better indication of what that relationship is.
If I could have specified the threshold, I definitely would have lowered it to some point at which I could look at some of the matches whom I suspect share distant ancestors of interest to me. But it would be important to understand that, the lower the threshold, the less confidence you could have in your results.
I think that this new tool was a really good idea and I applaud MyHeritage for doing it before any other companies. I’m sure that its ability to predict which matches have which mutual ancestors will greatly improve over time.
Feel free to write a response. Tell me what you think of this post, let me know if any of the people I mentioned are your ancestors or if you have additional information, or ask me about genetic genealogy or genealogical research. To see my other stories, click here.