AutoClusters from MyHeritage

MyHeritage gives us free access to Evert-Jan Blom’s tool that does what we’ve all been waiting for!

Brit Nicholson
5 min readMar 7, 2019

In the same week that I found out GEDMatch has brought back its X-Matching and family trees, I was surprised to see that MyHeritage has attempted to pull off the grandest of feats as far as genealogical tools are concerned. It isn’t often that I would give such terrific praise to any company. In the past year MyHeritage has gone from one of the DNA testing companies I used and thought about the least to the one I use and think about the most. Here I’ll talk about why this feature is such a good idea and what the limitations are.

I first got the idea of automating DNA matches by clusters in February, 2017. I’m sure I was nowhere near the first person. I was new to genetic genealogy at the time, but I had just left work at a university’s human behavior lab in the modeling and simulation department, so I had some experience with such ideas. I had been using a free software called Gephi, which makes beautiful graphs of social networks based on nodes (such as people) and edges (the fact that two nodes are linked, or even a quantified strength of that connection).

I thought, “Why not let Gephi sort my DNA matches for me?” Among a lot of other parameters, Gephi lets you specify the number of clusters that your nodes will be grouped into. In theory, I thought, if you specified two clusters, it could separate all of your DNA matches into those that are maternal and those that are paternal. If you specify five clusters, you might get three that are paternal and two that are maternal, or some other combination. If three of your grandparents were from the U.S. and one was from somewhere else, you might get no clusters at all from that grandparent whose family hadn’t been in the U.S. for a long time.

A graph that I created in Gephi in February, 2017. Matches for which I had a decent amount of data were automatically clustered into groups, meaning that I probably shared a recent common ancestor with all of the people in a particular group.

One problem that I had right away was that an enormous amount of data pre-proccessing was necessary. That February I spent a lot of my free time entering the number of centiMorgans (cM) that my mutual matches shared with each other. The number of cM I shared with my matches was easily available in spreadsheet-like format, but while you can look at mutual matches on a webpage, there’s no way to export or copy and paste your mutual matches’ relationships with each other.

MyHeritage’s new feature promises to do all of this for you, although not with the beautiful kinds of graphs that you can create in Gephi. Needless to say, when I found out it was available at the end of February this year, I requested it right away. It took just over 24 hours for my results to come in.

The results were in the form of a zipped file in an email. After unzipping it, I found that the contents were a CSV file, a HTML file, and a PDF titled “ReadMe.” One can see their results in either the CSV file or the HTML file. They showed me a list of 16 clusters, averaging about 5.5 matches per group.

There is a graph that comes with the AutoCluster results. You’ll see it if you open the HTML file. It may appear that all of your clusters are in one ancestral line, but that isn’t the case. However, the graph does attempt to predict ancestral time starting from most recent on the top left and most distant at the bottom right.

MyHeritage would have no trouble compiling the data of not only your matches but the relationships between your mutual matches — they already have that information stored in a readily usable format. The AutoCluster tool only includes people that match 30 cM or more with you. This turned out to be a pretty big problem. Except for one group, I ended up only seeing matches from one side of the tree — the side that had ancestors from the U.S. However, I already know a lot about the ancestors on that side of the tree.

Within a few hours I was able to tell which part of my tree each group had come from except for one. In most of those cases, I could tell which side the cluster was from at least three generations back. In a few cases, I could only tell for two generations back. I realized that this tool would be much more useful for someone who hasn’t studied their DNA matches very much. I, on the other hand, have a text file that has almost all of my segments of DNA covered, ordered from chromosome 1 to chromosome X, and from segment starting point to segment end, and lists of people who match more strongly on those segments. I’ve done a lot of hard work already that this tool attempts to do. On top of that, I’ve done it in far greater detail. However, this tool would have been a great stepping-off point in my research. And it would have saved me quite a bit of time early on.

Another problem with the 30 cM threshold was that many people matched on one segment in the 20–30 cM range and then one or two segments in the 6–7 cM range. Those smaller segments may have been identical by chance matches, i.e. they don’t indicate a real match.

The 30 cM match threshold was chosen to ensure that all of the matches in the results were really good, definite matches. If I could have specified the threshold, I would have certainly chosen something more like 20 cM. But I wouldn’t have allowed matches that were 6 cM or less to contribute to that total.

If I could have specified the threshold, I definitely would have lowered it to some point at which I could look at some of the matches whom I suspect share distant ancestors of interest to me. But it would be important to understand that, the lower the threshold, the less confidence you could have in your results.

Releasing this tool for free was a great idea. I applaud MyHeritage and Evert-Jan Blom for being the first to the game. I’m sure that its ability to predict which matches have which mutual ancestors will greatly improve over time.

Feel free to ask me about modeling & simulation, genetic genealogy, or genealogical research. And make sure to check out these ranges of shared DNA percentages or shared centiMorgans, which are the only published values that match peer-reviewed standard deviations. That model was also used to make a very accurate relationship prediction tool. Or, try a calculator that lets you find the amount of an ancestor’s DNA you have when combining multiple kits.

--

--