Friday, 30 June 2017

Criteria for Grouping People into Y-DNA Genetic Families

One of the main tasks of Surname Project administrators is to place new members into the appropriate genetic group within their surname project.

Having run a variety of surname projects for the last few years, I have come up with a set of criteria I use on a routine basis to place newcomers into existing genetic groups and also to identify new genetic groups. I call these criteria Markers of Potential Relatedness (MPRs). And (not surprisingly) these can be thought of as indicators that two people may be "related" to each other, which for the purposes of surname projects means somewhere in the last 1000 years or so. This arbitrary timepoint is chosen because many European surnames were introduced about 1000 years ago (in particular British and Irish surnames), although they only became commonplace several centuries thereafter.

This approach to grouping works best with hereditary surnames (i.e. passed from father to son) but should also work with patronymic (and other) surnames, except that (in these latter cases) criteria 1 and 8 will not apply. The discussion below is very much from the standpoint of hereditary surname projects.

Not all criteria have to be met. But the more criteria that are met, the higher the likelihood of two people being related. This is particularly important in relation to SDSs (Surname or DNA Switches; also known as NPEs, Non-Paternity Events), as it may be difficult to distinguish a match that is an SDS (e.g. adoption, illegitimacy) from one that is due to Convergence.

Below is a list of these criteria and we will consider each one in turn. Some of these Markers of Potential Relatedness (MPRs) have nothing to do with DNA. If two people have the same surname, or the same unusual surname variant, or have a similar ancestral homeland, or even an ancestor with the exact same name, then these can be indicators that the two people are related. And because they don't rely on genetics I simply call them "traditional markers" as opposed to "genetic markers". 

MPRs for deciding if two or more people are related within the last 1000 years

In practice, the most useful indicators (or at least the ones I most frequently use) are Markers 1, 2, 6 and 7. And if a new project member is grouped on the basis of these "main" markers, it usually becomes apparent that they meet many of the remaining criteria also.

1. The members have the same surname

This is an obvious criterion, especially for surname projects that deal with hereditary surnames. If two people share the same surname, the next question is: are they related? And it would seem a reasonable supposition that there is a much higher probability that they are related on their direct male lines (within the last 1000 years) if they do share a surname than if they don't.

Problems tend to arise when there is some doubt over what is a valid surname variant and what is not. For example, are Malley and Malloy surname variants? Are Farrell and Farris surname variants? What happens when you get both types of variant testing positive for M222? Do you group them together or keep them apart? Only other MPRs (such as downstream SNP testing) can answer these questions.

2. The Genetic Distance (GD) between two people indicates a (very) close relationship

The threshold for "declaring a match" between two people varies with the number of STR markers tested (see below). These thresholds are arbitrary, but the intention is to get the right balance between false positives and false negatives - in other words, letting the wrong people in and keeping the right people out (known more technically as specificity and sensitivity).

Most people do the Y-DNA-37 test initially and I would usually feel very confident grouping together people with the same surname if their GD was 2/37 or less; and reasonably confident of grouping them together if the GD was 4/37 or less. Except in the instance where there is evidence of Convergence, as indicated (for example) by the terminal SNPs of their matches sitting on a wide variety of distantly related "upstream" branches of the Y-Haplotree (Tree of Mankind). We'll talk about this some more in item 7 below.

In addition, Convergence is a common occurrence in certain subclades, such as M222 and L226. When I see these terminal SNPs in a new project member, alarm bells start ringing, my level of conservatism increases, and I start looking to other MPRs other than Genetic Distance to decide if two people belong in the same genetic family.

from www.familytreedna.com/privacy-policy.aspx

This technique for grouping people together will miss outliers - people who do indeed belong in the same genetic family but whose ancestors branched away from the main group many many generations ago. For example, in the Gleeson DNA Project, several of the members of Lineage II (all confirmed to be related by Big-Y SNP testing) have a GD of 10/37 compared to other group members, and that would usually preclude them being grouped together.

3. The TiP24 score is >80% compared to the group modal haplotype

I don't use this marker so much anymore but it can be a useful way of assessing if a newcomer belongs in a given genetic family, especially if there is insufficient data regarding SNP markers among their STR matches. The potential benefit of this method is that it takes into account the varying mutation rates of STR markers whereas GD does not.

It involves generating a TiP Report between a new project member and the member closest to the modal haplotype for a given genetic family within the project, and then looking at the percentage probability of being related within 24 generations. We call this the TiP24 Score (for lack of a better term). If this is >80% (an arbitrary figure, which can be adjusted to suit your personal preference), then the newcomer can be considered to be "likely to be related" and therefore placed in that specific genetic family.



It is important to note that the use of the TiP24 Score is not an attempt to date when two people are related, merely to ascertain if two people are likely to be related. The TiP24 Score is simply an attempt to standardise GD comparisons, given that we know that a GD of (say) 4/37 on slow-mutating markers is much more significant than a GD of 4/37 on fast-mutating markers. The former (probably) indicates a much more distant relationship than the latter.

This techniques works best for those related within the last several hundred years, but will miss outliers. I have several people in the Gleeson DNA Project (confirmed to be related via SNP testing) whose TiP24 Score with other members is as low as 1%.

Also, the TiP24 Score is likely to be tripped up by Convergence (in the same way that GD is) and is therefore of limited utility in such circumstances.

4. There is a clear Genetic Distance Demarcation between project members within a genetic cluster & project members outside it

Administrators have access to a tool called the "Y-DNA Genetic Distance" tool. This permits comparisons between any person in the project and every other person in the project. Often times, there will be a clear demarcation between a newcomer's range of GDs to a particular genetic family and all other genetic families within the project.

In the example below, the newcomer matches 9 members of R1b-Genetic Family 2 with a GD ranging from 4/67 to 9/67. Thereafter, the GD jumps to 16/37 and higher. This stark demarcation in GD suggests strongly that the newcomer falls within R1b-Genetic Family 2.

This also suggests that Convergence is unlikely to be an issue here (otherwise we might expect to see a more gradual increase in GD values, rather than the jump from 9 to 16 that we see here).

This technique works best with 111 or 67 marker comparisons. Demarcations are much less obvious using 37 marker comparisons.

The GD between the newcomer & other members
shows a clear demarcation between
one particular genetic family and all others

5. Presence of Rare Marker Values or a Relatively Unique STR Signature among genetic group members

The idea here is that if one or more people share a Rare Marker Value, then it stands to reason that they are more likely to be related to each other, especially if they all share the same surname.

Leo Little's spreadsheet of STR marker value frequencies is very useful for identifying those values which are particularly rare, even though the spreadsheet only covers six of the main haplogroups (E3a, E3b, G, I, J2, R1a, R1b). What constitutes "rare" is a moveable feast but a frequency less than 5% would not be unreasonable.

Usually these rare marker values emerge after several people have been grouped together. Any newcomers thereafter who share this rare marker value can be further assessed for membership of the specific genetic family wherein the rare marker value occurs. A famous example is Group B of the Wheaton Surname Project where 3 "rare" marker values occur within the first 12 markers (with incidences of 5%, 1% & 8% in the "general" R1b population). The chances of these occurring within the general population are 1 in 62,000. And therefore, any Wheaton who matches these 3 STR marker values can be automatically allocated to Group B (with 99.99% confidence). And they only need a 12-marker test to do so.

Leo Little's spreadsheet of marker value frequencies

An allied concept is that of the Relatively Unique STR Signature (also know by various other terms such as STR Motif). In short, these are a selection of STR marker values (usually between 3 to 8 in number) that are "unique" to just a few people within a surname project and which indicate that the people concerned are likely to be related to each other.

A good example from the Gleeson DNA Project shows that several members had relatively unique STR Signatures which predicted that they were related (Branch E and F below). This was later confirmed by SNP testing of the two branches.

Relatively Unique STR Signatures predict the existence of a Branch E and F (last 6 entries)
Branch E signature ... 464b=17, 607=14, 576=17
Branch F signature ... 391=10, 458=17, 459=9-9, 576=17

Robert Casey has developed this concept extensively and you can hear him talk about it in this video here.

6. SNP testing is consistent among the members of the particular group

The advent of Next Generation Sequencing (producing tests like the Big Y and the array of SNP Packs) has created a SNP tsunami. And as more people SNP test, their predicted red SNP is being converted to a green confirmed SNP on the project's Y-DNA Results page.

As a result, many groups within a surname project are having their "Terminal SNP" characterised. And this allows us to compare any SNP markers that the newcomer has tested with the SNP markers that characterise the various groups within our surname project. If they are discordant, then the newcomer is ruled out from membership of those particular genetic families. But if they agree with each other, especially if they are SNPs quite far downstream, then this is further supportive evidence that the newcomer belongs in a specific genetic family.

The phrase terminal SNP is a bit of a misnomer. It should be restated as "current terminal SNP" and simply means the "most downstream" SNP marker that you have currently tested. And what is meant by "most downstream"? Imagine the Tree of Mankind (the Y-Haplotree) as starting with genetic Adam (upstream) about 250,000 years ago and the various branches emerging from him and continuously branching over many thousands of years into finer and finer "more downstream" branches, until these finer branches start approaching the origin of surnames (roughly 1000 years ago) and a genealogical timeframe. So your "most downstream" branch would be the branch characterised by your "most downstream" SNP marker ... which in turn is determined by your current level of SNP testing. For example, your Y-DNA 37 STR results will predict which Haplogroup branch you sit on (let's say it is R-M269, which arose about 13.5K years ago), and the R-M269 SNP Pack will take you a little further down Branch R (say to Z255, 4000 years ago), and the R-Z255 SNP Pack, will take you even further downstream (maybe to 2000 years ago), but the Big Y test will take you the furthest (maybe down to 500 years ago).

In the example below, all the green confirmed SNPs sit below the SNP marker that defines Gleeson Lineage II, namely A5631. Therefore any newcomer who matches any of these SNPs (even if he has a large GD to everyone in the project) can be reliably grouped into Lineage II. The abbreviated SNP Progressions (or SNP Signatures) for each of the individual SNPs is detailed below:

  • R-Z255 > Z16437 > Z16438 > BY2852 > A5631 > A5629
  • R-Z255 > Z16437 > Z16438 > BY2852 > A5631 > A5629 > BY5706 
  • R-Z255 > Z16437 > Z16438 > BY2852 > A5631 > A5629 > BY5706 > BY5707 
  • R-Z255 > Z16437 > Z16438 > BY2852 > A5631 > A5629 > A5628 > Y16880

Predicted SNPs (red) and Confirmed SNPs (green)

The predicted red SNPs are almost always much further upstream on the Tree of Mankind than the green confirmed SNPs. Think of the upstream SNPs as closer to Genetic Adam (250,000 years ago) and the downstream SNPs as closer to a genealogical timeframe (say, 1000 years ago).

7. SNP predictions are consistent (Matches’ Terminal SNP Analysis) 


NB: SNP Predictions does not mean the red predicted SNP you get in the Haplogroup column (see figure above) when you first get your Y-DNA-37 results. It refers to SNPs much further downstream than that, usually within the last 5000 years and frequently within the last 2000 years.

If a newcomer to the surname project has not undertaken downstream SNP testing, it is still possible to guess what his downstream "terminal SNP" will be by simply analysing the terminal SNPs of his STR matches. I call this the Matches' Terminal SNP Analysis. It is a relatively simple technique that takes a little time to complete. Here are the steps in the process:

1) First, open up the Y-DNA Matches page and adjust the Matches Per Page setting so that all the matches are on the one page.

2) Next click on the heading in the Y-DNA Haplogroup column so that all of the matches are sorted by their terminal SNP.


3) Make a list of all the terminal SNPs (you can ignore the SNPs that are way upstream e.g. M269, P312, L21, etc)

4) Find out where each SNP sits on the Y-Haplotree, and (most importantly) the major subclade to which it belongs. You can do this by either of two ways: a) launch FTDNA's Haplotree, press Ctrl+F (Cmd+F on a Mac) and enter the SNP name. Once you find it, trace the branch back up to the previous branching point, make a note of the SNP there, and repeat the process until you arrive at a known subclade SNP; or b) google the following: "ytree" and the SNP name ... and this will bring you to the relevant page on the Big Tree. Then simply copy and paste the SNP Progression from the top of the page.

A google search for: ytree a5631

5) Both of the above methods will result in you having a SNP Progression for each SNP in the Matches List (see example below). If all (or most) of these SNP Progressions fall below a certain sublcade, then the likelihood is that the newcomer will also test positive for some SNP below this subclade level. It may even be possible to predict that he sits on one of maybe two or three "way downstream" branches. And this can be strong supportive evidence that he is related to certain project members and should be grouped in a particular genetic family.

If on the other hand, the various SNP Progressions associated with this list of SNPs indicate that the newcomer is matching to multiple distinct upstream branches of the Haplotree, then no firm conclusions can be drawn about the newcomer's likely terminal SNP and therefore this information cannot be used to help place him in a specific genetic family.

6) As a result of this analysis, I may write to the newcomer and suggest they skip the upstream SNP Pack (e.g. R-M269) and move down to the more relevant downstream subclade SNP Pack (e.g. R-Z255) and purchase that one ... warning them that there is a 1% chance that my assessment may be wrong (but I haven't been wrong yet).

Output of the MTSA for a new project member
(he was advised to do the R-L1065 SNP Pack)

There are SNP Packs available for most of the major subclades and it is important to know what these are. You can see a list of them by logging in to your FTDNA account, clicking on Upgrade, then Advanced Tests, then SNP Packs from the drop-down menu.

Surprisingly, this analysis works best at the 25-marker level (because there are usually too few matches at the 37, 67 and 111 marker levels).

Occasionally I will have to use www.Ybrowse.org to check for the existence of equivalent SNPs or alternative names (if the SNP in question does not turn up in the FTDNA Haplotree or the Big Tree).

8. The same surname variant is predominant in a genetic group

This usually emerges after the new project member has been grouped on the basis of the previous MPRs described above. This serves to support and validate the decision to group the newcomer in the specific genetic family.

9. The same MDKA location is present in the particular genetic group

As above. This serves to illustrate how essential it is to encourage all project members to include the birth location of their Most Distant Known Ancestor (MDKA / EKA) in the Genealogy section of their personal FTDNA webpages. After their surname, their ancestor's birth location is the single most important piece of information.


Always include the birth location of the EKA / MDKA

10. The same MDKA is present in the particular genetic group

This is the ultimate validation that the grouping based on the preceding MPRs is valid and accurate.


For a more detailed discussion of these various criteria, watch the video below. This is suitable for beginners, those who have already done the Y-DNA test and want to find out what it means, and for Surname Project Administrators.




Maurice Gleeson
June 2017








Tuesday, 13 June 2017

WDYTYA 2017 - videos going online

This year was the last year of Who Do You Think You Are? - Live! The event was an annual staple of the British genealogical calendar for the last 10 years. Starting in 2008 in Olympia in London, it moved to the National Exhibition Centre in Birmingham in 2014. The event attracted thousands of attendees year on year, and in 2009 Brian Swann, ISOGG-UK representative, persuaded FamilyTreeDNA to sponsor a stand at the event.

Shortly thereafter, the DNA Workshop began. And for the past several years this has been kindly sponsored by FamilyTreeDNA and run by volunteers from ISOGG. Each year the lecture schedule has attracted a host of international and local speakers, both academics and citizen scientists. And in addition, videos of the presentations have been made available free of charge on our dedicated YouTube channel as a service to the genetic genealogy community.



It is sad to see the demise of the WDYTYA event. It was a wonderful way of keeping in touch with friends and colleagues, and everyone looked forward to the manic three days of early mornings and late nights. Hopefully another annual event will rise to take its place. Nevertheless, at least the videos of the presentations will serve as a lasting legacy of the ten year run of WDYTYA.

The last batch of videos ever are ready to be uploaded to the YouTube channel and this will happen every Monday and Thursday over the coming 6 weeks or so. Three videos have already been uploaded and are attracting a large audience:

The Science of Admixture Percentages (Garrett Hellenthal)

DNA, emigration and shipping (Brian Swann)

Autosomal DNA demystified (Debbie Kennett)


The schedule of lectures from this year's event are indicated below (click the image to enlarge). Most of the presenters gave permission to upload their lectures and big thank you is due to the speakers for their generosity.

And this year we managed to get better audio recordings than ever before. Who would have thought that dangling a small microphone in front of the loudspeaker and recording a separate audio track on your iPhone would be the best way of conquering the ever-present background noise from the 10,000 people in the auditorium?!

Enjoy.

Maurice Gleeson
June 2017











Thursday, 1 June 2017

Convergence - quantifying Back & Parallel Mutations (Part 1)

In a recent post I explored the concept of Convergence and made the point that the mechanism by which Convergence arises is via a combination of Parallel Mutations and Back Mutations in the STR marker values. These mutations are changes that occurred at some time in the past but because they remain hidden to us in the present, we cannot tell when they occurred or how frequently they occurred just by looking at two sets of STR results from people living today.

However, there is a way around this problem. Or at least a partial solution.

By using a combination of STR data and SNP data we can build a Mutation History Tree that is a more accurate representation of the branching structure of the "family tree" for a specific genetic group. And this type of tree allows us to more easily (and more accurately) spot Back Mutations and Parallel Mutations.

I did this for one particular genetic family in one of my surname projects - the North Tipperary Gleeson's (Lineage II of the Gleason DNA Project). This tree is a "best fit" tree, by which I mean a tree constructed in such a way as to explain the STR & SNP data in the most parsimonious way i.e. with the fewest number of branches that will accommodate or "fit" the data. This approach is also called the "maximum parsimony" approach and is often used when building cladograms or phylogenetic trees. The Mutation History Tree (MHT) is simply another type of cladogram. You can read about the process of how the tree was developed in this blog post here and subsequent posts.

But a key point here is that this "best fit" tree is likely to change as more data becomes available. And to illustrate this point, I'm going to compare the current version of the tree (Dec 2016) with the next version that is being prepared following the recent availability of new data from 12 sets of Z255 SNP Pack results.

Below is the current version of the MHT for Lineage II. By comparing each mutation in the tree with every other one, we can identify which mutations are Back Mutations (occurring on a single line of descent) and which are Parallel Mutations (occurring on two or more lines of descent). I have highlighted the Back Mutations in yellow and the Parallel Mutations in green.


Back Mutations in yellow, Parallel Mutations in green
from Gleeson Lineage II MHT (version Dec 2016)

Parallel Mutations occur in the following lines of descent:
  • CDYb 40-39 ... A, E, D, F (4 times)
  • CDYa 39-38 ... A, B, C, F (4 times)
  • 464c 17-16 ... A x2, D (3 times)
  • 461 12-11 ... A, B (2 times)
  • 576 18-19 ... A, D (2 times)
  • 390 23-24 ... A, B, C (3 times)
  • 390 24-23 ... B, C (2 times)
  • 456 16-15 ... B, D (2 times)
  • and so on ...
Back Mutations are more difficult to count, and to conceptualise. Whether you consider the value as mutating forward or back is entirely dependant on your reference point. If our anchor is the upstream Z255 branch, then the original value of marker 390 (for example) is 24, mutating (forward) to 23 on the Z16438 branch, and then back to 24 (in parallel) on Branches A, B & C, and then back to 23 (again in parallel) on Branches B & C. So there are several points to make here:
  • this is in fact a Back Mutation that occurs in parallel in 3 separate lines of descent. It is thus both a Back Mutation (relative to its earlier value of 24 on the Z255 branch) and a Parallel Mutation, occurring at (presumably) different time points in Branches A, B & C. It is thus coloured yellow and green.
  • It can also be considered a Triple Mutation relative to the Z255 branch - in the sense that it mutates forward to 23 then back to 24, then back to 23 again. But what happens if it flips forward and back 5 times? What would we call that? And what do we call it if it goes two steps forward and one step back? This is where terminology fails us. I'm not sure if there is a standardised way of describing these different kinds of mutation (if there is, please leave a comment below).
  • the mutation 390 24-23 occurs in Branches B & C ... relative to its value of 24 in the Z255 branch, this could be considered a Parallel Forward Back Forward Mutation ... for Pete's Sake!!

But if we just focus on the Back Mutations that occur downstream of the branch characterised by the STR mutation (710 36-37), just above the A5627 SNP Block. This "710 branch" incorporates all the Gleeson's of Lineage II, from Branch A to F.* On this overarching branch for Lineage II, the value of the STR marker 390 is 23 and Back Mutations are as follows:
  • 390 24-23 ... B, C ... this is the only Back Mutation below the "710 branch"
  • And it is also a Parallel Mutation
  • All the other yellow Back Mutations are relative to the upstream Z255 branch, and not our downstream "710 branch", and so are not counted in this particular exercise.

So, let's generate some statistics from these numbers:
  • The total number of mutations below the "710 branch" (irrespective of whether they are forward or back) is 71.
  • There are 69 Forward Mutations (i.e. away from the original value of the relevant marker on the "710 branch")
    • 31 Forward Mutations show an increase in the number (e.g. 9 to 10)
    • 38 Forward Mutations show a decrease in the number (e.g. 9 to 8)
  • There are 2 Back Mutations 
    • both Back Mutations show a decrease in the number (i.e. 24 to 23)
  • There are 26 Parallel Mutations
  • Forward Mutations outnumber Back Mutations by a ratio of 35.5 : 1
  • Parallel Mutations outnumber Back Mutations by a ratio of 13 : 1
  • There are 16 people in this tree, and if we make the big assumption that the "710 branch" starts 1000 years ago (i.e. roughly at the time of the introduction of the Gleeson surname), then over the course of 1000 years, the rate of each type of mutation is (crudely) as follows:
    • Forward Mutations = 69/16 = 4.3125 mutations per "line of descent" per 1000 years
    • Back Mutations = 2/16 = 0.125 mutations per "line of descent" per 1000 years
    • Parallel Mutations = 26/16 = 1.625 mutations per "line of descent" per 1000 years

These are crude estimates but they give some idea of the relative importance of Parallel Mutations compared to Back Mutations. And applying this information to the phenomenon of Convergence, it would seem that Back Mutations play a very minor role compared to Parallel Mutations.

This conjecture is supported by some recent modelling work undertaken by Dave Vance and written up for the L21 Yahoo Discussion Forum. In Dave's simple model, which is an extremely useful basis for further discussion, the "average tree" could expect to have a ratio of Parallel to Back Mutations in the range of 25:1 to 50:1.

This is a lot higher than what I have shown in my MHT for the Lineage II Gleeson's, but this can be partly explained by the fact that there are only 16 people in my Gleeson sample, and we are looking at (perhaps) only the last 1000 years. I would predict that the ratio will increase further as 1) I add more people to the sample; and 2) the duration of observation is extended backward from 1000 years ago (the 710 Branch) to 4300 years ago (the Z255 Branch).

In subsequent posts we will see how these calculations stand up when we add in additional data from 12 SNP Pack results and reconfigure the MHT for Gleeson Lineage II into the next version of the "best fit" model. And we will also attempt to quantify the total number of Back & Parallel Mutations below the upstream marker Z255. And lastly, we will attempt to quantify Convergence itself.

Maurice Gleeson
June 2017

* the Big Y results of a 10th member of the group indicate that this branch is characterised by the SNP A5631 although this result is not reflected in this version of the MHT






Thursday, 25 May 2017

Convergence - what is it?

There are several phenomena encountered in the the analysis of Y-DNA STR data that can throw a genetic spanner in the works, and Convergence is one of them!

In genetic genealogy, Convergence occurs when two men have DNA signatures that are exactly or nearly identical, but have evolved that way purely by chance. As a result, the two men will show up in each others' list of matches and will give the false impression that they may be closely related (e.g. within the last several hundred years) when in fact they are much more distantly related (e.g. within the last several thousand years). The problem is we cannot tell that Convergence has occurred simply by looking at the two men's STR results. It is hidden from our view. We cannot see it just by looking at the present-day STR data. And the danger is that if the two men think they are closely related, they may start chasing their common connection, thinking that they will find the answer via further documentary research, when in fact there is little hope of that at all. Their "close match" is a red herring. And their pursuit of the Common Ancestor is a wild goose chase.

So what can we do about it? How can we recognise it? How can we avoid it wasting our precious research time?

Confusion

The concept is occasionally discussed in Facebook groups or on various blogs, but there tends to be quite a lot of confusion around what it actually means. And there are a variety of quite understandable reasons for this. 

Firstly, there isn't a standard definition for Convergence, so how it is used varies from person to person. Some people apply it only to exact matches, others apply it to exact and close matches. Moreover, the concept of Convergence is closely tied up with the concept of lack of Divergence. Both are different phenomena, but their effects and consequences are very similar. Another contributing factor is the fact that it is difficult to see it or detect it in practice. We know that it exists, but we have no way of identifying it just by comparing two sets of STR results. In other words, it's largely a hidden phenomenon (like Black Holes). It is only when we do SNP testing that the extent of Convergence becomes apparent. And the problem is that not enough people have done SNP testing. 

The good news is that more and more people are doing SNP testing and as they do, the extent of Convergence becomes more apparent. The Lineage II members in the Gleason DNA Project are trailblazers in this regard and we will explore the results of the recent Z255 SNP Pack testing in subsequent blog posts.

But in this post, we will look at an example of Convergence from the Gleason DNA Project in order to illustrate some of the key characteristics and consequences of Convergence. In later posts, we will look at clues that may indicate that Convergence is present, attempt to quantify the number of Back Mutations & Parallel Mutations that occur over time (using the Mutation History Tree that we have previously constructed for Lineage II - the North Tipperary Gleeson's), and finally we will attempt to quantify Convergence itself.

But first of all, let's look at some of the aspects of the definition of the term.

Definition

A general definition for the term convergence from the Conicse Oxford English Dictionary illustrates some general characteristics of convergence that are worth exploring because they are of relevance to how the term is applied in genetic genealogy and to the analysis of Y-DNA STR data in particular:
converge 1. come together from different directions so as eventually to meet

convergent 2. Biology (of unrelated animals and plants) showing a tendency to evolve superficially similar characteristics ...
There are several important aspects to these definitions that we can apply to the analysis of STR data (e.g. your 37 marker data). First of all, the sense that things were initially apart, but then they come together. Secondly, the idea that two things can look the same or similar on the surface, but in fact they have come from very different directions. And thirdly, the idea that two things can evolve from something different into something the same.

Let's look at how this more general concept can be applied to the analysis of Y-STR data.

And a good starting point is the description of Convergence on the ISOGG Wiki:
Convergence (also known as evolutionary convergence) is a term used in genetic genealogy to describe the process whereby two different genetic signatures (usually Y-STR-based haplotypes) have mutated over time to become identical or near identical resulting in an accidental or coincidental match.
One can think of convergence as producing misleading matches – two men appear to be more closely related than they actually are. The same situation may result (very occasionally) if there is an exceptional lack of divergence. In other words, so few mutations occurred in the descendants of a common ancestor over the course of time that the common ancestor may appear to have lived only a few hundred years ago when in fact he lived much further back than that, perhaps several thousand years ago.
So let's pick apart some of the key elements of this definition. You might like to refamiliarise yourself with some basic concepts, such as the different types of DNA markers (STRs and SNPs), and what you are actually seeing when you look at the DNA Results page.

Basic Concepts

Firstly, the above description of Convergence refers to the genetic signature - the Y-STR haplotype. This is the string of numbers you see associated with your results on the DNA Results page of the project. I like to think of it as if all the Y-chromosomes of the men in the group were all stacked up on top of each other, in such a way that each of the individual markers along the chromosome were all aligned with one column for each marker. Thus in the diagram below, each of the men have a value of 13 for the first marker. The values for the second marker are a mixture of 23 and 24. And so on.

The Y-STR results for the men of Lineage II
(click to enlarge)

Another key point in the above description is the concept that some markers mutate over time e.g. the number changes from 14 to 15. These mutations are identified by comparing the value in each square to the modal value for the entire group (i.e. the most frequent value among the men in that group). The most frequent values for each of the markers are used to generate the "modal haplotype" which is a virtual signature constructed from these most frequent values (and is represented by the row marked "MODE", the 3rd row from the top in the diagram above).

Mutations are indicated by coloured squares. If the value for any marker is the same as the modal value for that marker (i.e. the most common value among the men in that group), then the square that the value is in will not have a colour. If however, the value is higher than the norm, it will be coloured pink; if it is lower than the norm, it will be coloured purple.

If you and someone else have exactly the same string of numbers, you will have the same coloured squares and the same "no-colour" squares. If you are not exactly identical, you will have some coloured squares that the other person does not have ... and vice versa. In other words, the sequence of numbers, and hence colours, will be different. Each coloured square represents a mutation - a small minor increase or decrease in the number (compared to the norm) for that particular marker, in that particular individual.

Convergence in theory

Let's imagine that some distant ancestor living 10,000 years ago gave rise to four distinct lines of descent surviving today (represented by the men A, B, C, and D in the diagram below). Let's look at what happened to their first 37 STR markers over time, and let's assume that mutations only occurred in 5 of these STR markers, as shown in the diagram below. How did the values change over the passage of time, from 10,000 years ago to the present day? And how many of the descendants of this ancestor "match" each other today?

In descendant A, only one of these 5 STR markers mutated. It underwent a single mutation (from 13 to 14) about 6000 years ago, and that was the only mutation over the span of 10,000 years. This is an rather extreme example of "lack of Divergence".

Descendant B had several mutations in his line of descent, but only affecting the first and the fifth markers. These show progressive "forward mutations" away from their original values. With the first marker, the mutations go forward in an upward direction (14,15,16,17) whilst with the fifth marker they go forward in a downward direction (15,14,13,12). This latter may seem counterintuitive but it serves to emphasise that "forward" means "away from" the original value, no matter if it is up numerically or down numerically.

Descendant C also has experienced mutations in only the first and fifth marker. But here we see two examples of a Back Mutation. The first marker shows a forward mutation 6000 years ago (13 becomes 12) but this has gone back to 13 by 4000 years ago. It then undergoes another forward mutation by the time of the present day (13 to 14). Similarly, the fifth marker undergoes a forward mutation (16 to 17) by 4000 years ago but a Back Mutation by 2000 years ago.

Descendant D undergoes mutations on all 5 of his STR markers. A Back Mutation occurs with the second marker between 2000 years ago and the present day (15 to 14); and likewise with the third marker (12 to 13); and likewise with the fifth marker (17 to 16). Two Back Mutations occur with the fourth marker (29 to 30 by 4000 years ago; and 31 to 30 by the present day).

Mutations over time in 4 distinct lines of descendants

Remember, these are four distinct lines of descent, with the MRCA (Most Recent Common Ancestor) represented by the first row of 5 STR markers in the diagram above. So now let's look to see if any of the mutations that occurred in these four individual lines of descent occurred in parallel i.e. the same mutational change occurred in two completely separate lines of descent.

Have a look at the first marker in A, B and C. All three men developed the same mutation on this marker - a change from a value of 13 to 14. In Lines A and B this change occurred in parallel around 6000 years ago. In Line C, the change occurred in parallel around about the present day.

There is a similar parallel mutation between Line C and D. Look at the fifth marker - it increases in value from 16 to 17 around about 6000 years ago in Line D and 4000 years ago in Line C.

And there is a parallel back mutation present in Lines C and D also - the fifth marker switches from 17 to 16 about 2000 years ago in Line C and around about the present day in Line D.

With Back Mutations you are only looking at a single line of descent. With Parallel Mutations we are comparing two or more lines of descent. And we will see that in practice Parallel Mutations are much more common than Back Mutations and have a much greater role to play in the development of Convergence.

The STR results of living people today tells us nothing about their evolutionary history - it is hidden from view

Which brings us to Convergence itself. Let's look at the Genetic Distance between each of these lines of descent. This helps to make the point that the DNA results from living people are only a snapshot in time. They do not tell us anything about how those STR values have evolved over the past 10,000 years:
  • A and B have a Genetic Distance (GD) of 7. This is made up of a 3-step difference on the first marker (14 vs 17) and a 4-step difference on the fifth marker (16 vs 12). And as these were the only changes on their first 37 markers, the GD would be written as 7/37. This exceeds FTDNA's threshold for declaring a match (i.e. 4 steps or less over the first 37 markers; written as 0-4/37) and so A and B would not appear in each other's list of matches.
  • A and C have a GD of zero. They are an exact match. Their GD for the first 37 markers is thus 0/37. They appear in each other's match list and the match looks really close. They think they have a common ancestor in the last few hundred years. They start comparing family trees, looking for the elusive ancestor. They will never find him. This is a wild goose chase. This is the consequence of Convergence.
  • A and D have a GD of 2 (or 2/37). This GD falls within the threshold for declaring a match. They both appear in the other's match list. They email each other, looking for the common ancestor - another wild goose chase. Another example of Convergence and its consequences.
  • B and C have a GD of 7/37. No match.
  • B and D have a GD of 9/37. No match.
  • C and D have a GD of 2/37. It's a match. It's Convergence. They don't know that. They spend months researching their connection. It's a wild goose chase.

The STR results of people living today tell us nothing about how those STR marker values have evolved over time. They may have come from a relatively recent common source, or they may have come from widely differing directions.

Below is another way of conceptualising how the numerical value of a single STR marker might evolve over time. This marker started out with a value of 8 for the common ancestor of 4 distinct lines of descent. But by the time of the present day, two lines had a value of 9, one had a value of 13 and one had a value of 5. But the evolutionary history of these 4 lines of descent is peppered with Back Mutations and Parallel Mutations:
  • Back Mutations
    • Line 2 (red) - 14 becomes 13 some time between 1000 years ago and the present day (0)
    • Line 4 (purple) - 4 to 5 between 1000 and 0 years ago
    • Line 3 (green) - 5 to 6, 6 to 7, and 7 to 8 between 7000 (7K) and 4000 (4K0 years ago
  • Parallel Mutations
    • 8 to 9 in Line 2 (10K to 9K), Line 1 (7K to 6K), and Line 3 (2K to 1K)
    • 8 to 7 in Line 3 (10K to 9K) and Line 4 (9K to 8K)
    • 7 to 6 in Line 3 (9K to 8K) and Line 4 (7K to 6K)
    • 6 to 5 in Line 3 (8K to 7K) and Line 4 (4K to 3K)

The evolution of values in a single STR marker over time in 4 descendant lines
of a common ancestor who lived some 10,000 years ago

The consequence of all these Parallel & Back Mutations is that the present day descendants of two of the lines (green Line 3 & blue Line 1) have exactly the same numerical value for this STR marker despite the fact that their evolutionary histories are so different.

This is an example of the evolutionary history for a single STR marker. And if this is representative of all STR markers, then the chances that the values for a particular marker will converge over time is really quite high. But our DNA results usually consist of 37 markers (the standard test most people start with) so what are the chances of the first 37 markers evolving in such a way as to result in convergence of a sufficient number of STR values to cause a coincidental match? ... well, the probability of that happening would be a lot lower. And the probability would be lower still with 67 markers, and lower still with 111 markers. But because so many people have tested (over 600,000 currently), we do see the phenomenon occurring even at higher marker levels (67 and 111).

And in a subsequent post we will look at clues to the presence of Convergence, so that you can look at your own or anyone's list of matches and adjust your suspicion level accordingly.

Convergence in practice

And to illustrate these points, I have temporarily moved one of the ungrouped project members into Lineage II, namely member Jim Treacy (B38804)*. He is third from the end in the diagram below. Don't worry about not being able to read the text (you can click to enlarge the diagram if you like) - just focus on the coloured squares. 

The Y-STR results for the men of Lineage II (with a Treacy third from the end)
(click to enlarge)

And Jim has no coloured squares for the first half of the markers. It is only when we reach the 19th marker in the row that he has a pink square with the value 16 inside it - everyone else in that column has a value of 15 for that marker, except for one person who has a value of 14. And as we continue along Jim's row, there are 4 other coloured squares, bringing the total to 5. This can be expressed as a Genetic Distance of 5/37 from the modal haplotype (i.e. the 3rd row from the top, which - to remind you - is a virtual signature constructed from the most frequent values for each of the markers).

Now a GD of 5/37 between two men would mean that they do not appear in each others' list of matches (because FTDNA have set the threshold for "declaring" a match to be 4/37 or less). But among Jim's list of matches at the 37 marker level, there are two members of Lineage II (with a GD of 4/37). And at the 67 marker level, Jim has 6 members of Lineage II among his matches (with a GD of 6 to 7/67). So this looks (on the surface) that Jim is relatively closely related to our Lineage II group. And this suggests (on the surface) that there may be a common ancestor some time in the past several hundred years, maybe somewhere between 1700-1850 (on the basis of TMRCA calculations based on the TiP Report). 

So what do we do next? Do we start looking for documentary evidence? Do we go back to the church records and land records and old newspapers to see if there is mention of a Gleeson-Treacy connection? 

We could do. But it would be a wild goose chase. Because the Treacy-Gleeson connection is a red herring. And we know this because we have done SNP testing.

Jim has done the Big Y test, as have 10 of the members of Lineage II. Both Jim and Lineage II members belong to Haplogroup R, and both share some SNP markers in common. Each marker characterises a branching point in the Tree of Mankind and a SNP Progression is a list of these SNP markers down to the finer "more downstream" branches of the Tree. Here are the SNP Progressions for Jim and for the Lineage II Gleeson's:
  • R-P312> Z290 > L21> DF13 > ZZ10 > Z255 > Z16437 > A557 > Z29008 > A10891
  • R-P312> Z290 > L21> DF13 > ZZ10 > Z255 > Z16437 > Z16438 > BY2852 > A5631

You can see that the branching points are exactly the same ... until marker Z16437. Thereafter, Jim goes down one branch and the Gleeson's go down another one. Now, let's be clear: the Gleason's and Jim do share a common ancestor. And if he was around today he would test positive for the SNP marker Z16437. But his children would have evolved along different paths - one path taking us down to our present-day Jim Treacy, the other taking us down to our present-day Gleeson's. You can see where Jim and the Gleeson's are placed on the Tree of Mankind in the diagram below.

And when did this common ancestor live? YFULL date the formation of Z16437 as 1650 years ago. The two markers downstream of this, A557 (Jim Treacy) and A5631 (Gleeson), both have formation dates of 1400 years ago. So from this we can say that the common ancestor of Treacy & the Gleeson's is somewhere between 1400 to 1650 years ago. Or to give it an actual date (by subtracting from 1950, the approximate birth year for members of Lineage II), sometime between 300 and 450 AD.

This is clearly a lot further back in time than the 1700-1850 AD estimate suggested by the STR data.

So this is a great example of Convergence. By chance, Jim's STR signature has evolved over time to approximate that of the Gleeson's of Lineage II and as a result, he looks a lot more closely related to the group than he actually is.

Maurice Gleeson
May 2017

* a big thank you to Jim for allowing me to use his name and his results in this example


Gleeson's to the left, Treacy's to the right, & about 1500 years in between






Friday, 19 May 2017

23andMe Transition arrives in UK & Ireland

Some time ago, 23andMe transitioned their US customers to a new website format, whilst those of us in Europe remained with the old format. That was quite some time ago! But just this week, I have received an email informing me that I will be transitioned to the new format in June 2017. 

Below is the email I received. Of note, all Health Reports will be archived as pdf documents. I received mine before the FDA (Food & Drug Administration) put the extended hold on 23andMe's Health Reports, so I have 63 reports on physical traits, 53 on carrier status for inherited conditions, 25 on drug response, and 122 on health risks for a variety of medical conditions including Alzheimer's Disease and Parkinson's. 




The first bullet point talks about "Ethnicity" but on my screen it is described as "Ancestry" - click on your name (top right), then Edit Profile, & you will see it directly under the Ancestry Information heading. Click on Update.

You can also enter or update your ethnicity by clicking on the green button above (in the email you receive). Of particular note, if you manage several kits, after filling out the survey for your first kit, be sure to switch profiles and complete the survey for each one of your kits.

The new 23andMe experience is discussed on their international webpages here, and additional information for European customers is available on this link here and is abstracted below.

Some of the key features that stand out for me include:
  • some Health Reports may be available (depending on which chip was used - you can find this information on your Download Raw Data page in the Profile box toward the end of the page)
  • the maximum number of matches has increased to 2000
  • linking to online trees is allowed, even if they are with other companies (saves you the hassle of having to upload a gedcom ... which anyway is no longer available with the new experience)
  • when defining haplogroup subclades, they have switched from the old terminology (e.g. R1b1a) to the new one (e.g. I-M253)
  • any connections you currently share with your matches will be maintained in the new experience

One of the best additional features of the new experience will be the Relatives in Common feature. This is similar to the Shared Matches feature on Ancestry and the ICW (In Common With) Matches feature on FamilyTreeDNA.







Maurice Gleeson
May 2017








Thursday, 20 April 2017

7-Day Sale till 27 April 2017


National DNA Day is April 25.
Celebrate with us! 

FamilyTreeDNA are having a 7-Day Sale to celebrate National DNA Day (which marks the discovery of the structure of DNA in 1953). It runs from April 20th until April 27th 2017. The promotion ends at 11:59 pm Central Time on Thursday, April 27th (which is 5:59 am on April 28th in Britain & Ireland). Please note that all Items must be paid for by that time, including items 
ordered though the invoice system (Bill Me Later)

Below are the items that are on Sale. Of particular note are the Family Finder test for a mere $59 (55 euro, £46), the Y-DNA-37 test for $129 (121 euro, £101), and the SNP Packs for an unbelievably low price of $89 (83 euro, £70).
If you want to dip your toe in the genepool, now is the time - do the Family Finder test. It will give you your ethnic makeup estimates and connect you with cousins on all of your ancestral lines.

If you have thought about researching your surname, or buying a Y-DNA test for a relative, now is the time!

And if you have been advised to upgrade to a SNP Pack, they have never been cheaper and you should take advantage of this limited 7-day offer.

Happy DNA Day!
Maurice Gleeson
April 2017


Tuesday, 31 January 2017

Getting the Most out of your Y-DNA test (from FamilyTreeDNA)

The advice below pertains mainly to people who have tested their Y-DNA at FamilyTreeDNA, but some of the general principles apply to everybody, no matter which test you have done or which company you have tested with.

There are a few essential actions you should take to get the most out of your DNA test. You may not be able to do all of them all at once, so come back to this page often and check it out again to see if there is anything else you could be doing to maximise the value you get from your DNA test.

You may wish to share the link to this page with anyone else who might be interested in doing a DNA test so that they can see what they will get if they do.

Make yourself visible to your cousins

If no one can see you, you won't be able to connect with your cousins. So try to make yourself as visible as possible (or as visible as you feel comfortable with).

1) Prepare your surname's Ancestral Line (from you up to your surname's MDKA). This is the single most important piece of information that you can share. You will need this in your collaborations with other project members. In addition, many projects have a facility for posting this information somewhere on the project-related webpages. For example, in our Gleason/Gleeson DNA Project, these will go up on our Patriarchs & Matriarchs Page on the blog or the Patriarchs Page on the WFN website. This will potentially help other people to connect with you. It would help if you could provide it in the following format:
1) James GLEESON b c1835 Shallee, Co. Tipperary, d 12 Nov 1879 Longstone, Co. Tipperary, m 13 Apr 1860 Maria COYLE, Silvermines, Co. Tipperary
2) Morty GLEESON ...
3) John GLEESON ...
4) Abigail GLEESON … but not including dates for a) births <100 years ago, b) marriages <75 years ago, or c) deaths <50 years ago
Researcher: (insert your initials here)
Your email address
DNA Kits: (insert your DNA kit numbers)
Link to online tree: www.some-website.com

2) Use your kit number and password to Log in to your personal webpage and explore it. There are a lot of bits & pieces of information you can include on your personal webpage that will optimise your chances of successful collaboration with your DNA matches. And knowing what your DNA results can tell you will help you get the most out of them.

3) You should add your MDKA information (Most Distant Known Ancestor) including dates & locations for both birth and death. The format we recommend is the same as the one above, but you may have to abbreviate it as only a certain number of letters are allowed in this field. Location of birth is the most important piece of information. Here is an example:
James GLEESON b1835 Shallee, Tipp, d1879 Longstone, Tipp
To add this information, simply click on your name in the top right of your homepage - Account Settings - Genealogy - Most Distant Ancestors ... I have posted instructions on how to do this on the following link ... http://farrelldna.blogspot.co.uk/2015/01/essential-information-everyone-should.html

4) Fill out your MDKA Profile. In essence, this is your Brick Wall. And the more information you can give about it, the better the chance of breaking through it. There are lots of clues and circumstantial evidence from documentary data that may help you identify a possible connection with other members of the group.  This applies to all project members but is most relevant to members with Irish ancestry given that the records tend to peter out about 1800. Check out the MDKA Profile page of the Gleeson DNA Project for instructions on how to complete the profile for your own MDKA. You can also view an example of it here.

5) Add your Ancestral Surnames (click on your name in the top right - Account Settings - Genealogy - Surnames). I suggest to put SURNAMES in capital letters and Locations in normal text, as this makes the surnames "jump out" and easier for the reader to scan through.

6) Upload your Family Tree as a GEDCOM file so that you have a version of your family tree on your FTDNA webpages.   This is particularly important if you have done a Family Finder test (autosomal DNA). You can also add your Family Tree manually if it is easier for you. And if you have a Family Tree online, leave a link to it in the About Me section of your Personal Profile. Click here for specific instructions on uploading a Gedcom file - https://www.familytreedna.com/learn/ftdna/how-to-family-tree/

7) Optimise your Privacy settings so that your potential cousins can see your results:
  • Hover over your Name in the top right
  • Click on Account Settings, then the Privacy & Sharing tab at the end of the menu bar above
  • Then simply change the settings under My DNA Results by clicking on the words "Project Members" at the end, and on the next screen checking the box beside "Make my mtDNA & Y-DNA data public". Then press Save.

Before the change
After the change


Check out Project-related Resources

There are a lot of resources that are particularly relevant for Surname DNA Projects and you should check out and use these as you feel appropriate.

8) Join the relevant Surname Project. There are over 9200 of them at FamilyTreeDNA (FTDNA). You can either search for it via Google (simply type in: FTDNA & your surname) or you can search for it via the FTDNA Search page. Once you have joined, the Project Administrator should look at your results (within a week or so) and assign you to a particular group within the project. You can also email the Admin if you have any questions. Their email address is usually on the Home Page of the project.

9) If you join a surname project, check out the various pages of the project website - they usually have a lot of useful information that will help you understand your results. See the Gleason / Gleeson DNA Project blog as an example.

10) Join the relevant Haplogroup projects
Your results will reveal your haplogroup (your branch of the human Y-DNA tree and/or human mtDNA tree). Once your results arrive, make sure you join all the relevant projects as these will assist in the further analysis of your data and in particular your deep ancestry (where in the world your particular ancestors originated several thousand years ago). The projects are run by volunteer project administrators and they are a rich source for advice, guidance, and support. Frequently there is an associated mailing list or Facebook group you can join to keep abreast of up-to-date developments (this is a fast-moving field).

As an example, relevant Y-DNA haplogroup projects for each of the Gleason Lineages identified thus far include the following:

If your haplogroup project is not listed here, you can see if there is a specific project for your haplogroup on this list: http://www.isogg.org/wiki/Y-DNA_haplogroup_projects

11) Join the relevant Geographical Projects
As an example, relevant Y-DNA geographical projects for each of the Gleason Lineages identified thus far include the following:
There may be other geographical projects that are relevant to your ancestral line and you can find them on this list: http://www.isogg.org/wiki/Geographical_DNA_projects


Check out General Resources

There is a lot of information out there about genetic genealogy in general and it can be a bit confusing knowing where to find it. Below is a selection of our "best bits".

12) FTDNA have a lot of useful information in their Learning Centre. Be sure to check out the FAQs (Frequently Asked Questions).

13) The ISOGG wiki is a great place to start looking for general information about any topic related to genetic genealogy, including your particular type of test.

14) Read Kelly Wheaton's beginners’ guide to genetic genealogy: https://sites.google.com/site/wheatonsurname/beginners-guide-to-genetic-genealogy

15) Download and read the e-book from the resources tab on your myFTDNA homepage.

16) There are a variety of different YouTube videos on genetic genealogy which have been prepared by ISOGG members and Project Administrators.
17) Sign up to the relevant genetic genealogy mailing lists, forums and Facebook groups. These can be great sources of help if you have a specific question. See the list here: http://www.isogg.org/wiki/Genetic_genealogy_mailing_lists.
I particularly recommend:

18) Read blogs written by experienced genetic genealogists. See this list of genetic genealogy blogs: http://www.isogg.org/wiki/Genetic_genealogy_blogs

19) Read the relevant articles about your specific DNA-test ...

Y-DNA - traces your father's father's father's line
Y-DNA basics: http://www.familytreedna.com/learn/dna-basics/ydna

Mitochondrial DNA (mtDNA) - traces your mother's mother's mother's line
mtDNA testing for advanced users: http://www.familytreedna.com/learn/mtdna-testing

These two pages are relevant if you have taken the full mitochondrial sequence (FMS) test:
mtDNACommunity: http://www.familytreedna.com/learn/mtdna-community
mtDNA scientific collaboration: http://www.familytreedna.com/learn/mtdna-results-donation

Autosomal DNA (atDNA) - traces all your ancestral lines
Understanding Family Finder results: http://www.familytreedna.com/faq/answers.aspx?id=17
Understanding Population Finder results: http://www.familytreedna.com/faq/answers.aspx?id=22


Please let me know if any of these links are broken or cease working.


Maurice Gleeson
Jan 2017