MKT Menu
   
   
   
   
   
   
   

Index

1.Introduction to Standard & Generalized MKT

2.Web interfaces

Standard MKT
Advanced MKT

Multi-locus MKT
Main parameters

3.Analysis of the Sequences

Treatment of sites or codons with multiple changes
Estimation of the number of synonymous and non-synonymous changes
Estimation of the number of degenerate sites/changes
Estimation of the number of sites/changes in non-coding regions

4.MKT Output

Standard / Advanced MKT
Multi-locus MKT
 

 

1. Introduction to Standard & Generalized MKT

The comparison of patterns of polymorphism and divergence is one of the most powerful approaches to investigate selection at the DNA level. The McDonald-Kreitman test is almost a necessary step to follow within this approach. It compares the amount of variation within a species to the divergence between species at two types of site, one of which is putatively neutral and used as the reference to detect selection at the other type of site. As the test was initially described (McDonald and Kreitman 1991), these sites were synonymous (putatively neutral) and non-synonymous sites in a coding region. However, the test for selection can potentially be extended to any two types of sites, provided that one of them is assumed to evolve neutrally and that both types of sites are linked in the genome. Furthermore, the rationale of the test can be used to analyze multiple loci in the genome simultaneously provided that appropriate multi-locus statistical tests are applied.

The Standard and Generalized MKT website is the first Web-based resource where users can perform standard or generalized McDonald-Kreitman tests using different interfaces:

The Standard MKT allows users to create 2x2 contingency tables by comparing two types of coding sites, such as synonymous and non-synonymous changes in a coding region.

The Advanced MKT is an extension of the standard test which allows comparing any two linked regions in the genome, including non-coding DNA. Note that for the analysis to make sense the two regions must be tightly linked in the genome. The Neutral Class selected will be compared to any another types of sites to test for selection.

The Multi-locus MKT is an interface where users can analyze multiple coding regions in a single multi-locus MKT.

The Main Parameters page contains selectable parameters that apply to any of the three distinct interfaces for the MKT website.



The MKT Menu on the left allows you to navigate through the different pages of the Web site:

  1. The Main page, from which you can input your data.
  2. The Help page (this page).
  3. The Example page, which contains a pre-computed example for each type of test.
  4. The Contact us page, where you can find information about the authors.

The Resources menu contains quick links to some related resources:

 

2. Web interfaces

Standard MKT

NAME OF THE ANALYSIS: Enter a name to identify the current analysis.

(1) TYPES OF SITE TO ANALYZE: Select here the types of sites you want to analyze. The classical McDonald-Kreitman Test compares synonymous and non-synonymous changes in the coding region (selected by default), but you can choose to analyze:

For a single analysis you can only select those classes of sites that are mutually exclusive, i.e.:

  • Synonymous sites can only be compared to non-synonymous sites or non-degenerate sites.

  • Non-synonymous sites can only be compared to synonymous sites or four-fold degenerate sites.

  • Four-fold degenerate sites can only be compared to non-synonymous sites or two-fold and/or non-degenerate sites.

  • Two-fold degenerate sites can only be compared to four-fold and/or non-degenerate sites.

  • Non-degenerate sites can only be compared to synonymous sites or four-fold and/or two-fold degenerate sites.

 

(2) SET AS NEUTRAL: Select here which class of site will be taken as the neutral reference. In the classical McDonald-Kreitman Test, non-synonymous changes are compared to synonymous changes, being the latter the neutral reference. However, if you use the Advanced MKT interface to compare one gene to its pseudogene, you will want to choose the pseudogene sequence as the neutral reference. You can also compare four-fold degenerate sites of a coding region to a nearby non-coding region; in this case you will probably want to choose four-fold degenerate sites as the neutral reference.

 

(3) PASTE THE SEQUENCES: Paste here your sequences, either unaligned or already aligned. Format must be in FASTA (or aligned gapped-FASTA if sequences are already aligned). You need to enter 2 sequences for at least one species (from which polymorphism will be calculated) and at least 1 sequence in the other species (for divergence estimates). However, you can also include polymorphism data for both species, and in this case polymorphism will be added up together. You can also upload the sequences in two separate FASTA files.

 

(4) ANNOTATIONS: When you enter the sequences in the corresponding species box, the annotation box is automatically filled as follows: 'Sequence X --> 1..n' where X is the sequence number (following the order in the species box) and n is the sequence length. You can modify these annotations in order to specify which part of each sequence you want to include in the analysis.
If you uploaded a file with the sequences, then you have to enter the annotations manually. For each sequence enter a line as follows:
'
Sequence X --> n..m', where X is the number of the sequence and the expression n..m defines the bases range that will be analyzed (if you want to join different parts of the sequence you can enter 'Sequence X --> n..m,o..p,q..r', where o..p and q..r specify different base ranges in the sequence).

 

Advanced MKT

The advanced MKT form is similar to the standard MKT (and the statistical procedure for the calculations is exactly the same) but here you can analyze two separate regions that can be either coding or non-coding. Note that for the analysis to make sense the two regions must be tightly linked in the genome.

There are two boxes similar to the box in the standard MKT form in which you can enter the two different regions.

In each box you will find:

  • NAME OF THE REGION TO COMPARE: Enter the name of the region.

  • TYPE OF SITES TO ANALYZE: Select the the types of sites you want to analyze. In this case, you first have to determine if your sequences are coding or non-coding. If they are coding, you have to select which classes of sites you want to analyze (see above). In this form, there is a new category that adds the number of synonymous and non-synonymous changes altogether. Remember that only comparisons involving two mutually exclusive types of site are allowed and this category cannot be compared to any other type of site from the same region, since sites would not be mutually exclusive.

  • SET AS NEUTRAL: Determine the class of site you want to use as the neutral reference (see above). Note that only one neutral class will be allowed to be selected at either the first OR the second region, and that this selected neutral class will be compared to any other classes of sites selected from the two regions.

  • PASTE THE SEQUENCES: Paste or upload your sequences in FASTA format (see above).

  • ANNOTATIONS: Write here annotations corresponding to the bases on your sequences you want to analyse (see above).

 

Multi-locus MKT

Here you can analyze multiple loci in a single multi-locus MKT. You can analyze only Coding Regions in a form very similar to the Standard MKT or Coding and/or Non-Coding Regions in a form very similar to the Advanced MKT, but in both cases sequences must be entered in a new FASTA-based format that support multi-locus data.

This new FASTA-based format contains two different types of heading lines:

  • Lines starting with '>>' contain each locus name and are used to separate sequences from each loci.

  • Lines starting with '>' are normal headings for FASTA sequences within a locus

Sequences must thus be introduced following a certain order, e.g.:

>>Name_of_locus_1
>Sequence_1
actactactacta...
>Sequence_2
actactactacta...
>Sequence_3
actactactacta...

>>Name_of_locus_2
>Sequence_1
ggggcgcgtat...
>Sequence_2
ggggcgcgtat...

Please note that each locus name in species 1 must match exactly a locus name in species 2 (names are case-sensitive!), and that the order of the input loci must be the same in both species. In the form where you can analyze also non-coding regions, the locus name must match exactly not only for the two species, but also for the two regions analyzed.

See other parameters above.

 

Main Parameters

Finally, a set of main parameters apply to any of the forms described above:

(1) EXCLUDE LOW FREQUENCY VARIANTS: you can exclude variants under a given threshold frequency (i.e. rare polymorphisms).

(2) CHOOSE THE GENETIC CODE: available genetic codes include (from NCBI):

  • The Universal Code.

  • The Vertebrate Mitochondrial Code.

  • The Yeast Mitochondrial Code: for Saccharomyces cerevisiae, Candida glabrata, Hansemula saturnus and Kluyveromyces thermotolerans.

  • The Mold, Protozoan and Coelenterate Mitochondrial Code and the Mycoplasma/Spiroplasma Code: for Mollicutes (Entomoplasmatales and Mycoplasmatales), Fungi (Emericella nidulans, Neurospora crassa, Podospora anserina, Acremonium, Candida parapsilosis, Trichophyton rubrum, Dekkera/Brettanomyces, Eeniella and Ascobolus immersus, Aspergillus amstelodami, Claviceps purpurea, Cochliobolus heterostrophus), other Eukaryotes (Gigartinales among the red algae, and the protozoa Trypanosoma brucei, Leishmania tarentolae, Paramecium tetraurelia, Tetrahymena pyriformis and Plasmodium gallinaceum) and Metazoa (Coelenterata: Ctenophora and Cnidaria).

  • The Invertebrate Mitochondrial Code: for Nematoda (Ascaris and Caenorhabditis), Mollusca (Bivalvia and Polyplacophora), Arthropoda/Crustacea (Artemia) and Arthropoda/Insecta (Drosophila, Locusta migratoria and Apis mellifera).

  • The Ciliate, Dasycladacean and Hexamita Nuclear Code: for Ciliate (Oxytricha and Stylonychia, Paramecium, Tetrahymena, Oxytrichidae and Glaucoma chattoni), Dasycladaceae (Acetabularia and Batophora), and Diplomonadida (Hexamita inflata, Diplomonanida ATCC50330 and ATCC50380).

  • The Echinoderm and Flatworm Mitochondrial Code: for Asterozoa (starfishes), Echinozoa (sea urchins) and Rhabditophora among the Platyhelmintes.

  • The Euplotid Nuclear Code: for Ciliata (Euplotidae).

  • The Bacterial and Plan Plastid Code: for Bacteria, Archaea, prokaryotic viruses and chloroplast proteins.

  • The Alternative Yeast Nuclear Code: for Candida albicans, Candida cylindracea, Candida melibiosica, Candida parapsilosis and Candida rugosa.

  • The Ascidian Mitochondrial Code: for a phylogenetically diverse sample of tunicates (Urochordata).

  • The Alternative Flatworm Mitochondrial Code: for Platyhelminthes (flatworms).

  • The Blepharism Nuclear Code: for Blepharisma.

  • The Chlorophycean Mitochondrial Code: for Chlorophyceae and Spizellomyces punctatus.

  • The Trematode Mitochondrial Code: for Trematoda.

  • The Scenedesmus obliquus Mitochondrial Code: for Scenedesmus obliquus.

  • The Thraustochytrium Mitochondrial Code: for Thraustochytrium aureum.

(3) ALIGN SEQUENCES: if this checkbox is selected, sequences will be aligned previous to the analysis following the selected parameters.

(4) ALIGNMENT ORDER:you can also select the alignment order:

  • Species independently, then join: Sequences from species 1 and 2 will be aligned independently, and then the two alignments will be joined together. This option is recommended when sequences are divergent.

  • All sequences at the same time: All sequences from species 1 and 2 will be aligned together in a single step.

(5) CHOOSE THE ALIGNMENT PROGRAM: you can choose among two alignment programs: Muscle (Edgar 2004) or ClustalW2.0 (Larkin et al. 2007).

(6) SET THE PARAMETERS: set parameters for each alignment program:

  • MUSCLE

    • Output Format

      The alignments can be obtained in these formats: ClustalW, FASTA, HTML or Phylip sequential. If you want to select more than one format use Ctrl or Ctrl+Alt.

    • Create a Log file

      Select this checkbox if you want to get a log file. This file contains information about when the program started and finished, any error messages and warnings. It also contains the command line that has been executed, the internal parameters and the progress messages.

    • Output Tree

      You can get the tree resulting from the first or the second iteration.

    • Penalties

      • Gap Open

      • Gap Extend

      • Center

    • Diagonals

      Instead of aligning sequences in pairs, the program looks for "diagonals" (short regions of high similarity between two sequences). When this option is activated, the accuracy decreases but the speed increases. This option is recommended for large groups of related sequences.
      With 'diags' -> diagonals are always activated.
      With 'diags1' -> diagonals are activated for the first iteration only. The main objective of the first iteration is to rapidly construct a multiple alignment to improve the distance matrix, but it is not very sensitive to the quality of the alignment.
      With 'diags2' -> diagonals are activated for the second iteration only. The objective of the second iteration is to make the best possible progressive multiple alignment.

    • Maximum Trees

      This is the maximum number of new trees that are created in the second iteration. If the value is >1 (the default value), the process will be repeated until it arrives to a convergent result or to the number specified.

    • Maximum Iterations

      As this value decreases, the accuracy also decreases but the speed increases. It can range from 1 to 16. When 1-3 is chosen, the program performs the number iterations selected. For values 4, the program continues iterating until it arrives to a convergent result or to the number specified. For huge alignments, 2 is recommended.

  • CLUSTALW2.0

    • FAST PAIRWISE ALIGNMENT

      Control the speed-sensitivity of the initial alignments.

      • Ktup

        The size of the exactly matching fragment that is used. Can range from 1 to 4 for DNA. Increase this to increase speed; decrease to improve sensitivity.

      • Window Length

        The number of diagonals around each "top" diagonal that are considered. Decrease for speed; increase for greater sensitivity.

      • Score Type

        The similarity scores may be expressed as raw scores (number of identical residues minus a "gap penalty" for each gap) or as percentage scores. If sequences are of very different lengths, percentage scores make more sense.

      • Top Diagonals

        The number of best diagonals in the imaginary dot-matrix plot that are considered. Decrease to increase speed; increase to improve sensitivity.

      • Pairgap

        The number of matching residues that must be found in order to introduce a gap. This should be larger than K-Tuple Size. This has little effect on speed or sensitivity.

    • MULTIPLE ALIGNMENT

      Control the gaps in the final multiple alignments.

      • Gap Open

        Reduce this to encourage gaps of all sizes; increase it to discourage them. Terminal gaps are penalized the same as all others except for END GAPS not being selected. BEWARE of making this too small (approx 5 or so); if the penalty is too small, the program may prefer to align each sequence opposite one long gap.

      • No End Gaps

        Here you can select if you want the terminal gaps to be penalized or not.

      • Gap Extension

        Reduce this to encourage longer gaps; increase it to shorten them. Terminal gaps are penalized the same as all others. BEWARE of making this too small (approx 5 or so); if the penalty is too small, the program may prefer to align each sequence opposite one long gap.

      • Gap Distances

        Penalization for the distance between gaps. Gaps that are less than this distance apart are penalized more than other gaps. This does not prevent close gaps; it makes them less frequent, promoting a block-like appearance of the alignment.

      • Transition Weight

        Gives transitions a weight between 0 and 1. A weight of 0 means that transitions are scored as mismatches, a weight of 1 gives the transitions the match score.
        For distantly related DNA sequences, the weight should be near to 0, for closely related sequences it can be useful to assign a higher score.

      • Delay Divergent Sequences

        Switch delays the alignment of the most distantly related sequences until the after the most closely related sequences have been aligned. The setting shows the percent identity level required to delay the addition of a sequence; sequences that are less identical than this level to any other sequences will be aligned later.

      • Iteration

        With TREE you can iterate at each step of the progressive alignment.
        With ALIGNMENT you can iterate just on the final alignment.

      • Number Iterations

        The default number of iterations is 3. If you increase this value, the program will iterate until the score converges o until the maximum number of iterations is reached.

    • OUTPUT FORMAT

      The alignments can be obtained in one of these formats: Clustal with numbers, Clustal without numbers, GCG, GDE, Phylip, PIR or Nexus.

 

3. Analysis of the Sequences

Coding sequences are analyzed codon by codon. If one codon has a gap in any of the sequences, this codon is totally excluded from the analysis. If one codon is a stop codon, counts are performed as if it was another amino acid, but a warning is shown in the output page.

Non-coding sequences are analyzed position by position. If one position has a gap in any of the sequences, this position is totally excluded from the analysis.

In all cases, divergent counts are corrected by Jukes&Cantor (Jukes and Cantor 1969) and results are shown with and without this correction.

 

Treatment of sites or codons with multiple changes

When performing the test, counts in the contingency table can be either:

Sites: positions in the alignment which are either polymorphic within a species or divergent among species. Each site can be counted only once. E.g., if a site has more than two variants in one species, it will be counted as 1 polymorphic site. For the same reason, when the same site is polymorphic and divergent at the same time, since it cannot be categorized into a single class, this site is not taken into account.

Changes: the estimated number of mutations that have occurred in the position since the ancestor of both analyzed species. When changes are analyzed, one site (position) might eventually involve several changes (mutations) if it has more than two variants, and thus that site will be counted more than once in the contingency table. In this case, there is no statistical problem when the same site is divergent and polymorphic at the same time.

EXAMPLE:
Species 1 ATG TTC CTA GTT
ATG TTA CTA GTT
ATG TTT CTT GTT
Species 2 ATG TTC CTG GTT

In the example, the third position of the second codon has 3 variants; this is counted as 1 SITE but as 2 CHANGES. The third position of the third codon is polymorphic and divergent at the same time; this position is thus not taken into account for SITES, but it is counted as 1 POLYMORPHIC CHANGE + 1 DIVERGENT CHANGE.

Analyses including synonymous or non-synonymous changes, only changes are computed. For any other type of analyses, two different tests are performed: one for changes and one for sites.

Estimation of the number of synonymous and non-synonymous changes

The numbers of synonymous and non-synonymous polymorphic changes are estimated for species 1 and 2 independently, and then both counts are added up together. Then the numbers of synonymous and non-synonymous divergent changes are estimated.

We use a maximum parsimony criterion for these estimates. For each codon in the alignment, we get all the different codons represented in a species and compute the shortest path that connects all the codons; among these, we keep the path that involves the least number of replacements (sometimes different paths involve the same number of replacements and they are equally parsimonious; the corresponding number of replacements are added up to the contingency table).

Some special cases apply. First, when one of the codons at the same position from the other species equals an intermediate codon in a path, this path is chosen as the most parsimonious one. Second, stop codons within the alignment are treated as another amino acid, but when a stop codon appears in the alignment this is warned in the results; they are only excluded if positioned at the last codon in the alignment. Furthermore, any paths involving intermediate stop codons are excluded except if both end codons are stop codons.

For example:

ATC
ATT
The most conservative path is: ATC (Ile) ATT (Ile)
1 synonymous change.
AGT
AGC
AGA
AGG
One of the most conservative paths (tied with others) is: AGT (Leu) AGC (Leu) AGA (Arg) AGG (Arg)
2 synonymous changes and 1 non-synonymous change.
CCC
CAG
One of the most conservative paths (tied with others) is: CCC (Pro) CCG (Pro) CAG (Gln)
1 synonymous change and 1 non-synonymous change.

But if the second species contains an intermediate codon of another path (CAC), we will assume that the most parsimonious path in this case is the one that includes that codon: CCC (Pro) CAC (His) CAG (Gln)
2 non-synonymous changes

AATT
AGG
ACT
One of the most conservative paths (tied with others) is: AAT (Asn) ACT (Thr) ACG (Thr) AGG (Arg)
1 synonymous change and 2 non-synonymous changes.

The approach is the same for divergence. We compute all the possible paths from each codon of species 1 to each codon of species 2 and choose the path with the least number of replacements.  

Estimation of the number of degenerate sites/changes

To estimate the number of degenerate sites/changes we get the first sequence as a reference. For each codon we determine the degree of degeneracy as represented in the table below and compute the number of sites/changes for all the sequences:

In this representation of the standard genetic code, N stands for any nucleotide (T, C, A or G), Y for any pyrimidine (T or C), and R for any purine (A or G). The H in the set of codons for isoleucine (Ile) stands for “not-G” (T, C or A). Degeneracies are as follows: N represents a four-fold degenerate site, Y and R represent two-fold degenerate sites. The H is considered as a two-fold degenerate site, and also the first nucleotides in four leucine codons (TTA, TTG, CTA, and CTG) and four arginine codons (CGA, CGG, AGA, and AGG). All other nucleotides are non-degenerate.

For example:

Sequence 1:                 ATG TTA TCA CAA
Degree of degeneracy:  
000 202 004 002  

Estimation of the number of sites/changes in non-coding regions

We count the number of polymorphic and divergent sites/changes for each position in the alignment.

 

4. MKT Output

Standard and Advanced MKT:

The output of the analysis include:

  • A table with a summary of the comparisons performed

  • Information about the main input parameters:

    • The Genetic Code used
    • If low-frequency variants are excluded, the threshold value is indicated
  • Basic information about the input sequences at each region (in the Advanced MKT this information is repeated for each region):

    1. The number of sequences for each species
    2. The length of the alignment
    3. The percentage of gaps within the alignment. Note that end gaps are not taken into account. There is a warning when the percentage is >30%
    4. A JalView button to visualize the aligned sequences
    5. The aligned input sequences in any selected formats. When species are aligned independently, both alignments are also shown

  • A 2x2 contingency table for each comparison performed:


    When a comparison includes two classes of sites from which the program has analyzed both changes and sites, the results show the two analyses: one for changes and another for sites.

  • From this table, the following estimates are computed:

    • Neutrality Index: Indicates the extent to which the levels of amino acid polymorphism depart from the expected in the neutral model (Rand and Kann 1996).
      • Under neutrality, Dn/Ds equals Pn/Ps and thus NI = 1
      • If NI < 1, there is an excess of fixation of non-neutral replacements due to positive selection (Dn is higher than expected)
      • If NI > 1, negative selection is preventing the fixation of harmful mutations (Dn is lower than expected)
    • α: Proportion of adaptive substitutions (Smith and Eyre-Walker 2002)that ranges from -∞ to 1 and is estimated as 1-NI.
    • χ2
    • p-value

    Both the contingency table and the estimates are computed with the divergence corrected by Jukes&Cantor and without any correction for divergence. The default results shown are corrected by Jukes&Cantor, but the results without correction can be viewed by selecting the button 'Without any correction for divergence' in the output page.

Multi-locus MKT:

The output of the analysis include:

  • A table with a summary of the comparisons performed

  • Information about the main input parameters:

    • The Genetic Code used
    • If low-frequency variants are excluded, the threshold value is indicated
  • Basic information on the input sequences for each locus:

    1. The number of sequences for each species
    2. The length of the alignment
    3. The percentage of gaps within the alignment. Note that end gaps are not taken into account. There is a warning when the percentage is >30%
    4. A JalView button to see the aligned sequences
    5. The aligned input sequences in any selected formats. When the species are aligned independently, both alignments are also shown

  • A 2x2 contingency table for each comparison performed and locus:


    When a comparison includes two classes of sites from which the program has analyzed both changes and sites, the results show the two analyses: one for changes and another for sites.

  • From this table, the following estimates are computed:

    • The Mantel-Haenszel Test of Homogeneity indicates whether there is homogeneity among the loci. When the p-value is significant loci are heterogeneous and then the combination of these loci in a single 2x2 contingency table is not appropriate.

      • χ2

      • p-value

    • The Mantel-Haenszel Estimator is equivalent to the Neutrality Index (Rand and Kann 1996) shown for one-locus tests and indicates the extent to which the levels of amino acid polymorphism depart from the expected in the neutral model.

      • ωMH

      • χ2

       
      • p-value

       
    • : the mean proportion of adaptive substitutions (Smith and Eyre-Walker 2002), ranging from -∞ to 1 and being estimated as:

Note that if only one loci is input to this interface, output estimates will be the same as in the Standard or the Advanced MKT, and a warning will be displayed.

Both the contingency tables and the estimates are computed with the divergence corrected by Jukes&Cantor and without any correction for divergence. The default results shown are corrected by Jukes&Cantor, but the results without correction can be viewed by selecting the button 'Without any correction for divergence' in the output page.


DGM UAB