Mapping tags to a genome with LAST
==================================

LAST has many adjustable parameters, providing many ways of mapping
tags to a genome.  We cannot tell you which way is best, but here are
some ideas that might be helpful.

1. A simple mapping procedure
-----------------------------

Suppose we wish to map tags of length 36 to the mouse genome.  Here is
one way to do it:

  lastdb -s16G mousedb mouse/chr*.fa
  lastal -a2 -e30 -f0 mousedb tags.fa

Here, we used -s16G to indicate that 16 gigabytes of memory are
available.  This will make lastal run faster.  If you don't have 16
gigabytes, omit this option.  We then used -a2 to set the gap
existence cost to 2, and -e30 to get alignments with score >= 30.  We
left the other score parameters at their default values: match score =
1, mismatch cost = 1, gap extension cost = 1.  These parameters allow
a few mismatches and/or a few small gaps.  The -f0 option simply
selects the compact tabular output format.

2. Using sequence quality data
------------------------------

You can use sequence quality data in FASTQ or PRB format to get more
accurate mappings.  For example:

  lastal -Q3 mousedb tags_prb.txt

Using quality data improves mapping accuracy only if the quality data
itself is accurate, which it might not be.

3. Dealing with multi-mapping tags
----------------------------------

Often, one tag will align to more than one genome location.  You can
use last-map-probs.py in the scripts directory to help judge which
location the tag really maps to.  This script calculates a mapping
probability for each alignment.  For example, if one tag aligns to two
locations with identical scores, then the probabilities will be 50:50.

4. How does the simple mapping procedure work: what are its limitations?
------------------------------------------------------------------------

If you want to understand how this mapping procedure works in more
detail, read on.  LAST uses a two-step approach: first find initial
matches, then extend alignments from these matches.  In this case, the
"initial matches" are: all exact matches of any part of a tag to the
genome, of any size, where the match occurs at most ten times in the
genome.

One consequence of this is that repetitive tags will not be mapped: if
a tag perfectly matches more than ten locations in the genome, it gets
dropped at the first step.

Another wrinkle is the effect of database volumes.  LAST is designed
to work with 2 gigabytes of memory, so it splits large
(e.g. mammalian) genomes into "volumes", and maps tags to each volume
in turn as if they were separate genomes.  If a tag perfectly matches
more than ten locations in one volume, but less than ten in another,
then the former matches will not be reported but the latter will.  You
can avoid this inconsistency by using -s16G to put the whole mouse
genome into one volume.  Even if the genome is in one volume, however,
the two strands get searched separately.

The main point is that this procedure does not guarantee to find all
alignments with score >= 30.  It is more likely to miss alignments
that have uniformly-spaced mismatches/gaps, and less likely to miss
alignments with mismatches/gaps concentrated at the ends.  We think it
does a good job in practice.

5. Counting exact matches
-------------------------

We can gain information on repetitive tags as follows:

  lastal -j0 -l36 mousedb tags.fa

Here, -j0 tells lastal to just report counts of initial matches.  In
this case, there is no limit on how often the matches occur: matches
that occur more than ten times in the genome are counted too.  So
nothing is missed, and there is no effect from database volumes.  The
-l36 option requests matches of size >= 36 only: this makes it faster
and makes the output smaller.  (Without -l36, it counts all matches of
size >= 1: this is still quite fast.)

6. Finding all matches with up to N mismatches
----------------------------------------------

One approach to tag mapping is to guarantee finding all matches with
up to N mismatches.  The "guarantee" part sounds good, but there are
some drawbacks to this approach:

* It does not allow for insertions or deletions.

* It does not allow for higher error rates near the ends of tags.

* It is not suitable for partial matches, e.g. if a tag crosses a
  splice junction.

* Usually, some tags match repetitively to millions of genome
  locations: finding all these matches is slow and produces huge
  output.

You can mitigate the last drawback by counting exact matches (as
explained above) and then removing tags with many exact matches.

Suppose we wish to find all matches of our length-36 tags to the
genome, allowing up to two mismatches.  A naive approach is to start
by finding all exact matches of size 12, and extend alignments from
these.  This works because any length-36 tag with two mismatches is
guaranteed to have an exact match of size 12.  It will be very slow,
however, because there will be many unproductive size-12 matches.

We can do better by finding matches using a spaced seed, and then
extending alignments.  For example, our tags are guaranteed to have a
match using this spaced seed pattern: 11111011000111110110001111.
Since this seed has 18 matched positions (18 "1"s), we will get far
fewer unproductive matches.  With LAST, we can do this as follows:

  lastdb -m11111011000 mydb genome.fa
  lastal -l26 -m4000000000 -j1 -q0 -d34 mydb tags.fa

In the lastdb command, the seed pattern gets cyclically repeated, so
we only need to specify the repeating unit of the pattern.  In the
lastal command, we used -l26 to get length-26 initial matches, and
-m4000000000 to accept hugely repeated initial matches.  We also used
-j1 to request gapless alignments, -q0 to set the mismatch cost to 0,
and -d34 to request alignments with score >= 34.  This will give us
all 36-mer alignments with at most two mismatches.

The following table shows optimal spaced seed patterns for various tag
sizes and numbers of mismatches.  Each entry shows the match length
(e.g. 26) and the pattern (e.g. 11111011000).

====  ===========  ================  ==================  ======================
Tag   1 mismatch   2 mismatches      3 mismatches        4 mismatches
size
====  ===========  ================  ==================  ======================
16    12 11110     10 1110100         9 11010000          3 1110
17    13 11110     10 1110100        10 11010000          5 1110
18    14 11110     12 1110100        10 11010000          5 1110
19    14 11110     12 1110100        12 11010000          5 1110
20    16 11110     12 1110100        12 11010000         11 1100010000
21    17 11110     15 1110100        12 11010000         12 1100010000
22    18 11110     16 1110100        13 1110100000       12 1100010000
23    19 11110     17 1110100        13 11101001000      12 1100010000
24    19 111110    17 1110100        14 11101001000      12 1100010000
25    20 111110    19 1110100        14 11101001000      16 1100010000
26    21 111110    19 1110100        16 11101001000      16 1100010000
27    22 111110    19 1110100        16 11101001000      15 1110100000000
28    23 111110    22 1110100        16 11101001000      16 1110100000000
29    23 111110    23 1110100        19 11101001000      16 1110100000000
30    25 111110    24 1110100        19 11101001000      18 1110100000000
31    26 111110    24 1110100        19 1110110100000    18 1110100000000
32    27 111110    26 1110100        18 111101011001000  18 111010010000000
33    28 111110    26 1110100        19 111101011001000  18 111010010000000
34    29 111110    22 1111101110010  19 111101011001000  20 111010010000000
35    29 1111110   25 11111011000    21 111101011001000  20 111010010000000
36    30 1111110   26 11111011000    21 111101011001000  20 11110010000001000
37    31 1111110   27 11111011000    23 111101011001000  21 11110010000001000
38    32 1111110   27 11111011000    24 111101011001000  21 11110010000001000
39    33 1111110   29 11111011000    24 111101011001000  21 11110010000001000
40    34 1111110   30 11111011000    24 111101011001000  24 11110010000001000
41    34 1111110   30 11111011000    27 111101011001000  24 11110010000001000
42    36 1111110   30 1111110101100  27 111101011001000  24 1101110100000010000
43    37 1111110   31 1111110101100  27 111101011001000  25 1101110100000010000
44    38 1111110   32 1111110101100  32 1110110100000    25 1101110100000010000
45    39 1111110   32 1111110101100  31 111101011001000  27 1101110100000010000
46    40 1111110   34 1111110101100  32 111101011001000  28 1101001110100000000
47    41 1111110   35 1111101110010  33 111101011001000  28 1101001110100000000
48    41 11111110  35 1111101110010  34 111101011001000  30 1101001110100000000
49    42 11111110  37 1111110101100  34 111101011001000  30 1101001110100000000
50    43 11111110     ?              36 111101011001000  30 1101001110100000000
====  ===========  ================  ==================  ======================

This table was made using software kindly provided by the authors of
these publications:

* G Kucherov, L Noé, M Roytberg (2005) IEEE/ACM Trans Comput Biol
  Bioinform 2:51-61.
* S Burkhardt, J Kärkkäinen (2003) Fundamenta Informaticae 56:51-70.

For longer tags, it becomes harder to determine the optimal seed patterns.

7. Merging identical tag sequences
-----------------------------------

Suppose we have tag sequences in a FASTA format file called "tags.fa":

  >tagA
  GGACAAAAACCAAAAAAAACAAAAAAAAAAAAAAAA
  >tagB
  GGACAAAAACCAAAAAAAACAAAAAAAAAAAAAAAA
  >tagC
  GGACAAAAACCAAAAAAAACAAAAAAAAAAAAAAAA
  >tagD
  GGCACTCTTTCCCTACACGACGCTCTTCCGATCTGG

If there are many identical sequences, we can speed up the mapping by
merging them.  The following Unix pipeline merges identical sequences
(assuming each sequence is all on one line):

  grep -v '>' tags.fa | sort | uniq -c | awk '{print ">" NR ":" $1 "\n" $2}'

The output of this command looks as follows:

  >1:3
  GGACAAAAACCAAAAAAAACAAAAAAAAAAAAAAAA
  >2:1
  GGCACTCTTTCCCTACACGACGCTCTTCCGATCTGG

The number after the colon is the count of the tag, and the number
before the colon is just a serial number.

8. How does lastal use sequence quality data?
---------------------------------------------

The quality scores have no effect on finding initial matches, but they
do affect extending alignments from the initial matches.  If quality
scores are used, the default alignment scoring scheme is +6 for a
high-quality match and -18 for a high-quality mismatch.  Low-quality
matches and mismatches get scores between these values, as shown in
the following table.  If your tags are very short, make sure that the
alignment score threshold does not exceed the tag length times 6,
otherwise you will not get any matches!

======  ========  ========      ======  ========  ========
Solexa  Match     Mismatch      Phred   Match     Mismatch
score   score     score         score   score     score
======  ========  ========      ======  ========  ========
]  29       6       -18	        >  29       6       -18
\  28       6       -17	        =  28       6       -17
[  27       6       -17	        <  27       6       -17
Z  26       6       -17	        ;  26       6       -17
Y  25       6       -17	        :  25       6       -17
X  24       6       -17	        9  24       6       -17
W  23       6       -17	        8  23       6       -17
V  22       6       -16	        7  22       6       -16
U  21       6       -16	        6  21       6       -16
T  20       6       -15	        5  20       6       -15
S  19       6       -15	        4  19       6       -15
R  18       6       -14	        3  18       6       -14
Q  17       6       -14	        2  17       6       -14
P  16       6       -13	        1  16       6       -13
O  15       6       -13	        0  15       6       -12
N  14       6       -12	        /  14       6       -12
M  13       6       -11	        .  13       6       -11
L  12       6       -10         -  12       6       -10
K  11       6       -10         ,  11       6        -9 
J  10       6        -9         +  10       6        -8 
I   9       5        -8         *   9       5        -7 
H   8       5        -7         )   8       5        -7 
G   7       5        -6         (   7       5        -6 
F   6       5        -6         '   6       5        -5 
E   5       5        -5         &   5       4        -4 
D   4       5        -4         %   4       4        -3 
C   3       4        -3         $   3       3        -2 
B   2       4        -3         #   2       2        -1 
A   1       3        -2         "   1      -1         0  
@   0       3        -2         !   0     -18         1  
?  -1       2        -1
>  -2       2        -1
=  -3       1        -1
<  -4       1         0
;  -5       0         0
   -6      -1         0
   -7      -2         0
   -8      -3         1
   -9      -3         1
  -10      -4         1
  -11      -5         1
  -12      -6         1
  -13      -7         1
  -14      -8         1
  -15      -9         1
  -16     -10         1
  -17     -10         1
  -18     -11         1
  -19     -12         1
  -20     -13         1
  -21     -13         1
  -22     -14         1
  -23     -15         1
  -24     -15         1
  -25     -16         1
  -26     -16         1
  -27     -16         1
  -28     -17         1
  -29     -17         1
  -30     -17         1
  -31     -17         1
  -32     -17         1
  -33     -17         1
  -34     -18         1
======  ========  ========      ======  ========  ========
