LAST: Genome-Scale Sequence Comparison
======================================

Introduction
------------

LAST is software for comparing and aligning sequences, typically DNA
or protein sequences.  LAST is similar to BLAST, but it copes better
with huge amounts of sequence data.  It can also report probabilities
for every pair of aligned letters, indicating the reliability of each
pairing.


Requirements
------------

To handle mammalian genomes, you will need at least 2 gigabytes of
RAM, and a few tens of gigabytes of disk space.  To install the
software, you need a C++ compiler.

Optional: to run the scripts, you need a Unix-like environment with
Python.  To make dotplots, you need the Python Imaging Library.

Luxury: to handle mammalian genomes with maximum efficiency, it's good
to have about 16 gigabytes of RAM (and use it with "lastdb -s16G").


Installation
------------

Just go into the src directory and type 'make'.  This should make two
programs: lastdb and lastal.  (If you checked it out using subversion,
then type 'make' in the top-level directory, not the src directory.)
Run the programs without arguments to get usage messages.


Example 1: Compare the human and fugu mitochondrial genomes
-----------------------------------------------------------

You can find these sequences in the examples directory: humanMito.fa
and fuguMito.fa.  Firstly, make a LAST database of the human
sequence::

  lastdb -c -m110 humanMito humanMito.fa

This will make some new files whose names begin with "humanMito".
Here, we used "-c" to soft-mask lowercase letters, and "-m110" to skip
every third position when matching: this makes it more sensitive for
matching protein-coding DNA (and non-coding DNA to some extent).
Secondly, compare the fugu sequence to the human database::

  lastal -o myalns.maf -u2 humanMito fuguMito.fa

This will write alignments in a file called "myalns.maf".  Here, we
used "-u2" to soft-mask lowercase letters.  To view the alignments,
you'll want to avoid text-wrapping, e.g. 'less -S myalns.maf'.

For an example of aligning multiple mitochondrial genomes, see
multiMito.sh in the examples directory.


Example 2: Compare the cat and mouse genomes
--------------------------------------------

Let's assume you have the cat and mouse genomes in FASTA-format files:
cat/chr*.fa and mouse/chr*.fa.  We'll assume also that repetitive
regions are in lowercase.  We can compare them using the same steps as
above::

  lastdb -c -m110 -v mousedb mouse/chr*.fa
  lastal -o myalns.maf -u2 -v mousedb cat/chr*.fa

The "-v" (verbose) option just makes it write progress messages on the
screen.  Next, we might want to remove paralogs or make a dotplot: see
the accompanying document last-scripts.txt.


Example 3: Map short sequence tags to the human genome
------------------------------------------------------

Let's assume you have the human genome and tag sequences in
FASTA-format files: human/chr*.fa and tags.fa.  This time, we will not
mask repeats, because we want to map repetitive tags too::

  lastdb -v humandb human/chr*.fa
  lastal -o myalns.maf -a2 -e30 -v humandb tags.fa

Here, we used "-a2" to set the gap existence cost to 2, and "-e30" to
get alignments with score >= 30.  The appropriate score parameters
depend on how long the tags are and how many errors you want to allow:
the default scoring scheme assigns +1 to each match and -1 to each
mismatch.  For more ideas on tag mapping, see the accompanying
document tag-seeds.txt.


Output Formats
--------------

lastal can write alignments in two formats: tabular and MAF.  MAF
format looks like this::

  a score=15
  s chr3L        19433515 23 + 24543557 TTTGGGAGTTGAAGTTTTCGCCC
  s H04BA01F1907        2 21 +       25 TTTGGGAGTTGAAGGTT--GCCC
  p 1 1 1 1 1 1 1 1 1 1 1 1 1 1 0.85 0.759 0.662 - - 0.533 0.574 0.593 0.564

Lines starting with "s" contain: the sequence name, the start position
of the alignment, the number of nucleotides in the alignment, the
strand, the total size of the sequence, and the aligned nucleotides.
If the alignment starts at the beginning of the sequence, the start
position is zero.  If the strand is "-", the start position is as if
we had used the reverse-complemented sequence.  The line starting with
"p" contains the probability of each pair of aligned letters.  The
same alignment in tabular format looks like this::

  15 chr3L 19433515 23 + 24543557 H04BA01F1907 2 21 + 25 17,2:0,4

The final column shows the sizes and offsets of gapless blocks in the
alignment.  In this case, we have a block of size 17, then an offset
of size 2 in the upper sequence and 0 in the lower sequence, then a
block of size 4.  Probabilities are not shown in this format.


Steps in lastal
---------------

1) Find initial matches:
     keep those with multiplicity <= m and depth >= l.

2) Extend gapless alignments from the initial matches:
     keep those with score >= d.

3) Extend gapped alignments from the gapless alignments:
     keep those with score >= e.

4) Non-redundantize the gapped alignments:
     remove those that share an endpoint with a higher-scoring alignment.

5) Calculate probabilities (OFF by default).

6) Redo the gapped extensions using centroid alignment (OFF by default).


How Probabilities are Calculated
--------------------------------

We assume that each gapped extension has probability proportional to:
exp(lambda * score).  Here, lambda is the scale parameter of the
scoring matrix (YK Yu et al. 2003, PNAS 100(26):15688-93).  Then, the
probability of each letter-pair is the sum of the probabilities of all
possible gapped extensions that include this pairing.  Gapped
extensions are made from fixed "seeds", which are perfect matches:
each pairing within a seed is assigned a probability of 1.


Options for lastdb
------------------

Main Options
~~~~~~~~~~~~

-p  Interpret the sequences as proteins.  The default is to interpret
    them as DNA.

-c  Read the sequences case-sensitively.  Lowercase letters are then
    forbidden in initial matches (except in skipped positions), but
    they may participate in gapless and gapped alignments, depending
    on the -u option of lastal.  The default is to convert all letters
    to uppercase on reading.

-m  Specify skipped positions in initial matches, e.g. "-m 110101". In
    this example, every third and fifth position out of six will be
    skipped.  The first position cannot be skipped, i.e. it must be
    "1".


Advanced Options
~~~~~~~~~~~~~~~~

-w  Allow initial matches to start only at every "w"th position in each
    database sequence.  This reduces time and storage requirements, at
    the expense of sensitivity.  To emulate BLAT, use "-w 11".

-s  Split large databases into "volumes" of at most the specified
    number of bytes (excluding buckets).  If a single sequence exceeds
    this amount, however, it is not split.  The default is tuned for 2
    gigabytes of RAM: if you have more, increase this to make lastal
    go faster. You can use suffixes K, M, and G to specify KibiBytes,
    MebiBytes, and GibiBytes.

-a  Specify your own alphabet, e.g. "-a ABCDE".  In this example, only
    A, B, C, D, E will be allowed in initial matches (except in
    skipped positions).  Other letters will be allowed in alignments,
    but will receive the mismatch score.

-b  Specify the depth of "buckets" used to accelerate initial match
    finding.  The deeper the faster, but the more memory is needed.
    The default is to use the maximum depth that consumes at most one
    byte per possible match start position.  This option has no effect
    on the results.

-v  Be verbose: write messages about what lastdb is doing.


Options for lastal
------------------

Main Options
~~~~~~~~~~~~

-h  Show all options and their default settings.

-o  Write output to the specified file, instead of the screen.

-u  Specify treatment of lowercase letters in the query sequences.  0
    means convert them to uppercase; 1 means mask when finding initial
    matches but not thereafter; 2 means mask when finding initial
    matches and performing gapless extensions but not when performing
    gapped extensions; 3 means mask at all stages.  If lastdb was run
    with "-c", then this treatment will also apply to the database
    sequences, except that lowercase regions in the database are
    always masked when finding initial matches.

-s  Specify which query strand should be used: 0 means reverse only, 1
    means forward only, and 2 means both.

-f  Choose the output format: 0 means tabular and 1 means MAF.


Score Parameters
~~~~~~~~~~~~~~~~

-r  Match score.

-q  Mismatch score.

-p  Obtain match and mismatch scores from the specified file.  The -r
    and -q options will then be ignored.  For examples of the format,
    see HOXD70 and TiTv212 in the examples directory.

-a  Gap existence cost.

-b  Gap extension cost.  A gap of size k costs: a + b*k.

-c  This option allows use of "generalized affine gap costs" (SF
    Altschul 1998, Proteins 32(1):88-96).  Here, a "gap" may consist
    of unaligned regions of both sequences.  If these unaligned
    regions have sizes j and k, where j <= k, the cost is: a + b*(k-j)
    + c*j.  If c >= a + 2b (the default), it reduces to standard
    affine gaps.

-x  Maximum score dropoff for gapped alignments.  Gapped alignments
    are forbidden from having any internal region with score < -x.
    This serves two purposes: accuracy (avoid spurious internal
    regions in alignments) and speed (the smaller the faster).

-y  Maximum score dropoff for gapless alignments.

-d  Minimum score for gapless alignments.  For guidance on choosing
    this parameter, see the accompanying E-value tables.

-e  Minimum score for gapped alignments.  For guidance on choosing
    this parameter, see the accompanying E-value tables.


Miscellaneous Options
~~~~~~~~~~~~~~~~~~~~~

-m  Maximum multiplicity for initial matches.  Each initial match is
    lengthened until it occurs at most this many times in the database
    volume.

-l  Minimum depth for initial matches.  "Depth" is the number of
    matched, non-skipped nucleotides.

-k  Look for initial matches starting only at every "k"th position in
    the query.  This increases speed at the expense of sensitivity.

-i  Search queries in batches of at most this many bytes.  If a single
    sequence exceeds this amount, however, it is not split.  You can
    use suffixes K, M, and G to specify KibiBytes, MebiBytes, and
    GibiBytes. This option has no effect on the results (apart from
    their order). Higher values can reduce disk reads.

-w  This option is a kludge to avoid catastrophic time and memory
    usage when self-comparing a large sequence.  If a large identical
    match is found, then gapped alignments will not be triggered from
    repeats (typically tandem repeats) within the identical match
    whose start positions are offset by this distance or less.  Use
    "-w 0" to turn this off.

-t  'temperature' for calculating probabilities.  Make the probability
    of each gapped extension proportional to exp(score / t).

-g  This option allows use of "gamma-centroid alignment" (M Hamada et
    al. 2009, Bioinformatics 25(4):465-73).  Such alignments only
    include pairings with probability > 1/(1+g).  When g=1, this is
    the same as "centroid alignment" (LE Carvalho & CE Lawrence 2008,
    PNAS 105(9):3209-14).  When lastal does (gamma-)centroid
    alignment, it does not report the usual alignment score.  Instead,
    it reports: sum[prob * (1+g) - 1].

-v  Be verbose: write messages about what lastal is doing.

-j  Output type: 0 means counts of initial matches (of all sizes); 1
    means gapless alignments; 2 means gapped alignments before
    non-redundantization; 3 means gapped alignments after
    non-redundantization; 4 means alignments with probabilities; 5
    means centroid alignments.  Match counts (-j 0) respect the
    minimum depth option but not the maximum multiplicity option.
    It's a bad idea to try -j 0 when comparing a large sequence to
    itself.

-F  This option allows lastal to use sequence quality scores for the
    queries.  0 means read queries in FASTA format (without quality
    scores); 1 means FASTQ-Sanger format; 2 means FASTQ-Solexa format;
    3 means PRB format.  The FASTQ formats look like this::

      @mySequenceName
      TTTTTTTTGCCTCGGGCCTGAGTTCTTAGCCGCG
      +
      55555555*&5-/55*5//5(55,5#&$)$)*+$

    The "+" may optionally be followed by a name (ignored), and the
    sequence and quality codes are allowed to wrap onto more than one
    line.  For FASTQ-Sanger, the quality scores are obtained by
    subtracting 33 from the ASCII values of the characters below the
    "+", and for FASTQ-Solexa, they are obtained by subtracting 64.
    PRB format stores four quality scores (A, C, G, T) per position,
    with one sequence per line, like this::

      -40   40  -40  -40      -12    1  -12   -3      -10   10  -40  -40

    Since PRB does not store sequence names, lastal uses the line
    number (starting from 1) as the name.  In FASTQ-Sanger format, the
    quality scores are related to error probabilities like this:
    qScore = -10log10[p].  In FASTQ-Solexa and PRB, however, qScore =
    -10log10[p/(1-p)].  In lastal's MAF output, the quality scores are
    written on lines starting with "q".  For FASTQ, they are written
    with the same encoding as the input.  For PRB, they are written in
    the FASTQ-Solexa (ASCII-64) encoding.

    The quality scores influence alignment scores as follows.  Let Qiy
    be the probability that the base at position i is y (y = A, C, G,
    or T).  Let Sxy be the scoring matrix, and let T be the
    "temperature" parameter (by default 1/lambda).  Then, the score
    for aligning base x (A, C, G, or T) to position i is::

      Rix = T * ln[ sum(y){ Qiy * exp[ Sxy / T ] } ]


Credits & Citation
------------------

LAST was developed by Martin C. Frith, Michiaki Hamada, and Paul
B. Horton in the Computational Biology Research Center.  Many thanks
to Hajime Harada for setting up the repository and website, and Takako
Sugawara for making the logo.  LAST includes public domain code kindly
provided by Yi-Kuo Yu and Stephen Altschul at the NCBI.  There is no
journal publication yet, so please cite the website:
http://last.cbrc.jp/.


Questions, Comments, Problems
-----------------------------

Please email: last (ATmark) cbrc (dot) jp.  If reporting a problem,
please describe exactly how to trigger the problem.
