DO SELECT 1;
SELECT id,seq FROM sequences;
SELECT description FROM sequences WHERE id=#;
SELECT seq FROM sequences WHERE id=#;
```
### PostgreSQL
To compile the FASTA programs:
* install the package `libpq-dev` (`sudo apt-get install libpq-dev`)
* `cd src` and `make -f ../make/Makefile.linux_pgsql all`
Program execution example:
```
../bin/fasta36 -q ../seq/mgstm1.aa "/path/to/library_file 17"
```
### MariaDB (fork of MySQL)
To compile the FASTA programs:
* install the package `libmariadb-dev` (`sudo apt-get install libmariadb-dev`)
* `cd src` and `make -f ../make/Makefile.linux_mariadb all`
Program execution example:
```
../bin/fasta36 -q ../seq/mgstm1.aa "/path/to/library_file 16"
``` fasta36-36.3.8i_14-Nov-2020/doc/README_v36.3.8d.md 0000664 0000000 0000000 00000002766 14334533127 0020075 0 ustar 00root root 0000000 0000000
## The FASTA package - protein and DNA sequence similarity searching and alignment programs
Changes in **fasta-36.3.8d** released 13-April-2016:
1. Various bug fixes to `pssm_asn_subs.c` that avoid coredumps when
reading NCBI PSSM ASN.1 binary files. `pssm_asn_subs.c` can now read
IUPACAA sequences.
2. default gap penalties for VT40 (from -14/-2 to -13/-1), VT80 (from
-14/-2 to -11/-1), and VT120 (from -10/-1 to 11/-1) have changed
slightly.
3. Introduction of `scripts/m9B_btop_msa.pl` and
`scripts/m8_btop_msa.pl`, which uses the BTOP (`-m 9B` or `-m 8CB`)
encoded alignment strings to produce a query driving multiple
sequence alignment (MSA) in ClustalW format. This MSA can be used
as input to `psiblast` to produce an ASN.1 PSSM.
4. The `scripts/annot_blast_btop2.pl` script replaces
`scripts/annot_blast_btop.pl` and allows annotation of both the query
and subject sequences.
5. Various domain annotation scripts have been renamed for clarity.
For example, `ann_feats_up_sql.pl` uses an SQL implementation of
Uniprot features tables to annotate domains. Likewise,
`ann_pfam_www.pl` gets domain information from the Pfam web site,
while `ann_pfam27.pl` gets the information from the downloaded
Pfam27 mySQL tables, and `ann_pfam28.pl` uses the Pfam28 mySQL
tables.
6. percent identity in sub-alignment scores is calculated like a BLAST
percent identity -- gaps are not included in the denominator.
For more detailed information, see `doc/readme.v36`.
fasta36-36.3.8i_14-Nov-2020/doc/README_v36.3.8i.md 0000664 0000000 0000000 00000030225 14334533127 0020071 0 ustar 00root root 0000000 0000000
## The FASTA package - protein and DNA sequence similarity searching and alignment programs
This directory contains the source code for the FASTA package of
programs (W. R. Pearson and D. J. Lipman (1988), "Improved Tools
for Biological Sequence Analysis", *PNAS 85:2444-2448*). The current verion of the program is `fasta-36.3.8i`.
If you are reading this at
[fasta.bioch.virginia.edu/wrpearson/fasta/fasta36](https://fasta.bioch.virginia.edu/wrpearson/fasta/fasta36),
links are available to executable binaries for Linux, MacOS, and
Windows. The source code is also available from
[github.com/wrpearson/fasta36](https://github.com/wrpearson/fasta36).
The FASTA package offers many of the same programs as `BLAST`, but
takes a different approach to statistical estimates, and provides
additional optimal programs for local (`ssearch36`) and global
(`ggsearch36`, `glsearch36`) alignment, and for non-overlapping
internal local alignments (`lalign36`).
The programs available include:
|
FASTA | BLAST | description |
|
fasta36 | blastp/blastn | Protein and DNA local similarity search |
ssearch36 | | optimal Smith-Waterman search -- vectorized on Intel and Arm architectures |
ggsearch36 | | optimal global Needleman-Wunsche search -- vectorized on Intel and Arm architectures |
glsearch36 | | optimal global(query)/local (library) search -- vectorized on Intel and Arm architectures |
fastx36 / fasty36 | blastx | DNA query search against protein sequence database. (fasty36 uses a slower, more sophisticated frame shift aligner) |
tfastx36 / tfasty36 | tblastn | protein query search against DNA database |
|
fastf36 / tfastf36 | | compares an ordered peptide mixture against a protein (fastf36) or DNA (tfastf36) database |
fastm36 / tfastm36 | | compares a set of ordered peptide against a protein (fastf36) or DNA (tfastf36) database or oligonucleotides against a DNA database |
fasts36 / tfasts36 | | compares an unordered set of peptides against a protein (fasts36) or DNA (tfasts36) database |
|
lalign36 | | look for non-overlapping internal alignments, similar to a "dot-plot," but with statistical signficance |
|
Changes in **fasta-36.3.8i** Nov, 2022
1. bug fix to remove duplicate variant annotations
2. update to scripts/get_protein.py and annotation scripts.
3. modify code to reduce mktemp compilation warning messages
4. changes to annotation scripts for Pfam shutdown; new ann_pfam_www.py, ann_pfam_sql.py
5. a new option, `r` for `-m 8CB` that displays the raw optimal alignment score (typically Smith-Waterman).
Changes in **fasta-36.3.8i** Sept, 2021
1. Enable translation table -t 9 for Echinoderms. This bug has existed
since alternate translation tables were first made available.
Changes in **fasta-36.3.8i** May, 2021
1. Add an option, -Xg, that preserves the gi|12345 string the score
summary and alignment output.
Changes in **fasta-36.3.8i** Nov, 2020
1. fasta-36.3.8i (November, 2020) incorporates the SIMDe
(SIMD-everywhere,
https://github.com/simd-everywhere/simde/blob/master/simde/x86/sse2.h)
macro definitions that allow the smith\_waterman\_sse2.c,
global\_sse2.c, and glocal\_sse2.c code to be compiled on non-Intel
architectures (currently tested on ARM/NEON). Many thanks to
Michael R. Crusoe (https://orcid.org/0000-0002-2961-9670) for the
SIMDE code converstion, and to Evan Nemerson for creating SIMDe.
2. The code to read FASTA format sequence files now ignores lines with
'#' at the beginning, for compatibility with PSI Extended FASTA
Format (PEFF) files (http://www.psidev.info/peff).
Changes in **fasta-36.3.8h** May, 2020
1. fasta-36.3.8h (May 2020) fixes a bug that appeared when
multiple query sequences were searched against a large library
that would not fit in memory. In that case, the number of
library sequences and residues increased by the library size
with each new search.
2. More consistent formats for **ERROR** and **Warning** messages.
3. Corrections to code to address compiler warnings with gcc8/9.
4. addition of 's' option to show similarity in -m8CBls (or -m8CBs, -m8CBsl) and 'd' option to show raw (unaligned) domain information.
Changes in **fasta-36.3.8h** February, 2020
1. The license for Michael Farrar's Smith-Waterman sse2 code and global/glocal sse2 code is now open source (BSD), see COPYRIGHT.sse2 for details.
Changes in **fasta-36.3.8h** August, 2019
1. Modifications to support makeblastdb format v5 databases. Currently, only simple database reads have been tested.
Changes in **fasta-36.3.8h** March, 2019
1. Translation table 1 (`-t 1`) now translates 'TGA'->'U' (selenocysteine).
2. New script for extracting DNA sequences from genomes (`scripts/get_genome_seq.py`). Currently works with human (hg38), mouse (mm10), and rat (rn6).
Changes in **fasta-36.3.8h** January, 2019
1. Bug fixes: `fastx`/`tfastx` searches done with the `-t t` option (which adds a `*` to protein sequences so that termination codons can be matched), did not work properly with the `VT` series of matrices, particularly `VT10`. This has been fixed.
2. New features: Both query and library/subject sequences can be generated by specifying a program script, either by putting a `!` at the start of the query/subject file name, or by specifying library type `9`. Thus, `fasta36 \\!../scripts/get_protein.py+P09488+P30711 /seqlib/swissprot.fa` or `fasta36 "../scripts/get_protein.py+P09488+P30711 9" /seqlib/swissprot.fa` will compare two query sequences, `P09488` and `P30711`, to SwissProt, by downloading them from Uniprot using the `get_protein.py` script (which can download sequences using either Uniprot or RefSeq protein accessions). Often, the leading `!` must be escaped from shell interpretation with `\\!`.
New scripts that return FASTA sequences using accessions or genome coordinates are available in `scripts/`. `get_protein.py`, `get_uniprot.py`, `get_up_prot_iso_sql.py` and `get_refseq.py`. `get_refseq.py` can download either protein or mRNA RefSeq entries. `get_up_prot_iso_sql.py` retrieves a protein and its isoforms from a MySQL database.
`get_genome_seq.py` extracts genome sequences using coordinates from local reference genomes (`hg38` and `mm10` included by default).
Changes in **fasta-36.3.8h** December, 2018
The `scripts/ann_exons_up_www.pl` and `ann_exons_up_sql.pl` now include the option `--gen_coord` which provides the associated genome coordinate (including chromosome) as a feature, indicated by `'<'` (start of exon) and `'>'` (end of exon).
Changes in **fasta-36.3.8h** released November, 2018
**fasta-36.3.8h** provides new scripts and modifications to the `fasta` programs that normalize the process of merging sub-alignment scores and region information into both FASTA and BLAST results. To move BLASTP towards FASTA with respect to alignment annotation and sub-alignment scoring:
1. The `blastp_annot_cmd.sh` runs a blast search, finds and scores domain information for the alignments, and merges this information back into the blast output `.html` file. This script uses:
1. `annot_blast_btab2.pl --query query.file --ann_script annot_script.pl --q_ann_script annot_script.pl blast.btab_file > blast.btab_file_ann` (a blast tabular file with one or two new fields, an annotation field and (optionally with --dom_info) a raw domain content field.
2. `merge_blast_btab.pl --btab blast.btab_file_ann blast.html > blast_ann.html` (merge the annotations and domain content information in the `blast.btab_file_ann` file together with the standard blast output file to produce annotated alignments.
3. In addition, `rename_exons.py` is available to rename exons (later other domains) in the subject sequences to match the exon labeling in the aligned query sequence.
4. `relabel_domains.py` can be used to adjust color sets for homologous domains.
2. There is also an equivalent `fasta_annot_cmd.sh` script that provides similar funtionality for the FASTA programs. This script does not need to use `annot_blast_btab2.pl` to produce domain subalignment scores (that functionality is provided in FASTA), but it also can use `merge_fasta_btab.pl` and `rename_exons.py` to modify the names of the aligned exons/domains in the subject sequences.
3. To support the independence of the `blastp`/`fasta` output from html annotation, the FASTA package includes some new options:
1. The `-m 8CBL` option includes query sequence length and subject sequence length in the blast tabular output. In addition, if domain annotations are available, the raw domain coordinates are provided in an additional field after the annotation/subalignment scoring field. `-m 8CBl` provides the sequence lengths, but does not add the raw domain coordinates.
2. The `-Xa` option prevents annotation information from being included in the html output -- it is only available in the `-m 8CB` (or `-m 8CBL/l`) output
3. To reduce problems with spaces in script arguements, annotation scripts with spaces separating arguments can use '+' instead of ' '.
4. The `fasta_annot_cmd.sh` script produces both a conventional alignment on `stdout` and a `-m 8CBL` alignment, which is sent to a separate file, which is separated from the `-m F8CBL` option with a `=`, thus `-m F8CBL=tmp_output.blast_tab`.
Changes in **fasta-36.3.8g** released 23-Oct-2018
1. (Oct. 2018) Improvements to scripts in the `psisearch2/` directory:
1. `psisearch2/m89_btop_msa2.pl`
1. the `--clustal` option produces a "CLUSTALW (1.8)", which is required for some downstream programs
2. the `--trunc_acc` option removes the database and accession from identifiers of the form: `sp|P09488|GSTM1_HUMAN` to produce `GSTM1_HUMAN`.
3. the `--min_align` option specifies the fraction of the query sequence that must be aligned `(q_end-q_start+1)/q_length)`
Together, these changes make it possible for the output of `m89_btop_msa2.pl` to be used by the EMBOSS program `fprotdist`.
2. A more general implementation of `psisearch2_msa_iter.sh`, which does `psisearch2` one iteration at a time, and a new equivalent `psisearch2_msa_iter_bl.sh`, which uses `psiblast` to do the search.
* (Oct. 2018) A small restructuring of the `make/Makefiles` to remove the `-lz` dependence for non-debugging scripts (and add it back when -DDEBUG is used).
Changes in **fasta-36.3.8g** released 5-Aug-2018
1. (Apr 2018) incorporation of `-t t1` termination codes ("*") in `-m 8CB`, `-m 8CC`, and `-m9C` so that aligned termination codons are indicated as `**` (`-m8CB`) or `*1` (`-m8CC`, `-m9C`).
2. (Mar 2018) Updates to scripts/annot_blast_btop2.pl to provide subalignment scoring for blastp searches (BLOSUM62 only). (see doc/readme.v36)
3. (Feb. 2018) a new extended option, `-XB`, which causes percent identity, percent similarity, and alignment length to be calculated using the BLAST model, which does not count gaps in the alignment length.
see readme.v36 for other bug fixes.
Changes in **fasta-36.3.8g** released 31-Dec-2017
1. (December, 2017) -- Make statistical thresholds more robust for small E()-values with normally distributed scores (`ggsearch36`,`glsearch36`).
2. (September, 2017) Treat lower-case queries with no upper-case residues as uppercase with `-S` option.
3. (May, 2017) Improvements/fixes to sub-alignment scoring strategies.
4. Improvements/fixes to psisearch2 scripts.
For more detailed information, see `doc/readme.v36`.
fasta36-36.3.8i_14-Nov-2020/doc/changes_v34.html 0000664 0000000 0000000 00000033633 14334533127 0020434 0 ustar 00root root 0000000 0000000
ChangeLog - FASTA v34
ChangeLog - FASTA v34
$Id: changes_v34.html 120 2010-01-31 19:42:09Z wrp $
$Revision: 210 $
May 28, 2007
Small modification for GCG ASCII (libtype=5) header line.
October 6, 2006 CVS fa34t26b3
New Windows programs available using Intel C++ compiler. First
threaded programs for Windows; first SSE2 acceleration of SSEARCH for
Windows.
July 18, 2006 CVS fa34t26b2
More powerful environment variable substitutions for FASTLIBS files.
The library file name parsing programs now provide the option for
environment variable substitions. For example, SLIB2=/slib2 as an
environment variable (e.g. export SLIB2=/slib2 for ksh and bash), then
fasta34 -q query.aa '${SLIB2}/swissprot.fa' expands as expected.
While this is not important for command lines, where the Unix shell
would expand things anyway, it is very helpful for various
configuration files, such as files of file names, where:
<${SLIB2}/blast
swissprot.fa
now expands properly, and in FASTLIBS files the line:
NCBI/Blast Swissprot$0S${SLIB2}/blast/swissprot.fa
expands properly. Currently, Environment variable expansion only
takes place for library file names, and the <directory in a file of
file names.
July 2, 2006 fa34t26b0
This release provides an extremely efficient SSE2 implementation of
the Smith-Waterman algorithm for the SSE2 vector instructions written
by Michael Farrar (farrar.michael@gmail.com). The SSE code speeds up
Smith-Waterman 8 - 10-fold in my tests, making it comparable to Eric
Lindahl's Altivec code for the Apple/IBM G4/G5 architecture.
May 24, 2006 fa34t25d8
In addition, support for ASN.1 PSSM:2 files provided by the NCBI
PSI-BLAST WWW site is included. This code will not work with
iteration 0 PSSM's (which have no PSSM information). For ASN.1
PSSM's, which provide the matrix name (and in some cases the gap
penalties), the scoring matrix and gap penalties are set appropriately
if they were not specified on the command line. ASN.1 PSSM's are type 2:
ssearch34 -P "pssm.asn1 2" .....
May 18, 2006
Support for NCBI Blast formatdb databases has been expanded. The
FASTA programs can now read some NCBI *.pal and *.nal files, which are
used to specify subsets of databases. Specifically, the
swissprot.00.pal and pdbaa.00.pal files are supported. FASTA supports
files that refer to *.msk files (i.e. swissprot.00.pal refers to
swissprot.00.msk); it does not currently support .pal files that
simply list other .pal or database files (e.g. FASTA does not support
nr.pal or swissprot.pal).
Nov 20, 2005
Changes to support asymmetric matrices - a scoring matrix read in from
a file can be asymmetric. Default matrices are all symmetric.
Sept 2, 2005
The prss34 program has been modified to use the same display routines
as the other search programs. To be more consistent with the other
programs, the old "-w shuffle-window-size" is now "-v window-size".
prss34/prfx34 will also show the optimal alignment for which the
significance is calculated by using the "-A" option.
Since the new program reports results exactly like other
fasta/ssearch/fastxy34 programs, parsing for statistical significance
is considerably different. The old format program can be make using
"make prss34o".
May 5, 2005 CVS fa34t25d1
Modification to the -x option, so that both an "X:X" match score and
an "X:not-X" mismatch score can be specified. (This score is also used
to give a positive score to a "*:*" match - the end of a reading frame,
while giving a negative score to "*:not-*".
Jan 24, 2005
Include a new program, "print_pssm", which reads a blastpgp binary
checkpoint file and writes out the frequency values as text. These
values can be used with a new option with ssearch34(_t) and prss34,
which provides the ability to read a text PSSM file. To specify a
text PSSM, use the option -P "query.ckpt 1" where the "1" indicates a
text, rather than a binary checkpoint file. "initfa.c" has also been
modified to work with PSSM files with zero's in the in the frequency
table. Presumably these positions (at the ends) do not provide
information. (Jan 26, 2005) blastpgp actually uses BLOSUM62 values
when zero frequencies are provided, so read_pssm() has been modified
to use scoring matrix values for zero frequencies as well.
Nov 4-8, 2004
Incorporation of Erik Lindahl "anti-diagonal" Altivec code for
Smith-Waterman, only. Altivec SSEARCH is now faster than FASTA for
Aug 25,26, 2004 CVS fa34t24b3
Small change in output format for
p34comp* programs in
">>>query_file#1 string" line before alignments. This line is not present
in the non-parallel versions - it would be better for them to be consistent.
Dec 10, 2003 CVS fa34t23b3
Cause default ktup to drop for short query sequences. For protein queries < 50,
ktup=1;
for DNA queries < 20, 50, 100
ktup = 1, 2, 3, respectively.
Dec 7, 2003
A new option, "-U" is available for RNA sequence comparison. "-U"
functions like "-n", indicating that the query is an RNA sequence. In
addition, to account for "G:U" base pairs, "-U" modifies the scoring
matrices so that a "G:A" match has the same score as a "G:G" match,
and "T:C" match has the same score as a "T:T" match.
Nov 2, 2003
Support for more sophisticated display options. Previously, one could
have only on "-m #" option, even though several of the options were
orthogonal (-m 9c is independent of -m 1 and -m2, which is independent
of -m 6 (HTML)). In particular -m 9c can be combined with -m 6, which
can be very helpful for runs that need HTML output but can also
exploit the encoding provided by -m 9c.
The "-m 9" option now also allows "-m 9i", which shows the standard
best score information, plus percent identity and alignment length.
Sept 25, 2003
A new option is available for annotating alignments. -V '@#?!'
can be used to annotate sites in a sequence, e.g:
>GTM1_HUMAN ...
PMILGYWDIRGLAHAIRLLLEYTDS@S?YEEKKYT@MG
DAPDYDRS@QWLNEKFKLGLDFPNLPYLIDGAHKIT
might mark known and expected (S,T) phosphorylation sites. These
symbols are then displayed on the query coordinate line:
10 20 @? 30 @ 40 @ 50 60
GTM1_H PMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYTMGDAPDYDRSQWLNEKFKLGLDFPNLP
::::::::::::::::::::::::::::::::::::::::::::::::::::::::::::
gtm1_h PMILGYWDIRGLAHAIRLLLEYTDSSYEEKKYTMGDAPDYDRSQWLNEKFKLGLDFPNLP
10 20 30 40 50 60
This annotation is mostly designed to display post-translational
modifications detected by MassSpec with FASTS, but is also available
with FASTA and SSEARCH.
June 16, 2003 version: fasta34t22
ssearch34 now supports PSI-BLAST PSSM/profiles. Currently, it only
supports the "checkpoint" file produced by blastall, and only on
certain architectures where byte-reordering is unnecessary. It has not
been tested extensively with the -S option.
ssearch34 -P blast.ckpt -f -11 -g -1 -s BL62 query.aa library
Will use the frequency information in the blast.chkpt file to do a
position specific scoring matrix (PSSM) search using the
Smith-Waterman algorithm. Because ssearch34 calculates scores for
each of the sequences in the database, we anticipate that PSSM
ssearch34 statistics will be more reliable than PSI-Blast statistics.
The Blast checkpoint file is mostly double precision frequency
numbers, which are represented in a machine specific way. Thus, you
must generate the checkpoint file on the same machine that you run
ssearch34 or prss34 -P query.ckpt. To generate a checkpoint file,
run:
blastpgp -j 2 -h 1e-6 -i query.fa -d swissprot -C query.ckpt -o /dev/null
(This searches swissprot for 2 iterations ("-j 2" using a E()
threshold 1e-6 saving the resulting position specific frequencies in
query.ckpt. Note that the original query.fa and query.ckpt must
match.)
Apr 11, 2003 CVS fa34t21b3
Fixes for "-E" and "-F" with ssearch34, which was inadvertantly disabled.
A new option, "-t t", is available to specify that all the protein
sequences have implicit termination codons "*" at the end. Thus, all
protein sequences are one residue longer, and full length matches are
extended one extra residue and get a higher score. For
fastx34/tfastx34, this helps extend alignments to the very end in
cases where there may be a mismatch at the C-terminal residues.
-m 9c has also been modified to indicate locations of termination
codons ( *1).
Mar 17, 2003 CVS fa34t21b2
A new option on scoring matrices "-MS" (e.g. "BL50-MS") can be used to
turn the I/L, K/Q identities on or off. Thus, to make "fastm34" use
the isobaric identities, use "-s M20-MS". To turn them off for "fasts34",
use "-s M20".
Jan 25, 2003
Add option "-J start:stop" to pv34comp*/mp34comp*. "-J x" used to
allow one to start at query sequence "x"; now both start and stop can
be specified.
Nov 14-22, 2002 CVS fa34t20b6
Include compile-time define (-DPGM_DOC) that causes all the fasta
programs to provide the same command line echo that is provided by the
PVM and MPI parallel programs. Thus, if you run the program:
fasta34_t -q -S gtt1_drome.aa /slib/swissprot 12
the first lines of output from FASTA will be:
# fasta34_t -q gtt1_drome.aa /slib/swissprot
FASTA searches a protein or DNA sequence data bank
version 3.4t20 Nov 10, 2002
Please cite:
W.R. Pearson & D.J. Lipman PNAS (1988) 85:2444-2448
This has been turned on by default in most FASTA Makefiles.
Aug 27, 2002
Modifications to mshowbest.c and drop*.c (and p2_workcomp.c,
compacc.c, doinit.c, etc.) to provide more information about the
alignment with the -m 9 option. There is now a "-m 9c" option, which
displays an encoded alignment after the -m 9 alignment information.
The encoding is a string of the form: "=#mat+#ins=#mat-#del=#mat".
Thus, an alignment over 218 amino acids with no gaps (not necessarily
100% identical) would be =218. The alignment:
10 20 30 40 50 60 70
GT8.7 NVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKL--GLDFPNLPYL-IDGSHKITQ
:.:: . :: :: . .::: : .: ::.: .: : ..:.. ::: :..:
XURTG NARGRMECIRWLLAAAGVEFDEK---------FIQSPEDLEKLKKDGNLMFDQVPMVEIDG-MKLAQ
20 30 40 50 60
would be encoded: "=23+9=13-2=10-1=3+1=5". The alignment encoding is
with respect to the beginning of the alignment, not the beginning of
either sequence. The beginning of the alignment in either sequence is
given by the an0/an1 values. This capability is particularly useful
for [t]fast[xy], where it can be used to indicate frameshift positions
"/#\#" compactly. If "-m 9c" is used, the "The best scores" title
line includes "aln_code".
Aug 14, 2002 CVS tag fa34t20
Changes to nmgetlib.c to allow multiple query searches coming from
STDIN, either through pipes or input redirection. Thus, the command
cat prot_test.lseg | fasta34 -q -S @ /seqlib/swissprot
produces 11 searches. If you use the multiple query functions, the
query subset applies only to the first sequence.
Unfortunately, it is not possible to search against a STDIN library,
because the FASTA programs do not keep the entire library in memory
and need to be able to re-read high-scoring library sequences. Since
it is not possible to fseek() against STDIN, searching against a STDIN
library is not possible.
Aug 5, 2002
fasts34(_t) and
fastm34(_t) have been modified to allow searches with
DNA sequences. This gives a new capability to search for DNA motifs,
or to search for ordered or unordered DNA sequences spaced at
arbitrary distances.
June 25, 2002
Modify the statistical estimation strategy to sample all the sequences
in the database, not just the first 60,000. The histogram is still
based only on the first 60,000 scores and lengths, though all scores
an lengths are shown. The fit to the data may be better than the
histogram indicates, but it should not be worse.
June 19, 2002
Added "-C #" option, where 6 <= # <= MAX_UID (20), to specify the
length of the sequence name display on the alignment labels. Until
now, only 6 characters were ever displayed. Now, up to MAX_UID
characters are available.
Mar 16, 2002
Added create_seq_demo.sql, nt_to_sql.pl to show how to build an SQL
protein sequence database that can be used with with the mySQL
versions of the fasta34 programs. Once the mySQL seq_demo database
has been installed, it can be searched using the command:
fasta34 -q mgstm1.aa "seq_demo.sql 16"
mysql_lib.c has been modified to remove the restriction that mySQL
protein sequence unique identifiers be integers. This allows the
program to be used with the PIRPSD database. The RANLIB() function
call has been changed to include "libstr", to support SQL text keys.
Due to the size of libstr[], unique ID's must be < MAX_UID (20)
characters.
A "pirpsd.sql" file is available for searching the mySQL distribution
of the PIRPSD database. PIRPSD is available from
ftp://nbrfa.georgetown.edu/pir_databases/psd/mysql.
fasta36-36.3.8i_14-Nov-2020/doc/changes_v35.html 0000664 0000000 0000000 00000021074 14334533127 0020431 0 ustar 00root root 0000000 0000000
ChangeLog - FASTA v35
ChangeLog - FASTA v35
$Id: changes_v35.html 120 2010-01-31 19:42:09Z wrp $
$Revision: 210 $
Summary - Major Changes in FASTA version 35 (August, 2007)
- Accurate shuffle based statistics for searches of small libraries (or pairwise comparisons).
-
Inclusion of lalign35 (SIM) into FASTA3. Accurate statistics for
lalign35 alignments. plalign has been replaced by
lalign35 and lav2ps.
-
Two new global alignment programs: ggsearch35 and glsearch35.
February 7, 2008
Allow annotations in library, as well as
query sequences. Currently, annotations are only available within
sequences (i.e., they are not read from the feature table), but they
should be available in FASTA format, or any of the other ascii text
formats (EMBL/Swissprot, Genbank, PIR/GCG). If annotations are
present in a library and the annotation characters includes '*', then
the -V '*' option MUST be used. However, special characters other
than '*' are ignored, so annotations of '@', '%', or '@' should be
transparent.
In translated sequence comparisons, annotations are only available for
the protein sequence.
January 25, 2007
Support protein queries and sequence
libraries that contain 'O' (pyrrolysine) and 'U' (selenocysteine).
('J' was supported already). Currently, 'O' is mapped automatically to
'K' and 'U' to 'C'.
Dec. 13, 2007 CVS fa35_03_02m
Add ability to search a subset of a library using a file name and a
list of accession/gi numbers. This version introduces a new filetype,
10, which consists of a first line with a target filename, format, and
accession number format-type, and optionally the accession number
format in the database, followed by a list of accession numbers. For
example:
</slib2/blast/swissprot.lseg 0:2 4|
3121763
51701705
7404340
74735515
...
Tells the program that the target database is swissprot.lseg, which is
in FASTA (library type 0) format.
The accession format comes after the ":". Currently, there are four
accession formats, two that require ordered accessions (:1, :2), and
two that hash the accessions (:3, :4) so they do not need to be
ordered. The number and character after the accession format
(e.g. "4|") indicate the offset of the beginning of the accession and
the character that terminates the accession. Thus, in the typical
NCBI Fasta definition line:
>gi|1170095|sp|P46419|GSTM1_DERPT Glutathione S-transferase (GST class-mu)
The offset is 4 and the termination character is '|'. For databases
distributed in FASTA format from the European Bioinformatics
Institute, the offset depends on the name of the database, e.g.
>SW:104K_THEAN Q4U9M9 104 kDa microneme/rhoptry antigen precursor (p104).
and the delimiter is ' ' (space, the default).
Accession formats 1 and 3 expect strings; accession formats 2 and 4
work with integers (e.g. gi numbers).
December 10, 2007
Provide encoded annotation information with
-m 9c alignment summaries. The encoded alignment information makes it
much simpler to highlight changes in critical residues.
August 22, 2007
A new program is
available,
lav2svg, which creates SVG (Scalable Vector
Graphics) output. In addition,
ps_lav,
which was introduced May 30, 2007, has been replaced
by
lav2ps. SVG files are more easily edited with Adobe
Illustrator than postscript (
lav2ps) files.
July 25, 2007 CVS fa35_02_02
Change default gap penalties for OPTIMA5 matrix to -20/-2 from -24/-4.
July 23, 2007
Add code to support to support sub-sequence ranges for "library"
sequences - necessary for fully functional prss (ssearch35) and
lalign35. For all programs, it is now possible to specify a subset of
both the query and the library, e.g.
lalign35 -q mchu.aa:1-74 mchu.aa:75-148
Note, however, that the subset range applied to the library will be
applied to every sequence in the library - not just the first - and
that the same subset range is applied to each sequence. This probably
makes sense only if the library contains a single sequence (this is
also true for the query sequence file).
July 3, 2007 CVS fa35_02_01
Merge of previous
fasta34 with development version
fasta35.
June 26, 2007
Add amino-acid 'J' for 'I' or 'L'.
Add Mueller and Vingron (2000) J. Comp. Biol. 7:761-776 VT160 matrix,
"-s VT160", and OPTIMA_5 (Kann et al. (2000) Proteins 41:498-503).
June 7, 2007
ggssearch35(_t),
glsearch35(_t) can now use PSSMs.
May 30, 2007 CVS fa35_01_04
Addition of
ps_lav
(now
lav2ps or
lav2svg) -- which can be used to plot the lav
output of
lalign35 -m 11.
lalign35 -m 11 | lav2ps
replaces
plalign
(from
FASTA2).
May 2, 2007
The labels on the alignment scores are much more informative (and more
diverse). In the past, alignment scores looked like:
>>gi|121716|sp|P10649|GSTM1_MOUSE Glutathione S-transfer (218 aa)
s-w opt: 1497 Z-score: 1857.5 bits: 350.8 E(): 8.3e-97
Smith-Waterman score: 1497; 100.0% identity (100.0% similar) in 218 aa overlap (1-218:1-218)
^^^^^^^^^^^^^^
where the highlighted text was either: "Smith-Waterman" or "banded
Smith-Waterman". In fact, scores were calculated in other ways,
including global/local for
fasts and
fastf. With the addition of
ggsearch35, glsearch35, and
lalign35, there are many more ways to
calculate alignments: "Smith-Waterman" (ssearch and protein fasta),
"banded Smith-Waterman" (DNA fasta), "Waterman-Eggert",
"trans. Smith-Waterman", "global/local", "trans. global/local",
"global/global (N-W)". The last option is a global global alignment,
but with the affine gap penalties used in the Smith-Waterman
algorithm.
April 19, 2007 CVS fa34t27br_lal_3
Two new programs,
ggsearch35(_t) and
glsearch35(_t) are now available.
ggsearch35(_t) calculates an alignment score that is global in the
query and global in the library;
glsearch35(_t) calculates an alignment
that is global in the query and local, while local in the library
sequence. The latter program is designed for global alignments to domains.
Both programs assume that scores are normally distributed. This
appears to be an excellent approximation for ggsearch35 scores, but
the distribution is somewhat skewed for global/local (glsearch)
scores.
ggsearch35(_t) only compares the query to library sequences
that are beween 80% and 125% of the length of the query; glsearch
limits comparisons to library sequences that are longer than 80% of
the query. Initial results suggest that there is relatively little
length dependence of scores over this range (scores go down
dramatically outside these ranges).
March 29, 2007 CVS fa34t27br_lal_1
At last, the
lalign (SIM) algorithm has been moved from
FASTA21 to
FASTA35. A
plalign
equivalent is also available using
lalign -m 11 | lav2ps
or
| lav2svg.
The statistical estimates for
lalign35 should be much more accurate
than those from the earlier lalign, because lambda and K are estimated
from shuffles.
In addition, all programs can now generate accurate statistical
estimates with shuffles if the library has fewer than 500 sequences.
If the library contains more than 500 sequences and the sequences are
related, then the -z 11 option should be used.
p
FASTA v34 Change Log
fasta36-36.3.8i_14-Nov-2020/doc/changes_v36.html 0000664 0000000 0000000 00000056335 14334533127 0020442 0 ustar 00root root 0000000 0000000
ChangeLog - FASTA v36
ChangeLog - FASTA v36
Updates - FASTA version 36.3.8i (Nov, 2022)
-
Enable translation table -t 9 for Echinoderms. This bug has existed
since alternate translation tables were first made available.
-
Add an option, -Xg, that preserves the gi|12345 string the score
summary and alignment output.
-
Changes in scripts
(get_protein.py, ann_pfam_www.pl, ann_pfam_www.py)
to address changes in web addresses. Addition
of ann_pfam_sql.py (python version
of ann_pfam_sql.pl).
Updates - FASTA version 36.3.8i (Nov, 2020)
-
fasta-36.3.8i (November, 2020) incorporates the SIMDe (SIMD-everywhere,
https://github.com/simd-everywhere/simde/blob/master/simde/x86/sse2.h)
macro definitions that allow the smith\_waterman\_sse2.c,
global\_sse2.c, and glocal\_sse2.c code to be compiled on non-Intel
architectures (currently tested on ARM/NEON). Many thanks to
Michael R. Crusoe (https://orcid.org/0000-0002-2961-9670) for the
SIMDE code converstion, and to Evan Nemerson for creating SIMDe.
-
The code to read FASTA format sequence files now ignores lines with
'#' at the beginning, for compatibility with PSI Extended FASTA
Format (PEFF) files (http://www.psidev.info/peff).
Updates - FASTA version 36.3.8h (May, 2020)
- Correct bug where library sequence and residue
count was not reset when large memory mapped databases
that did not fit into memory were searched with multiple query sequences.
- Regularization of ***ERROR and ***Warning messages
- Changes to reduce compiler warnings
- The SSE2 implementations of the Smith-Waterman algorithm and a
corresponding global alignment algorithm are now available under the
BSD open source license.
Updates - FASTA version 36.3.8h (March, 2019)
- The FASTA programs have been released under the Apache2.0 Open
Source License. The COPYRIGHT file, and copyright notices in
program files, have been updated to reflect this change.
- FASTA can now use shell-scripts to produce both query and library sequence sets.
- [Feb, 2019] Scripts are available for extracting genomic DNA sequences
using BEDTools. Combined with the ability to specify sequences
using shell-scripts, this greatly simplifies the process of aligning
a protein or DNA sequence to a region of a genome.
- preliminary code is available to read NCBI BLAST version 5 format libraries.
- fasta-36.3.8h includes bug fixes for translated alignments
with termination codons, the ability to use scripts as query
and library sequences, and new scripts for extracting genomic
DNA sequences given chromosome coordinates.
- fasta-36.3.8g includes bug fixes for sub-alignment scoring and
psisearch2 scripts, new annotation scripts for exons, and
fixes enabling very low statistical thresholds with ggsearch36
and glsearch36.
- fasta-36.3.8e/scripts includes updated scripts for
capturing domain and feature annotations using the
EBI/proteins API (https://www.ebi.ac.uk/proteins/api/) to get
Uniprot annotations and exon locations.
- The
fasta-36.3.8e/psisearch2/
directory now
provides psisearch2_msa.pl
and psisearch2_msa.py
, functionally identical scripts
for iterative searching with psiblast
or ssearch36
. psisearch2-msa.pl
offers an
option, --query_seed
, that can dramatically reduce
false-positives caused by alignment overextension, with very
little loss of search sensitivity.
- The
fasta-36.3.8d/scripts/
directory now provides a
script, annot_blast_btop2.pl
that allows annotations and
sub-alignment scoring on BLAST alignments that use the tabular format
with BTOP alignment encoding.
- Alignment sub-scoring scripts have been extended to allow
overlapping domains. This requires a modified annotation file format.
The "classic" format placed the beginning and end of a domain on different lines:
1 [ - GST_N
88 ] -
90 [ - GST_C
208 ] -
Since the closing "]" was associated with the previous "[", domains could not overlap.
The new format is:
1 - 88 GST_N
90 - 208 GST_C
which allows annotations of the form:
1 - 88 GST_N
75 - 123 GST-middle
90 - 208 GST_C
- New annotation scripts are available in
the
fasta-36.3.8/scripts
directory,
e.g. ann_pfam_www_e.pl
(Pfam) and ann_up_www2_e.pl
(Uniprot) to support this new format. If the domain annotations
provided by Pfam or Uniprot overlap, then overlapping domains are
provided. The _e.pl
new scripts can be directed to provide
non-overlapping domains, using the boundary averaging strategy in
the older scripts, by specifying the --no-over
option.
Updates - FASTA version 36.3.6f (August, 2014)
FASTA version 36.3.6f extends previous versions in several ways:
-
There is a new command line option,
-XI
, that causes the
alignment programs to report 100% identity only when there are no
mismatches. In previous versions, one mismatch in 10,000 would round
up to 100.0% identity; with -XI
, the identity will be
reported as 99.9%.
-
The option to provide alignment encodings (-m 9c, or -m 9C forCIGAR
strings) has been extended to provide mis-match information in the
alignment encoding using the -m 9d (classic FASTA alignment encoding)
or -m 9D (CIGAR string). For protein alignments, which are often < 40% identity,
enabling mismatch encoding produces very long CIGAR
strings.
-
Provide more scripts for annotating proteins using either UniProt or
Pfam web resources.
Additional bug fixes are documented in fasta-36.3.6f/doc/readme.v36
Updates - FASTA version 36.3.6 (July, 2013)
FASTA version 36.3.6 provides two new features:
-
A new script-based strategy for including annotation information.
-
Domain annotation information can be used to produce partition the
alignment, and partition the scores of the alignment (sub-alignment
scores). Sub-alignment scores can be used to identify regions of
alignment over-extension, where a homologous domain aligns, but the
alignment extends beyond the homologous region into an adjacent
non-homologous domain.
Several scripts are provided (e.g. scripts/ann_feats_up_www.pl) that
can be used to add Uniprot feature and domain annotations to searches
of SwissProt and Uniprot.
(fasta-36.3.5 January 2013)
The NCBI's transition from BLAST to BLAST+ several years ago broke the
ability of ssearch36
to use PSSMs, because psiblast
did not produce the binary ASN.1 PSSMs that ssearch36
could
parse. With the January 2013 fasta-36.3.5f
,
release ssearch36
can read binary ASN.1 PSSM files produced
by the NCBI datatool
utility.
See fasta_guide.pdf for more information
(look for the -P
option).
Summary - Major Changes in FASTA version 36.3.5 (May, 2011)
-
By default, the FASTA36 programs are no longer interactive. Typing
fasta36
presents a short help message, and
fasta36 -help
presents a complete list of options. To see the interactive prompts, use
fasta36 -I
.
Likewise, the score histogram is no longer shown by default; use
the -H
option to show the histogram (or compile with
-DSHOW_HIST for previous behavior).
The _t
(fasta36_t
) versions of the programs are
built automatically on Linux/MacOSX machines and
named fasta36
, etc. (the programs are threaded by default,
and only one program version is built).
Documentation has been significantly revised and updated.
See doc/fasta_guide.pdf
for a description of the programs and options.
-
Display of all significant alignments between query and library
sequence. BLAST has always displayed multiple high-scoring
alignments (HSPs) between the query and library sequence; previous
versions of the FASTA programs displayed only the best alignment,
even when other high-scoring alignments were present. This is the
major change in FASTA36. For most programs
(
fasta36
, ssearch36
,
[t]fast[xy]36
), if the library sequence contains additional
significant alignments, they will be displayed with the alignment
output, and as part of -m 9
output (the initial list of high
scores).
By default, the statistical threshold for alternate alignments
(HSPs) is the E()-threshold / 10.0. For proteins, the default
expect threshold is E() < 10.0, the secondary threshold for showing
alternate alignments is thus E() < 1.0. Fror translated
comparisons, the E()-thresholds are 5.0/0.5; for DNA:DNA 2.0/0.2.
Both the primary and secondary E()-thresholds are set with the
-E "prim sec" command line option. If the secondary
value is betwee zero and 1.0, it is taken as the actual
threshold. If it is > 1.0, it is taken as a divisor for the primary
threshold. If it is negative, alternative alignments are disabled
and only the best alignment is shown.
-
New statistical options,
-z 21, 22, 26
, provide a second E()-value
estimate based on shuffles of the highest scoring sequences.
-
New output options.
-m 8
provides the same output format as
tabular BLAST; -m 8C
mimics tabular blast with comment
lines. -m 9C
provides CIGAR encoded alignments.
(fasta-36.3.4) Alignment option -m B
provides BLAST-like alignments (no context, coordinates at the beginning and end of the alignment line, Query/Sbjct
.
-
Improved performance using statistics based thresholds for
gap-joining and band-optimization in the heuristic FASTA local
alignment programs (
fasta36
, [t]fast[xy]36
). By
default (fasta36.3) fasta36
, [t]fast[xy]36
can use
a similar strategy to BLAST to set the thresholds for combining
ungapped regions and performing band alignments. This dramatically
reduces the number of band alignments performed, for a speed increase
of 2 - 3X. The original statistical thresholds can be enabled with
the -c O
(upper-case letter 'O') command line option.
Protein and translated protein alignment programs can also use ktup=3
for increased speed, though ktup=2 is still the default.
Statistical thresholds can dramatically reduce the number of
"optimized" scores, from which statistical estimates are calculated.
To address this problem, the statistical estimation procedure has
been adjusted to correct for the fraction of scores that were
optimized. This process can dramatically improve statistical accuracy
for some matrices and gap pentalies, e.g. BLOSUM62 -11/-1.
With the new joining thresholds, the
-c "E-opt E-join"
options have expanded meanings. -c "E-opt E-join"
calculates a threshold designed (but not guaranteed) to do band
optimization and joining for that fraction of sequences. Thus, -c
"0.02 0.1"
seeks to do band optimization (E-opt) on 2% of alignments,
and joining on 10% of alignments. -c "40 10"
sets the gap
threshold as in earlier versions.
-
A new option (
-e expand_script.sh
) is available that allows
the set of sequences that are aligned to be larger than the set of
sequences searched. When the -e expand_script.sh
option is
used, the expand_script.sh
script is run with an input
argument that is a file of accession numbers and E()-values; this
information can be used to produce a fasta-formatted list of
additional sequences, which will then be compared and aligned (if they
are significant), and included in the list of high scoring sequences
and the alignments. The expanded set of sequences does not change the
database size o statisical parameters, it simply expands the set of
high-scoring sequences.
-
The
-m F
option can be used to produce multiple output formats in different files from the same search. For example, -m "F9c,10 m9c10.output" -m "FBB blastBB.output"
produces two output files in addition to the normally formatted output sent to stdout
. The m9c10.output
file contains -m 9c
score descriptions and -m 10
alignments, while blastBB.output
contains BLAST-like output (-m BB
).
-
Scoring matrices can vary with query sequence length. In large-scale
searches with metagenomics reads, some reads may be too short to
produce statistically significant scores against comprehensive
databases (e.g. a DNA read of 90 nt is translated into 30 aa, which
would require a scoring matrix with at least 1.3 bits/position to
produce a 40 bit score). fasta-36.3.* includes the option to specify
a "variable" scoring matrix by including '?' as the first letter of
the scoring matrix abbreviation, e.g. fasta36_t -q -s '?BP62' would
use BP62 for sequences long enough to produce significant alignment
scores, but would use scoring matrices with more information content
for shorter sequences. The FASTA programs include BLOSUM50 (0.49
bits/pos) and BLOSUM62 (0.58 bits/pos) but can range to MD10 (3.44
bits/position). The variable scoring matrix option searches down the
list of scoring matrices to find one with information content high
enough to produce a 40 bit alignment score. (Several bugs in the
process are fixed in fasta-36.3.2.)
-
Several less-used options
(
-1
, -B
, -o
, -x
, -y
) have
become extended options, available via the -X
(upper case X) option.
The old -X off1,off2
option is now -o off1,off2
.
By default, the program will read up to 2 GB (32-bit systems) or 12 GB
(64-bit systems) of the database into memory for multi-query searches.
The amount of memory available for databases can be set with
the -XM4G
option.
-
Much greater flexibility in specifying combinations of library files
and subsets of libraries. It has always been possible to search a
list of libraries specified by an indirect (@) file; the FASTA36
programs can include indirect files of library names inside of
indirect files of library names.
-
fasta-36.3.2
ggsearch36 (global/global)
and glsearch36 now incorporate SSE2 accelerated global
alignment, developed by Michael Farrar. These programs are now about
20-fold faster.
-
fasta-36.2.1
(and later versions) are fully threaded, both for
searches, and for alignments. The programs routinely run 12 - 15X
faster on dual quad-core machines with "hyperthreading".
Summary - Major Changes in FASTA version 35 (August, 2007)
- Accurate shuffle based statistics for searches of small libraries (or pairwise comparisons).
-
Inclusion of lalign35 (SIM) into FASTA3. Accurate statistics for
lalign35 alignments. plalign has been replaced by
lalign35 and lav2ps.
-
Two new global alignment programs: ggsearch35 and glsearch35.
February 7, 2008
Allow annotations in library, as well as
query sequences. Currently, annotations are only available within
sequences (i.e., they are not read from the feature table), but they
should be available in FASTA format, or any of the other ascii text
formats (EMBL/Swissprot, Genbank, PIR/GCG). If annotations are
present in a library and the annotation characters includes '*', then
the -V '*' option MUST be used. However, special characters other
than '*' are ignored, so annotations of '@', '%', or '@' should be
transparent.
In translated sequence comparisons, annotations are only available for
the protein sequence.
January 25, 2007
Support protein queries and sequence
libraries that contain 'O' (pyrrolysine) and 'U' (selenocysteine).
('J' was supported already). Currently, 'O' is mapped automatically to
'K' and 'U' to 'C'.
Dec. 13, 2007 CVS fa35_03_02m
Add ability to search a subset of a library using a file name and a
list of accession/gi numbers. This version introduces a new filetype,
10, which consists of a first line with a target filename, format, and
accession number format-type, and optionally the accession number
format in the database, followed by a list of accession numbers. For
example:
</slib2/blast/swissprot.lseg 0:2 4|
3121763
51701705
7404340
74735515
...
Tells the program that the target database is swissprot.lseg, which is
in FASTA (library type 0) format.
The accession format comes after the ":". Currently, there are four
accession formats, two that require ordered accessions (:1, :2), and
two that hash the accessions (:3, :4) so they do not need to be
ordered. The number and character after the accession format
(e.g. "4|") indicate the offset of the beginning of the accession and
the character that terminates the accession. Thus, in the typical
NCBI Fasta definition line:
>gi|1170095|sp|P46419|GSTM1_DERPT Glutathione S-transferase (GST class-mu)
The offset is 4 and the termination character is '|'. For databases
distributed in FASTA format from the European Bioinformatics
Institute, the offset depends on the name of the database, e.g.
>SW:104K_THEAN Q4U9M9 104 kDa microneme/rhoptry antigen precursor (p104).
and the delimiter is ' ' (space, the default).
Accession formats 1 and 3 expect strings; accession formats 2 and 4
work with integers (e.g. gi numbers).
December 10, 2007
Provide encoded annotation information with
-m 9c alignment summaries. The encoded alignment information makes it
much simpler to highlight changes in critical residues.
August 22, 2007
A new program is available, lav2svg
, which creates SVG (Scalable Vector
Graphics) output. In addition, ps_lav
,
which was introduced May 30, 2007, has been replaced
by lav2ps
. SVG files are more easily edited with Adobe
Illustrator than postscript (lav2ps
) files.
July 25, 2007 CVS fa35_02_02
Change default gap penalties for OPTIMA5 matrix to -20/-2 from -24/-4.
July 23, 2007
Add code to support to support sub-sequence ranges for "library"
sequences - necessary for fully functional prss (ssearch35) and
lalign35. For all programs, it is now possible to specify a subset of
both the query and the library, e.g.
lalign35 -q mchu.aa:1-74 mchu.aa:75-148
Note, however, that the subset range applied to the library will be
applied to every sequence in the library - not just the first - and
that the same subset range is applied to each sequence. This probably
makes sense only if the library contains a single sequence (this is
also true for the query sequence file).
July 3, 2007 CVS fa35_02_01
Merge of previous
fasta34
with development version
fasta35
.
June 26, 2007
Add amino-acid 'J' for 'I' or 'L'.
Add Mueller and Vingron (2000) J. Comp. Biol. 7:761-776 VT160 matrix,
"-s VT160", and OPTIMA_5 (Kann et al. (2000) Proteins 41:498-503).
June 7, 2007
ggssearch35(_t)
,
glsearch35(_t)
can now use PSSMs.
May 30, 2007 CVS fa35_01_04
Addition of
ps_lav
(now
lav2ps or
lav2svg) -- which can be used to plot the lav
output of
lalign35 -m 11
.
lalign35 -m 11 | lav2ps
replaces
plalign
(from
FASTA2
).
May 2, 2007
The labels on the alignment scores are much more informative (and more
diverse). In the past, alignment scores looked like:
>>gi|121716|sp|P10649|GSTM1_MOUSE Glutathione S-transfer (218 aa)
s-w opt: 1497 Z-score: 1857.5 bits: 350.8 E(): 8.3e-97
Smith-Waterman score: 1497; 100.0% identity (100.0% similar) in 218 aa overlap (1-218:1-218)
^^^^^^^^^^^^^^
where the highlighted text was either: "Smith-Waterman" or "banded
Smith-Waterman". In fact, scores were calculated in other ways,
including global/local for
fasts
and
fastf
. With the addition of
ggsearch35,
glsearch35,
and
lalign35,
there are many more ways to
calculate alignments: "Smith-Waterman" (ssearch and protein fasta),
"banded Smith-Waterman" (DNA fasta), "Waterman-Eggert",
"trans. Smith-Waterman", "global/local", "trans. global/local",
"global/global (N-W)". The last option is a global global alignment,
but with the affine gap penalties used in the Smith-Waterman
algorithm.
April 19, 2007 CVS fa34t27br_lal_3
Two new programs,
ggsearch35(_t)
and
glsearch35(_t)
are now available.
ggsearch35(_t)
calculates an alignment score that is global in the
query and global in the library;
glsearch35(_t)
calculates an alignment
that is global in the query and local, while local in the library
sequence. The latter program is designed for global alignments to domains.
Both programs assume that scores are normally distributed. This
appears to be an excellent approximation for ggsearch35 scores, but
the distribution is somewhat skewed for global/local (glsearch)
scores.
ggsearch35(_t)
only compares the query to library sequences
that are beween 80% and 125% of the length of the query; glsearch
limits comparisons to library sequences that are longer than 80% of
the query. Initial results suggest that there is relatively little
length dependence of scores over this range (scores go down
dramatically outside these ranges).
March 29, 2007 CVS fa34t27br_lal_1
At last, the
lalign
(SIM) algorithm has been moved from
FASTA21 to
FASTA35. A
plalign
equivalent is also available using
lalign -m 11 | lav2ps
or
| lav2svg
.
The statistical estimates for
lalign35
should be much more accurate
than those from the earlier lalign, because lambda and K are estimated
from shuffles.
In addition, all programs can now generate accurate statistical
estimates with shuffles if the library has fewer than 500 sequences.
If the library contains more than 500 sequences and the sequences are
related, then the -z 11 option should be used.
p
FASTA v34 Change Log
fasta36-36.3.8i_14-Nov-2020/doc/fasta.defaults 0000664 0000000 0000000 00000001061 14334533127 0020257 0 ustar 00root root 0000000 0000000 #pgm mol matrix g_open g_ext fr_shft e_cut ktup
# -n/-p -s -e -f -h/-j -E argv[3]
fasta prot BL50 -10 -2 - 10.0 2
fasta dna +5/-4 -14 -4 - 2.0 6
ssearch prot bl50 -10 -2 - 10.0 -
ssearch dna +5/-4 -14 -4 - 2.0 -
fastx prot BL50 -12 -2 -20 5.0 2
fasty prot BL50 -12 -2 -20/-24 5.0 2
tfastx dna BL50 -14 -2 -20 5.0 2
tfasty dna BL50 -14 -2 -20/-24 5.0 2
fasts prot MD20-MS - - - 5.0 -
tfasts prot MD10-MS - - - 2.0 -
fastf prot MD20 - - - 5.0 -
tfastf prot MD10 - - - 2.0 -
fastm prot MD20 - - - 5.0 -
tfastm prot MD10 - - - 2.0 -
lalign prot BL50 -12 -2 10.0 -
fasta36-36.3.8i_14-Nov-2020/doc/fasta.history.tex 0000664 0000000 0000000 00000021421 14334533127 0020752 0 ustar 00root root 0000000 0000000 \begin{longtable}{p{0.75 in}p{5.25 in}}
\multicolumn{2}{c}{\textbf{FASTA version history (cont.)}} \\
\hline\\[-1.0ex]
% \textbf{Date} & {\bf Improvements} \\[0.5ex] \hline \\[-1.5ex]
\endhead
\multicolumn{2}{l}{{\Large {\bf FASTA version history}}} \\[2 ex]
\hline\\[-1.0ex]
% {\bf Date} & {\bf Improvements} \\[0.5ex] \hline \\[-1.5ex]
\endfirsthead
\hline\\
& \\
\endfoot
\hline\\
& \\
\endlastfoot
\multicolumn{2}{c}{ \FASTA v33, Oct, 1999 -- Dec, 2000 } \\[1 ex]
\hline \\[-0.5 ex]
Oct 1999 & Add support for NCBI Blast2.0 formatted libraries, and
memory mapped databases. \FASTA now reads both \texttt{BLAST1.4} and
\texttt{BLAST2.0} formatted databases. (version 3.2t08)\\ & Include
Maximum Likelihood Estimates for Lambda and K ( -z 2) \\
& Include a new strategy for searching with low
complexity regions. The \texttt{pseg} program can produce libraries
with low complexity regions as lower case characters, which can be
ignored during the initial \texttt{FASTA}/\texttt{SSEARCH} scan, but are considered when
producing the final alignments. (3.3t01)\\
& Change output to report bit scores, which are also used by BLAST. \\
Mar 2000 & Another new statistics option, -z 6, uses Mott's
approach \cite{mot921} for calculating a
composition dependent Lambda for each sequence. (3.3t05) \\
Dec 2000 & Automatically change the gap penalties when alternate
(known) scoring matrices are used using Reese and Pearson gap
penalties \cite{wrp022}. First implementation to read from MySQL
databases. \\ May 2001 & change all \FASTA gap penalties from
first-residue, additional residue to the gap-open, gap-extend values
used by BLAST. \\[0.5ex]
\hline \\[-0.5 ex]
\multicolumn{2}{c}{ \FASTA v34, Jan, 2001 -- Jan, 2007 } \\[1 ex]
\hline \\[-0.5 ex]
Jun 2002 & Modify statistical estimation strategy to sample all the
sequences in the database, not just the first 60,000. (3.4t11) \\
Jan 2003 & Implementation of vector-accelerated (Altivec) code for
Smith-Waterman ({\tt SSEARCH}) and banded Smith-Waterman (\FASTA)
using the Rognes and Seebug \cite{rog003} algorithm. This code was
removed in Sept, 2003, because of possible conflict with a patent
application, but was restored using a different algorithm in
Nov. 2004. \\
Jun 2003 & Provide \texttt{PSI-SEARCH} --- an implementation of
\texttt{SSEARCH} that can search with \texttt{PSI-BLAST} PSSM profile
files. \texttt{PSI-SEARCH} estimates statistical significance from
the distribution of actual alignment scores; thus the estimates are
much more reliable than \texttt{PSI-BLAST} estimates. Also, change
the similarity display to work with profiles. (3.4t22) \\
July 2003 & Provide ASN.1 definition line parsing for \texttt{BLAST}
{\tt formatdb} v.4 libraries. Restructure the programs to use a table-driven
approach to parameter setting. Two tables now define the algorithm,
query sequence type, library type, scoring matrix, and gap penalties for
all programs. \\
Sept 2003 & A new option {\tt -V} for annotating alignments
provided. Designed for highlighting post-translational modifications
with {\tt fasts}, it can also be used to highlight active sites and
other conserved residues. (3.4t23) \\
Dec 2003 & Addition of {\tt -U} option for RNA sequence
comparison. {\tt G:A} matches score like {\tt G:G} matches to account
for {\tt G:U} basepairs. Change default {\it ktup} for short query
sequences. Increase band-width for DNA banded final alignments. \\
July 2004 & Allow searching of \texttt{Postgres}, as well as
\texttt{MySQL} database queries. \\
Nov 2004 & (\texttt{fa34t24}) Incorporation of Erik Lindahl "anti-diagonal" Altivec
implementation of \cite{woz974} for Smith-Waterman only. Altivec
{\tt ssearch34} is now faster than {\tt fasta34} for query sequences $<$ 250 amino acids. \\
Jan 2005 & Change {\tt FASTS} to accommodate very large numbers of
peptides ($>$100) for full coverage on long proteins \\
Jun. 2006 & (\texttt{fa34t26}) Incorporation of Smith-Waterman
algorithm for the SSE2 vector instructions written by Michael Farrar
\cite{farrar2007}. The SSE code speeds up Smith-Waterman 8 --
16-fold. \\[1.0 ex]
\hline \\[-0.5 ex]
\multicolumn{2}{c}{ \FASTA v35, March, 2007 -- March, 2010 } \\[1 ex]
\hline \\[-0.5 ex]
Mar. 2007 & fasta v35 -- Accurate shuffle-based $E()$-values for all searches and alignments; statistics from searches against small libraries are supplemented with shuffled alignments.\\[1 ex]
& More efficient threading strategies on multi-core computers, for 12X speedup on 16-core machines.\\[1 ex]
& Inclusion of \texttt{lalign} (\texttt{SIM}) local domain alignments. \texttt{lalign} alignments now have accurate shuffle-based $E()$-values.\\[1 ex]
Apr. 2007 & Introduction of \texttt{ggsearch}, for global alignment searches, and \texttt{glsearch}, for searches with scores that are global in the query and local in the library. \texttt{ggsearch} and \texttt{glsearch} calculate $E()$-values using the normal distribution. Both programs can search with \texttt{PSI-BLAST} PSSMs.\\[1 ex]
Dec. 2007 & Efficient strategy for searching subsets of databases (lists of GI or accession numbers) \\[1 ex]
Feb. 2008 & Annotations in either query or library sequences can be highlighted in the alignment, and the state of annotated residues is compactly summarized with \texttt{-m 9c}. \\[1 ex]
Oct. 2008 & Modification \texttt{lsim4.c} (\texttt{lalign35}) provided by Xiaoqui Huang to ensure
that self-alignments do not cross the identity diagonal. \\[1ex]
%\pagebreak
\hline \\[-0.5 ex]
\multicolumn{2}{c}{ \FASTA v36, March, 2010 -- } \\[1 ex]
\hline \\[-0.5 ex]
Mar. 2010 & \FASTA v36 displays all significant alignments between
query and library sequence. BLAST has always displayed multiple
high-scoring alignments (HSPs) between the query and library sequence;
previous versions of the FASTA programs displayed only the best
alignment, even when other high-scoring alignments were present.\\[1
ex]
& New statistical options, \texttt{-z 21, 22, 26}, provide a second $E2()$-value
estimate based on shuffles of the highest scoring sequences. \\[1 ex]
& Improved performance using statistics-based thresholds for
gap-joining and band-optimization in the heuristic FASTA local
alignment programs, increasing speed 2 - 3X. \\[1 ex]
& Greater flexibility in specifying combinations of library files
and subsets of libraries. \FASTA v36
programs can include indirect files of library names inside of
indirect files of library names. \\[1 ex]
& \FASTA36 programs are fully threaded, both for
searches, and for alignments. The programs routinely run 12 - 15X
faster on 8-core machines with "hyperthreading" (effectively 16 cores).
\\[1 ex]
& \texttt{-z 21} .. \texttt{26} E2() statistical estimates from
shuffled best scores.\\[1.0ex]
Sep. 2010 & \texttt{-m 8}, \texttt{-m 8C} BLAST tabular output. \\[1.0ex]
Nov, 2010 & Variable scoring matrices (\texttt{-m ?BP62}).\\[1.0ex]
Dec, 2010 & (\texttt{fasta-36.3.1}) SSE2 vectorized \texttt{ggsearch36}, \texttt{glsearch36} (Michael Farrar).\\[1.0ex]
Jan, 2011 & (\texttt{fasta-36.3.2}) MPI versions implemented and tested.\\[1ex]
Feb, 2011 & Introduce \texttt{-m B}, \texttt{-m BB} BLAST-like output.\\[1.0ex]
Mar, 2011 & (\texttt{fasta-36.3.4}) Program is no longer interactive by
default. \texttt{fasta36 -h} and \texttt{fasta36 -help} provide
common/complete options, with many defaults. \texttt{doc/fasta\_guide.pdf} available.\\[1.0ex]
May, 2011 & (\texttt{fasta-36.3.5}) Introduce (1) \texttt{-e
expand.sh} scripts to extend the effective size of the database
searched, based on significant hits; (2) \texttt{-m "F\# output.file"}
to send different output formats to different files; and (3)
\texttt{-X} expanded options, \texttt{-o} replaces the old \texttt{-X}
and \texttt{-Xo} replaces \texttt{-o}. \\[1.0ex]
Jan, 2012 & Include \texttt{.fastq} files as library type 7 \\[1.0ex]
May, 2012 & allow reverse-complement alignments with \texttt{ggsearch} and \texttt{glsearch} \\[1.0ex]
Jun, 2012 & Introduce \texttt{-V !script.pl} driven alignments, and variant scoring.\\[1.0ex]
Aug, 2012 & Introduce \texttt{-V !ann\_feats.pl} sub-alignment (region-based) scoring.\\[1.0ex]
Apr, 2013 & Extend \texttt{ENV} options to introduce a domain-plotting option for FASTA web sites.\\[1.0ex]
Nov, 2014 & (\texttt{fasta-36.3.7}) Allow overlapping domains in annotation scripts.\\[1.0ex]
Nov, 2015 & (\texttt{fasta-36.3.8}) Improvements in overlapping domain
code. Introduction of \texttt{scripts/annot\_blast\_btop.pl} to
provide annotations and subalignment scoring to \texttt{blast}
alignments. Provide annotations in \texttt{-m 8CB} BLAST tabular
output. \\[1.0ex]
May, 2016 & Implement \texttt{psisearch2\_msa.pl} and \texttt{psisearch2\_msa.py}.\\[1.0ex]
Feb, 2018 & Introduce \texttt{-X B}, which causes the \texttt{FASTA}
programs ignore gaps to calculate \texttt{BLASTP} percent
identities.\\[1.0ex]
\hline
\end{longtable}
fasta36-36.3.8i_14-Nov-2020/doc/fasta.options 0000664 0000000 0000000 00000005025 14334533127 0020147 0 ustar 00root root 0000000 0000000 ##
## updated 13-Nov-2022 to correct extended options in initfa.c
doinit.c
case 'B': m_msg->z_bits = 0;
case 'C': m_msg->nmlen
case 'D': ppst->debug_lib = 1;
case 'F': m_msg->e_low
case 'H': m_msg->nohist = 0
case 'i': m_msg->revcomp = 1
case 'l': m_msg->flstr
case 'L': m_msg->long_info = 1
case 'm': m_msg->markx
case 'N': m_msg->maxn
case 'O': m_msg->outfile
case 'q':
case 'Q': m_msg->quiet = 1;
case 'R': m_msg->dfile
case 'T': max_workers
PCOMPLIB: worker_1,worker_n
case 'v': ppst->zs_win
case 'w': m_msg->aln.llen
case 'W': m_msg->aln.llcntx
case 'z': ppst->zsflag
case 'v': ppst->zs_win
case 'V': m_msg->ann_arr
case 'Z': ppst->zdb_size
initfa.c
case '3': m_msg->nframe = 3; /* TFASTA */
m_msg->nframe = 1; /* for TFASTXY */
m_msg->qframe = 1; /* for FASTA, FASTX */
case 'a': m_msg->aln.showall = 1;
case 'A': ppst->sw_flag= 1;
case 'b': m_msg->mshow
case 'c': ppst->param_u.fa.optcut
case 'd': m_msg->ashow;
case 'E': m_msg->e_cut, m_msg->e_cut_r
case 'f': ppst->gdelval
case 'g': ppst->ggapval
case 'h': help /ppst->gshift (-USHOW_HELP)
case 'I': m_msg->self = 1
case 'j': ppst->gshift, ppst->gsubs
case 'k': m_msg->shuff_max
case 'K': ppst->max_repeat
case 'M': m_msg->n1_low,&m_msg->n1_high
case 'n': m_msg->qdnaseq = SEQT_DNA (1)
case 'p': m_msg->qdnaseq = SEQT_PROT (0);
case 'r': ppst->p_d_mat,&ppst->p_d_mis
case 's': standard_pam(smstr); ppst->pamoff=atoi(bp+1);
case 'S': ppst->ext_sq_set = 1; /* treat upper/lower case residues differently */
case 't': ppst->tr_type
case 'X': initfa.c/parse_ext_opts() /* extended options */
'X1' : ppst->param_u.fa.initflag = 1 /* sort by init1 */
'Xa' : m_msg->m8_show_annot = 1
'XB' : m_msp->blast_ident = 1 /* count identities like BLAST (gaps not in divisor) */
'Xb' : m_msp->z_bits = 0 /* show z-scores, not bit-scores in best score list */
'Xg' : m_msp->gi_save = 1 /* do not remove gi|12345 from output */
'XI' : m_msp->tot_ident = 1 /* do not round 99.999% identity to 100% */
'XM' : m_msp->max_memK = l_arg /* specify maximum amount of memory for library */
'XN/XX' : ppst->pam_x_id_sim = 1/-1 /* modify treatment of N:N or X:X in identities */
'Xo' : ppst->param_u.fa.optflag = 0 /* do not calculate opt score */
'Xx' : ppst->pam_xx, ppst->pam_xm /* modify score for match to X */
'Xy' : ppst->param_u.fa.optwid /* modify width of fasta optimization window for opt score */
fasta36-36.3.8i_14-Nov-2020/doc/fasta36.1 0000664 0000000 0000000 00000045545 14334533127 0017000 0 ustar 00root root 0000000 0000000 .TH fasta36/ssearch36/[t]fast[x,y]36/lalign36 1 local
.SH NAME
fasta36 \- scan a protein or DNA sequence library for similar
sequences
fastx36 \ - compare a DNA sequence to a protein sequence
database, comparing the translated DNA sequence in forward and
reverse frames.
tfastx36 \ - compare a protein sequence to a DNA sequence
database, calculating similarities with frameshifts to the forward and
reverse orientations.
fasty36 \ - compare a DNA sequence to a protein sequence
database, comparing the translated DNA sequence in forward and reverse
frames.
tfasty36 \ - compare a protein sequence to a DNA sequence
database, calculating similarities with frameshifts to the forward and
reverse orientations.
fasts36 \- compare unordered peptides to a protein sequence database
fastm36 \- compare ordered peptides (or short DNA sequences)
to a protein (DNA) sequence database
tfasts36 \- compare unordered peptides to a translated DNA
sequence database
fastf36 \- compare mixed peptides to a protein sequence database
tfastf36 \- compare mixed peptides to a translated DNA
sequence database
ssearch36 \- compare a protein or DNA sequence to a
sequence database using the Smith-Waterman algorithm.
ggsearch36 \- compare a protein or DNA sequence to a
sequence database using a global alignment (Needleman-Wunsch)
glsearch36 \- compare a protein or DNA sequence to a
sequence database with alignments that are global in the query and
local in the database sequence (global-local).
lalign36 \- produce multiple non-overlapping alignments for protein
and DNA sequences using the Huang and Miller sim algorithm for the
Waterman-Eggert algorithm.
prss36, prfx36 \- discontinued; all the FASTA programs will estimate
statistical significance using 500 shuffled sequence scores if two
sequences are compared.
.SH DESCRIPTION
Release 3.6 of the FASTA package provides a modular set of sequence
comparison programs that can run on conventional single processor
computers or in parallel on multiprocessor computers. More than a
dozen programs \- fasta36, fastx36/tfastx36, fasty36/tfasty36,
fasts36/tfasts36, fastm36, fastf36/tfastf36, ssearch36, ggsearch36,
and glsearch36 \- are currently available.
All the comparison programs share a set of basic command line options;
additional options are available for individual comparison functions.
Threaded versions of the FASTA programs (built by default under
Unix/Linux/MacOX) run in parallel on modern Linux and Unix multi-core
or multi-processor computers. Accelerated versions of the
Smith-Waterman algorithm are available for architectures with the
Intel SSE2 or Altivec PowerPC architectures, which can speed-up
Smith-Waterman calculations 10 - 20-fold.
In addition to the serial and threaded versions of the FASTA programs,
MPI parallel versions are available as \fCfasta36_mpi\fP,
\fCssearch36_mpi\fP, \fCfastx36_mpi\fP, etc. The MPI parallel versions
use the same command line options as the serial and threaded versions.
.SH Running the FASTA programs
.LP
By default, the FASTA programs are no longer interactive; they are run
from the command line by specifying the program, query.file, and
library.file. Program options \fImust\fP preceed the
query.file and library.file arguments:
.sp
.ti 0.5i
\fCfasta36 -option1 -option2 -option3 query.file library.file > fasta.output\fP
.sp
The "classic" interactive mode, which prompts for a query.file and
library.file, is available with the \fC-I\fP option. Typing a program
name without any arguments (\fCssearch36\fP) provides a short help
message; \fCprogram_name -help\fP provides a complete set of program
options.
.LP
Program options \fIMUST\fP preceed the query.file and library.file arguments.
.SH FASTA program options
.LP
The default scoring matrix and gap penalties used by each of the
programs have been selected for high sensitivity searches with the
various algorithms. The default program behavior can be modified by
providing command line options \fIbefore\fP the query.file and
library.file arguments. Command line options can also be used in
interactive mode.
Command line arguments come in several classes.
(1) Commands that specify the comparison type. FASTA, FASTS, FASTM,
SSEARCH, GGSEARCH, and GLSEARCH can compare either protein or DNA
sequences, and attempt to recognize the comparison type by looking the
residue composition. \fC-n\fP, \fC-p\fP specify DNA (nucleotide) or
protein comparison, respectively. \fC-U\fP specifies RNA comparison.
(2) Commands that limit the set of sequences compared: \fC-1\fP,
\fC-3\fP, \fC-M\fP.
(3) Commands that modify the scoring parameters: \fC-f gap-open penalty\P, \fC-g
gap-extend penalty\fP, \fC-j inter-codon frame-shift, within-codon frameshift\fP,
\fC-s scoring-matrix\fP, \fC-r
match/mismatch score\fP, \fC-x X:X score\fP.
(4) Commands that modify the algorithm (mostly FASTA and [T]FASTX/Y):
\fC-c\fP, \fC-w\fP, \fC-y\fP, \fC-o\fP. The \fC-S\fP can be used to
ignore lower-case (low complexity) residues during the initial score
calculation.
(5) Commands that modify the output: \fC-A\fP, \fC-b number\fP, \fC-C
width\fP, \fC-d number\fP, \fC-L\fP, \fC-m 0-11,B\fP, \fC-w
line-width\fP, \fC-W context-width\fP, \fC-o offset1,ofset2\fP
(6) Commands that affect statistical estimates: \fC-Z\fP, \fC-k\fP.
.SH Option summary:
.TP
\-1
Sort by "init1" score (obsolete)
.TP
\-3
([t]fast[x,y] only) use only forward frame translations
.TP
\-a
Displays the full length (included unaligned regions) of both
sequences with fasta36, ssearch36, glsearch36, and fasts36.
.TP
\-A (fasta36 only) For DNA:DNA, force Smith-Waterman alignment for
output. Smith-Waterman is the default for FASTA protein alignment and
[t]fast[x,y], but not for DNA comparisons with FASTA. For
protein:protein, use band-alignment algorithm.
.TP
\-b #
number of best scores/descriptions to show (must be <
expectation cutoff if -E is given). By default, this option is no
longer used; all scores better than the expectation (E()) cutoff are
listed. To guarantee the display of # descriptions/scores, use \fC-b
=#\fP, i.e. \fC-b =100\fP ensures that 100 descriptions/scores will be
displayed. To guarantee at least 1 description, but possibly many
more (limited by \fC-E e_cut\fP), use \fC-b >1\fP.
.TP
\-c "E-opt E-join"
threshold for gap joining (E-join) and band optimization (E-opt) in
FASTA and [T]FASTX/Y. FASTA36 now uses BLAST-like statistical
thresholds for joining and band optimization. The default statistical
thresholds for protein and translated comparisons are E-opt=0.2,
E-join=0.5; for DNA, E-join = 0.1 and E-opt= 0.02. The actual number
of joins and optimizations is reported after the E-join and E-opt
scoring parameters. Statistical thresholds improves search speed 2 -
3X, and provides much more accurate statistical estimates for matrices
other than BLOSUM50. The "classic" joining/optimization thresholds
that were the default in fasta35 and earlier programs are available
using -c O (upper case O), possibly followed a value > 1.0 to set
the optcut optimization threshold.
.TP
\-C #
length of name abbreviation in alignments, default = 6. Must be less
than 20.
.TP
\-d #
number of best alignments to show ( must be < expectation (-E) cutoff
and <= the -b description limit).
.TP
\-D
turn on debugging mode. Enables checks on sequence alphabet that
cause problems with tfastx36, tfasty36 (only available after compile
time option). Also preserves temp files with -e expand_script.sh option.
.TP
\-e expand_script.sh
Run a script to expand the set of sequences displayed/aligned based on
the results of the initial search. When the -e expand_script.sh
option is used, after the initial scan and statistics calculation, but
before the "Best scores" are shown, expand_script.sh with a single
argument, the name of a file that contains the accession information
(the text on the fasta description line between > and the first space)
and the E()-value for the sequence. expand_script.sh then uses this
information to send a library of additional sequences to stdout. These
additional sequences are included in the list of high-scoring
sequences (if their scores are significant) and aligned. The
additional sequences do not change the statistics or database size.
.TP
\-E e_cut e_cut_r
expectation value upper limit for score and alignment display.
Defaults are 10.0 for FASTA36 and SSEARCH36 protein searches, 5.0 for
translated DNA/protein comparisons, and 2.0 for DNA/DNA
searches. FASTA version 36 now reports additional alignments between
the query and the library sequence, the second value sets the
threshold for the subsequent alignments. If not given, the threshold
is e_cut/10.0. If given and value > 1.0, e_cut_r = e_cut / value; for
value < 1.0, e_cut_r = value; If e_cut_r < 0, then the additional
alignment option is disabled.
.TP
\-f #
penalty for opening a gap.
.TP
\-F #
expectation value lower limit for score and alignment display.
-F 1e-6 prevents library sequences with E()-values lower than 1e-6
from being displayed. This allows the use to focus on more distant
relationships.
.TP
\-g #
penalty for additional residues in a gap
.TP
\-h
Show short help message.
.TP
\-help
Show long help message, with all options.
.TP
\-H
show histogram (with fasta-36.3.4, the histogram is not shown by default).
.TP
\-i
(fasta DNA, [t]fastx[x,y]) compare against
only the reverse complement of the library sequence.
.TP
\-I
interactive mode; prompt for query filename, library.
.TP
\-j # #
([t]fast[x,y] only) penalty for a frameshift between two codons,
([t]fasty only) penalty for a frameshift within a codon.
.TP
\-J
(lalign36 only) show identity alignment.
.TP
\-k
specify number of shuffles for statistical parameter estimation (default=500).
.TP
\-l str
specify FASTLIBS file
.TP
\-L
report long sequence description in alignments (up to 200 characters).
.TP
\-m 0,1,2,3,4,5,6,8,9,10,11,B,BB,"F# out.file" alignment display
options. \fC-m 0, 1, 2, 3\fP display different types of alignments.
\fC-m 4\fP provides an alignment "map" on the query. \fC-m 5\fP
combines the alignment map and a \fC-m 0\fP alignment. \fC-m 6\fP
provides an HTML output.
.TP
\fC-m 8\fP seeks to mimic BLAST -m 8 tabular output. Only query and
library sequence names, and identity, mismatch, starts/stops,
E()-values, and bit scores are displayed. \fC-m 8C\fp mimics BLAST
tabular format with comment lines. \fC-m 8\fP formats do not show
alignments.
.TP
\fC-m 9\fP does not change the alignment output, but provides
alignment coordinate and percent identity information with the best
scores report. \fC-m 9c\fP adds encoded alignment information to the
\fC-m 9\fP; \fC-m 9C\fP adds encoded alignment information as a CIGAR
formatted string. To accomodate frameshifts, the CIGAR format has been
supplemented with F (forward) and R (reverse). \fC-m 9i\fP provides
only percent identity and alignment length information with the best
scores. With current versions of the FASTA programs, independent
\fC-m\fP options can be combined; e.g. \fC-m 1 -m 9c -m 6\fP.
.TP
\-m 11 provides \fClav\fP format output from lalign36. It does not
currently affect other alignment algorithms. The \fClav2ps\fP and
\fClav2svg\fP programs can be used to convert \fClav\fP format output
to postscript/SVG alignment "dot-plots".
.TP
\-m B provides \fCBLAST\fP-like alignments. Alignments are labeled as
"Query" and "Sbjct", with coordinates on the same line as the
sequences, and \fCBLAST\fP-like symbols for matches and
mismatches. \fC-m BB\fP extends BLAST similarity to all the output,
providing an output that closely mimics BLAST output.
.TP
\-m "F# out.file" allows one search to write different alignment
formats to different files. The 'F' indicates separate file output;
the '#' is the output format (1-6,8,9,10,11,B,BB, multiple compatible
formats can be combined separated by commas -',').
.TP
\-M #-#
molecular weight (residue) cutoffs. -M "101-200" examines only
library sequences that are 101-200 residues long.
.TP
\-n
force query to nucleotide sequence
.TP
\-N #
break long library sequences into blocks of # residues. Useful for
bacterial genomes, which have only one sequence entry. -N 2000 works
well for well for bacterial genomes. (This option was required when
FASTA only provided one alignment between the query and library
sequence. It is not as useful, now that multiple alignments are
available.)
.TP
\-o "#,#"
offsets query, library sequence for numbering alignments
.TP
\-O file
send output to file.
.TP
\-p
force query to protein alphabet.
.TP
\-P pssm_file
(ssearch36, ggsearch36, glsearch36 only). Provide blastpgp checkpoint
file as the PSSM for searching. Two PSSM file formats are available,
which must be provided with the filename. 'pssm_file 0' uses a binary
format that is machine specific; 'pssm_file 1' uses the "blastpgp -u 1
-C pssm_file" ASN.1 binary format (preferred).
.TP
\-q/-Q
quiet option; do not prompt for input (on by default)
.TP
\-r "+n/-m"
(DNA only) values for match/mismatch for DNA comparisons. \fC+n\fP is
used for the maximum positive value and \fC-m\fP is used for the
maximum negative value. Values between max and min, are rescaled, but
residue pairs having the value -1 continue to be -1.
.TP
\-R file
save all scores to statistics file (previously -r file)
.TP
\-s name
specify substitution matrix. BLOSUM50 is used by default; PAM250,
PAM120, and BLOSUM62 can be specified by setting -s P120, P250, or
BL62. Additional scoring matrices include: BLOSUM80 (BL80), and
MDM10, MDM20, MDM40 (Jones, Taylor, and Thornton, 1992 CABIOS
8:275-282; specified as -s MD10, -s MD20, -s MD40), OPTIMA5 (-s OPT5,
Kann and Goldstein, (2002) Proteins 48:367-376), and VTML160 (-s
VT160, Mueller and Vingron (2002) J. Comp. Biol. 19:8-13). Each
scoring matrix has associated default gap penalties. The BLOSUM62
scoring matrix and -11/-1 gap penalties can be specified with -s BP62.
.IP
Alternatively, a BLASTP format scoring matrix file can be specified,
e.g. -s matrix.filename. DNA scoring matrices can also be specified
with the "-r" option.
.IP
With fasta36.3, variable scoring matrices can
be specified by preceeding the scoring matrix abbreviation with '?',
e.g. -s '?BP62'. Variable scoring matrices allow the FASTA programs to
choose an alternative scoring matrix with higher information content
(bit score/position) when short queries are used. For example, a 90
nucleotide FASTX query can produce only a 30 amino-acid alignment, so
a scoring matrix with 1.33 bits/position is required to produce a 40
bit score. The FASTA programs include BLOSUM50 (0.49 bits/pos) and
BLOSUM62 (0.58 bits/pos) but can range to MD10 (3.44
bits/position). The variable scoring matrix option searches down the
list of scoring matrices to find one with information content high
enough to produce a 40 bit alignment score.
.TP
\-S
treat lower case letters in the query or database as low complexity
regions that are equivalent to 'X' during the initial database scan,
but are treated as normal residues for the final alignment display.
Statistical estimates are based on the 'X'ed out sequence used during
the initial search. Protein databases (and query sequences) can be
generated in the appropriate format using John Wooton's "pseg"
program, available from ftp://ftp.ncbi.nih.gov/pub/seg/pseg. Once you
have compiled the "pseg" program, use the command:
.IP
\fCpseg database.fasta -z 1 -q > database.lc_seg\fP
.TP
\-t #
Translation table - [t]fastx36 and [t]fasty36 support the BLAST
tranlation tables. See
\fChttp://www.ncbi.nih.gov/htbin-post/Taxonomy/wprintgc?mode=c/\fP.
.TP
\-T #
(threaded, parallel only) number of threads or workers to use (on
Linux/MacOS/Unix, the default is to use as many processors as are
available; on Windows systems, 2 processors are used).
.TP
\-U
Do RNA sequence comparisons: treat 'T' as 'U', allow G:U base pairs (by
scoring "G-A" and "T-C" as score(G:G)-3). Search only one strand.
.TP
\-V "?$%*"
Allow special annotation characters in query sequence. These characters
will be displayed in the alignments on the coordinate number line.
.TP
\-w # line width for similarity score, sequence alignment, output.
.TP
\-W # context length (default is 1/2 of line width -w) for alignment,
like fasta and ssearch, that provide additional sequence context.
.TP
\-X extended options. Less used options. Other options include
\fC-XB\fP, \fC-XM4G\fP, \fC-Xo\fP, \fC-Xx\fP, and \fC-Xy\fP; see
\fBfasta_guide.pdf\fP.
.TP
\-z 1, 2, 3, 4, 5, 6
Specify the statistical calculation. Default is -z 1 for local
similarity searches, which uses regression against the length of the
library sequence. -z -1 disables statistics. -z 0 estimates
significance without normalizing for sequence length. -z 2 provides
maximum likelihood estimates for lambda and K, censoring the 250
lowest and 250 highest scores. -z 3 uses Altschul and Gish's
statistical estimates for specific protein BLOSUM scoring matrices and
gap penalties. -z 4,5: an alternate regression method. \-z 6 uses a
composition based maximum likelihood estimate based on the method of
Mott (1992) Bull. Math. Biol. 54:59-75.
.TP
\-z 11,12,14,15,16
compute the regression against scores of randomly
shuffled copies of the library sequences. Twice as many comparisons
are performed, but accurate estimates can be generated from databases
of related sequences. -z 11 uses the -z 1 regression strategy, etc.
.TP
\-z 21, 22, 24, 25, 26
compute two E()-values. The standard (library-based) E()-value is
calculated in the standard way (-z 1, 2, etc), but a second E2()
value is calculated by shuffling the high-scoring sequences (those
with E()-values less than the threshold). For "average" composition
proteins, these two estimates will be similar (though the
best-shuffle estimates are always more conservative). For biased
composition proteins, the two estimates may differ by 100-fold or
more. A second -z option, e.g. -z "21 2", specifies the estimation
method for the best-shuffle E2()-values. Best-shuffle E2()-values
approximate the estimates given by PRSS (or in a pairwise SSEARCH).
.TP
\-Z db_size
Set the apparent database size used for expectation value calculations
(used for protein/protein FASTA and SSEARCH, and for [T]FASTX/Y).
.SH Reading sequences from STDIN
.LP
The FASTA programs can accept a query sequence from
the unix "stdin" data stream. This makes it much easier to use
fasta36 and its relatives as part of a WWW page. To indicate that
stdin is to be used, use "@" as the query sequence file name. "@" can
also be used to specify a subset of the query sequence to be used,
e.g:
.sp
.ti 0.5i
cat query.aa | fasta36 @:50-150 s
.sp
would search the 's' database with residues 50-150 of query.aa. FASTA
cannot automatically detect the sequence type (protein vs DNA) when
"stdin" is used and assumes protein comparisons by default; the '-n'
option is required for DNA for STDIN queries.
.SH Environment variables:
.TP
FASTLIBS
location of library choice file (-l FASTLIBS)
.TP
SRCH_URL1, SRCH_URL2
format strings used to define options to re-search the
database.
.TP
REF_URL
the format string used to define the option to lookup the library
sequence in entrez, or some other database.
.SH AUTHOR
Bill Pearson
.br
wrp@virginia.EDU
Version: $ Id: $
Revision: $Revision: 210 $
fasta36-36.3.8i_14-Nov-2020/doc/fasta_func.doc 0000664 0000000 0000000 00000022655 14334533127 0020244 0 ustar 00root root 0000000 0000000 Over all structure of the fasta3 program. (Some functions
are different for translated comparisons FASTX, FASTY, TFASTX, TFASTY.)
main() { /* complib.c structure */
/* get command line arguments, set up initial parameter values */
initenv (argc, argv, &m_msg, &pst,&aa0[0],outtty);
/* allocate space for sequence arrays */
/* get the query file name if not on command line */
/* get query */
m_msg.n0 = getseq (m_msg.tname,aa0[0], MAXTOT, m_msg.libstr,&pst.dnaseq,
&m_msg.sq0off);
/* reset some parameters if DNA */
resetp (aa0[0], m_msg.n0, &m_msg, &pst);
/* get a library name if not on command line */
libchoice(m_msg.lname,sizeof(m_msg.lname),&m_msg);
/* use library name to build list of library files */
libselect(m_msg.lname, &m_msg);
/* get additional options (ktup, prss-window) if not specified */
query_parm (&m_msg, &pst);
/* do final parameter initializations */
last_init(&m_msg, &pst);
/* set up structures for saved scores[20000], statistics[50000] */
nbest = 0;
/* initialize the comparison function */
init_work (aa0[0], m_msg.n0, &pst, &f_str[0]);
/* open the library */
for (iln = 0; iln < m_msg.nln; iln++) {
if (openlib(m_msg.lbnames[iln],m_msg)!=1) {continue;}
}
/* get the library sequence and do the comparison */
while ((n1=GETLIB(aa1ptr,maxt,libstr,&lmark,&lcont))>0) {
do_work (aa0[itt], m_msg.n0, aa1, n1, itt, &pst, f_str[itt], &rst);
/* save the scores */
/* save the scores for statistics */
}
/* all done with all libraries */
process_hist(stats,nstats,pst);
/* sort the scores by z-value */
sortbestz (bptr, nbest);
/* sort the scores by E-value */
sortbeste (bptr, nbest);
/* print the histogram */
prhist (stdout,m_msg,pst,gstring2);
/* show the high scoring sequences */
showbest (stdout, aa0, aa1, maxn, bptr, nbest, qlib, &m_msg, pst,
f_str, gstring2);
/* show the high-scoring alignments */
showalign(outfd, aa0, aa1, maxn, bptr, nbest, qlib, m_msg, pst,
f_str, gstring2);
/* thats all folks !!! */
}
================
complib.c /* version set as mp_verstr */
main()
printsum() /* prints summary of run (residues, entries, time) */
void fsigint() /* sets up interrupt handler for HUP not used */
================
compacc.c
void selectbest() /* select best 15000/20000 based on raw score */
void selectbestz() /* select best 15000/20000 based on z-score */
void sortbest() /* sort based on raw score */
void sortbestz() /* sort based on z-score */
void sortbeste() /* sort based on E() score - different from z-score for DNA */
prhist() /* print histogram */
shuffle() /* shuffle sequence (prss) */
wshuffle() /* window shuffle */
================
showbest.c
void showbest() /* present list of high scoring sequences */
================
showalign.c
void showalign() /* show list of high-scoring alignments */
void do_show() /* show an individual alignment */
void initseq() /* setup seqc0/seqc1 which contain alignment characters */
void freeseq() /* free them up */
================
htime.c
time_t s_time() /* get the time in usecs */
void ptime() /* print elapsed time */
================
apam.c
initpam () /* read in PAM matrix or change default array */
void mk_n_pam() /* make DNA pam from +5/-3 values */
================
doinit.c
void initenv() /* read environment variables, general options */
================
initfa.c /* version set as "verstr" */
alloc_pam() /* allocate 2D pam array */
initpam2() /* fill it up from 1D pam triangle */
f_initenv() /* function-specific environment variables */
f_getopt() /* function-specific options */
f_getarg() /* function specific argument - ktup */
resetp() /* reset scoring matrix, optional parameters for DNA-DNA */
reseta() /* reset scoring matrix, optional parameters for prot-DNA */
query_parm() /* ask for additional program arguments (ktup) */
last_init() /* last chance to set up parameters based on query,lib,parms */
f_initpam() /* not used - could set parameters from pam matrix */
================
scaleswn.c
process_hist() /* do statistics calculations */
proc_hist_r() /* regression fit z=1, also used by z=5 */
float find_z() /* gives z-score for score, length, mu, rho, var */
float find_zr() /* gives z-score for score, length, mu, rho, var */
fit_llen() /* first estimate of mu, rho, var */
fit_llens() /* second estimate of mu, rho, var, mu2, rho2 */
proc_hist_r2() /* regression_i fit z=4 */
float find_zr2() /* gives z-score for score, length, mu, rho, mu2, rho2 */
fit_llen2() /* iterative estimate of mu, rho, var */
proc_hist_ln() /* ln()-scaled z=2 */ /* no longer used */
float find_zl() /* gives z-score from ln()-scaled scores */
proc_hist_ml() /* estimate lambda, K using Maximum Likelihood */
float find_ze() /* z-score from lambda, K */
proc_hist_n() /* no length-scaling z=0 */
float find_zn() /* gives z-score from mu, var (no scaling) */
proc_hist_a() /* Altschul-Gish params z= 3 */
ag_parm() /* match pst.pamfile name, look_p() */
look_p() /* lookup Lambda, K, H given param struct */
float find_za()
eq_s() /* returns (double)score (available for length correction) */
ln_s() /* returns (double)score * ln(200)/ln(length) */
proc_hist_r() /* regression fit z=1, also used by z=5 */
alloc_hist() /* set up arrays for score vs length */
free_hist() /* free them */
inithist() /* calls alloc_hist(), sets some other globals */
addhist() /* update score vs length hist */
inithistz() /* initialize displayed (z-score) histogram hist[]*/
addhistz() /* add to hist[], increment num_db_entries */
addhistzp() /* add to hist[], don't change num_db_entries */
prune_hist() /* remove scores from score vs length */
update_db_size() /* num_db_entries = nlib - ntrimmed */
set_db_size() /* -Z db_size; set nlib */
double z_to_E() /* z-value to E() (extreme value distribution */
double zs_to_E() /* z-score (mu=50, sigma=10) to E() */
double zs_to_bit() /* z-score to BLAST2 bit score */
float E_to_zs() /* E() to z-score */
double zs_to_Ec() /* z-score to num_db_entries*(1 - P(zs))
summ_stats() /* put stat summary in string */
vsort() /* not used, does shell sort */
calc_ks() /* does Kolmogorov-Smirnoff calculation for histogram */
================
dropnfa.c /* contains worker comparison functions */
init_work() /* set up struct f_struct fstr - hash query */
get_param() /* actually prints parameters to string */
close_work() /* clean up fstr */
do_work() /* do a comparison */
do_fasta() /* use the fasta() function */
savemax() /* save the best region during scan */
spam() /* rescan the best regions */
sconn() /* try to connect the best regions for initn */
kssort() /* sort by score */
kpsort() /* sort by left end pos */
shscore() /* best self-score */
dmatch() /* do band alignment for opt score */
FLOCAL_ALIGN() /* fast band score-only */
do_opt() /* do an "optimized comparison */
do_walign() /* put an alignment into res[] for calcons() */
sw_walign() /* SW alignment driver - find boundaries */
ALIGN() /* actual alignment driver */
nw_align() /* recursive global alignment */
CHECK_SCORE() /* double check */
DISPLAY() /* Miller's display routine */
bd_walign() /* band alignment driver for DNA */
LOCAL_ALIGN() /* find boundaries in band */
B_ALIGN() /* produce band alignment */
bg_align() /* recursively produce band alignment */
BCHECK_SCORE() /* double check */
calcons() /* calculate ascii alignment seqc0,seqc1 from res[]*/
calc_id() /* calculate % identity with no alignment */
================
nxgetaa.c
getseq() /* get a query (prot or DNA) */
getntseq() /* get a nt query (for fastx, fasty) */
gettitle() /* get a description */
int openlib() /* open a library */
closelib() /* close it */
GETLIB() /* get a fasta-format next library entry */
RANLIB() /* jump back in, get description, position for getlib() */
lgetlib() /* get a Genbank flat-file format next library entry */
lranlib() /* jump back in, get description, position for lgetlib() */
pgetlib() /* get CODATA format next library entry */
pranlib() /* jump back in, get description, position for lgetlib() */
egetlib() /* get EMBL format next library entry */
eranlib() /* jump back in, get description, position for egetlib() */
igetlib() /* get Intelligenetics format next library entry */
iranlib() /* jump back in, get description, position for igetlib() */
vgetlib() /* get PIR/VMS/GCG format next library entry */
vranlib() /* jump back in, get description, position for vgetlib() */
gcg_getlib() /* get GCG binary format next library entry */
gcg_ranlib() /* jump back in, get description, position for gcg_getlib() */
int scanseq() /* find %ACGT */
revcomp() /* do reverse complement */
sf_sort() /* sort superfamily numbers */
================
c_dispn.c
discons() /* display alignment from seqc0, seqc1 */
disgraph() /* display graphical representation, -m 4,5 */
aancpy() /* copy a binary sequence to ascii */
r_memcpy()
l_memcpy()
iidex() /* lookup ascii-encoding of residue */
cal_coord() /* calculate coordinates of alignment ends */
================
ncbl_lib.c
ncbl_openlib()
ncbl_closelib()
ncbl_getliba()
ncbl_getlibn()
ncbl_ranlib()
src_ulong_read()
src_long_read()
src_char_read()
src_fstr_read()
newname()
================
lib_sel.c
getlnames()
libchoice()
libselect()
addfile()
ulindex()
================
nrand48.c
irand(time) /* initialize random number generator */
nrand(n) /* get a number 0 - n */
================
url_subs.c
void do_url1() /* setup search links */
fasta36-36.3.8i_14-Nov-2020/doc/fasta_guide.bib 0000664 0000000 0000000 00000015741 14334533127 0020373 0 ustar 00root root 0000000 0000000
@article( WRP881,
author = {W. R. Pearson
and D. J. Lipman},
title = {Improved tools for biological sequence comparison},
year = 1988,
journal = {Proc. Natl. Acad. Sci. USA},
volume = 85,
pages = {2444-2448},
annote = 88190088 )
@incollection( day787,
author = {M. Dayhoff
and R. M. Schwartz
and B. C. Orcutt},
title = {A model of evolutionary change in proteins},
year = 1978,
volume = {5, supplement 3},
booktitle = {Atlas of Protein Sequence and Structure},
editor = {M. Dayhoff},
publisher = {National Biomedical Research Foundation},
pages = {345-352},
address = {Silver Spring, MD} )
@article( WRP960,
author = {W. R. Pearson},
title = {Effective protein sequence comparison},
year = 1996,
journal = {Methods Enzymol.},
volume = 266,
pages = {227-258},
annote = 97422296 )
@article( wrp971,
author = {Z. Zhang
and W. R. Pearson
and W. Miller},
title = {Aligning a {DNA} sequence with a protein sequence},
year = 1997,
journal = {J. Computational Biology},
volume = 4,
pages = {339-349},
annote = 97422296 )
@article( wrp973,
author = {W. R. Pearson
and T. C. Wood
and Z. Zhang
and W. Miller},
title = {Comparison of {DNA} sequences with protein sequences},
year = 1997,
journal = {Genomics},
volume = 46,
pages = {24-36},
annote = 98066759 )
@article( wrp951,
author = {W. R. Pearson},
title = {
Comparison of methods for searching protein sequence databases},
year = 1995,
journal = {Prot. Sci.},
volume = 4,
pages = {1145-1160},
annote = 97422296 )
@article( wrp981,
author = {W. R. Pearson},
title = {
Empirical statistical estimates for sequence similarity searches},
year = 1998,
journal = {J. Mol. Biol.},
volume = 276,
pages = {71-84},
annote = 98179551 )
@article( tay925,
author = {D. T. Jones
and W. R. Taylor
and J. M. Thornton},
title = {
The rapid generation of mutation data matrices from protein sequences},
year = 1992,
journal = {Comp. Appl. Biosci.},
volume = 8,
pages = {275-282} )
@article( woo935,
author = {J. C. Wootton
and S. Federhen},
title = {
Statistics of local complexity in amino acid sequences and sequence databases},
year = 1993,
journal = {Comput. Chem.},
volume = 17,
pages = {149-163} )
@article( alt960,
author = {S. F. Altschul
and W. Gish},
title = {Local alignment statistics},
year = 1996,
journal = {Methods Enzymol.},
volume = 266,
pages = {460-480} )
@article( alt915,
author = {S. F. Altschul},
title = {
Amino acid substitution matrices from an information theoretic
perspective},
year = 1991,
journal = {J. Mol. Biol.},
volume = 219,
pages = {555-65} )
@article( WAT815,
author = {T. F. Smith
and M. S. Waterman},
title = {Identification of common molecular subsequences},
year = 1981,
journal = {J. Mol. Biol.},
volume = 147,
pages = {195-197},
annote = 81267385 )
@article( wrp021,
author = {A. J. Mackey
and T. A. J. Haystead
and W. R. Pearson},
title = {
Getting more From Less: Algorithms for Rapid Protein Identification
with Multiple Short Peptide Sequences},
year = 2002,
journal = {Mol. Cell. Proteomics},
volume = 1,
pages = {139-147} )
@article( farrar2007,
author = {M. Farrar},
title = {
Striped {S}mith-{W}aterman speeds database searches six times over
other SIMD implementations},
year = 2007,
journal = {Bioinformatics},
volume = 23,
pages = {156-161},
annote = 17110365 )
@article{kan023,
author = {Maricel G Kann and Richard A Goldstein},
journal = {Proteins},
title = {Performance evaluation of a new algorithm for the detection of remote homologs with sequence comparison},
pages = {367--76},
volume = {48},
year = {2002},
month = {Aug},
pmid = {12112703}
}
@article{Muller2002,
author = {Tobias Muller and Rainer Spang and Martin Vingron},
journal = {Mol Biol Evol},
title = {Estimating amino acid substitution models: a comparison of Dayhoff's estimator, the resolvent approach and a maximum likelihood method},
pages = {8--13},
volume = {19},
year = {2002},
date-added = {2011-03-14 22:15:08 -0400},
date-modified = {2011-03-14 22:15:08 -0400},
pmid = {11752185},
URL = {http://mbe.oxfordjournals.org/content/19/1/8.long}
}
@article( hen929,
author = {S. Henikoff
and J. G. Henikoff},
title = {Amino acid substitutions matrices from protein blocks},
year = 1992,
journal = {Proc. Natl. Acad. Sci. USA},
volume = 89,
pages = {10915-10919} )
@article( WAT875,
author = {M. S. Waterman
and M. Eggert},
title = {
A new algorithm for best subsequences alignment with application to
t{RNA}-r{RNA} comparisons},
year = 1987,
journal = {J. Mol. Biol.},
volume = 197,
pages = {723-728} )
@article( mil908,
author = {X. Huang
and R. C. Hardison
and W. Miller},
title = {A space-efficient algorithm for local similarities},
year = 1990,
journal = {Comp. Appl. Biosci.},
volume = 6,
pages = {373-381} )
@article( uniprot11,
author = {UniProt Consortium},
title = {
Ongoing and future developments at the Universal Protein Resource.},
year = 2011,
journal = {Nucleic Acids Res},
volume = 39,
pages = {D214-D219},
annote = 21051339 )
@article( wrp022,
author = {J. T. Reese
and W. R. Pearson},
title = {
Empirical determination of effective gap penalties for sequence
comparison},
year = 2002,
journal = {Bioinformatics},
volume = 18,
pages = {1500-1507},
annote = 22310732 )
@article( rog003,
author = {T. Rognes
and E. Seeberg},
title = {
Six-fold speed-up of Smith-Waterman sequence database searches using
parallel processing on common microprocessors},
year = 2000,
journal = {Bioinformatics},
volume = 16,
pages = {699-706},
annote = 20551510 )
@article( mot921,
author = {R. Mott},
title = {
Maximum-likelihood estimation of the statistical distribution of
Smith-Waterman local sequence similarity scores},
year = 1992,
journal = {Bull. Math. Biol.},
volume = 54,
pages = {59-75} )
@article( woz974,
author = {A. Wozniak},
title = {
Using video-oriented instructions to speed up sequence comparison},
year = 1997,
journal = {Comput Appl Biosci},
volume = 13,
pages = {145-150},
annote = 97292450 )
@article{wrp136,
Author = {L. J. Mills and W. R. Pearson},
Journal = {Bioinformatics},
Pages = {3007-3013},
Title = {Adjusting scoring matrices to correct overextended alignments.},
Volume = 29,
Year = 2013}
@article( wrp103,
author = {M. W. Gonzalez
and W. R. Pearson},
title = {
Homologous over-extension: a challenge for iterative similarity searches},
year = 2010,
journal = {Nuc. Acids Res.},
volume = 38,
pages = {2177-2189},
pmcid = {PMC2853128} )
@article{wrp171,
author = {Pearson, W. R. and Li, W. and Lopez, R.},
title = {{Query-seeded iterative sequence similarity searching improves selectivity 5-20-fold.}},
journal = {Nucleic Acids Res},
year = {2017},
volume = {45},
number = {7},
pages = {e46--e46},
month = apr
}
fasta36-36.3.8i_14-Nov-2020/doc/fasta_guide.fg1.tex 0000664 0000000 0000000 00000007000 14334533127 0021100 0 ustar 00root root 0000000 0000000 \begin{footnotesize}
\begin{quote}
\begin{verbatim}
# ../bin/ssearch36 -q -w 80 ../seq/mgstm1.aa a
SSEARCH performs a Smith-Waterman search
version 36.3.6 June, 2013(preload9)
Please cite:
T. F. Smith and M. S. Waterman, (1981) J. Mol. Biol. 147:195-197;
W.R. Pearson (1991) Genomics 11:635-650
Query: ../seq/mgstm1.aa
1>>>mGSTM1 mouse glutathione transferase M1 - 218 aa
Library: PIR1 Annotated (rel. 66)
5121825 residues in 13143 sequences
Statistics: Expectation_n fit: rho(ln(x))= 7.4729+/-0.000484; mu= 2.0282+/- 0.027
mean_var=56.9651+/-10.957, 0's: 9 Z-trim(119.4): 17 B-trim: 67 in 1/62
Lambda= 0.169930
statistics sampled from 13135 (13143) to 13135 sequences
Algorithm: Smith-Waterman (SSE2, Michael Farrar 2006) (7.2 Nov 2010)
Parameters: BL50 matrix (15:-5), open/ext: -10/-2
Scan time: 3.820
The best scores are: s-w bits E(13143)
sp|P08010|GSTM2_RAT Glutathione S-transferase Mu 2; GST 4-4; GT ( 218) 1248 312.0 7.7e-86
sp|P04906|GSTP1_RAT Glutathione S-transferase P; Chain 7; GST - ( 210) 344 90.4 3.8e-19
sp|P00502|GSTA1_RAT Glutathione S-transferase alpha-1; GST 1-1 ( 222) 237 64.1 3.2e-11
sp|P14942|GSTA4_RAT Glutathione S-transferase alpha-4; GST 8-8 ( 222) 179 49.9 6.1e-07
sp|P12653|GSTF1_MAIZE Glutathione S-transferase 1; GST class-pi ( 214) 120 35.4 0.013
sp|P04907|GSTF3_MAIZE Glutathione S-transferase 3; GST class-pi ( 222) 115 34.2 0.032
sp|P20432|GSTT1_DROME Glutathione S-transferase 1-1; DDT-dehydr ( 209) 100 30.5 0.38
sp|P11277|SPTB1_HUMAN Spectrin beta chain, erythrocytic; Beta- (2137) 108 31.6 1.9
... (alignments deleted) ...
>>sp|P14942|GSTA4_RAT Glutathione S-transferase alpha-4; GST 8-8; (222 aa)
s-w opt: 179 Z-score: 231.0 bits: 49.9 E(13143): 6.1e-07
Smith-Waterman score: 179; 25.6% identity (54.5% similar) in 211 aa overlap (5-207:7-207)
10 20 30 40 50 60 70
mGSTM MPMILGYWNVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKF-KLG-LDFPNLPYL-IDGSHKITQSNA
: :.. :: . :: :: . ..: .: ... ::. : : : : ..: . ::: .::. :
sp|P14 MEVKPKLYYFQGRGRMESIRWLLATAGVEFEE---------EFLETREQYEKLQKDGCLLFGQVPLVEIDG-MLLTQTRA
10 20 30 40 50 60 70
80 90 100 110 120 130 140 150
mGSTM ILRYLARKHHLDGETEEERIRADIVENQVMDTRMQLIMLCYNPDFEKQKPEFLKTIPEKMKLYSEF--LGK---RPWFAG
:: ::: :..: :. .::.: :. . ..: :..: .. ::.. : . : . . : . : . ...:
sp|P14 ILSYLAAKYNLYGKDLKERVRIDMYADGTQDLMMMIIGAPFKAPQEKEESLALAVKRAKNRYFPVFEKILKDHGEAFLVG
80 90 100 110 120 130 140 150
160 170 180 190 200 210
mGSTM DKVTYVDFLAYDILDQYRMFEPKCLDAFPNLRDFLARFEGLKKISAYMKSSRYIATPIFSKMAHWSNK
......:. . . . . :. :: :. : .:. .. :. ... . :
sp|P14 NQLSWADIQLLEAILMVEEVSAPVLSDFPLLQAFKTRISNIPTIKKFLQPGSQRKPPPDGHYVDVVRTVLKF
160 170 180 190 200 210 220
... (alignments deleted) ...
218 residues in 1 query sequences
5121825 residues in 13143 library sequences
Tcomplib [36.3.6 May, 2013(preload9)] (4 proc in memory [0G])
start: Thu Jun 6 11:23:28 2013 done: Thu Jun 6 11:23:30 2013
Total Scan time: 3.820 Total Display time: 0.130
Function used was SSEARCH [36.3.6 May, 2013(preload9)]
\end{verbatim}
\end{quote}
\end{footnotesize}
\vspace{-4.0ex}
fasta36-36.3.8i_14-Nov-2020/doc/fasta_guide.fg2.tex 0000664 0000000 0000000 00000003003 14334533127 0021100 0 ustar 00root root 0000000 0000000 \begin{footnotesize}
\begin{verbatim}
>>GST26_SCHMA Glutathione S-transferase class-mu (218 aa)
initn: 422 init1: 359 opt: 407 Z-score: 836.8 bits: 162.0 E(437847): 3.7e-39
Smith-Waterman score: 451; 42.4% identity (73.4% similar) in 203 aa overlap (6-208:6-203)
10 20 30 40 50 60 70 80
mGSTM1 MPMILGYWNVRGLTHPIRMLLEYTDSSYDEKRYTMGDAPDFDRSQWLNEKFKLGLDFPNLPYLIDGSHKITQSNAILRYL
:::.:.::..: :.:::. ...:.:. : : ..: : :.::::::.:::::: :::. :.::: ::.::.
GST26_ MAPKFGYWKVKGLVQPTRlllehleetyeeRAY---DRNEIDA--WSNDKFKLGLEFPNLPYYIDGDFKLTQSMAIIRYI
10 20 30 40 50 60 70
90 100 110 120 130 140 150 160
mGSTM1 ARKHHLDGETEEERIRADIVENQVMDTRMQLIMLCYNPDFEKQKPEFLKTIPEKMKLYSEFLGKRPWFAGDKVTYVDFLA
: ::.. : .:: . ...:. :.: :: .. ..:: ..: : .::. .: ..:.... :... .. :. ::. ::.
GST26_ ADKHNMLGACPKERAEISMLEGAVLDIRMGVLRIAYNKEYETLKVDFLNKLPGRLKMFEDRLSNKTYLNGNCVTHPDFML
80 90 100 110 120 130 140 150
170 180 190 200 210
mGSTM1 YDILDQYRMFEPKCLDAFPNLRDFLARFEGLKKISAYMKSSRYIATPIFSKMAHWSNK
:: :: .. .::. ::.: .: .: : .:. :..::::: :.
GST26_ YDALDVVLYMDSQCLNEFPKLVSFKKCIEDLPQIKNYLNSSRYIKWPLQGWDATFGGGDTPPK
160 170 180 190 200 210
\end{verbatim}
\end{footnotesize}
fasta36-36.3.8i_14-Nov-2020/doc/fasta_guide.pdf 0000664 0000000 0000000 00001056225 14334533127 0020413 0 ustar 00root root 0000000 0000000 %PDF-1.5
%ÐÔÅØ
3 0 obj
<<
/Length 3014
/Filter /FlateDecode
>>
stream
xÚYKsÛ8¾ûWèHU… _Ù“½I¶’J²ÉØ»sHr IHb
’Š£ùõÛnð!ÑÎlÕ”«L ñê>t
ÝÜ]ýòF†›À÷R?
6w»Mxqšlb%½0U›»bóÅÙm¥ïdý¹2ò¤—”[W$a$œí6ï[:º¾×ÈãÄ ä*_ˆí·»wW¯ï®þ¸
6>üã
2õ¤ÜäõÕ—oþ¦€¦wß“Ðòh:ÖE^†P®6·WŸ¯nŒ¶Ñ&PžT‘˜k뇞Òöî iù7ÛÄw®oï¶)|HôÐm…ï´û.«Yå(ù=Û¾³ßºð_£Ê°¨+¥§|Þ‚·Í@C‹c>”mƒ}η†ˆÔó…U£ìqÈ)ÚüXëfÈÌ@é>‡Ý‚Í+ï5wHïˆ÷³ëÇÞ2¢o»;ë9Z61Û=Óá?23šÌÜóȯ~è÷š+¿á4•åï§m[‘u=ÙŠÆ •’uYS@?‘:¯¶1ˆ¼Cœx$_>ÔYCr\,H“¾BC¦ÎW!äÛt Ä’1m[õ4Ø€¯í¨ñ¦l«v_æYE·ú£nrM×MVú²‡É¯óéãõ-uM—B)å¿äÅܤ,ú¤ß¶®Lâq;dBÛ²ùvÙ”FhKÀ¤×;£®ŒQ4Ø9è’Gö£ÎXËÛú!c(Àä¨:ù ‡,ÅóºùóT·•©ùŽˆ¢—BÄîÚ±À%€Ñ~ÖDßþE!è¨P¶´z€)•„.ت˜‚>ÿÒM[—yO5K+ñhþ7ø_Ü3‹Yl“ôAÈâ×@VúÉd±q†Î~ͪ¼…ËS6¨ƒò„ìÖ‰jgpÇ>pì#h°K6燲ÙSõÕÇk*ÐÆÐ|