NHGRI-Blastall-0.66/004075000135110000144000000000000742110615000151155ustar00jpearsonusers00003710000001NHGRI-Blastall-0.66/README010064400135110000144000000071120741534407200160120ustar00jpearsonusers00003710000001Send bug reports to webblaster@nhgri.nih.gov See Changes file for revision history. --------------------------------------------------------------------------- Warning: The parsing methods have been tested against GenBANK datasets. If you are using private databases that have non-standard deflines, you will want to adjust the defline id regex. See DB_ID_REGEX section of the pod documentation There is also a testing script to help develop a regular expression that fits your databases. Look in the scripts subdirectory of the distribution. --------------------------------------------------------------------------- If you have NCBI's BLAST2 or WU-BLAST installed locally and your environment is already setup you can use Perl's object-oriented capabilities to run your BLASTs. Also if you have a blastcl3 binary from the toolkit (or binaries from our FTP site) you can run BLAST over the network. There are also methods to blast single sequences against each other using the bl2seq binaries (also in the toolkit and binaries). You can blast one sequence against a library of sequences using the blast_one_to_many method. You can format databases with formatdb method. You can also have NHGRI::Blastall read existing BLAST reports. If you have a database of repetitive DNA or other DNA you would like to mask out, you can use the mask method to mask the data against these databases. You can then use either the filter or result methods to parse the report and access the various elements of the data. If you need help installing NCBI's local blast check... http://genome.nhgri.nih.gov/blastall/blast_install/ --------------------------------------------------------------------------- IT IS HIGHLY RECOMMENDED THAT YOU READ THROUGH THE POD DOCUMENTATION embedded in the module to learn how to use NHGRI::Blastall. It won't take you that long. After Installation... perldoc NHGRI::Blastall the on-line manual page is available at http://genome.nhgri.nih.gov/blastall/Blastall.html --------------------------------------------------------------------------- If you have problems, questions, comments send to webblaster@nhgri.nih.gov --------------------------------------------------------------------------- PUBLIC DOMAIN NOTICE This software/database is ``United States Government Work'' under the terms of the United States Copyright Act. It was written as part of the authors' official duties for the United States Government and thus cannot be copyrighted. This software/database is freely available to the public for use without a copyright notice. Restrictions cannot be placed on its present or future use. Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the National Human Genome Research Institute (NHGRI) and the U.S. Government does not and cannot warrant the performance or results that may be obtained by using this software or data. NHGRI and the U.S. Government disclaims all warranties as to performance, merchantability or fitness for any particular purpose. In any work or product derived from this material, proper attribution of the authors as the source of the software or data should be made, using http://genome.nhgri.nih.gov/webblast as the citation. --------------------------------------------------------------------------- Special thanks to Peter Chines for reporting a bunch of bugs and suggesting some excellent features. Thanks to David Lapointe from UMass Medical School for his suggestion to use blastcl3 for network blasting. NHGRI-Blastall-0.66/scripts/004075000135110000144000000000000742110615000166045ustar00jpearsonusers00003710000001NHGRI-Blastall-0.66/scripts/ncbi/004075000135110000144000000000000742110615000175175ustar00jpearsonusers00003710000001NHGRI-Blastall-0.66/scripts/ncbi/test.nt010064400135110000144000000016300741534407200210550ustar00jpearsonusers00003710000001>gi|3278015|gb|AI038821.1|AI038821 ox96d03.x1 Soares_senescent_fibroblasts_NbHSF Homo sapiens cDNA clone IMAGE:1664165 3' similar to gb:V00574_cds1 TRANSFORMING PROTEIN P21/H-RAS-1 (HUMAN);contains element MER22 repetitive element ;, mRNA sequence [Homo sapiens] TTTTTGACCATCCAATAATTGGGTGGGATCCCATCTGTGCCCGACAAGGGCCCACAGAGGCCTGGGAGGG GAGCTAAGGGCTGGGGTTCCGGTGGCATTTGGGATGTTCAAGACAGTCTGTGCACAGCCTCCCTGGGAGG GTCTGCAGTCACCTCGGCCCACGGTCCCGGGGTGACTGGGCTCCAGCAGCCCTTCCTTCCTTCCTTGCTT CCGTCCTTCCTTCCTCCTCCTTCCGTCTGCACCTCCTTCCTGCATCCGGCACCTCCATGTCCTGAGCTTG TGCTGCGTCAGGAGAGCACACACTTGCAGCTCATGCAGCCGGGGCCACTCTCATCAGGAGGGTTCAGCTT CCGCAGCTTGTGCTGCCGGATCTCACGCACCAACGTGTAGAAGGCATCCTCCACTCTCTGCCGGGTCTTG GCCGATGTCTCGATGTANGGGATGCCGTAGCTTCGGGCGAGGTCCTGAGCCTGCCGAGATTCCACAGTGC GTGCAGCCAGGTCACACTTTGTCCCCACCAGCACCATGGGCACGTCATCCGAGTCCTTCACCCCGTTGAT CTGCTCCCTGGTACTGGTGATGTCCTCAAAAGACTGGTGTTGTGATGGAAACACCACAGGAAGCCCTCCC TGTCCCATGACTGGTCC NHGRI-Blastall-0.66/scripts/ncbi/basic.plx010075500135110000144000000005550741534407200213510ustar00jpearsonusers00003710000001#!/usr/local/bin/perl -w use lib qw(/sysdev/users/jfryan/blastall.pm/NHGRI/Blastall); use strict; use NHGRI::Blastall; my $b = new NHGRI::Blastall; print "running ncbi blast...\n"; $b->blastall( p => 'blastn', d => 'est', e => 0.001, i => 'test.nt', o => 'blastn.est.output' } ); $b->print_report(); NHGRI-Blastall-0.66/scripts/net/004075000135110000144000000000000742110615000173725ustar00jpearsonusers00003710000001NHGRI-Blastall-0.66/scripts/net/basic.plx010075500135110000144000000004600741534407200212170ustar00jpearsonusers00003710000001#!/usr/local/bin/perl -w use strict; use NHGRI::Blastall; my $b = new NHGRI::Blastall; print "running blastn over the network...\n"; $b->blastcl3( {'p' => 'blastn', 'd' => 'est', 'i' => 'test.nt', 'o' => 'blastn.est.output' } ); $b->print_report(); NHGRI-Blastall-0.66/scripts/net/test.nt010064400135110000144000000016300741534407200207300ustar00jpearsonusers00003710000001>gi|3278015|gb|AI038821.1|AI038821 ox96d03.x1 Soares_senescent_fibroblasts_NbHSF Homo sapiens cDNA clone IMAGE:1664165 3' similar to gb:V00574_cds1 TRANSFORMING PROTEIN P21/H-RAS-1 (HUMAN);contains element MER22 repetitive element ;, mRNA sequence [Homo sapiens] TTTTTGACCATCCAATAATTGGGTGGGATCCCATCTGTGCCCGACAAGGGCCCACAGAGGCCTGGGAGGG GAGCTAAGGGCTGGGGTTCCGGTGGCATTTGGGATGTTCAAGACAGTCTGTGCACAGCCTCCCTGGGAGG GTCTGCAGTCACCTCGGCCCACGGTCCCGGGGTGACTGGGCTCCAGCAGCCCTTCCTTCCTTCCTTGCTT CCGTCCTTCCTTCCTCCTCCTTCCGTCTGCACCTCCTTCCTGCATCCGGCACCTCCATGTCCTGAGCTTG TGCTGCGTCAGGAGAGCACACACTTGCAGCTCATGCAGCCGGGGCCACTCTCATCAGGAGGGTTCAGCTT CCGCAGCTTGTGCTGCCGGATCTCACGCACCAACGTGTAGAAGGCATCCTCCACTCTCTGCCGGGTCTTG GCCGATGTCTCGATGTANGGGATGCCGTAGCTTCGGGCGAGGTCCTGAGCCTGCCGAGATTCCACAGTGC GTGCAGCCAGGTCACACTTTGTCCCCACCAGCACCATGGGCACGTCATCCGAGTCCTTCACCCCGTTGAT CTGCTCCCTGGTACTGGTGATGTCCTCAAAAGACTGGTGTTGTGATGGAAACACCACAGGAAGCCCTCCC TGTCCCATGACTGGTCC NHGRI-Blastall-0.66/scripts/wu/004075000135110000144000000000000742110615000172375ustar00jpearsonusers00003710000001NHGRI-Blastall-0.66/scripts/wu/test.nt010064400135110000144000000016300741534407200205750ustar00jpearsonusers00003710000001>gi|3278015|gb|AI038821.1|AI038821 ox96d03.x1 Soares_senescent_fibroblasts_NbHSF Homo sapiens cDNA clone IMAGE:1664165 3' similar to gb:V00574_cds1 TRANSFORMING PROTEIN P21/H-RAS-1 (HUMAN);contains element MER22 repetitive element ;, mRNA sequence [Homo sapiens] TTTTTGACCATCCAATAATTGGGTGGGATCCCATCTGTGCCCGACAAGGGCCCACAGAGGCCTGGGAGGG GAGCTAAGGGCTGGGGTTCCGGTGGCATTTGGGATGTTCAAGACAGTCTGTGCACAGCCTCCCTGGGAGG GTCTGCAGTCACCTCGGCCCACGGTCCCGGGGTGACTGGGCTCCAGCAGCCCTTCCTTCCTTCCTTGCTT CCGTCCTTCCTTCCTCCTCCTTCCGTCTGCACCTCCTTCCTGCATCCGGCACCTCCATGTCCTGAGCTTG TGCTGCGTCAGGAGAGCACACACTTGCAGCTCATGCAGCCGGGGCCACTCTCATCAGGAGGGTTCAGCTT CCGCAGCTTGTGCTGCCGGATCTCACGCACCAACGTGTAGAAGGCATCCTCCACTCTCTGCCGGGTCTTG GCCGATGTCTCGATGTANGGGATGCCGTAGCTTCGGGCGAGGTCCTGAGCCTGCCGAGATTCCACAGTGC GTGCAGCCAGGTCACACTTTGTCCCCACCAGCACCATGGGCACGTCATCCGAGTCCTTCACCCCGTTGAT CTGCTCCCTGGTACTGGTGATGTCCTCAAAAGACTGGTGTTGTGATGGAAACACCACAGGAAGCCCTCCC TGTCCCATGACTGGTCC NHGRI-Blastall-0.66/scripts/wu/basic.plx010075500135110000144000000005520741534407200210660ustar00jpearsonusers00003710000001#!/usr/local/bin/perl -w use lib qw(/sysdev/users/jfryan/blastall.pm/NHGRI/Blastall); use strict; use NHGRI::Blastall; my $b = new NHGRI::Blastall; print "running wu-blastn...\n"; $b->wu_blastall( {'p' => 'blastn', 'd' => 'est', 'i' => 'test.nt', 'o' => 'blastn.est.output' } ); $b->print_report(); NHGRI-Blastall-0.66/scripts/id_regex_test.pl010064400135110000144000000024450741534407200220070ustar00jpearsonusers00003710000001# this is a little script to help come up with a -DB_ID_REGEX # if you are using non-Genbank formatted databases than odds are # this regex that matches GENBANK deflines, # # [^\|]+(?:\|[^\|,\s]*){1,10} # # that matches GENBANK formatted deflines # # sp|Q54152|XASA_SHIFL AMINO ACID ANTIPORTER... # # is not going to work for your database. You may want to use something # a bit more generic like, # # [^ ]+ # # which will match everything before the first space # which would match H_DJ0592P03.Contig29 in this WashU defline # # H_DJ0592P03.Contig29 : preliminary Length:47736 # # This script will let you test your regex. # THE KEY to picking a good regex is to make sure your regex # catches something which is UNIQUE to each sequence in the database. # set $REGEX to the value of the regex you would like to test and # set $BLAST_REPORT to a BLAST report against your formatted databases # After you find a regex you will need to add -DB_ID_REGEX => YOUR_REGEX # to your new declaration. # # $b = new NHGRI::Blastall(-DB_ID_REGEX => '[^ ]+'); # $REGEX = '[^ ]+'; $BLAST_REPORT = '../t/blast.out'; use NHGRI::Blastall; $b = new NHGRI::Blastall( -DB_ID_REGEX => $REGEX ); $b->read_report( $BLAST_REPORT ); my @results = $b->result(); foreach $rh_r (@results) { print "id: $rh_r->{'id'}\n"; } NHGRI-Blastall-0.66/MANIFEST010064400135110000144000000003660741534407200162670ustar00jpearsonusers00003710000001Blastall.pm Changes MANIFEST INSTALL Makefile.PL README t/blast.report t/CAB96192.aa t/basic.t scripts/wu/test.nt scripts/wu/basic.plx scripts/net/test.nt scripts/net/basic.plx scripts/ncbi/test.nt scripts/ncbi/basic.plx scripts/id_regex_test.pl NHGRI-Blastall-0.66/t/004075000135110000144000000000000742110615000153605ustar00jpearsonusers00003710000001NHGRI-Blastall-0.66/t/blast.report010064400135110000144000000470250741534407200177460ustar00jpearsonusers00003710000001BLASTN 2.0.10 [Aug-26-1999] Reference: Altschul, Stephen F., Thomas L. Madden, Alejandro A. Schaffer, Jinghui Zhang, Zheng Zhang, Webb Miller, and David J. Lipman (1997), "Gapped BLAST and PSI-BLAST: a new generation of protein database search programs", Nucleic Acids Res. 25:3389-3402. Query= HUMRASH|gi|190890|gb|J00277 Human c-Ha-ras1 proto-oncogene. (6453 letters) Database: Non-redundant Database of all other organisms GenBank+EMBL+DDBJ EST sequences: updated on Sat Dec 4 01:15:02 1999 800,021 sequences; 338,270,040 total letters Searching..................................................done Score E Sequences producing significant alignments: (bits) Value gb|AI657850.1|AI657850 fc23g10.y1 Zebrafish WashU MPIMG EST Dani... 139 2e-30 gb|AA497334|AA497334 fa04e08.r1 Zebrafish ICRFzfls Danio rerio c... 123 1e-25 gb|AI877912.1|AI877912 fc55a12.y1 Zebrafish WashU MPIMG EST Dani... 100 2e-18 gb|AI959079.1|AI959079 fd24c02.y1 Zebrafish WashU MPIMG EST Dani... 90 2e-15 gb|AA141102|AA141102 CK01231.5prime CK Drosophila melanogaster e... 90 2e-15 gb|AW148172.1|AW148172 da13b03.x1 normalized Xenopus laevis gast... 70 2e-09 dbj|AU003591|AU003591 AU003591 Bombyx mori p50(Daizo) Bombyx mor... 64 1e-07 dbj|AU004341|AU004341 AU004341 Bombyx mori p50(Daizo) Bombyx mor... 64 1e-07 gb|AI575394.1|AI575394 UI-R-G0-uh-e-12-0-UI.s1 UI-R-G0 Rattus no... 64 1e-07 gb|AI259307|AI259307 LP02693.3prime LP Drosophila melanogaster l... 64 1e-07 dbj|AU004702|AU004702 AU004702 Bombyx mori p50(Daizo) Bombyx mor... 64 1e-07 gb|AI295927|AI295927 LP09693.5prime LP Drosophila melanogaster l... 62 5e-07 gb|AI405934|AI405934 GH26087.5prime GH Drosophila melanogaster h... 62 5e-07 gb|AI238083|AI238083 GH14071.5prime GH Drosophila melanogaster h... 62 5e-07 gb|AI626207.1|AI626207 fc12c05.y1 Zebrafish WashU MPIMG EST Dani... 58 7e-06 gb|AA900048|AA900048 UI-R-E0-da-e-01-0-UI.s1 UI-R-E0 Rattus norv... 58 7e-06 >gb|AI657850.1|AI657850 fc23g10.y1 Zebrafish WashU MPIMG EST Danio rerio cDNA 5' similar to gb:V00574_cds1 TRANSFORMING PROTEIN P21/H-RAS-1 (HUMAN);, mRNA sequence Length = 573 Score = 139 bits (70), Expect = 2e-30 Identities = 145/170 (85%) Strand = Plus / Plus Query: 2052 ggaagcaggtggtcattgatggggagacgtgcctgttggacatcctggataccgccggcc 2111 ||||||||||||| ||||| || || ||||| || ||||||||||||| || || || | Sbjct: 194 ggaagcaggtggtgattgacggagaaacgtgtctactggacatcctggacactgcaggtc 253 Query: 2112 aggaggagtacagcgccatgcgggaccagtacatgcgcaccggggagggcttcctgtgtg 2171 ||||||| ||||| |||||| |||||||||||||| | || || ||||||||||| |||| Sbjct: 254 aggaggaatacagtgccatgagggaccagtacatgaggacaggagagggcttcctctgtg 313 Query: 2172 tgtttgccatcaacaacaccaagtcttttgaggacatccaccagtacagg 2221 | ||||||||||| ||||| ||||| ||||||||||| ||||| |||||| Sbjct: 314 tctttgccatcaataacacaaagtcctttgaggacattcaccactacagg 363 Score = 79.8 bits (40), Expect = 2e-12 Identities = 94/112 (83%) Strand = Plus / Plus Query: 1664 atgacggaatataagctggtggtggtgggcgccggcggtgtgggcaagagtgcgctgacc 1723 ||||| ||||||||||| ||||| ||||| || || || || ||||| || || || ||| Sbjct: 73 atgaccgaatataagcttgtggtcgtgggagctggaggcgtaggcaaaagcgctctcacc 132 Query: 1724 atccagctgatccagaaccattttgtggacgaatacgaccccactatagagg 1775 ||||| || ||||||||||| |||||||| ||||| ||||| |||||||||| Sbjct: 133 atccaactcatccagaaccactttgtggatgaatatgacccaactatagagg 184 Score = 65.9 bits (33), Expect = 3e-08 Identities = 60/69 (86%) Strand = Plus / Plus Query: 2371 cagggagcagatcaaacgggtgaaggactcggatgacgtgcccatggtgctggtggggaa 2430 |||||||||||| || || || |||||||| || ||||| |||||||| ||||||||||| Sbjct: 360 cagggagcagataaagcgagtaaaggactccgaggacgtccccatggttctggtggggaa 419 Query: 2431 caagtgtga 2439 |||||||| Sbjct: 420 taagtgtga 428 >gb|AA497334|AA497334 fa04e08.r1 Zebrafish ICRFzfls Danio rerio cDNA clone 1O24 5' similar to gb:V00574_cds1 TRANSFORMING PROTEIN P21/H-RAS-1 (HUMAN); Length = 671 Score = 123 bits (62), Expect = 1e-25 Identities = 143/170 (84%) Strand = Plus / Plus Query: 2052 ggaagcaggtggtcattgatggggagacgtgcctgttggacatcctggataccgccggcc 2111 ||||||||||||| ||||| || || ||||| || ||||||||||||| || || || | Sbjct: 452 ggaagcaggtggtgattgacggagaaacgtgtctactggacatcctggacactgcaggtc 511 Query: 2112 aggaggagtacagcgccatgcgggaccagtacatgcgcaccggggagggcttcctgtgtg 2171 ||||||| ||||| |||||| |||||||||||||| | || || || |||||||| |||| Sbjct: 512 aggaggaatacagtgccatgagggaccagtacatgaggacaggagaaggcttcctctgtg 571 Query: 2172 tgtttgccatcaacaacaccaagtcttttgaggacatccaccagtacagg 2221 | ||||||||||| ||||| |||| ||||||||||| ||||| |||||| Sbjct: 572 tctttgccatcaataacacaaagtgctttgaggacattcaccactacagg 621 Score = 71.9 bits (36), Expect = 5e-10 Identities = 93/112 (83%) Strand = Plus / Plus Query: 1664 atgacggaatataagctggtggtggtgggcgccggcggtgtgggcaagagtgcgctgacc 1723 ||||| ||||||||||| ||||| ||||| || || || || ||||| || || || ||| Sbjct: 331 atgaccgaatataagcttgtggtcgtgggagctggaggcgtaggcaaaagcgctctcacc 390 Query: 1724 atccagctgatccagaaccattttgtggacgaatacgaccccactatagagg 1775 ||||| || ||||||||||| |||||| | ||||| ||||| |||||||||| Sbjct: 391 atccaactcatccagaaccactttgtgaatgaatatgacccaactatagagg 442 >gb|AI877912.1|AI877912 fc55a12.y1 Zebrafish WashU MPIMG EST Danio rerio cDNA 5' similar to gb:V00574_cds1 TRANSFORMING PROTEIN P21/H-RAS-1 (HUMAN);, mRNA sequence Length = 647 Score = 99.6 bits (50), Expect = 2e-18 Identities = 116/138 (84%) Strand = Plus / Plus Query: 2052 ggaagcaggtggtcattgatggggagacgtgcctgttggacatcctggataccgccggcc 2111 ||||||||||||| ||||| || |||||||| ||| ||||||||||||| || || |||| Sbjct: 336 ggaagcaggtggtgattgacggcgagacgtgtctgctggacatcctggacactgcaggcc 395 Query: 2112 aggaggagtacagcgccatgcgggaccagtacatgcgcaccggggagggcttcctgtgtg 2171 |||| ||||||||||| ||| | |||||||||||| | || || ||||| ||||| || | Sbjct: 396 aggaagagtacagcgcaatgagagaccagtacatgaggacaggagagggtttcctctgcg 455 Query: 2172 tgtttgccatcaacaaca 2189 | || || |||||||||| Sbjct: 456 tcttcgctatcaacaaca 473 Score = 73.8 bits (37), Expect = 1e-10 Identities = 94/113 (83%) Strand = Plus / Plus Query: 1663 gatgacggaatataagctggtggtggtgggcgccggcggtgtgggcaagagtgcgctgac 1722 |||||| || ||||||||||| || ||||| || || ||||| || ||||| ||| | || Sbjct: 214 gatgactgagtataagctggttgttgtgggagcaggaggtgttgggaagagcgcgttaac 273 Query: 1723 catccagctgatccagaaccattttgtggacgaatacgaccccactatagagg 1775 |||||||| |||||||| || |||||||| ||||| ||||||||||| |||| Sbjct: 274 aatccagctcatccagaatcactttgtggatgaatatgaccccactattgagg 326 Score = 67.9 bits (34), Expect = 8e-09 Identities = 52/58 (89%) Strand = Plus / Plus Query: 2375 gagcagatcaaacgggtgaaggactcggatgacgtgcccatggtgctggtggggaaca 2432 ||||||||||| || ||||||||||||||||| || |||||||| || |||||||||| Sbjct: 506 gagcagatcaagcgtgtgaaggactcggatgatgttcccatggtcctagtggggaaca 563 >gb|AI959079.1|AI959079 fd24c02.y1 Zebrafish WashU MPIMG EST Danio rerio cDNA 5' similar to gb:V00574_cds1 TRANSFORMING PROTEIN P21/H-RAS-1 (HUMAN);, mRNA sequence Length = 647 Score = 89.7 bits (45), Expect = 2e-15 Identities = 129/157 (82%) Strand = Plus / Plus Query: 2040 aggattcctaccggaagcaggtggtcattgatggggagacgtgcctgttggacatcctgg 2099 ||||||| ||| | |||||||| || |||||||| ||||| || || |||| ||||||| Sbjct: 397 aggattcatacagaaagcaggttgtgattgatggagagacttgtttgctggatatcctgg 456 Query: 2100 ataccgccggccaggaggagtacagcgccatgcgggaccagtacatgcgcaccggggagg 2159 | || || || |||||||||||||| |||||| | |||||||| ||| | || || |||| Sbjct: 457 acactgcaggtcaggaggagtacagtgccatgagagaccagtatatgaggacaggagagg 516 Query: 2160 gcttcctgtgtgtgtttgccatcaacaacaccaagtc 2196 | || || |||||||||||||||||||| ||||||| Sbjct: 517 gatttctttgtgtgtttgccatcaacaatgccaagtc 553 >gb|AA141102|AA141102 CK01231.5prime CK Drosophila melanogaster embryo BlueScript Drosophila melanogaster cDNA clone CK01231 5prime similar to 0: Length = 551 Score = 89.7 bits (45), Expect = 2e-15 Identities = 89/104 (85%) Strand = Plus / Plus Query: 1664 atgacggaatataagctggtggtggtgggcgccggcggtgtgggcaagagtgcgctgacc 1723 ||||||||||| || ||||| || || || ||||| || ||||||||| ||||| ||| Sbjct: 181 atgacggaatacaaactggtcgtcgttggagccggaggcgtgggcaagtccgcgctcacc 240 Query: 1724 atccagctgatccagaaccattttgtggacgaatacgaccccac 1767 |||||||| |||||||||||||| |||||||| ||||||||||| Sbjct: 241 atccagctaatccagaaccatttcgtggacgantacgaccccac 284 Score = 63.9 bits (32), Expect = 1e-07 Identities = 83/101 (82%) Strand = Plus / Plus Query: 2048 taccggaagcaggtggtcattgatggggagacgtgcctgttggacatcctggataccgcc 2107 ||||| ||||| ||||| || ||||| || || |||||| ||||||||||||| |||||| Sbjct: 298 taccgaaagcaagtggttatcgatggaganacctgcctgctggacatcctggacaccgcc 357 Query: 2108 ggccaggaggagtacagcgccatgcgggaccagtacatgcg 2148 ||||| || || | | ||||||||||| ||||| ||||| Sbjct: 358 ggccaagatgantnctcggccatgcgggatcagtatatgcg 398 >gb|AW148172.1|AW148172 da13b03.x1 normalized Xenopus laevis gastrula Xenopus laevis cDNA clone XENOPUS_SOURCE_ID:xlnga001a06 3' similar to gb:V00574_cds1 TRANSFORMING PROTEIN P21/H-RAS-1 (HUMAN); gb:X13664 Mouse mRNA for N-ras protein (MOUSE... Length = 652 Score = 69.9 bits (35), Expect = 2e-09 Identities = 134/167 (80%) Strand = Plus / Minus Query: 2055 agcaggtggtcattgatggggagacgtgcctgttggacatcctggataccgccggccagg 2114 |||||||||| || || ||||| || ||||| | ||||| ||||||| || || || | Sbjct: 526 agcaggtggtgatagacggggaaacttgcctcctagacatattggatactgcggggcaag 467 Query: 2115 aggagtacagcgccatgcgggaccagtacatgcgcaccggggagggcttcctgtgtgtgt 2174 |||| ||||| |||||| |||| |||||||||||||| || || || || || ||||| | Sbjct: 466 aggaatacagtgccatgagggatcagtacatgcgcacgggagaagggtttctctgtgtct 407 Query: 2175 ttgccatcaacaacaccaagtcttttgaggacatccaccagtacagg 2221 |||| || || ||||| |||||||| |||||| |||| || |||||| Sbjct: 406 ttgctattaataacacaaagtctttcgaggacgtccatcattacagg 360 >dbj|AU003591|AU003591 AU003591 Bombyx mori p50(Daizo) Bombyx mori cDNA clone ws00308, mRNA sequence [Bombyx mori] Length = 724 Score = 63.9 bits (32), Expect = 1e-07 Identities = 59/68 (86%) Strand = Plus / Plus Query: 2375 gagcagatcaaacgggtgaaggactcggatgacgtgcccatggtgctggtggggaacaag 2434 ||||||||||| || ||||||||| |||| || ||||||||||| || ||||| |||||| Sbjct: 150 gagcagatcaagcgagtgaaggacgcggaagaggtgcccatggtacttgtgggcaacaag 209 Query: 2435 tgtgacct 2442 || ||||| Sbjct: 210 tgcgacct 217 >dbj|AU004341|AU004341 AU004341 Bombyx mori p50(Daizo) Bombyx mori cDNA clone ws20237, mRNA sequence [Bombyx mori] Length = 739 Score = 63.9 bits (32), Expect = 1e-07 Identities = 59/68 (86%) Strand = Plus / Plus Query: 2375 gagcagatcaaacgggtgaaggactcggatgacgtgcccatggtgctggtggggaacaag 2434 ||||||||||| || ||||||||| |||| || ||||||||||| || ||||| |||||| Sbjct: 499 gagcagatcaagcgagtgaaggacgcggaagaggtgcccatggtacttgtgggcaacaag 558 Query: 2435 tgtgacct 2442 || ||||| Sbjct: 559 tgcgacct 566 >gb|AI575394.1|AI575394 UI-R-G0-uh-e-12-0-UI.s1 UI-R-G0 Rattus norvegicus cDNA clone UI-R-G0-uh-e-12-0-UI 3', mRNA sequence Length = 536 Score = 63.9 bits (32), Expect = 1e-07 Identities = 59/68 (86%) Strand = Plus / Plus Query: 2114 gaggagtacagcgccatgcgggaccagtacatgcgcaccggggagggcttcctgtgtgtg 2173 ||||||||||| || ||| |||||||||||||| | || ||||||||||| || ||||| Sbjct: 387 gaggagtacagtgcaatgagggaccagtacatgagaactggggagggctttctttgtgta 446 Query: 2174 tttgccat 2181 |||||||| Sbjct: 447 tttgccat 454 >gb|AI259307|AI259307 LP02693.3prime LP Drosophila melanogaster larval-early pupal pOT2 Drosophila melanogaster cDNA clone LP02693 3prime similar to M80535: R FBgn0004636 PID:g158198 SWISS-PROT:P08645, mRNA sequence [Drosophila melanogaster] Length = 542 Score = 63.9 bits (32), Expect = 1e-07 Identities = 50/56 (89%) Strand = Plus / Minus Query: 2387 cgggtgaaggactcggatgacgtgcccatggtgctggtggggaacaagtgtgacct 2442 |||||||||||| |||| || |||||||||||||| ||||| |||||||| ||||| Sbjct: 275 cgggtgaaggacacggacgatgtgcccatggtgctcgtgggcaacaagtgcgacct 220 >dbj|AU004702|AU004702 AU004702 Bombyx mori p50(Daizo) Bombyx mori cDNA clone ws20754, mRNA sequence [Bombyx mori] Length = 747 Score = 63.9 bits (32), Expect = 1e-07 Identities = 59/68 (86%) Strand = Plus / Plus Query: 2375 gagcagatcaaacgggtgaaggactcggatgacgtgcccatggtgctggtggggaacaag 2434 ||||||||||| || ||||||||| |||| || ||||||||||| || ||||| |||||| Sbjct: 485 gagcagatcaagcgagtgaaggacgcggaagaggtgcccatggtacttgtgggcaacaag 544 Query: 2435 tgtgacct 2442 || ||||| Sbjct: 545 tgcgacct 552 >gb|AI295927|AI295927 LP09693.5prime LP Drosophila melanogaster larval-early pupal pOT2 Drosophila melanogaster cDNA clone LP09693 5prime similar to Y07564: Ric FBgn0017549 PID:e276381 SPTREMBL:P91637, mRNA sequence [Drosophila melanogaster] Length = 522 Score = 61.9 bits (31), Expect = 5e-07 Identities = 52/59 (88%) Strand = Plus / Plus Query: 2090 gacatcctggataccgccggccaggaggagtacagcgccatgcgggaccagtacatgcg 2148 |||||||| || ||||||||||||| ||||| || |||||||||||||| |||||||| Sbjct: 413 gacatccttgacaccgccggccaggtggagttcacggccatgcgggaccaatacatgcg 471 >gb|AI405934|AI405934 GH26087.5prime GH Drosophila melanogaster head pOT2 Drosophila melanogaster cDNA clone GH26087 5prime similar to Y07564: Ric FBgn0017549 PID:e276381 SPTREMBL:P91637, mRNA sequence [Drosophila melanogaster] Length = 602 Score = 61.9 bits (31), Expect = 5e-07 Identities = 52/59 (88%) Strand = Plus / Plus Query: 2090 gacatcctggataccgccggccaggaggagtacagcgccatgcgggaccagtacatgcg 2148 |||||||| || ||||||||||||| ||||| || |||||||||||||| |||||||| Sbjct: 77 gacatccttgacaccgccggccaggtggagttcacggccatgcgggaccaatacatgcg 135 >gb|AI238083|AI238083 GH14071.5prime GH Drosophila melanogaster head pOT2 Drosophila melanogaster cDNA clone GH14071 5prime similar to Y07564: Ric FBgn0017549 PID:e276381 SPTREMBL:P91637, mRNA sequence [Drosophila melanogaster] Length = 535 Score = 61.9 bits (31), Expect = 5e-07 Identities = 52/59 (88%) Strand = Plus / Plus Query: 2090 gacatcctggataccgccggccaggaggagtacagcgccatgcgggaccagtacatgcg 2148 |||||||| || ||||||||||||| ||||| || |||||||||||||| |||||||| Sbjct: 425 gacatccttgacaccgccggccaggtggagttcacggccatgcgggaccaatacatgcg 483 >gb|AI626207.1|AI626207 fc12c05.y1 Zebrafish WashU MPIMG EST Danio rerio cDNA 5' similar to TR:P97913 P97913 N-RAS {EXONS 1 AND 2 ;, mRNA sequence Length = 529 Score = 58.0 bits (29), Expect = 7e-06 Identities = 92/113 (81%) Strand = Plus / Plus Query: 1663 gatgacggaatataagctggtggtggtgggcgccggcggtgtgggcaagagtgcgctgac 1722 |||||| || ||||||||||| || ||||| || || |||| | ||||| ||| | || Sbjct: 298 gatgactgagtataagctggttgttgtgggagcaggaggtggttggaagagcgcgttaac 357 Query: 1723 catccagctgatccagaaccattttgtggacgaatacgaccccactatagagg 1775 |||||||| |||||||| || |||||||| ||||| ||||||||||| |||| Sbjct: 358 aatccagctcatccagaatcactttgtggatgaatatgaccccactattgagg 410 >gb|AA900048|AA900048 UI-R-E0-da-e-01-0-UI.s1 UI-R-E0 Rattus norvegicus cDNA clone UI-R-E0-da-e-01-0-UI 3' similar to > gi|191344|gb|M84166|HAMCHARAS Hamster c-Ha-ras protein gene, complete cds, mRNA sequence [Rattus norvegicus] Length = 315 Score = 58.0 bits (29), Expect = 7e-06 Identities = 32/33 (96%) Strand = Plus / Minus Query: 3319 gctgcatgagctgcaagtgtgtgctctcctgac 3351 ||||||||||||||||||||||||| ||||||| Sbjct: 315 gctgcatgagctgcaagtgtgtgctgtcctgac 283 Database: Non-redundant Database of all other organisms GenBank+EMBL+DDBJ EST sequences: updated on Sat Dec 4 01:15:02 1999 Posted date: Dec 4, 1999 2:31 AM Number of letters in database: 338,270,040 Number of sequences in database: 800,021 Lambda K H 1.37 0.711 1.31 Gapped Lambda K H 1.37 0.711 1.31 Matrix: blastn matrix:1 -3 Gap Penalties: Existence: 5, Extension: 2 Number of Hits to DB: 890609 Number of Sequences: 800021 Number of extensions: 890609 Number of successful extensions: 160 Number of sequences better than 1.0e-04: 19 length of query: 6453 length of database: 338,270,040 effective HSP length: 21 effective length of query: 6432 effective length of database: 321,469,599 effective search space: 2067692460768 effective search space used: 2067692460768 T: 0 A: 0 X1: 6 (11.9 bits) X2: 25 (49.6 bits) S1: 12 (24.3 bits) S2: 28 (56.0 bits) Identities = 89/104 (85%) Strand = Plus / Plus Query: 1664 atgacggaatataagctggtggtggtgggcgccggcggtgtgggcaagagtgcgctgacc 1723 ||||||||||| || ||||| || || || ||||| || ||||||||| ||||| ||| Sbjct: 181 atgacggaatacaaactggtcgtcgttggagccggaggcgtgggcaagtccgcgctcacc 240 Query: 1724 atccagctgatccagaaccattttgtggacgaatacgaccccac 1767 ||||||||NHGRI-Blastall-0.66/t/basic.t010064400135110000144000000007010741534407200166400ustar00jpearsonusers00003710000001BEGIN { $| = 1; print "1..4\n"; } END {print "not ok 1\n" unless $loaded;} use NHGRI::Blastall; $loaded = 1; print "ok 1\n"; my $t = 't'; my $b = new NHGRI::Blastall; if ($b->read_report("$t/blast.report")) { print "ok 2\n"; } else { print "not ok 2\n"; } if ($b->result('id')) { print "ok 3\n"; } else { print "not ok 3\n"; } if ($b->filter( {'identities' => .70} )) { print "ok 4\n"; } else { print "not ok 4\n"; } NHGRI-Blastall-0.66/t/CAB96192.aa010064400135110000144000000001110741534407200166700ustar00jpearsonusers00003710000001>gi|8919844|emb|CAB96192.1| PSI 9 KDa protein [Aloe vera] YLWHETTRSMGLSY NHGRI-Blastall-0.66/Blastall.pm010064400135110000144000001616060742110614400172300ustar00jpearsonusers00003710000001package NHGRI::Blastall; require 5.004; ########################################################################### # See the bottom of this file for the POD documentation. Search for the # string '=head'. # # You can run this file through either pod2man or pod2html to produce pretty # documentation in manual or html file format (these utilities are part of the # Perl 5 distribution). # # The most recent version and complete docs are available at: # ftp://ftp.nhgri.nih.gov/pub/software/ ########################################################################### # This software is "United States Government Work" under the # terms of the United States Copyright Act. It was written as part of the # authors' official duties for the United States Government and thus # cannot be copyrighted. This software/database is freely available to # the public for use without a copyright notice. Restrictions cannot be # placed on its present or future use. # # Although all reasonable efforts have been taken to ensure the accuracy # and reliability of the software and data, the National Human Genome # Research Institute (NHGRI) and the U.S. Government does not and cannot # warrant the performance or results that may be obtained by using this # software or data. NHGRI and the U.S. Government disclaims all # warranties as to performance, merchantability or fitness for any # particular purpose. # # In any work or product derived from this material, proper attribution # of the authors as the source of the software or data should be made, # using http://genome.nhgri.nih.gov/blastall as the citation. ########################################################################### use strict; use vars qw($VERSION $REVISION $BLASTALL $BLASTCL3 $BL2SEQ %WU_BLAST $FORMATDB %DEFAULT_FILTERS); use IO::File; use Carp qw(carp croak); $VERSION = '0.67'; $REVISION = '# $Id: Blastall.pm,v 1.3 2002/01/15 20:09:08 jpearson Exp $'; # you can set these variables to the full path of the corresponding binaries $BLASTALL = 'blastall'; $BLASTCL3 = 'blastcl3'; $BL2SEQ = 'bl2seq'; $FORMATDB = 'formatdb'; # you can adjust the locations of wu-blast programs # for instance 'wu_blastn' => '/usr/local/wu/blastn', # I haven't added support for wu-blastall (hopefully I will get to this) %WU_BLAST = ('wu_blastn' => 'wu-blastn', 'wu_blastp' => 'wu-blastp', 'wu_blastx' => 'wu-blastx', 'wu_tblastn' => 'wu-tblastn', 'wu_tblastx' => 'wu-tblastx' ); ########################################################################### # You probably shouldn't be mucking below here... ########################################################################### %DEFAULT_FILTERS = ('id' => '=~' ,'defline' => '=~' ,'subject_length' => '>' ,'scores' => '>' ,'expects' => '<' ,'identities' => '>' ,'match_lengths' => '>' ,'subject_strands' => 'eq' ,'query_strands' => 'eq' ,'query_frames' => '==' ,'subject_frames' => '==' ); sub new { my $this = shift; my %args = @_; my $class = ref($this) || $this; my $self = \%args; bless $self, $class; return($self); } sub blastall { my $self = shift; my $rh_opts = _get_opts(@_); my ($ra_args,$tmp_file) = _get_args($rh_opts); unshift @{$ra_args}, $BLASTALL; my $rc = 0xffff & system(@{$ra_args}); if ($rc == 0) { $self->{options} = $rh_opts; push @{$self->{tmp_file}}, $tmp_file; # record so we can DESTROY later $self->{report_file} = $self->{options}->{o} || $tmp_file; return(1); } else { foreach (@{$ra_args}) { print STDERR "$_ "; } print STDERR "\n"; carp qq!the above system call to $BLASTALL failed. Possible problems include 1. Is blastall in your path or is \$BLASTALL the full path to blastall? 2. Do you have the DATABASE installed 3. Are you pointing to the correct location of the query sequence 4. Is your DATABASE corrupted 5. Are you running the correct program for the type of database and query sequence you are running. e.g. BLASTP -> PROTEIN 6. Perhaps try blastcl3 instead of blastall. !; return undef; } } sub formatdb { my $self = shift; my $rh_opts = _get_opts(@_); my $ra_args = _get_formatdb_args($rh_opts); unshift @{$ra_args}, $FORMATDB; my $rc = 0xffff & system(@{$ra_args}); if ($rc == 0) { $self->{options} = $rh_opts; my $ra_indexes = _get_formatdb_indexes($rh_opts); push @{$self->{indexes}}, @{$ra_indexes}; #$self->{indexes} = _get_formatdb_indexes($rh_opts); $self->{tmp_file} = []; return(1); } else { foreach (@{$ra_args}) { print STDERR "$_ "; } print STDERR "\n"; carp qq!the above system call to $FORMATDB failed. Possible problems include 1. Is formatdb in your path or is \$FORMATDB the full path to formatdb? !; return undef; } } sub remove_formatdb_indexes { my $self = shift; if ($self->{indexes}) { foreach my $i (@{$self->{indexes}}) { unlink $i or carp "could not remove $i:$!"; } } } sub blast_one_to_many { my $self = shift; my $rh_opts = _get_opts(@_); my ($ra_args,$tmp_file) = _get_args($rh_opts); if ($rh_opts->{p} eq 'blastp' or $rh_opts->{p} eq 'blastx') { $self->formatdb(p => 'T', i => $rh_opts->{d}); } else { $self->formatdb(p => 'F', i => $rh_opts->{d}); } $self->blastall($rh_opts); $self->remove_formatdb_indexes(); } sub bl2seq { my $self = shift; my $rh_opts = _get_opts(@_); my ($ra_args,$tmp_file) = _get_args($rh_opts); unshift @{$ra_args}, $BL2SEQ; my $rc = 0xffff & system(@{$ra_args}); if ($rc == 0) { $self->{options} = $rh_opts; push @{$self->{tmp_file}}, $tmp_file; # record so we can DESTROY later $self->{report_file} = $self->{options}->{o} || $tmp_file; return(1); } else { foreach (@{$ra_args}) { print STDERR "$_ "; } print STDERR "\n"; carp qq!the above system call to $BL2SEQ failed. Possible problems include 1. Is bl2seq in your path or is \$BL2SEQ the full path to bl2seq? 2. Are you running the correct program for the type of sequences you are comparing? e.g. BLASTP -> PROTEIN !; return undef; } } sub wu_blastall { my $self = shift; my $rh_opts = _get_opts(@_); my ($ra_args,$tmp_file) = _get_wu_args($rh_opts); my $command = join ' ', @{$ra_args}; my $output = $rh_opts->{o} || $tmp_file; $command .= " > $output"; #print "$command\n"; my $rc = 0xffff & system("$command"); if ($rc == 0) { $self->{options} = $rh_opts; push @{$self->{tmp_file}}, $tmp_file; # record so we can DESTROY later $self->{report_file} = $self->{options}->{o} || $tmp_file; return(1); } else { print STDERR "$command\n"; carp qq!the above system call failed.!; return undef; } } sub net_blastall { carp "net_blastall is depricated please use blastcl3 instead\n"; blastcl3(@_); } sub blastcl3 { { local $BLASTALL = $BLASTCL3; blastall(@_); } } sub mask { my $self = shift; my $rh_opts; if (@_ > 1) { my %opts = @_; $rh_opts = \%opts; } else { $rh_opts = shift; } my ($rc); # type can be wu_blastall, blastcl3, blastall or net_blastall(depricated) my $type_of_blast = $rh_opts->{type} || 'blastall'; delete $rh_opts->{type}; my ($ra_args,$tmp_file) = _get_args($rh_opts); unshift @{$ra_args}, $BLASTALL; my $m = new NHGRI::Blastall; $rh_opts->{o} = $rh_opts->{o} || $tmp_file; if ($type_of_blast eq 'net_blastall') { $m->net_blastall($rh_opts); } elsif ($type_of_blast eq 'blastcl3') { $m->blastcl3($rh_opts); } elsif ($type_of_blast eq 'wu_blastall') { $m->wu_blastall($rh_opts); } else { # 2002-01-12 JVP added "return undef unless" return undef unless ($m->blastall($rh_opts)); } my $report_file = $m->{report_file}; my $fa_file = $m->{options}->{i}; my($fa_line,$sequence) = _get_sequence($fa_file); my($ra_mask_positions) = _get_mask_positions($report_file); my $masked_sequence = _get_masked_sequence($sequence,$ra_mask_positions); return ("$fa_line\n$masked_sequence", $ra_mask_positions) if (wantarray); return ("$fa_line\n$masked_sequence"); } sub print_report { my $self = shift; my $report_file = $self->{report_file}; return undef unless ($report_file); open REPORT, $report_file or return(0); my $report = join ('',); print $report; return; } sub filter { my $self = shift; my $rh_criteria; if (@_ > 1) { my %criteria = @_; $rh_criteria = \%criteria; } else { $rh_criteria = shift; } my @filtered_results = (); $self->_parse_report() unless ($self->{been_parsed}); my ($ra_stripped_values,$ra_cmp_ops) = _get_stripped_values_and_comparison_operators($rh_criteria); foreach my $entry (@{$self->{results}}) { if ($entry = _entry_passes_filter($entry,$ra_stripped_values, $ra_cmp_ops,$rh_criteria)) { push @filtered_results,$entry; } } $self->{results} = \@filtered_results; return @filtered_results; } sub unfilter { my $self = shift; $self->_parse_report(); } sub result { my($self,@r) = @_; my $ra_results = []; $self->_parse_report() unless ($self->{been_parsed}); return $self->_all_results unless @r; if (@r == 1) { $ra_results = $self->_result_for_one(@r); } elsif (@r == 2) { $ra_results = $self->_result_for_two(@r); } else { return undef; } return(@{$ra_results}); } sub read_report { my $self = shift; my $file = shift; $self->{report_file} = $file; } sub get_database_description { my $self = shift; $self->_parse_report() unless ($self->{been_parsed}); return $self->{database_description}; } sub get_database_sequence_count { my $self = shift; $self->_parse_report() unless ($self->{been_parsed}); return $self->{database_sequence_count}; } sub get_database_letter_count { my $self = shift; $self->_parse_report() unless ($self->{been_parsed}); return $self->{database_letter_count}; } sub get_blast_program { my $self = shift; $self->_parse_report() unless ($self->{been_parsed}); return $self->{program}; } sub get_blast_version { my $self = shift; $self->_parse_report() unless ($self->{been_parsed}); return $self->{blast_version}; } sub get_report { my $self = shift; return $self->{report_file}; } sub DESTROY { my $self = shift; foreach (@{$self->{tmp_file}}) { unlink $_ if (-e $_); } } # END OF PUBLIC METHODS # BEGIN PRIVATE METHODS sub _get_opts { my @args = @_; my $rh_opts; if (@args == 1) { $rh_opts = shift; } else { my %opts = @args; $rh_opts = \%opts; } return $rh_opts; } sub _get_formatdb_indexes { my $rh_opts = shift; my @indexes = (); my $x = 'p'; my $name = $rh_opts->{i}; $name = $rh_opts->{n} if ($rh_opts->{n}); $x = 'n' if ($rh_opts->{p} =~ /f/i); push @indexes, "${name}.${x}hr", "${name}.${x}in", "${name}.${x}sq"; if ($rh_opts->{o} && $rh_opts->{o} =~ /t/i) { push @indexes, "${name}.${x}nd", "${name}.${x}ni", "${name}.${x}sd", "${name}.${x}si"; } return \@indexes; } sub _write_report_to_disk { my $report_file = shift; my $results = shift; open OUT, ">$report_file" or croak "cannot open $report_file:$!"; print OUT $results; } sub _get_matrix_parameters { return { PAM30 => [9, 1], PAM70 => [10, 1], BLOSUM45 => [14, 2], BLOSUM80 => [10, 1], BLOSUM62 => [11, 1], BLOSUM45 => [14, 2] }; } sub _get_genetic_codes { return { 1 => "Standard (1)", 2 => "Vertebrate Mitochondrial (2)", 3 => "Yeast Mitochondrial (3)", 4 => "Mold Mitochondrial; ... (4)", 5 => "Invertebrate Mitochondrial (5)", 6 => "Ciliate Nuclear; ... (6)", 9 => "Echinoderm Mitochondrial (9)", 10 => "Euplotid Nuclear (10)", 11 => "Bacterial (11)", 12 => "lternative Yeast Nuclear (12)", 13 => "Ascidian Mitochondrial (13)", 14 => "Flatworm Mitochondrial (14)", 15 => "Blepharisma Macronuclear (15)" }; } sub _get_masked_sequence { my $sequence = shift; my $ra_mask_positions = shift; my ($i,$iplus); return $sequence unless ($ra_mask_positions); # nothing masked if ($ra_mask_positions->[0] && $ra_mask_positions->[0] > $ra_mask_positions->[1]) { @{$ra_mask_positions} = reverse @{$ra_mask_positions}; } for ($i=0; $i < @{$ra_mask_positions}; $i=$i+2) { $iplus = ($i + 1); $sequence = _mask_from_x_to_y($sequence,$ra_mask_positions->[$i], $ra_mask_positions->[$iplus]); } return $sequence; } sub _mask_from_x_to_y { my $sequence = shift; my $x = shift; my $y = shift; my $length = ($y - $x); substr($sequence, $x, $length) = ('N' x $length); return($sequence); } sub _get_mask_positions { my $report_file = shift; my @mask_positions = (); open REPORT, $report_file or carp "cannot open $report_file:$!"; while () { if (/^Query:\s+(\d+)\s+.*\s+(\d+)$/) { push @mask_positions,($1,$2); } } return(\@mask_positions); } sub _get_sequence { my $fasta_file = shift; my @fasta_lines = (); open FASTA, $fasta_file or warn "cannot open $fasta_file:$!"; chomp(@fasta_lines = ); my $first_line = shift @fasta_lines; my $seq = join '',@fasta_lines; return ($first_line,$seq); } sub _result_for_one { my $self = shift; my $attribute = shift; my $count = 0; my $ra_results = []; $attribute = _map_attribute($attribute); foreach (@{$self->{results}}) { if (ref($self->{results}->[$count]->{$attribute}) eq 'ARRAY'){ push @{$ra_results},@{$self->{results}->[$count]->{$attribute}}; } else { push @{$ra_results},$self->{results}->[$count]->{$attribute}; } $count++; } return($ra_results); } sub _result_for_two { my $self = shift; my $attribute = shift; my $id = shift; my $ra_results = []; $attribute = _map_attribute($attribute); for (my $i = 0; $i < @{$self->{results}}; $i++) { next unless ($self->{results}->[$i]->{id} eq $id); if (ref($self->{results}->[$i]->{$attribute}) eq 'ARRAY'){ push @{$ra_results},@{$self->{results}->[$i]->{$attribute}}; } else { push @{$ra_results},$self->{results}->[$i]->{$attribute}; } } return($ra_results); } sub _map_attribute { my $attribute = shift; my %attribute_map = ('ids' => 'id', 'score' => 'scores', 'def_line' => 'defline', 'def_lines' => 'defline', 'def-line' => 'defline', 'def-lines' => 'defline', 'deflines' => 'defline', 'expect' => 'expects', 'identity' => 'identities', 'gap' => 'gaps', 'subject_lengths' => 'subject_length', 'subject-length' => 'subject_length', 'subject-lengths' => 'subject_length', 'subjectlength' => 'subject_length', 'subjectlengths' => 'subject_length', 'length' => 'subject_length', 'lengths' => 'subject_length', 'match_length' => 'match_lengths', 'matchlengths' => 'match_lengths', 'matchlength' => 'match_lengths', 'match-length' => 'match_lengths', 'match-lengths' => 'match_lengths', 'query-frame' => 'query_frames', 'queryframe' => 'query_frames', 'query-frames' => 'query_frames', 'queryframes' => 'query_frames', 'query_frame' => 'query_frames', 'query-frame' => 'query_frames', 'subjectframe' => 'subject_frames', 'subject-frames' => 'subject_frames', 'subjectframes' => 'subject_frames', 'subject_frame' => 'subject_frames', ); $attribute = lc($attribute); $attribute = $attribute_map{$attribute} || $attribute; return($attribute); } sub _all_results { my $self = shift; return () unless defined($self) && $self->{results}; return () unless @{$self->{results}}; return @{$self->{results}}; } sub _entry_passes_filter { my $entry = shift; my $ra_stripped_values = shift; my $ra_cmp_ops = shift; my $rh_criteria = shift; my @fields = keys %$rh_criteria; my ($i,$code,$pass,@codes,$ra_safe_vals); for($i=0; $i < @$ra_stripped_values; $i++) { if (ref($entry->{$fields[$i]}) eq "ARRAY") { my $n = 0; my (@sorted,$best_result); foreach (@{$entry->{$fields[$i]}}) { if ($ra_cmp_ops->[$i] =~ /\~/) { s/"/'/g; # cannot nest double quotes in regex $code = '$pass' . " = 1 if (\"$_\" $ra_cmp_ops->[$i] "; $ra_stripped_values->[$n] =~ s/\//\\\//g; $code .= "/$ra_stripped_values->[$n]/i)"; } else { $ra_safe_vals = _check_for_unsafe_e($entry->{$fields[$i]}); if ($ra_cmp_ops->[$i] =~ />/) { @sorted = sort { $b <=> $a } @{$ra_safe_vals}; } else { @sorted = sort { $a <=> $b } @{$ra_safe_vals}; } $best_result = shift @sorted; $code = '$pass' . qq! = 1 if ("$best_result" !; $code .= qq!$ra_cmp_ops->[$i] '$ra_stripped_values->[$i]')!; } _set_passing_positions_flags ($entry, $ra_safe_vals, $ra_cmp_ops->[$i], $ra_stripped_values->[$i]); push @codes, $code; $n++; } foreach (@codes) { $pass = 0; eval $_; carp "eval on $_ failed: $@" if $@; } return undef unless $pass; } else { if ($ra_cmp_ops->[$i] =~ /\~/) { # cannot nest double quotes in regex $entry->{$fields[$i]} =~ s/"/'/g; $code = '$pass' . qq! = 1 if ("$entry->{$fields[$i]}" !; $code .= "$ra_cmp_ops->[$i] "; $ra_stripped_values =~ s/\//\\\//g; $code .= "/$ra_stripped_values->[$i]/i)\;"; } else { $entry->{$fields[$i]} =~ s/^e-(\d*)$/1e-$1/; $code = '$pass' . qq! = 1 if ("$entry->{$fields[$i]}" !; $code .= "$ra_cmp_ops->[$i] "; $code .= qq!'$ra_stripped_values->[$i]')\;!; } $pass = 0; eval $code; return undef unless $pass; croak "eval on $code failed: $@" if $@; } } return $entry; } sub _set_passing_positions_flags { my $entry = shift; my $ra_safe_vals = shift; my $cmp_op = shift; my $cmp_val = shift; my $count = 0; my $code = ''; my @array_positions = (); $entry->{passing_positions} = []; foreach (@{$ra_safe_vals}) { $code = qq^if ("$_" $cmp_op "$cmp_val") { ^; $code .= q^ $entry->{passing_positions}->[$count] = "pass"; ^; $code .= q^} else { ^; $code .= q^ $entry->{passing_positions}->[$count] = "fail"; }^; eval $code; carp "eval on $code failed: $@" if $@; $count++; } return; } ########################################################################### # sub _check_for_unsafe_e # ########################################################################### # When sorting BLAST values that have an initial e (e.g. `e-122') # they will not sort. so we need to put a 1 if front (e.g. `1e-122') ########################################################################### sub _check_for_unsafe_e { my $ra_unsafe_vals = shift; my @safe_vals = (); foreach (@{$ra_unsafe_vals}) { $_ =~ s/^e-(\d*)$/1e-$1/; push @safe_vals,$_; } return \@safe_vals; } sub _get_stripped_values_and_comparison_operators { my $rh_criteria = shift; my ($key,$value,@stripped_values,@cmp_ops); my $count = 0; while (($key,$value) = each %$rh_criteria) { $key = lc $key; ($stripped_values[$count],$cmp_ops[$count]) = _check_for_comparison_operators($key,$value); $count++; } return(\@stripped_values,\@cmp_ops); } sub _check_for_comparison_operators { my $key = shift; my $value = shift; my ($cmp_op,$new_value); if ($value =~ /^(<=|>=|=~|~|!~|>|<|=|==)(.*)/) { if ($1 eq '~') { $cmp_op = '=~'; } elsif ($1 eq '=') { $cmp_op = '=='; } else { $cmp_op = $1; } $new_value = $2; } else { $cmp_op = $DEFAULT_FILTERS{$key}; $new_value = $value; } return($new_value,$cmp_op); } sub _parse_report { my $self = shift; my $report_file = $self->{report_file}; $self->{been_parsed} = 1; $self->{results} = []; unless ($self->{report_file_handle}) { open REPORT, $report_file or carp "cannot open $report_file:$!"; $self->{report_file_handle} = \*REPORT; } my $hit_new_seq = 0; my $hit_a_seq = 0; my $hit_first_seq_line = 0; my $hit_first_hsp = 0; my $hit_summaries = 0; my $hit_query = 0; my $hit_database = 0; my ($id,$defline,$subject_length,@scores,@expects,@identities, @match_lengths,@query_starts,@subject_starts,$strand,$q_strand, $last_end_point,$q_last_end_point,@subject_strands, @query_strands,$query,$query_length,$e,@query_seqs, @subject_seqs,$query_seq,$subject_seq,@query_frames,@subject_frames, @query_ends, @subject_ends); my $id_regex = $self->{-DB_ID_REGEX} || '[^\|]+(?:\|[^\|,\s]*){1,10}'; $id_regex = "($id_regex)"; # add capturing parens while () { $hit_summaries = 1 if (/^Sequences producing significant alignments/); $hit_a_seq = 1 if (/^>/); # Gather Query defline and Length if ($hit_query == 0 && /^Query=\s+(.*)/) { $query = $1; $hit_query = 1; } elsif ($hit_query == 1) { if (/\(([0-9,]+) letters\)/) { $query_length = $1; $query_length =~ s/,//g; $hit_query = 0; } else { chomp; $query .= $_; } } # Gather Database information if (/^Database:\s+(.*)/) { $self->{database_description} = $1; $hit_database = 1; } elsif ($hit_database && /^\s+([\d,]+) sequences; ([\d,]+) total letters/) { $self->{database_sequence_count} = $1; $self->{database_letter_count} = $2; $hit_database = 0; } elsif ($hit_database && /^\s*(.*)$/) { $self->{database_description} .= $1; } if (!$self->{program} && $_ =~ /^(T?BLAST[XNP]) ([\d.]+)?/) { $self->{program} = $1; $self->{blast_version} = $2; } $self->{options}->{p} = lc($1) if (/^(T?BLAST[XNP]?)/); next unless ($hit_a_seq); if (/^>$id_regex\s+(.*)/) { if ($id) { # catches the starting point for the previous reverse strand push @query_starts,$q_last_end_point if ($q_strand eq 'minus'); push @query_ends,$q_last_end_point if ($q_strand eq 'plus'); push @subject_starts,$last_end_point if ($strand eq 'minus'); push @subject_ends,$last_end_point if ($strand eq 'plus'); push @subject_seqs,$subject_seq if ($subject_seq); push @query_seqs,$query_seq if ($query_seq); $self->_push_new_results($id,$defline,$subject_length, $query,$query_length, \@scores,\@expects,\@identities,\@match_lengths, \@query_starts,\@query_ends,\@subject_starts,\@subject_ends, \@subject_strands,\@query_strands,\@subject_seqs, \@query_seqs,\@query_frames,\@subject_frames); } # initialize arrays for new sample @scores = (); @expects = (); @identities = (); @match_lengths = (); @query_starts = (); @query_ends = (); @subject_strands = (); @query_strands = (); @subject_starts = (); @subject_ends = (); $strand = ''; $q_strand = ''; $last_end_point = ''; $q_last_end_point = ''; $subject_seq = ''; $query_seq = ''; @subject_seqs = (); @query_seqs = (); @query_frames = (); @subject_frames = (); $hit_new_seq = 1; $id = $1; $defline = $2; } elsif (/\s+Length = (\d+)/) { $hit_new_seq = 0; $subject_length = $1; } elsif ($hit_new_seq == 1) { chomp(); $defline .= $_; } elsif (/^\s*Score[^=]*=\s*([0-9\.-]*)[^E]*Expect[^=]*=\s*([0-9\.e-]*)/) { # catches the starting point for the previous reverse strand push @query_starts,$q_last_end_point if ($q_strand eq 'minus'); push @query_ends,$q_last_end_point if ($q_strand eq 'plus'); push @subject_starts,$last_end_point if ($strand eq 'minus'); push @subject_ends,$last_end_point if ($strand eq 'plus'); push @subject_seqs,$subject_seq if ($subject_seq); push @query_seqs,$query_seq if ($query_seq); $subject_seq = ''; $query_seq = ''; push @scores,$1; push @expects,$2; $hit_first_seq_line = 0; } elsif (/^\s*Identities[^=]*=\s*(\d+)\/(\d+)/) { my $identity = $1 / $2; push @identities,$identity; push @match_lengths,$2; $hit_first_seq_line = 0; } elsif (/^\s*Frame = ([-,+]\d)(?: \/ ([-,+]\d))?/) { push @query_frames, $1; push @subject_frames, $2 if ($2); } elsif (/Query:\s+(\d+)\s+(\S+)\s+(\d+)/) { if (($1 > $3) && $hit_first_seq_line == 0) { $q_strand = 'minus'; push @query_ends,$1; } elsif (($1 < $3) && $hit_first_seq_line == 0) { $q_strand = 'plus'; push @query_starts,$1; } push @query_strands,$q_strand if ($hit_first_seq_line == 0); $q_last_end_point = $3; $query_seq .= $2; $hit_first_seq_line++; } elsif (/^Sbjct:\s+(\d+)\s+(\S+)\s+(\d+)/) { if (($1 > $3) && $hit_first_seq_line == 1) { $strand = 'minus'; push @subject_ends,$1; } elsif (($1 < $3) && $hit_first_seq_line == 1) { $strand = 'plus'; push @subject_starts,$1; } push @subject_strands,$strand if ($hit_first_seq_line == 1); $last_end_point = $3; $subject_seq .= $2; $hit_first_seq_line++; } } # catch the last one. my $ra_safe_expects = _check_for_unsafe_e(\@expects); if ($id) { # catches the starting point for the previous reverse strand push @query_starts,$q_last_end_point if ($q_strand eq 'minus'); push @query_ends,$q_last_end_point if ($q_strand eq 'plus'); push @subject_starts,$last_end_point if ($strand eq 'minus'); push @subject_ends,$last_end_point if ($strand eq 'plus'); push @subject_seqs,$subject_seq if ($subject_seq); push @query_seqs,$query_seq if ($query_seq); $self->_push_new_results($id,$defline,$subject_length, $query,$query_length,\@scores,$ra_safe_expects,\@identities, \@match_lengths,\@query_starts,\@query_ends,\@subject_starts, \@subject_ends,\@subject_strands,\@query_strands, \@subject_seqs,\@query_seqs,\@query_frames,\@subject_frames); } } sub _push_new_results { my $self = shift; my ($id,$defline,$subject_length,$query,$query_length,$ra_scores, $ra_expects,$ra_identities,$ra_match_lengths,$ra_query_starts, $ra_query_ends,$ra_subject_starts,$ra_subject_ends,$ra_subject_strands, $ra_query_strands,$ra_subject_seqs,$ra_query_seqs,$ra_query_frames, $ra_subject_frames) = @_; my $rh_hash = {'id' =>$id , 'defline' =>$defline , 'subject_length' =>$subject_length , 'query' =>$query , 'query_length' =>$query_length , 'scores' => [ @{$ra_scores} ] , 'expects' => [ @{$ra_expects} ] , 'identities' => [ @{$ra_identities} ] , 'match_lengths' => [ @{$ra_match_lengths} ] , 'query_starts' => [ @{$ra_query_starts} ] , 'query_ends' => [ @{$ra_query_ends} ] , 'subject_starts' => [ @{$ra_subject_starts} ] , 'subject_ends' => [ @{$ra_subject_ends} ] , 'subject_strands' => [ @{$ra_subject_strands} ] , 'query_strands' => [ @{$ra_query_strands} ] , 'subject_seqs' => [ @{$ra_subject_seqs} ] , 'query_seqs' => [ @{$ra_query_seqs} ] , 'query_frames' => [] , 'subject_frames' => [] }; # only true in blastx or tblastx if ($ra_query_frames && $self->{options}->{p} =~ /blastx/i) { push @{$rh_hash->{query_frames}}, @{$ra_query_frames}; } # only true in tblastn if ($ra_query_frames && $self->{options}->{p} =~ /tblastn/i) { push @{$rh_hash->{subject_frames}}, @{$ra_query_frames}; } # only true in tblastx if ($self->{options}->{p} =~ /tblastx/i) { push @{$rh_hash->{subject_frames}}, @{$ra_subject_frames}; } push @{$self->{results}}, $rh_hash; } sub _get_args { my $rh_opts = shift; my $tmp_dir = _get_temp_directory(); my ($key,$value,$outfile); my @args = (); my $tmp_file = ""; while (($key,$value) = each %$rh_opts) { push @args, "-$key$value"; $outfile = $value if ($key eq "o"); } unless ($outfile) { my $base = sprintf("%s/%s-%d-%d-0", $tmp_dir, ,"Blastall", $$, time()); $tmp_file = _generate_temp_file($base); push @args, "-o$tmp_file"; } return(\@args,$tmp_file); } sub _get_formatdb_args { my $rh_opts = shift; my @args = (); foreach my $k (keys %{$rh_opts}) { push @args, "-${k}$rh_opts->{$k}"; } return \@args; } sub _get_wu_args { my $rh_opts = shift; my ($key,$value); my $tmp_dir = _get_temp_directory(); my $base_name = sprintf("%s/%s-%d-%d-0", $tmp_dir, ,"Blastall", $$, time()); my $tmp_file = _generate_temp_file($base_name); my $program = $WU_BLAST{$rh_opts->{p}} || $WU_BLAST{wu_blastn}; my $database = $rh_opts->{d} || 'nr'; my $infile = $rh_opts->{i}; my $outfile = $rh_opts->{o}; my @args = ($program, $database, $infile); while (($key,$value) = each %${rh_opts}) { next if ($key eq 'p' || $key eq 'd' || $key eq 'i' || $key eq 'o'); if ($value eq '!' || $value eq '') { push @args, "-$key"; } else { push @args, "-$key=$value"; } } # $rh_opts->{o} ? push @args, "> $rh_opts->{o}" : push @args, "> $tmp_file"; return(\@args,$tmp_file); } sub _get_temp_directory { return($ENV{TMPDIR}) if ($ENV{TMPDIR}); return($ENV{TEMP}) if ($ENV{TEMP}); return($ENV{TMP}) if ($ENV{TMP}); return('/tmp') if (-d '/tmp' && -w '/tmp'); return('/var/tmp') if (-d '/var/tmp' && -w '/var/tmp'); return('/usr/tmp') if (-d '/usr/tmp' && -w '/usr/tmp'); return('/temp') if (-d '/temp' && -w '/temp'); return('.') if (-w '.'); $_ = 'Cannot find a place to write tempfiles. '; $_ .= 'Please set the TMPDIR environmental variable. '; return undef; } sub _generate_temp_file { my $base_name = shift; my $count = 0; while ($count < 1000) { return ($base_name) unless (-e $base_name); # don't clobber existing $base_name =~ s/-?(\d+)?$/"-" . (1 + $1)/e; $count++; } return undef; } 1; __END__ =head1 NAME NHGRI::Blastall - Perl extension for running and parsing NCBI's BLAST 2.x =head1 SYNOPSIS =over 4 =head1 DESCRIPTION If you have NCBI's BLAST2 or WU-BLAST installed locally and your environment is already setup you can use Perl's object-oriented capabilities to run your BLASTs. Also if you have a blastcl3 binary from the toolkit (or binaries from our FTP site) you can run BLAST over the network. There are also methods to blast single sequences against each other using the bl2seq binaries (also in the toolkit and binaries). You can blast one sequence against a library of sequences using the blast_one_to_many method. You can format databases with formatdb method. You can also have NHGRI::Blastall read existing BLAST reports. If you have a database of repetitive DNA or other DNA you would like to mask out, you can use the mask method to mask the data against these databases. You can then use either the filter or result methods to parse the report and access the various elements of the data. =item RUNNING NEW BLASTS use NHGRI::Blastall; my $b = new NHGRI::Blastall; # If you are running NCBI's Local BLAST $b->blastall( p => 'blastn', d => 'nr', i => 'infile', o => 'outfile' ); # If you are running NCBI's blastcl3 network client $b->blastcl3( p => 'blastn', d => 'nr', i => 'infile', o => 'outfile' ); # If you are running WU-BLAST locally $b->wu_blastall( p => 'blastn', d => 'nr', nogap => '!', #use ! for arguments w/o parameter i => 'infile', o => 'outfile' ); See BLASTALL for more info =item BLASTING 2 SEQUENCES use NHGRI::Blastall; my $b = new NHGRI::Blastall; $b->bl2seq(i => 'seq1', j => 'seq2', p => 'tblastx' ); See BL2SEQ for more info =item BLASTING 1 SEQUENCE AGAINST A FASTA LIBRARY OF SEQUENCES # a library is a FASTA file with multiple FASTA formatted sequences use NHGRI::Blastall; my $b = new NHGRI::Blastall; $b->blast_one_to_many(i => 'seq1', d => 'seq2.lib', p => 'tblastx', ); See BLAST_ONE_TO_MANY for more info =item INITIALIZING EXISTING BLAST REPORTS use NHGRI::Blastall; my $b = new NHGRI::Blastall; $b->read_report('/path/to/report'); =item MASKING SEQUENCES use NHGRI::Blastall; my $b = new NHGRI::Blastall; $masked_seq = $b->mask( type => 'wu_blastall', p => 'blastn', d => 'alu', i => 'infile' ); See MASKING for more info =item CREATING BLAST INDEXES use NHGRI::Blastall; my $b = new NHGRI::Blastall; $b->formatdb( i => 'est', p => 'F', o => 'T', ); See FORMATDB for more info =item PRINTING REPORTS $b->print_report(); # this method only opens the report and prints. It does not print # summary reports =item FILTERING BLAST RESULTS @hits = $b->filter( scores => '38.2', identities => '.98' ); # returns an array of hash references. See HASHREF for more info on manipulating the results. See FILTERING for more info on using the filter method =item GETTING AT ELEMENTS @ids = $b->result('id'); @scores = $b->result('scores',$ids[0]); # second param must be an id See RESULT for more info on using the result method See ELEMENTS for element names =item GETTING AT ALL THE DATA @results = $b->result(); # returns an array of hashes See HASHREF for information on the array of hashes that is returned. See DUMP RESULTS to see how to work with the array of hashes =item ADJUSTING THE DEFLINE REGEX $b = new NHGRI::Blastall (-DB_ID_REGEX => '[^ ]+'); See DB_ID_REGEX for more info =back =head1 BLASTALL This method provides a simple object oriented frontend to BLAST. This module works with either NCBI's blastall binary distributed with BLAST 2.x, WU-BLAST or over the web through NCBI's Web Site. The blastall function accepts as a parameter an anonymous hash with keys that are the command line options (See BLASTALL OPTIONS) and values which are the corresponding values to those options. You may want to set the BLASTALL variable in Blastall.pm to the full path of your `blastall' binary, especially if you will be running scripts as cron jobs or if blastall is not in the system path. =head1 BLASTALL OPTIONS For wu_blastall you need to use NCBI type switches for the following [CB<-i>] for infile [CB<-o>] for outfile [CB<-p>] for program [CB<-d>] for database the rest of the parameters MUST be the parameters available through WU-BLAST (e.g. -sump, -nogap -compat1.4, etc.) use a `!' to specify that an argument has no parameters. See the example at the top of the manpage. These are the options that NCBI's blastall and binary accepts and these are the same options that are accepted by the blastall and blastcl3 methods. NOTE: You must set the proper environmental variables for the blastall method to work (BLASTDB,BLASTMAT). =over 4 =item B

=> Program Name =item B => Database default=nr =item B => QueryFile =item B => Expectation vaule (E) default=10.0 =item B => alignment view default=0 0 = pairwise, 1 = master-slave showing identities, 2 = master-slave no identities, 3 = flat master-slave, show identities, 4 = flat master-slave, no identities, 5 = master-slave no identities and blunt ends, 6 = flat master-slave, no identities and blunt ends =item B => BLAST report Output File default=stdout =item B => Filter query sequence default=T (DUST with blastn, SEG with others) =item B => Cost to open a gap default=0 (zero invokes default behavior) =item B => Cost to extend a gap default=0 (zero invokes default behavior) =item B => X dropoff value for gapped alignment (in bits) default=0 (zero invokes default behavior) =item B => Show GI's in deflines default=F =item B => Penalty for a nucleotide mismatch (blastn only) default=-3 =item B => Reward for a nucleotide match (blastn only) default=1 =item B => Number of one-line descriptions (V) default=500 =item B => Number of alignments to show (B) default=250 =item B => Threshold for extending hits, default if zero default=0 =item B => Perfom gapped alignment (NA with tblastx) default=T =item B => Query Genetic code to use default=1 =item B => DB Genetic code (for tblast[nx] only) default=1 =item B => Number of processors to use default=1 =item B => SeqAlign file Optional =item B => Believe the query defline default=F =item B => Matrix default=BLOSUM62 =item B => Word size, default if zero default=0 =item B => Effective length of the database default=0 (use zero for the real size) =item B => Number of best hits from a region to keep default=100 =item B => Length of region used to judge hits default=20 =item B => Effective length of the search space default=0 (use zero for the real size) =item B => Query strands to search against database default=3 (for blast[nx], and tblastx). 3 is both, 1 is top, 2 is bottom =item B => Produce HTML output [T/F] default=F =item B => Restrict search of database to list of GI's [String] =back NOTE: If you do not supply an `o' option (outfile), the following environment variables are checked in order: `TMPDIR', `TEMP', and `TMP'. If one of them is set, outfiles are created relative to the directory it specifies. If none of them are set, the first possible one of the following directories is used: /var/tmp , /usr/tmp , /temp , /tmp , This file is deleted after the NHGRI::Blastall object is destroyed. It is recommended that you create a tmp directory in your home directory and set one of the above environmental vars to point to this directory and then set the permissions on this directory to 0700. Writing to a "public" tmp directory can have security ramifications. =head1 BL2SEQ This method uses the bl2seq binary (distributed with BLAST executables and source) to BLAST one sequence against another sequence. Like the blastall method the bl2seq method accepts the same options that the bl2seq binary accepts. Run bl2seq without options from the command line to get a full list of options. An important note about the options, when running blastx 1st sequence should be nucleotide; when running tblastn 2nd sequence should be nucleotide. use NHGRI::Blastall; my $b = new NHGRI::Blastall; $b->bl2seq(i => 'seq1.nt', j => 'seq2.aa', p => 'blastx' ); =head1 BLAST_ONE_TO_MANY This method allows for blasting one sequence against a FASTA library of sequences. Behind the scenes, BLAST indexes are created (in the same directory as the FASTA library) using the provided FASTA library and the one sequence is used to search against this database. If the program completes successfully, the databases are removed. To compare two sequences, use the bl2seq method which is faster and less messy (no tmp indexes). This method accepts the same options as the blastall binary with the d option corresponding to the FASTA library. use NHGRI::Blastall; my $b = new NHGRI::Blastall; $b->blast_one_to_many(i => 'seq.aa', d => 'seq.nt.lib', e => '0.001', p => 'tblastn', ); =head1 MASKING Screens DNA sequences in fasta format against the database specified in the blastall 'd' option. The mask method accepts the same parameters as the blastall method. Any matches to the masking database will be substituted with "N"s. The mask method returns the masked sequence. Performs similar function as xblast, an old NCBI program written in C. Set the type parameter to wu_blastall, blastcl3 or blastall depending on your configuration. $masked_seq = $b->mask( type => 'blastcl3', # defaults to blastall p => 'blastn', d => 'alu', i => 'infile' ); To get the mask coordinates back call the mask method in an array context. @mask = $b->mask(p => 'blastn', d => 'alu', i => 'infile' ); $masked_seq = $mask[0]; # same as above masked seq $ra_masked_coords = $mask[1]; # reference to array of mask coordinates =head1 FORMATDB This method creates BLAST indexes using the formatdb binary which is distributed with BLAST. It accepts the same parameters as formatdb. The remove_formatdb_indexes method will remove databases created using the formatdb method (if called by the same object). formatdb leaves a file called formatdb.log by default in the current working directory (if it has permission). To change this behavior, use the l option to direct the sequence to /dev/null or somewhere else. use NHGRI::Blastall; my $b = new NHGRI::Blastall; $b->formatdb( i => 'swissprot', p => 'T', l => '/dev/null', o => 'T', ); =head1 DB_ID_REGEX By default Blastall.pm expects FASTA deflines of BLAST databases to be formatted like Genbank database (gi|GINUMBER|DATABASE|ACCESSION|SUMMARY). The default regular expression is [^\|]+(?:\|[^\|,\s]*){1,10} When using non-genbankformatted deflines, it may become necessary to adjust the regular expression that identifies the unique identifier in a defline. This can be done with the -DB_ID_REGEX parameter to the new method. For example $b = new NHGRI::Blastall( -DB_ID_REGEX => '[^ ]+' ); =head1 FILTERING The filter method accepts an anonymous hash in which the keys are elements of the blast report and the values are limits that are put on the result set. The following are the Filter elements and their default operations. id => regular expression match defline => regular expression match subject_length => greater than scores => greater than expects => less than identities => greater than match_length => greater than subject_strand => equals query_frames => equals subject_frames => equals so if you would like to limit your results to entries that have scores greater than 38.2 and identities greater than 98% you would say... @hits = $b->filter( scores => '38.2', identities => '.98' ); you can also override the defaults. if you would like only scores that are less than 38.2 you could say... @hits = $b->filter( scores => '<38.2' ); or if you wanted only identities that were equal to 1 and you didn't care about the hits array you could say... $b->filter( identities => '=1' ); Regular expression matches are case insensitive. If you wanted only records with the word "human" in the defline you could say... @hits = $b->filter( defline => 'HuMaN' ); After you run the filter method on an object the object only contains those results which passed the filter. This will effect additional calls to the filter method as well as calls to other methods (e.g. result). To reset the NHGRI::Blastall object you can use the unfilter method. $b->unfilter(); See DUMP RESULTS for info on how to manipulate the array of hash refs. =head1 RESULT The result method has 3 possible invocations. The first invocation is when it is called without parameters. @results = $b->result(); This invocation returns an array of hash references. See HASHREF for further explanation of this structure. To get a list of all the ids do... @ids = $b->result('id'); These ids can be used to get at specific elements. If 2 parameters are present and the first one is an element (See ELEMENTS for a list of ELEMENTS) and the second one is an id then the routine will return a list of elements corresponding to the id. @scores = $b->result('scores',$ids[0]); # second param must be an id If more than 2 elements are passed the function will return undef. =head1 ACCESSOR METHODS =over 4 =item B returns the filename of the BLAST report. =item B returns description given to the database during formatting of db. e.g. All non-redundant GenBank CDStranslations+PDB+SwissProt+PIR =item B returns the number of sequences in the database. =item B returns the number of total letters in the database. =item B returns the BLAST program name that appears at the top of the report. either BLASTN, BLASTP, BLASTX, TBLASTN or TBLASTX =item B returns the version of the BLAST program that was used. =back =head1 ELEMENTS =over 4 =item B an example of an id is `>gb|U19386|MMU19386' the initial `>' is just a flag. The next characters up until the first pipe is the database the subject was taken from. The next characters up to the next pipe is the Genbank accession number. The last characters are the locus. This element is used as a unique identifier by the NHGRI::Blastall module. (SCALAR) =item B The definition line taken from the subject (SCALAR) =item B This is the length of the full subject, not the length of the match. (SCALAR) =item B This is score (in bits) of the match. (ARRAY) =item B This is the statistical significance (`E value') for the match. (ARRAY) =item B This is the number of identities divided by the match length in decimal format. (Listed as a fraction and a percentage in a BLAST report.) (ARRAY) =item B this is the number of base pairs that match up. (ARRAY) =item B This is the number of the first base which matched with the subject. (ARRAY) =item B This is the number of the last base which matched with the subject. (ARRAY) =item B This is the number of the first base which matched with the query. (ARRAY) =item B This is the number of the last base which matched with the query. (ARRAY) =item B This is either plus or minus depending on the orientation of the subject sequence in the match. (ARRAY) =item B This is either plus or minus depending on the orientation of the query sequence in the match. (ARRAY) =item B If you are running a blastx or tblastx search in which the query_sequence is translated this is the frame the query sequence matched. (ARRAY) =item B If you are running a tblastn or tblastx search in which the subject sequence is translated, this is the frame where the subject sequence matched. (ARRAY) =back =head1 HASHREF Each hash ref contains an id, defline and subject Length. Because there can be multiple scores, expect values, Identities, match_lengths, query_starts, query_strands and subject_starts, these are stored as array references. The following is an array containing two hash refs. @hits = ( {'id' => '>gb|U79716|HSU79716', 'defline' => 'Human reelin (RELN) mRNA, complete cds', 'subject_length' => '11580', 'scores' => [ 684, 123 ], 'expects' => [ 0.0, 3e-26 ], 'identities' => [ .99430199, .89256198 ], 'match_lengths' => [ 351, 121 ], 'query_starts' => [ 3, 404 ], 'query_ends' => [ 303, 704 ], 'subject_starts' => [ 5858, 6259 ], 'subject_ends' => [ 6158, 6559 ], 'subject_strands' => [ 'plus', 'minus' ], 'query_strands' => [ 'plus', 'plus' ], 'query_frames' => [ '+1', '-3' ], 'subject_frames' => [ '+2', '-1' ], }, {'id' => '>gb|U24703|MMU24703', 'defline' => 'Mus musculus reelin mRNA, complete cds', 'subject_length' => '11673', 'scores' => [ 319, 38.2 ], 'expects' => [ 2e-85, 1.2 ], 'identities' => [ .86455331, 1 ], 'match_lengths' => [ 347, 19 ], 'query_starts' => [ 3, 493 ], 'query_ends' => [ 303, 793 ], 'subject_starts' => [ 5968, 6457 ] 'subject_ends' => [ 6268, 6757 ], 'subject_strands' => [ 'plus', 'minus' ], 'query_strands' => [ 'plus', 'plus' ], 'query_frames' => [ '+3', '-3' ], 'subject_frames' => [ '+1', '-2' ], } ); See ELEMENTS for explanation of each element. See DUMP RESULTS and/or the perlref(1) manpage for clues on working with this structure. =head1 DUMP RESULTS When calling the result function or with no parameters, or calling the filter function, an array of references to hashes is returned. Each elment of the array is a reference to a hash containing 1 record. See HASHREF for details on this structure. The following routine will go through each element of the array of hashes and then print out the element and it's corresponding value or values. See perlref(1) for more info on references. sub dump_results { foreach $rh_r (@results) { while (($key,$value) = each %$rh_r) { if (ref($value) eq "ARRAY") { print "$key: "; foreach $v (@$value) { print "$v "; } print "\n"; } else { print "$key: $value\n"; } } } } =head1 AUTHOR =over 4 =item Joseph Ryan (jfryan@nhgri.nih.gov) =back =head1 CONTACT ADDRESS If you have problems, questions, comments send to webblaster@nhgri.nih.gov =head1 COPYRIGHT INFORMATION This software/database is "United States Government Work" under the terms of the United States Copyright Act. It was written as part of the authors' official duties for the United States Government and thus cannot be copyrighted. This software/database is freely available to the public for use without a copyright notice. Restrictions cannot be placed on its present or future use. Although all reasonable efforts have been taken to ensure the accuracy and reliability of the software and data, the National Human Genome Research Institute (NHGRI) and the U.S. Government does not and cannot warrant the performance or results that may be obtained by using this software or data. NHGRI and the U.S. Government disclaims all warranties as to performance, merchantability or fitness for any particular purpose. In any work or product derived from this material, proper attribution of the authors as the source of the software or data should be made, using http://genome.nhgri.nih.gov/blastall as the citation. =head1 ENVIRONMENT VARIABLES =over 4 =item B location of BLAST formated databases =item B location of BLAST matrices =item B B B If the `o' option is not passed to the blastall method than NHGRI::Blastall looks for one of these vars (in order) to store the BLAST report. This report is destroyed after the NHGRI::Blastall.pm object is destroyed. =back =head1 SEE ALSO L L http://www.ncbi.nlm.nih.gov/BLAST/newblast.html ftp://ncbi.nlm.nih.gov/blast/db/README http://www.ncbi.nlm.nih.gov/BLAST/tutorial/Altschul-1.html =cut => regular expression match defline => regular expression match subject_length => greater thaNHGRI-Blastall-0.66/Changes010064400135110000144000000120210741534407200164200ustar00jpearsonusers00003710000001VERSION 0.66 --------------------------------------------------------------------- fixed a section of code in masking logic which would cause a warning if user got no hits to their masking database. VERSION 0.65 --------------------------------------------------------------------- added the ability to get the coordinates back from a mask. usually mask is called in a scalar context, but if you ask in an array context you will receive a masked sequence and a reference to an array of masked sequences. VERSION 0.64 (never released) --------------------------------------------------------------------- Version .63 fix fixed a masking problem but introduced a bug in blasting which would delete all BLAST reports after BLASTing. (version .63 was never released). This version re-fixes the masking problem without breaking anything else. VERSION 0.63 (never released) --------------------------------------------------------------------- fixed problem with masking in which masking would fail if you provided -o opt fixed typo in pod documentation. VERSION 0.61 --------------------------------------------------------------------- fixed pod which caused make to gripe fixed reference in README VERSION 0.60 --------------------------------------------------------------------- added bl2seq method which BLASTs 2 sequences against each other added blast_one_to_many method which BLASTs 1 sequence against a library of sequences. added formatdb method which creates BLASTable databases added remove_formatdb_indexes method which removes the BLASTable databases made by the formatdb method. cleaned up some of the code fixed a couple methods so they would work with other methods besides blastall updated documentation VERSION 0.56 --------------------------------------------------------------------- added get_report method which returns the BLAST report file. VERSION 0.55 --------------------------------------------------------------------- fixes a bug which would turn some 0.0 expect values to 1000 if they did not appear in the summaries. This was to correct an old bug in NCBI BLAST but seems to not exist any longer. VERSION 0.54 --------------------------------------------------------------------- added the blastcl3 method which uses the blastcl3 binary to make network BLASTs. This became necessary when NCBI changed their web interface (QBLAST) and broke my LWP routines. VERSION 0.53 --------------------------------------------------------------------- fixed bug in wu-blast option processing. VERSION 0.52 --------------------------------------------------------------------- Added check to make sure temporary directories are writable. VERSION 0.51 --------------------------------------------------------------------- Added query_ends and subject_ends to the data parsed from the BLAST report instead of relying on query_starts and match_lengths to get the end. VERSION 0.50 --------------------------------------------------------------------- fixed a number of bugs. Added the DB_ID_REGEX added some accessor methods to get the database description, number of sequences in the database, number of letters in the database, BLAST program and program version. VERSION 0.44 --------------------------------------------------------------------- Blastall now parses and stores "Frames" from blastx, tblastn and tblastx runs. Updated documentation. VERSION 0.43 --------------------------------------------------------------------- fixed some misleading documentation VERSION 0.42 ---------------------------------------------------------------------- methods that required a hash_ref (wu-blastall, blastall, net-blastall and mask) will now accept a hash. so you can either say... blastall({'p'=>'blastn', 'd'=>'nr', 'i'=>'infile'}); #with ref blastall('p'=>'blastn', 'd'=>'nr', 'i'=>'infile'); #without ref Fixed the test suite so it tests for LWP modules before trying to run net-blastall. If it doesn't find LWP modules it looks for $BLASTDB environmental variable. If your BLASTDB environmental variable =~ /wu/ than it will run wu-blastall. If the BLASTDB env var does exist but does not have the 'wu' pattern it tries to run NCBI's BLAST via the blastall method. If all else fails it gives you a warning and fails the test. VERSION 0.41 ---------------------------------------------------------------------- Filter method used to remove sub entries but now does not. Now if query sequence has one significant hit to a subject sequence all hits to that sequence are shown. There is a bug(feature?). If one of the hits passes on Identity but not E-value and another hit does the opposite all hits are reported. Haven't decided what to do about this. NEW IN VERSION 0.40 ---------------------------------------------------------------------- * support for WU BLAST * support for WWW BLAST through NCBI HOWEVER: NCBI has recently changed there web protocol to QBLAST and this may present future problems. * added sample scripts in scripts directory for WU, WWW and NCBI BLASTs * documentation may be slightly behind. Sorry. Working on this... NHGRI-Blastall-0.66/INSTALL010064400135110000144000000051410741534407200161630ustar00jpearsonusers00003710000001To install this module, cd to the directory that contains this INSTALL file and type the following: perl Makefile.PL make make test make install ---------------------------------------------------------------------------- NHGRI::Blastall.pm expects that you have NCBI's blastall binary, wu-blast binaries, or NCBI's blastcl3 binary installed. If you do not have it you can download the NCBI binary for several platforms from ftp://ftp.ncbi.nlm.nih.gov/blast/executables/ If you do not have the blastcl3 binary you can either compile the NCBI toolkit (ftp://ftp.ncbi.nlm.nih.gov/toolbox/) or download the Entrez binaries from (ftp://ncbi.nlm.nih.gov/entrez/CURRENT/). WU-BLAST is available from (http://blast.wustl.edu/). ---------------------------------------------------------------------------- Certain methods in NHGRI::Blastall expect to find binaries in your path. METHOD EXPECTED BINARIES ---------- --------------------- blastall() blastall blastcl3() blastcl3 wu_blastall wu-blastn wu-blastp, wu-blastx, wu-tblastn, wu-tblastx ---------------------------------------------------------------------------- If you are using the blastall or wu_blastall methods you need to make sure the BLASTDB environmental variable is set. If you stored your BLAST formatted databases in /usr/local/ncbi/blast/db the command under bash, sh, or ksh would be... BLASTDB=/usr/local/ncbi/blast/db; export BLASTDB; under csh or tcsh it would be setenv BLASTDB /usr/local/ncbi/blast/db Also make sure you have a .ncbirc file in your home directory or in $NCBI/.ncbirc. --------------------------------------------------------------------------- If you are having problems using the blastall method you may want to test BLAST from the command line. cd NHGRI-Blastall-0.xx blastall -d vector -p blastn -i t/fasta.seq if this fails than the problem is probably one of the following a. you don't have the vector database installed b. you don't have the blastall binary in your path c. you don't have NCBI's BLAST2.xx installed on your machine d. the user that you are running the test under does not have the correct environment settings (see above) ------------------ If you have trouble installing NHGRI::Blastall.pm because you have insufficient access privileges to add to the perl library directory, you can still use NHGRI::Blastall.pm. You would want to do something like: perl Makefile.PL LIB=~/lib See perldoc ExtUtils::MakeMaker for more details on this. If you have problems send mail to: webblaster@nhgri.nih.gov Read the POD docs in Blastall.pm try perldoc NHGRI::Blastall NHGRI-Blastall-0.66/Makefile.PL010064400135110000144000000004700741534407200171040ustar00jpearsonusers00003710000001use ExtUtils::MakeMaker; # See lib/ExtUtils/MakeMaker.pm for details of how to influence # the contents of the Makefile that is written. WriteMakefile( 'NAME' => 'NHGRI::Blastall', 'VERSION_FROM' => 'Blastall.pm', # finds $VERSION 'dist' => { COMPRESS=> 'gzip -9f', SUFFIX=>'gz', } );