Grinder-0.5.3/ 0000755 0001750 0001750 00000000000 12151575606 013341 5 ustar floflooo floflooo Grinder-0.5.3/galaxy/ 0000755 0001750 0001750 00000000000 12151575606 014626 5 ustar floflooo floflooo Grinder-0.5.3/galaxy/Galaxy_readme.txt 0000644 0001750 0001750 00000000440 11652114025 020114 0 ustar floflooo floflooo This is an XML wrapper that provides a GUI for Grinder in Galaxy (http://galaxy.psu.edu/).
Place these files in your Galaxy directory. More information at http://wiki.g2.bx.psu.edu/FrontPage.
Note: The Grinder wrapper uses Galaxy builtin datasets located in the 'all_fasta' data table.
Grinder-0.5.3/galaxy/grinder.xml 0000644 0001750 0001750 00000074617 12052630705 017011 0 ustar floflooo floflooo versatile omic shotgun and amplicon read simulatorgrindergrinder --version
stderr_wrapper.py
grinder
#if $reference_file.specify == "builtin":
-reference_file ${ filter( lambda x: str( x[0] ) == str( $reference_file.value ), $__app__.tool_data_tables[ 'all_fasta' ].get_fields() )[0][-1] }
#else if $reference_file.specify == "uploaded":
-reference_file $reference_file.value
#end if
#if str($coverage_fold):
-coverage_fold $coverage_fold
#end if
#if str($total_reads):
-total_reads $total_reads
#end if
#if str($read_dist):
-read_dist $read_dist
#end if
#if str($insert_dist):
-insert_dist $insert_dist
#end if
#if str($mate_orientation):
-mate_orientation $mate_orientation
#end if
#if str($exclude_chars):
-exclude_chars $exclude_chars
#end if
#if str($delete_chars):
-delete_chars $delete_chars
#end if
#if str($forward_reverse) != "None":
-forward_reverse $forward_reverse
#end if
#if str($unidirectional):
-unidirectional $unidirectional
#end if
#if str($length_bias):
-length_bias $length_bias
#end if
#if str($copy_bias):
-copy_bias $copy_bias
#end if
#if str($mutation_dist):
-mutation_dist $mutation_dist
#end if
#if str($mutation_ratio):
-mutation_ratio $mutation_ratio
#end if
#if str($homopolymer_dist):
-homopolymer_dist $homopolymer_dist
#end if
#if str($chimera_perc):
-chimera_perc $chimera_perc
#end if
#if str($chimera_dist):
-chimera_dist $chimera_dist
#end if
#if str($chimera_kmer):
-chimera_kmer $chimera_kmer
#end if
#if str($abundance_file) != "None":
-abundance_file $abundance_file
#end if
#if str($abundance_model):
-abundance_model $abundance_model
#end if
#if str($num_libraries):
-num_libraries $num_libraries
#end if
#if str($multiplex_ids) != "None":
-multiplex_ids $multiplex_ids
#end if
#if str($diversity):
-diversity $diversity
#end if
#if str($shared_perc):
-shared_perc $shared_perc
#end if
#if str($permuted_perc):
-permuted_perc $permuted_perc
#end if
#if str($random_seed):
-random_seed $random_seed
#end if
#if str($permuted_perc):
-desc_track $desc_track
#end if
#if str($qual_levels):
-qual_levels $qual_levels
#end if
#if str($fastq_output) == '1':
-fastq_output $fastq_output
#end if
#if str($profile_file) != "None":
-profile_file $profile_file.value
#end if
int(str(num_libraries)) == 1int(str(num_libraries)) == 1 and fastq_output == 0int(str(num_libraries)) == 1 and str(qual_levels) and fastq_output == 0int(str(num_libraries)) == 1 and fastq_output == 1int(str(num_libraries)) >= 2int(str(num_libraries)) >= 2 and fastq_output == 0int(str(num_libraries)) >= 2 and str(qual_levels) and fastq_output == 0int(str(num_libraries)) >= 2 and fastq_output == 1int(str(num_libraries)) >= 2int(str(num_libraries)) >= 2 and fastq_output == 0int(str(num_libraries)) >= 2 and str(qual_levels) and fastq_output == 0int(str(num_libraries)) >= 2 and fastq_output == 1int(str(num_libraries)) >= 3int(str(num_libraries)) >= 3 and fastq_output == 0int(str(num_libraries)) >= 3 and str(qual_levels) and fastq_output == 0int(str(num_libraries)) >= 3 and fastq_output == 1int(str(num_libraries)) >= 4int(str(num_libraries)) >= 4 and fastq_output == 0int(str(num_libraries)) >= 4 and str(qual_levels) and fastq_output == 0int(str(num_libraries)) >= 4 and fastq_output == 1int(str(num_libraries)) >= 5int(str(num_libraries)) >= 5 and fastq_output == 0int(str(num_libraries)) >= 5 and str(qual_levels) and fastq_output == 0int(str(num_libraries)) >= 5 and fastq_output == 1int(str(num_libraries)) >= 6int(str(num_libraries)) >= 6 and fastq_output == 0int(str(num_libraries)) >= 6 and str(qual_levels) and fastq_output == 0int(str(num_libraries)) >= 6 and fastq_output == 1int(str(num_libraries)) >= 7int(str(num_libraries)) >= 7 and fastq_output == 0int(str(num_libraries)) >= 7 and str(qual_levels) and fastq_output == 0int(str(num_libraries)) >= 7 and fastq_output == 1int(str(num_libraries)) >= 8int(str(num_libraries)) >= 8 and fastq_output == 0int(str(num_libraries)) >= 8 and str(qual_levels) and fastq_output == 0int(str(num_libraries)) >= 8 and fastq_output == 1int(str(num_libraries)) >= 9int(str(num_libraries)) >= 9 and fastq_output == 0int(str(num_libraries)) >= 9 and str(qual_levels) and fastq_output == 0int(str(num_libraries)) >= 9 and fastq_output == 1int(str(num_libraries)) >= 10int(str(num_libraries)) >= 10 and fastq_output == 0int(str(num_libraries)) >= 10 and str(qual_levels) and fastq_output == 0int(str(num_libraries)) >= 10 and fastq_output == 1
**What it does**
Grinder is a program to create random shotgun and amplicon sequence libraries
based on reference sequences in a FASTA file. Features include:
* omic support: genomic, metagenomic, transcriptomic, metatranscriptomic,
proteomic and metaproteomic
* shotgun library or amplicon library
* arbitrary read length distribution and number of reads
* simulation of PCR and sequencing errors (chimeras, point mutations, homopolymers)
* support for creating paired-end (mate pair) datasets
* specific rank-abundance settings or manually given abundance for each genome
* creation of datasets with a given richness (alpha diversity)
* independent datasets can share a variable number of genomes (beta diversity)
* modeling of the bias created by varying genome lengths or gene copy number
* profile mechanism to store preferred options
* API to automate the creation of a large number of simulated datasets
**Input**
A variety of FASTA databases containing genes or genomes can be used as input
for Grinder, such as the NCBI RefSeq collection (ftp://ftp.ncbi.nih.gov/refseq/release/microbial/),
the GreenGenes 16S rRNA database (http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/Isolated_named_strains_16S_aligned.fasta), the human genome and transcriptome (ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/, ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.rna.fna.gz), ...
These input files can either be provided as a Galaxy dataset, or can be uploaded
by Galaxy users in their history.
**Output**
For each library requested, a first file contains the abundance of the species
in the simulated community created, e.g.::
# rank seqID rel. abundance
1 86715_Lachnospiraceae 0.367936925098555
2 6439_Neisseria_polysaccharea 0.183968462549277
3 103712_Fusobacterium_nucleatum 0.122645641699518
4 103024_Frigoribacterium 0.0919842312746386
5 129066_Streptococcus_pyogenes 0.0735873850197109
6 106485_Pseudomonas_aeruginosa 0.0613228208497591
7 13824_Veillonella_criceti 0.0525624178712221
8 28044_Lactosphaera 0.0459921156373193
The second file is a FASTA file containing shotgun or amplicon reads, e.g.::
>1 reference=13824_Veillonella_criceti position=89-1088 strand=+
ACCAACCTGCCCTTCAGAGGGGGATAACAACGGGAAACCGTTGCTAATACCGCGTACGAA
TGGACTTCGGCATCGGAGTTCATTGAAAGGTGGCCTCTATTTATAAGCTATCGCTGAAGG
AGGGGGTTGCGTCTGATTAGCTAGTTGGAGGGGTAATGGCCCACCAAGGCAA
>2 reference=103712_Fusobacterium_nucleatum position=2-1001 strand=+
TGAACGAAGAGTTTGATCCTGGCTCAGGATGAACGCTGACAGAATGCTTAACACATGCAA
GTCAACTTGAATTTGGGTTTTTAACTTAGGTTTGGG
If you specify the quality score levels option, a third file representing the
quality scores of the reads is created::
>1 reference=103712_Fusobacterium_nucleatum position=2-1001 strand=+
30 30 30 10 30 30 ...
Grinder-0.5.3/galaxy/all_fasta.loc.sample 0000644 0001750 0001750 00000001523 11642517353 020533 0 ustar floflooo floflooo #This file lists the locations and dbkeys of all the fasta files
#under the "genome" directory (a directory that contains a directory
#for each build). The script extract_fasta.py will generate the file
#all_fasta.loc.
#IMPORTANT: EACH LINE OF THIS FILE HAS TO BE TAB-DELIMITED!
#
#
#
#So, all_fasta.loc could look something like this:
#
#ncbi_refseq_complete_viruses ncbi_refseq_complete_viruses RefSeq complete viruses /path/to/ncbi_refseq_complete_viruses.fna
#ncbi_refseq_complete_microbes ncbi_refseq_complete_microbes RefSeq complete microbes /path/to/ncbi_refseq_complete_microbes.fna
#homo_sapiens_GRCh37 homo_sapiens_GRCh37 Homo sapiens genome /path/to/Homo_sapiens_GRCh37_reference.fna
#gg_named_16S gg_named_16S GreenGenes named 16S strains /path/to/Isolated_named_strains_16S.fna
Grinder-0.5.3/galaxy/stderr_wrapper.py 0000755 0001750 0001750 00000003222 11635507757 020255 0 ustar floflooo floflooo #!/usr/bin/env python
"""
Wrapper that executes a program with its arguments but reports standard error
messages only if the program exit status was not 0. This is useful to prevent
Galaxy to interpret that there was an error if something was printed on stderr,
e.g. if this was simply a warning.
Example: ./stderr_wrapper.py myprog arg1 -f arg2
Author: Florent Angly
"""
import sys, subprocess
assert sys.version_info[:2] >= ( 2, 4 )
def stop_err( msg ):
sys.stderr.write( "%s\n" % msg )
sys.exit()
def __main__():
# Get command-line arguments
args = sys.argv
# Remove name of calling program, i.e. ./stderr_wrapper.py
args.pop(0)
# If there are no arguments left, we're done
if len(args) == 0:
return
# If one needs to silence stdout
#args.append( ">" )
#args.append( "/dev/null" )
#cmdline = " ".join(args)
#print cmdline
try:
# Run program
proc = subprocess.Popen( args=args, shell=False, stderr=subprocess.PIPE )
returncode = proc.wait()
# Capture stderr, allowing for case where it's very large
stderr = ''
buffsize = 1048576
try:
while True:
stderr += proc.stderr.read( buffsize )
if not stderr or len( stderr ) % buffsize != 0:
break
except OverflowError:
pass
# Running Grinder failed: write error message to stderr
if returncode != 0:
raise Exception, stderr
except Exception, e:
# Running Grinder failed: write error message to stderr
stop_err( 'Error: ' + str( e ) )
if __name__ == "__main__": __main__()
Grinder-0.5.3/galaxy/tool_data_table_conf.xml.sample 0000644 0001750 0001750 00000000363 11642516504 022750 0 ustar floflooo floflooo
value, dbkey, name, path
Grinder-0.5.3/CHANGES 0000644 0001750 0001750 00000034741 12151574445 014345 0 ustar floflooo floflooo Revision history for Grinder
0.5.3 30-May-2013
Completed fix for bug #6, multiplexed read close to length of reference
(reported by Ali May).
When generating multiple libraries, default is now to use 100% permuted
to have dissimilar communities (consistent with 0% shared as default).
0.5.2 26-Apr-2013
Fixed bug causing reads too short when using MIDs and asking for a read
length close to that of their reference (bug #6, reported by Ali May).
0.5.1 19-Apr-2013
Fixed bug preventing the insertion of very low frequency sequencing
errors (bug #5).
Updated average_genome_size script to use percentage in Grinder rank
file instead of fractional numbers.
0.5.0 14-Jan-2013
Removed the =encoding statement which was breaking Pod::PlainText
(reported by Lauren Bragg)
Precompile regular expression
0.4.9 20-Nov-2012
Significant speedup by using improved version of Bioperl modules
(reported by Ben Woodcroft).
Fixed bug in RF and FR -oriented mates produced from the reverse-
complement of the reference sequence (reported by Mike Imelfort).
Mate orientation documented for IonTorrent (reported by Mike Imelfort).
The relative abundances reported by Grinder in the rank file are now
expressed as percentage instead of fractional for consistency.
Updated dependencies to satisfy older Perl (reported by Stephen Turner).
Build the documentation on author-side, not user side (reported by
Stephen Turner).
0.4.8 10-Oct-2012
Fixed bug when making amplicon reads using specified relative abundances
based on genomes with multiple amplicons (reported by Bertrand Bonnaud).
Usage message improvements (reported by Xiao Yang).
Delegated some operations to dedicated modules.
0.4.7 27-May-2012
Requiring Math::Random::MT version 1.14 should fix issues that Windows
users are having (reported by David Koslicki).
0.4.6 27-May-2012
When generating kmer-based chimeras, save resources by only calculating
the kmers of the reference sequences that are going to be used
(improvement suggested by David Koslicki).
Fixed an "undefined value" error when using kmer-based chimeras
(reported by David Koslicki).
Fixed an error when using kmer-based chimeras but not using all the
reference sequences (reported by David Koslicki).
0.4.5 27-Jan-2012
Fixed bug when adding mutations linearly to a 1 bp read (reported by
Robert Schmieder).
Better handling of 0 bp reference sequences.
Fixed bug when looking for amplicons on the reverse complement of a
reference sequence.
Properly remove the shortest of two amplicons, even if they are on
different strands.
0.4.4 20-Jan-2012
Dependencies update: no need for Math::Random::MT::Perl anymore.
0.4.3 18-Jan-2012
Implemented multimeras, i.e. chimeras from more than two reference
sequences (suggested by anonymous reviewer). See .
Implemented chimeras where the breakpoints correspond to k-mers shared
by the reference sequences (suggested by anonymous reviewer). See
.
0.4.2 15-Dec-2011
Fixed incorrectly calculated relative abundances when using length bias
(reported by Mike Imelfort and Mohamed Fauzi Haroon).
0.4.1 25-Nov-2011
The keyword 'strand' is not used anymore in the description of reads.
Read coordinates are now reported like in the Genbank format:
"position=complement(1..20)" instead of "position=1-20 strand=-1"
Fixed bug reported by Dana Willner: when looking for full-length amplicon
matches based on PCR primers, matches are now sought in the reference
sequences but also in their reverse-complement
Better handling of discrepancies between the number of libraries specified
with the num_libraries option and in the abundance_file (reported by
Dana Willner).
0.4.0 04-Nov-2011
Support for DNA, RNA and proteic reference sequences to produce genomic
metagenomic, transcriptomic, metatranscriptomic, proteomic and
metaproteomic datasets
New error model suitable to simulate Illumina reads: 4th degree polynome
Change in error model (mutation_distribution) parameter:
- general syntax is now model_name, model_parameters...
- the first parameter for the linear model is now the error rate at the
3' end of the reads, not the average error rate
Speed improvement for position-specific error models
Galaxy GUI fix so that the output is fastqsanger, not just fastq
The reference_file parameter is now a required argument, so that running
grinder without arguments displays the help (reported by Robert Schmieder)
Fixed a bug that caused a crash when using an indel model and a homopolymer
model simultaneously (reported by Robert Schmieder)
Information displayed on screen now reports whether the library is a
shotgun or amplicon library
0.3.9 18-Oct-2011
New option to select orientation of mate pairs
New default for mate orientation: forward-reverse instead of forward-forward
Handle empty reference sequence description more gracefully
Galaxy GUI compatible with workflows and new tool shed
0.3.8 04-Oct-2011
Graphical interface for the Galaxy project
Support for writing the output reads in FASTQ format (Sanger variant)
Support for nested and overlapping amplicons
Tests do not fail if the optional dependency Statistics::R is not installed
Tested that Grinder works 100% on Windows
Generating 100 reads by default instead of coverage 0.1x
Fixed bug where read description was not created if unidirectional was set to -1
0.3.7 13-Sep-2011
Fixed bug in richter and margulies homopolymer error models
Fixed bug so that output rank file now collapses amplicon by species
The Grinder CLI script is now called 'grinder' (all lowercase)
Option mutation_ratio has changed so that it is possible to specify indels without substitutions
Location of amplicon relative to the reference sequence is now recorded
in the read description using the 'amplicon' field
Better reporting of chimeras in read descriptions using a comma-separated
list for the 'amplicon' and 'reference' field
Redundant sequencing errors (multiple errors at the same position) are
now tracked in read descriptions
New dependency: using Math::Random::MT Perl module for added speed
Improved build and test mechanics
Added tests for chimeras, indels, substitutions and homopolymers
More comprehensive tests for seeding and random number generation
0.3.6 03-Aug-2011
Support for reference sequences that contain several amplicons
Implemented a gene copy bias option for amplicon libraries
Primers can now match RNA sequences or ambiguous residues of the reference
sequence
Automatic community structure parameter value picking when none is provided
Fixed uniform insert and read length distribution
Fixed quality scores, which were generated but never written to disk
Write on screen when QUAL files are generated
Added links to example databases that users can use as Grinder input
Specified the URL where to report bugs
More unit tests: community structure, read and insert length distributions
amplicons with specified genome abundance
0.3.5 21-Jul-2011
Implemented a profile mechanisms to store user's preferred options
Added a script to reverse the orientation of right-hand mates
Fixed issues with reads with MIDs (in Bio::Seq::SimulatedRead)
Library number in ID of first sequence in libraries with even number was
wrong when mate pair was used
Number of the pair in mate pair IDs was wrong
Grinder development put under Git versioning control on SourceForge
More unit tests
Versioning fix
0.3.4 23-Jun-2011
New option to generate basic quality scores if desired (-qual_levels)
New option to not track the read info in the read description (-desc_track)
Objects returned by Grinder are now Bio::Seq::SimulatedRead Bioperl objects
Double-quotes in read description are now escaped, i.e. '"' becomes '\"'
Now using 'reference' instead of 'source' in read tracking description
Changes in the defaults:
uniform community structure instead of power law
uniform read distribution instead of normally distributed
0.3.3 03-Mar-2011
New option to sequence from the reverse strand: see
(suggested by Barry Cayford).
Output FASTA files now named *reads* instead of *shotgun* because
libraries can be amplicon too.
Output file names now use numbers padded with zeroes so that, e.g. if
123 libraries were requested, their name is in 001, 002 ... 123.
Output folder is now created automatically if it does not already exist.
The next_read() method now returns only one read, even for mate pairs.
Force the alphabet to DNA when reading the primer sequence file since
degenerate primers can look like protein sequences.
Fixed bug where Grinder sometimes created libraries even though there
were not enough sequences to do it safely (reported by Dana Willner).
When the number of reads to generate is smaller than the required
diversity, the actual diversity reported reflects this now.
Not reporting errors "Not enough sequences for chimera..." when there is
less than 2 reads and chimera_perc is 0.
Fixed bug in argument processing by Getopt::Euclid that affected
repeated calls to the new() method.
Fixed calculation of number of genomes shared. Clearly specified in the
documentation that the percent shared is relative to the diversity of
the least abundant library (reported by Dana Willner).
Fixed calculation of the total library diversity.
Many more Grinder test cases.
0.3.2 11-Feb-2011
New feature to specify specific characters to delete (N, -, ...) (suggested by Mike Imelfort)
New method to retrieve the seed number used for the computation: $factory->get_random_seed
When excluding specific characters, an amplicon read is attempted only once now
More robust parsing of abundance file
It is now a fatal error if sequences requested in an abundance file are
not found in the genome file
Small optimizations
0.3.1 08-Feb-2011
Support for making multiple libraries with different richness (diversity) values
Fixed bug for communities with specified relative abundances (reported by Mike Imelfort)
Better error messages for sequences that have a specified abundance
0.3.0 12-Jan-2011
Command-line arguments have changed; all have a short and long version
Grinder API to allow to run Grinder inside Perl programs
Support for amplicon sequencing
For amplicon simulation, a forward and optional reverse primer (in IUPAC) can be specified
Amplicon can be given multiplex identifiers (MIDs)
Support for a generating chimeras
Homopolymer error simulation
More error models for point mutations (uniform and linear)
Read error tracking in the sequence description
New default is to produce reads with no errors
New FASTA read description that specifies its source, position, strand, description and errors
Option to take shotgun reads from reverse complement
Support for specifying the structure of several communities manually
Speed improvements
0.2.0 22-Sep-2010
New options available when generating multiple shotgun libraries. Alpha
and beta diversity can be specified:
* richness
* percentage of genomes shared between libraries
* percentage of the top genomes with a different abundance rank
Revised way that mate pair reads are named. Example:
>1000/1 seq3|31-60
>1000/2 seq3|41-70
Added utility to calculate average genome length from Grinder rank file
0.1.9 24-Jun-2010
Thanks to Ramsi Temanni for his suggestions and feedback regarding forbidden characters.
Support for characters forbidden in the shotgun reads
Little bugfix regarding default values for arguments that take a list of values
0.1.8 22-Apr-2010
Thanks to Albert Villela for his suggestions and feedback regarding paired reads.
Changes in command-line options to accomodate new features
Support for inputting a file specifying the abundance of the different genomes
Support for mate pairs / paired end reads
Support for uniform or normal distribution of read lengths and mate pair
insert lengths
Fixed bug causing an error when the number of reads in the input file
cannot be divided by the number of independent libraries required
Changed output sequence ID to a more consistent scheme
0.1.7 15-Feb-2010
Not keeping the sequences in memory anymore to preserve resources
Really using the Math::Random::MT::Perl seeding facility
0.1.6 07-Dec-2009
Now using the Math::Random::MT::Perl seeding facility
0.1.5 24-Feb-2009
Grinder now has a proper installer (Perl module style)
0.1.4
Added basic report on libraries produced
Fixed bug in number of sequences created when using independent libraries
0.1.3
Ability to generate several random shotgun libraries at once that do not
contain any genome in common
0.1.2
Correction in the code to generate mutations
Changed the defaults to use a powerlaw model and the size-dependent option
The main module function now returns a hashref of rank-abundances
0.1.1
Introduction of the simulation of sequencing errors (substitutions and indels)
Modified the way the random number generation is handled
The main module function now returns an arrayref of Bio::Seq objects
0.1.0
Initial release
Grinder-0.5.3/man/ 0000755 0001750 0001750 00000000000 12151575606 014114 5 ustar floflooo floflooo Grinder-0.5.3/man/average_genome_size.1 0000644 0001750 0001750 00000013201 12151575456 020174 0 ustar floflooo floflooo .\" Automatically generated by Pod::Man 2.25 (Pod::Simple 3.26)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings. \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote. \*(C+ will
.\" give a nicer C++. Capital omega is used to do unbreakable dashes and
.\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff,
.\" nothing in troff, for use with C<>.
.tr \(*W-
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
. ds -- \(*W-
. ds PI pi
. if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
. if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch
. ds L" ""
. ds R" ""
. ds C` ""
. ds C' ""
'br\}
.el\{\
. ds -- \|\(em\|
. ds PI \(*p
. ds L" ``
. ds R" ''
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\"
.\" If the F register is turned on, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD. Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.ie \nF \{\
. de IX
. tm Index:\\$1\t\\n%\t"\\$2"
..
. nr % 0
. rr F
.\}
.el \{\
. de IX
..
.\}
.\"
.\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2).
.\" Fear. Run. Save yourself. No user-serviceable parts.
. \" fudge factors for nroff and troff
.if n \{\
. ds #H 0
. ds #V .8m
. ds #F .3m
. ds #[ \f1
. ds #] \fP
.\}
.if t \{\
. ds #H ((1u-(\\\\n(.fu%2u))*.13m)
. ds #V .6m
. ds #F 0
. ds #[ \&
. ds #] \&
.\}
. \" simple accents for nroff and troff
.if n \{\
. ds ' \&
. ds ` \&
. ds ^ \&
. ds , \&
. ds ~ ~
. ds /
.\}
.if t \{\
. ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u"
. ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u'
. ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u'
. ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u'
. ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u'
. ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u'
.\}
. \" troff and (daisy-wheel) nroff accents
.ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V'
.ds 8 \h'\*(#H'\(*b\h'-\*(#H'
.ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#]
.ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H'
.ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u'
.ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#]
.ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#]
.ds ae a\h'-(\w'a'u*4/10)'e
.ds Ae A\h'-(\w'A'u*4/10)'E
. \" corrections for vroff
.if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u'
.if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u'
. \" for low resolution devices (crt and lpr)
.if \n(.H>23 .if \n(.V>19 \
\{\
. ds : e
. ds 8 ss
. ds o a
. ds d- d\h'-1'\(ga
. ds D- D\h'-1'\(hy
. ds th \o'bp'
. ds Th \o'LP'
. ds ae ae
. ds Ae AE
.\}
.rm #[ #] #H #V #F C
.\" ========================================================================
.\"
.IX Title "AVERAGE_GENOME_SIZE 1"
.TH AVERAGE_GENOME_SIZE 1 "2013-03-20" "perl v5.14.2" "User Contributed Perl Documentation"
.\" For nroff, turn off justification. Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH "NAME"
average_genome_size \- Calculate the average genome size (in bp) of species in a Grinder library
.SH "DESCRIPTION"
.IX Header "DESCRIPTION"
Calculate the average genome size (in bp) of species in a Grinder library given
the library composition and the full-genomes used to produce it.
.SH "REQUIRED ARGUMENTS"
.IX Header "REQUIRED ARGUMENTS"
.IP "" 4
.IX Item ""
\&\s-1FASTA\s0 file containing the full-genomes used to produce the Grinder library.
.IP "" 4
.IX Item ""
Grinder rank file that describes the library composition.
.SH "COPYRIGHT"
.IX Header "COPYRIGHT"
Copyright 2009\-2012 Florent \s-1ANGLY\s0
.PP
Grinder is free software: you can redistribute it and/or modify
it under the terms of the \s-1GNU\s0 General Public License (\s-1GPL\s0) as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
Grinder is distributed in the hope that it will be useful,
but \s-1WITHOUT\s0 \s-1ANY\s0 \s-1WARRANTY\s0; without even the implied warranty of
\&\s-1MERCHANTABILITY\s0 or \s-1FITNESS\s0 \s-1FOR\s0 A \s-1PARTICULAR\s0 \s-1PURPOSE\s0. See the
\&\s-1GNU\s0 General Public License for more details.
You should have received a copy of the \s-1GNU\s0 General Public License
along with Grinder. If not, see .
.SH "BUGS"
.IX Header "BUGS"
All complex software has bugs lurking in it, and this program is no exception.
If you find a bug, please report it on the SourceForge Tracker for Grinder:
.PP
Bug reports, suggestions and patches are welcome. Grinder's code is developed
on Sourceforge () and is
under Git revision control. To get started with a patch, do:
.PP
.Vb 1
\& git clone git://biogrinder.git.sourceforge.net/gitroot/biogrinder/biogrinder
.Ve
Grinder-0.5.3/man/grinder.1 0000644 0001750 0001750 00000114016 12151575456 015636 0 ustar floflooo floflooo .\" Automatically generated by Pod::Man 2.25 (Pod::Simple 3.26)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings. \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote. \*(C+ will
.\" give a nicer C++. Capital omega is used to do unbreakable dashes and
.\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff,
.\" nothing in troff, for use with C<>.
.tr \(*W-
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
. ds -- \(*W-
. ds PI pi
. if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
. if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch
. ds L" ""
. ds R" ""
. ds C` ""
. ds C' ""
'br\}
.el\{\
. ds -- \|\(em\|
. ds PI \(*p
. ds L" ``
. ds R" ''
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\"
.\" If the F register is turned on, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD. Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.ie \nF \{\
. de IX
. tm Index:\\$1\t\\n%\t"\\$2"
..
. nr % 0
. rr F
.\}
.el \{\
. de IX
..
.\}
.\"
.\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2).
.\" Fear. Run. Save yourself. No user-serviceable parts.
. \" fudge factors for nroff and troff
.if n \{\
. ds #H 0
. ds #V .8m
. ds #F .3m
. ds #[ \f1
. ds #] \fP
.\}
.if t \{\
. ds #H ((1u-(\\\\n(.fu%2u))*.13m)
. ds #V .6m
. ds #F 0
. ds #[ \&
. ds #] \&
.\}
. \" simple accents for nroff and troff
.if n \{\
. ds ' \&
. ds ` \&
. ds ^ \&
. ds , \&
. ds ~ ~
. ds /
.\}
.if t \{\
. ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u"
. ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u'
. ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u'
. ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u'
. ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u'
. ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u'
.\}
. \" troff and (daisy-wheel) nroff accents
.ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V'
.ds 8 \h'\*(#H'\(*b\h'-\*(#H'
.ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#]
.ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H'
.ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u'
.ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#]
.ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#]
.ds ae a\h'-(\w'a'u*4/10)'e
.ds Ae A\h'-(\w'A'u*4/10)'E
. \" corrections for vroff
.if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u'
.if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u'
. \" for low resolution devices (crt and lpr)
.if \n(.H>23 .if \n(.V>19 \
\{\
. ds : e
. ds 8 ss
. ds o a
. ds d- d\h'-1'\(ga
. ds D- D\h'-1'\(hy
. ds th \o'bp'
. ds Th \o'LP'
. ds ae ae
. ds Ae AE
.\}
.rm #[ #] #H #V #F C
.\" ========================================================================
.\"
.IX Title "GRINDER 1"
.TH GRINDER 1 "2013-05-30" "perl v5.14.2" "User Contributed Perl Documentation"
.\" For nroff, turn off justification. Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH "NAME"
grinder \- A versatile omics shotgun and amplicon sequencing read simulator
.SH "DESCRIPTION"
.IX Header "DESCRIPTION"
Grinder is a versatile program to create random shotgun and amplicon sequence
libraries based on \s-1DNA\s0, \s-1RNA\s0 or proteic reference sequences provided in a \s-1FASTA\s0
file.
.PP
Grinder can produce genomic, metagenomic, transcriptomic, metatranscriptomic,
proteomic, metaproteomic shotgun and amplicon datasets from current sequencing
technologies such as Sanger, 454, Illumina. These simulated datasets can be used
to test the accuracy of bioinformatic tools under specific hypothesis, e.g. with
or without sequencing errors, or with low or high community diversity. Grinder
may also be used to help decide between alternative sequencing methods for a
sequence-based project, e.g. should the library be paired-end or not, how many
reads should be sequenced.
.PP
Grinder features include:
.IP "\(bu" 4
shotgun or amplicon read libraries
.IP "\(bu" 4
omics support to generate genomic, transcriptomic, proteomic,
metagenomic, metatranscriptomic or metaproteomic datasets
.IP "\(bu" 4
arbitrary read length distribution and number of reads
.IP "\(bu" 4
simulation of \s-1PCR\s0 and sequencing errors (chimeras, point mutations, homopolymers)
.IP "\(bu" 4
support for paired-end (mate pair) datasets
.IP "\(bu" 4
specific rank-abundance settings or manually given abundance for each genome, gene or protein
.IP "\(bu" 4
creation of datasets with a given richness (alpha diversity)
.IP "\(bu" 4
independent datasets can share a variable number of genomes (beta diversity)
.IP "\(bu" 4
modeling of the bias created by varying genome lengths or gene copy number
.IP "\(bu" 4
profile mechanism to store preferred options
.IP "\(bu" 4
available to biologists or power users through multiple interfaces: \s-1GUI\s0, \s-1CLI\s0 and \s-1API\s0
.PP
Briefly, given a \s-1FASTA\s0 file containing reference sequence (genomes, genes,
transcripts or proteins), Grinder performs the following steps:
.IP "1." 4
Read the reference sequences, and for amplicon datasets, extracts full-length
reference \s-1PCR\s0 amplicons using the provided degenerate \s-1PCR\s0 primers.
.IP "2." 4
Determine the community structure based on the provided alpha diversity (number
of reference sequences in the library), beta diversity (number of reference
sequences in common between several independent libraries) and specified rank\-
abundance model.
.IP "3." 4
Take shotgun reads from the reference sequences or amplicon reads from the full\-
length reference \s-1PCR\s0 amplicons. The reads may be paired-end reads when an insert
size distribution is specified. The length of the reads depends on the provided
read length distribution and their abundance depends on the relative abundance
in the community structure. Genome length may also biases the number of reads to
take for shotgun datasets at this step. Similarly, for amplicon datasets, the
number of copies of the target gene in the reference genomes may bias the number
of reads to take.
.IP "4." 4
Alter reads by inserting sequencing errors (indels, substitutions and homopolymer
errors) following a position-specific model to simulate reads created by current
sequencing technologies (Sanger, 454, Illumina). Write the reads and their
quality scores in \s-1FASTA\s0, \s-1QUAL\s0 and \s-1FASTQ\s0 files.
.SH "CITATION"
.IX Header "CITATION"
If you use Grinder in your research, please cite:
.PP
.Vb 2
\& Angly FE, Willner D, Rohwer F, Hugenholtz P, Tyson GW (2012), Grinder: a
\& versatile amplicon and shotgun sequence simulator, Nucleic Acids Reseach
.Ve
.PP
Available from .
.SH "VERSION"
.IX Header "VERSION"
This document refers to grinder version 0.5.2
.SH "AUTHOR"
.IX Header "AUTHOR"
Florent Angly
.SH "INSTALLATION"
.IX Header "INSTALLATION"
.SS "Dependencies"
.IX Subsection "Dependencies"
You need to install these dependencies first:
.IP "\(bu" 4
Perl (>= 5.6)
.Sp
.IP "\(bu" 4
make
.Sp
Many systems have make installed by default. If your system does not, you should
install the implementation of make of your choice, e.g. \s-1GNU\s0 make:
.PP
The following \s-1CPAN\s0 Perl modules are dependencies that will be installed automatically
for you:
.IP "\(bu" 4
Bioperl modules (>=1.6.901).
.Sp
Note that some unreleased Bioperl modules have been included in Grinder.
.IP "\(bu" 4
Getopt::Euclid (>= 0.3.4)
.IP "\(bu" 4
List::Util
.Sp
First released with Perl v5.7.3
.IP "\(bu" 4
Math::Random::MT (>= 1.13)
.IP "\(bu" 4
version (>= 0.77)
.Sp
First released with Perl v5.9.0
.SS "Procedure"
.IX Subsection "Procedure"
To install Grinder globally on your system, run the following commands in a
terminal or command prompt:
.PP
On Linux, Unix, MacOS:
.PP
.Vb 2
\& perl Makefile.PL
\& make
.Ve
.PP
And finally, with administrator privileges:
.PP
.Vb 1
\& make install
.Ve
.PP
On Windows, run the same commands but with nmake instead of make.
.SS "No administrator privileges?"
.IX Subsection "No administrator privileges?"
If you do not have administrator privileges, Grinder needs to be installed in
your home directory.
.PP
First, follow the instructions to install local::lib
at http://search.cpan.org/~apeiron/local\-lib\-1.008004/lib/local/lib.pm#The_bootstrapping_technique . After local::lib is installed, every Perl
module that you install manually or through the \s-1CPAN\s0 command-line application
will be installed in your home directory.
.PP
Then, install Grinder by following the instructions detailed in the \*(L"Procedure\*(R"
section.
.SH "RUNNING GRINDER"
.IX Header "RUNNING GRINDER"
After installation, you can run Grinder using a command-line interface (\s-1CLI\s0),
an application programming interface (\s-1API\s0) or a graphical user interface (\s-1GUI\s0)
in Galaxy.
.PP
To get the usage of the \s-1CLI\s0, type:
.PP
.Vb 1
\& grinder \-\-help
.Ve
.PP
More information, including the documentation of the Grinder \s-1API\s0, which allows
you to run Grinder from within other Perl programs, is available by typing:
.PP
.Vb 1
\& perldoc Grinder
.Ve
.PP
To run the \s-1GUI\s0, refer to the Galaxy documentation at .
.PP
The 'utils' folder included in the Grinder package contains some utilities:
.IP "average genome size:" 4
.IX Item "average genome size:"
This calculates the average genome size (in bp) of a simulated random library
produced by Grinder.
.IP "change_paired_read_orientation:" 4
.IX Item "change_paired_read_orientation:"
This reverses the orientation of each second mate-pair read (\s-1ID\s0 ending in /2)
in a \s-1FASTA\s0 file.
.SH "REFERENCE SEQUENCE DATABASE"
.IX Header "REFERENCE SEQUENCE DATABASE"
A variety of \s-1FASTA\s0 databases can be used as input for Grinder. For example, the
GreenGenes database ()
contains over 180,000 16S rRNA clone sequences from various species which would
be appropriate to produce a 16S rRNA amplicon dataset. A set of over 41,000 \s-1OTU\s0
representative sequences and their affiliation in seven different taxonomic
sytems can also be used for the same purpose (
and ). The
\&\s-1RDP\s0 () and Silva
(http://www.arb\-silva.de/no_cache/download/archive/release_108/Exports/ )
databases also provide many 16S rRNA sequences and Silva includes eukaryotic
sequences. While 16S rRNA is a popular gene, datasets containing any type of gene
could be used in the same fashion to generate simulated amplicon datasets, provided
appropriate primers are used.
.PP
The >2,400 curated microbial genome sequences in the \s-1NCBI\s0 RefSeq collection
() would also be suitable for
producing 16S rRNA simulated datasets (using the adequate primers). However, the
lower diversity of this database compared to the previous two makes it more
appropriate for producing artificial microbial metagenomes. Individual genomes
from this database are also very suitable for the simulation of single or
double-barreled shotgun libraries. Similarly, the RefSeq database contains
over 3,100 curated viral sequences ()
which can be used to produce artificial viral metagenomes.
.PP
Quite a few eukaryotic organisms have been sequenced and their genome or genes
can be the basis for simulating genomic, transcriptomic (RNA-seq) or proteomic
datasets. For example, you can use the human genome available at
, the human transcripts
downloadable from
or the human proteome at .
.SH "CLI EXAMPLES"
.IX Header "CLI EXAMPLES"
Here are a few examples that illustrate the use of Grinder in a terminal:
.IP "1." 4
A shotgun \s-1DNA\s0 library with a coverage of 0.1X
.Sp
.Vb 1
\& grinder \-reference_file genomes.fna \-coverage_fold 0.1
.Ve
.IP "2." 4
Same thing but save the result files in a specific folder and with a specific name
.Sp
.Vb 1
\& grinder \-reference_file genomes.fna \-coverage_fold 0.1 \-base_name my_name \-output_dir my_dir
.Ve
.IP "3." 4
A \s-1DNA\s0 shotgun library with 1000 reads
.Sp
.Vb 1
\& grinder \-reference_file genomes.fna \-total_reads 1000
.Ve
.IP "4." 4
A \s-1DNA\s0 shotgun library where species are distributed according to a power law
.Sp
.Vb 1
\& grinder \-reference_file genomes.fna \-abundance_model powerlaw 0.1
.Ve
.IP "5." 4
A \s-1DNA\s0 shotgun library with 123 genomes taken random from the given genomes
.Sp
.Vb 1
\& grinder \-reference_file genomes.fna \-diversity 123
.Ve
.IP "6." 4
Two \s-1DNA\s0 shotgun libraries that have 50% of the species in common
.Sp
.Vb 1
\& grinder \-reference_file genomes.fna \-num_libraries 2 \-shared_perc 50
.Ve
.IP "7." 4
Two \s-1DNA\s0 shotgun library with no species in common and distributed according to a
exponential rank-abundance model. Note that because the parameter value for the
exponential model is omitted, each library uses a different randomly chosen value:
.Sp
.Vb 1
\& grinder \-reference_file genomes.fna \-num_libraries 2 \-abundance_model exponential
.Ve
.IP "8." 4
A \s-1DNA\s0 shotgun library where species relative abundances are manually specified
.Sp
.Vb 1
\& grinder \-reference_file genomes.fna \-abundance_file my_abundances.txt
.Ve
.IP "9." 4
A \s-1DNA\s0 shotgun library with Sanger reads
.Sp
.Vb 1
\& grinder \-reference_file genomes.fna \-read_dist 800 \-mutation_dist linear 1 2 \-mutation_ratio 80 20
.Ve
.IP "10." 4
A \s-1DNA\s0 shotgun library with first-generation 454 reads
.Sp
.Vb 1
\& grinder \-reference_file genomes.fna \-read_dist 100 normal 10 \-homopolymer_dist balzer
.Ve
.IP "11." 4
A paired-end \s-1DNA\s0 shotgun library, where the insert size is normally distributed
around 2.5 kbp and has 0.2 kbp standard deviation
.Sp
.Vb 1
\& grinder \-reference_file genomes.fna \-insert_dist 2500 normal 200
.Ve
.IP "12." 4
A transcriptomic dataset
.Sp
.Vb 1
\& grinder \-reference_file transcripts.fna
.Ve
.IP "13." 4
A unidirectional transcriptomic dataset
.Sp
.Vb 1
\& grinder \-reference_file transcripts.fna \-unidirectional 1
.Ve
.Sp
Note the use of \-unidirectional 1 to prevent reads to be taken from the reverse\-
complement of the reference sequences.
.IP "14." 4
A proteomic dataset
.Sp
.Vb 1
\& grinder \-reference_file proteins.faa \-unidirectional 1
.Ve
.IP "15." 4
A 16S rRNA amplicon library
.Sp
.Vb 1
\& grinder \-reference_file 16Sgenes.fna \-forward_reverse 16Sprimers.fna \-length_bias 0 \-unidirectional 1
.Ve
.Sp
Note the use of \-length_bias 0 because reference sequence length should not affect
the relative abundance of amplicons.
.IP "16." 4
The same amplicon library with 20% of chimeric reads (90% bimera, 10% trimera)
.Sp
.Vb 1
\& grinder \-reference_file 16Sgenes.fna \-forward_reverse 16Sprimers.fna \-length_bias 0 \-unidirectional 1 \-chimera_perc 20 \-chimera_dist 90 10
.Ve
.IP "17." 4
Three 16S rRNA amplicon libraries with specified MIDs and no reference sequences
in common
.Sp
.Vb 1
\& grinder \-reference_file 16Sgenes.fna \-forward_reverse 16Sprimers.fna \-length_bias 0 \-unidirectional 1 \-num_libraries 3 \-multiplex_ids MIDs.fna
.Ve
.IP "18." 4
Reading reference sequences from the standard input, which allows you to
decompress \s-1FASTA\s0 files on the fly:
.Sp
.Vb 1
\& zcat microbial_db.fna.gz | grinder \-reference_file \- \-total_reads 100
.Ve
.SH "CLI REQUIRED ARGUMENTS"
.IX Header "CLI REQUIRED ARGUMENTS"
.IP "\-rf | \-reference_file | \-gf | \-genome_file " 4
.IX Item "-rf | -reference_file | -gf | -genome_file "
\&\s-1FASTA\s0 file that contains the input reference sequences (full genomes, 16S rRNA
genes, transcripts, proteins...) or '\-' to read them from the standard input. See the
\&\s-1README\s0 file for examples of databases you can use and where to get them from.
Default: \-
.SH "CLI OPTIONAL ARGUMENTS"
.IX Header "CLI OPTIONAL ARGUMENTS"
.IP "\-tr | \-total_reads " 4
.IX Item "-tr | -total_reads "
Number of shotgun or amplicon reads to generate for each library. Do not specify
this if you specify the fold coverage. Default: 100
.IP "\-cf | \-coverage_fold " 4
.IX Item "-cf | -coverage_fold "
Desired fold coverage of the input reference sequences (the output \s-1FASTA\s0 length
divided by the input \s-1FASTA\s0 length). Do not specify this if you specify the number
of reads directly.
.IP "\-rd ... | \-read_dist ..." 4
.IX Item "-rd ... | -read_dist ..."
Desired shotgun or amplicon read length distribution specified as:
average length, distribution ('uniform' or 'normal') and standard deviation.
.Sp
Only the first element is required. Examples:
.Sp
.Vb 6
\& All reads exactly 101 bp long (Illumina GA 2x): 101
\& Uniform read distribution around 100+\-10 bp: 100 uniform 10
\& Reads normally distributed with an average of 800 and a standard deviation of 100
\& bp (Sanger reads): 800 normal 100
\& Reads normally distributed with an average of 450 and a standard deviation of 50
\& bp (454 GS\-FLX Ti): 450 normal 50
.Ve
.Sp
Reference sequences smaller than the specified read length are not used. Default:
100
.IP "\-id ... | \-insert_dist ..." 4
.IX Item "-id ... | -insert_dist ..."
Create paired-end or mate-pair reads spanning the given insert length.
Important: the insert is defined in the biological sense, i.e. its length includes
the length of both reads and of the stretch of \s-1DNA\s0 between them:
0 : off,
or: insert size distribution in bp, in the same format as the read length
distribution (a typical value is 2,500 bp for mate pairs)
Two distinct reads are generated whether or not the mate pair overlaps. Default:
0
.IP "\-mo | \-mate_orientation " 4
.IX Item "-mo | -mate_orientation "
When generating paired-end or mate-pair reads (see ), specify the
orientation of the reads (F: forward, R: reverse):
.Sp
.Vb 4
\& FR: \-\-\-> <\-\-\- e.g. Sanger, Illumina paired\-end, IonTorrent mate\-pair
\& FF: \-\-\-> \-\-\-> e.g. 454
\& RF: <\-\-\- \-\-\-> e.g. Illumina mate\-pair
\& RR: <\-\-\- <\-\-\-
.Ve
.Sp
Default: \s-1FR\s0
.IP "\-ec | \-exclude_chars " 4
.IX Item "-ec | -exclude_chars "
Do not create reads containing any of the specified characters (case insensitive).
For example, use '\s-1NX\s0' to prevent reads with ambiguities (N or X). Grinder will
error if it fails to find a suitable read (or pair of reads) after 10 attempts.
Consider using , which may be more appropriate for your case.
Default: ''
.IP "\-dc | \-delete_chars " 4
.IX Item "-dc | -delete_chars "
Remove the specified characters from the reference sequences (case-insensitive),
e.g. '\-~*' to remove gaps (\- or ~) or terminator (*). Removing these characters
is done once, when reading the reference sequences, prior to taking reads. Hence
it is more efficient than . Default:
.IP "\-fr | \-forward_reverse " 4
.IX Item "-fr | -forward_reverse "
Use \s-1DNA\s0 amplicon sequencing using a forward and reverse \s-1PCR\s0 primer sequence
provided in a \s-1FASTA\s0 file. The reference sequences and their reverse complement
will be searched for \s-1PCR\s0 primer matches. The primer sequences should use the
\&\s-1IUPAC\s0 convention for degenerate residues and the reference sequences that that
do not match the specified primers are excluded. If your reference sequences are
full genomes, it is recommended to use = 1 and = 0 to
generate amplicon reads. To sequence from the forward strand, set
to 1 and put the forward primer first and reverse primer second in the \s-1FASTA\s0
file. To sequence from the reverse strand, invert the primers in the \s-1FASTA\s0 file
and use = \-1. The second primer sequence in the \s-1FASTA\s0 file is
always optional. Example: \s-1AAACTYAAAKGAATTGRCGG\s0 and \s-1ACGGGCGGTGTGTRC\s0 for the 926F
and 1392R primers that target the V6 to V9 region of the 16S rRNA gene.
.IP "\-un | \-unidirectional " 4
.IX Item "-un | -unidirectional "
Instead of producing reads bidirectionally, from the reference strand and its
reverse complement, proceed unidirectionally, from one strand only (forward or
reverse). Values: 0 (off, i.e. bidirectional), 1 (forward), \-1 (reverse). Use
= 1 for amplicon and strand-specific transcriptomic or
proteomic datasets. Default: 0
.IP "\-lb | \-length_bias " 4
.IX Item "-lb | -length_bias "
In shotgun libraries, sample reference sequences proportionally to their length.
For example, in simulated microbial datasets, this means that at the same
relative abundance, larger genomes contribute more reads than smaller genomes
(and all genomes have the same fold coverage).
0 = no, 1 = yes. Default: 1
.IP "\-cb | \-copy_bias " 4
.IX Item "-cb | -copy_bias "
In amplicon libraries where full genomes are used as input, sample species
proportionally to the number of copies of the target gene: at equal relative
abundance, genomes that have multiple copies of the target gene contribute more
amplicon reads than genomes that have a single copy. 0 = no, 1 = yes. Default:
1
.IP "\-md ... | \-mutation_dist ..." 4
.IX Item "-md ... | -mutation_dist ..."
Introduce sequencing errors in the reads, under the form of mutations
(substitutions, insertions and deletions) at positions that follow a specified
distribution (with replacement): model (uniform, linear, poly4), model parameters.
For example, for a uniform 0.1% error rate, use: uniform 0.1. To simulate Sanger
errors, use a linear model where the errror rate is 1% at the 5' end of reads and
2% at the 3' end: linear 1 2. To model Illumina errors using the 4th degree
polynome 3e\-3 + 3.3e\-8 * i^4 (Korbel et al 2009), use: poly4 3e\-3 3.3e\-8.
Use the option to alter how many of these mutations are
substitutions or indels. Default: uniform 0 0
.IP "\-mr ... | \-mutation_ratio ..." 4
.IX Item "-mr ... | -mutation_ratio ..."
Indicate the percentage of substitutions and the number of indels (insertions
and deletions). For example, use '80 20' (4 substitutions for each indel) for
Sanger reads. Note that this parameter has no effect unless you specify the
option. Default: 80 20
.IP "\-hd | \-homopolymer_dist " 4
.IX Item "-hd | -homopolymer_dist "
Introduce sequencing errors in the reads under the form of homopolymeric
stretches (e.g. \s-1AAA\s0, \s-1CCCCC\s0) using a specified model where the homopolymer length
follows a normal distribution N(mean, standard deviation) that is function of
the homopolymer length n:
.Sp
.Vb 3
\& Margulies: N(n, 0.15 * n) , Margulies et al. 2005.
\& Richter : N(n, 0.15 * sqrt(n)) , Richter et al. 2008.
\& Balzer : N(n, 0.03494 + n * 0.06856) , Balzer et al. 2010.
.Ve
.Sp
Default: 0
.IP "\-cp | \-chimera_perc " 4
.IX Item "-cp | -chimera_perc "
Specify the percent of reads in amplicon libraries that should be chimeric
sequences. The 'reference' field in the description of chimeric reads will
contain the \s-1ID\s0 of all the reference sequences forming the chimeric template.
A typical value is 10% for amplicons. This option can be used to generate
chimeric shotgun reads as well. Default: 0 %
.IP "\-cd ... | \-chimera_dist ..." 4
.IX Item "-cd ... | -chimera_dist ..."
Specify the distribution of chimeras: bimeras, trimeras, quadrameras and
multimeras of higher order. The default is the average values from Quince et al.
2011: '314 38 1', which corresponds to 89% of bimeras, 11% of trimeras and 0.3%
of quadrameras. Note that this option only takes effect when you request the
generation of chimeras with the option. Default: 314 38 1
.IP "\-ck | \-chimera_kmer " 4
.IX Item "-ck | -chimera_kmer "
Activate a method to form chimeras by picking breakpoints at places where k\-mers
are shared between sequences. represents k, the length of the
k\-mers (in bp). The longer the kmer, the more similar the sequences have to be
to be eligible to form chimeras. The more frequent a k\-mer is in the pool of
reference sequences (taking into account their relative abundance), the more
often this k\-mer will be chosen. For example, \s-1CHSIM\s0 (Edgar et al. 2011) uses this
method with a k\-mer length of 10 bp. If you do not want to use k\-mer information
to form chimeras, use 0, which will result in the reference sequences and
breakpoints to be taken randomly on the \*(L"aligned\*(R" reference sequences. Note that
this option only takes effect when you request the generation of chimeras with
the option. Also, this options is quite memory intensive, so you
should probably limit yourself to a relatively small number of reference sequences
if you want to use it. Default: 10 bp
.IP "\-af | \-abundance_file " 4
.IX Item "-af | -abundance_file "
Specify the relative abundance of the reference sequences manually in an input
file. Each line of the file should contain a sequence name and its relative
abundance (%), e.g. 'seqABC 82.1' or 'seqABC 82.1 10.2' if you are specifying two
different libraries.
.IP "\-am ... | \-abundance_model ..." 4
.IX Item "-am ... | -abundance_model ..."
Relative abundance model for the input reference sequences: uniform, linear, powerlaw,
logarithmic or exponential. The uniform and linear models do not require a
parameter, but the other models take a parameter in the range [0, infinity). If
this parameter is not specified, then it is randomly chosen. Examples:
.Sp
.Vb 3
\& uniform distribution: uniform
\& powerlaw distribution with parameter 0.1: powerlaw 0.1
\& exponential distribution with automatically chosen parameter: exponential
.Ve
.Sp
Default: uniform 1
.IP "\-nl | \-num_libraries " 4
.IX Item "-nl | -num_libraries "
Number of independent libraries to create. Specify how diverse and similar they
should be with , and . Assign them
different \s-1MID\s0 tags with . Default: 1
.IP "\-mi | \-multiplex_ids " 4
.IX Item "-mi | -multiplex_ids "
Specify an optional \s-1FASTA\s0 file that contains multiplex sequence identifiers
(a.k.a MIDs or barcodes) to add to the sequences (one sequence per library). The
MIDs are included in the length specified with the \-read_dist option and can be
altered by sequencing errors. See the MIDesigner or BarCrawl programs to
generate \s-1MID\s0 sequences.
.IP "\-di ... | \-diversity ..." 4
.IX Item "-di ... | -diversity ..."
This option specifies alpha diversity, specifically the richness, i.e. number of
reference sequences to take randomly and include in each library. Use 0 for the
maximum richness possible (based on the number of reference sequences available).
Provide one value to make all libraries have the same diversity, or one richness
value per library otherwise. Default: 0
.IP "\-sp | \-shared_perc " 4
.IX Item "-sp | -shared_perc "
This option controls an aspect of beta-diversity. When creating multiple
libraries, specify the percent of reference sequences they should have in common
(relative to the diversity of the least diverse library). Default: 0 %
.IP "\-pp | \-permuted_perc " 4
.IX Item "-pp | -permuted_perc "
This option controls another aspect of beta-diversity. For multiple libraries,
choose the percent of the most-abundant reference sequences to permute (randomly
shuffle) the rank-abundance of. Default: 0 %
.IP "\-rs | \-random_seed " 4
.IX Item "-rs | -random_seed "
Seed number to use for the pseudo-random number generator.
.IP "\-dt | \-desc_track " 4
.IX Item "-dt | -desc_track "
Track read information (reference sequence, position, errors, ...) by writing
it in the read description. Default: 1
.IP "\-ql ... | \-qual_levels ..." 4
.IX Item "-ql ... | -qual_levels ..."
Generate basic quality scores for the simulated reads. Good residues are given a
specified good score (e.g. 30) and residues that are the result of an insertion
or substitution are given a specified bad score (e.g. 10). Specify first the
good score and then the bad score on the command-line, e.g.: 30 10. Default:
.IP "\-fq | \-fastq_output " 4
.IX Item "-fq | -fastq_output "
Whether to write the generated reads in \s-1FASTQ\s0 format (with Sanger-encoded
quality scores) instead of \s-1FASTA\s0 and \s-1QUAL\s0 or not (1: yes, 0: no).
need to be specified for this option to be effective. Default: 0
.IP "\-bn | \-base_name " 4
.IX Item "-bn | -base_name "
Prefix of the output files. Default: grinder
.IP "\-od | \-output_dir " 4
.IX Item "-od | -output_dir "
Directory where the results should be written. This folder will be created if
needed. Default: .
.IP "\-pf | \-profile_file " 4
.IX Item "-pf | -profile_file "
A file that contains Grinder arguments. This is useful if you use many options
or often use the same options. Lines with comments (#) are ignored. Consider the
profile file, 'simple_profile.txt':
.Sp
.Vb 3
\& # A simple Grinder profile
\& \-read_dist 105 normal 12
\& \-total_reads 1000
.Ve
.Sp
Running: grinder \-reference_file viral_genomes.fa \-profile_file simple_profile.txt
.Sp
Translates into: grinder \-reference_file viral_genomes.fa \-read_dist 105 normal 12 \-total_reads 1000
.Sp
Note that the arguments specified in the profile should not be specified again on the command line.
.SH "CLI OUTPUT"
.IX Header "CLI OUTPUT"
For each shotgun or amplicon read library requested, the following files are
generated:
.IP "\(bu" 4
A rank-abundance file, tab-delimited, that shows the relative abundance of the
different reference sequences
.IP "\(bu" 4
A file containing the read sequences in \s-1FASTA\s0 format. The read headers
contain information necessary to track from which reference sequence each read
was taken and what errors it contains. This file is not generated if
option was provided.
.IP "\(bu" 4
If the option was specified, a file containing the quality scores
of the reads (in \s-1QUAL\s0 format).
.IP "\(bu" 4
If the option was provided, a file containing the read sequences
in \s-1FASTQ\s0 format.
.SH "API EXAMPLES"
.IX Header "API EXAMPLES"
The Grinder \s-1API\s0 allows to conveniently use Grinder within Perl scripts. Here is
a synopsis:
.PP
.Vb 1
\& use Grinder;
\&
\& # Set up a new factory (see the OPTIONS section for a complete list of parameters)
\& my $factory = Grinder\->new( \-reference_file => \*(Aqgenomes.fna\*(Aq );
\&
\& # Process all shotgun libraries requested
\& while ( my $struct = $factory\->next_lib ) {
\&
\& # The ID and abundance of the 3rd most abundant genome in this community
\& my $id = $struct\->{ids}\->[2];
\& my $ab = $struct\->{abs}\->[2];
\&
\& # Create shotgun reads
\& while ( my $read = $factory\->next_read) {
\&
\& # The read is a Bioperl sequence object with these properties:
\& my $read_id = $read\->id; # read ID given by Grinder
\& my $read_seq = $read\->seq; # nucleotide sequence
\& my $read_mid = $read\->mid; # MID or tag attached to the read
\& my $read_errors = $read\->errors; # errors that the read contains
\&
\& # Where was the read taken from? The reference sequence refers to the
\& # database sequence for shotgun libraries, amplicon obtained from the
\& # database sequence, or could even be a chimeric sequence
\& my $ref_id = $read\->reference\->id; # ID of the reference sequence
\& my $ref_start = $read\->start; # start of the read on the reference
\& my $ref_end = $read\->end; # end of the read on the reference
\& my $ref_strand = $read\->strand; # strand of the reference
\&
\& }
\& }
\&
\& # Similarly, for shotgun mate pairs
\& my $factory = Grinder\->new( \-reference_file => \*(Aqgenomes.fna\*(Aq,
\& \-insert_dist => 250 );
\& while ( $factory\->next_lib ) {
\& while ( my $read = $factory\->next_read ) {
\& # The first read is the first mate of the mate pair
\& # The second read is the second mate of the mate pair
\& # The third read is the first mate of the next mate pair
\& # ...
\& }
\& }
\&
\& # To generate an amplicon library
\& my $factory = Grinder\->new( \-reference_file => \*(Aqgenomes.fna\*(Aq,
\& \-forward_reverse => \*(Aq16Sgenes.fna\*(Aq,
\& \-length_bias => 0,
\& \-unidirectional => 1 );
\& while ( $factory\->next_lib ) {
\& while ( my $read = $factory\->next_read) {
\& # ...
\& }
\& }
.Ve
.SH "API METHODS"
.IX Header "API METHODS"
The rest of the documentation details the available Grinder \s-1API\s0 methods.
.SS "new"
.IX Subsection "new"
Title : new
.PP
Function: Create a new Grinder factory initialized with the passed arguments.
Available parameters described in the \s-1OPTIONS\s0 section.
.PP
Usage : my \f(CW$factory\fR = Grinder\->new( \-reference_file => 'genomes.fna' );
.PP
Returns : a new Grinder object
.SS "next_lib"
.IX Subsection "next_lib"
Title : next_lib
.PP
Function: Go to the next shotgun library to process.
.PP
Usage : my \f(CW$struct\fR = \f(CW$factory\fR\->next_lib;
.PP
Returns : Community structure to be used for this library, where \f(CW$struct\fR\->{ids}
is an array reference containing the IDs of the genome making up the
community (sorted by decreasing relative abundance) and \f(CW$struct\fR\->{abs}
is an array reference of the genome abundances (in the same order as
the IDs).
.SS "next_read"
.IX Subsection "next_read"
Title : next_read
.PP
Function: Create an amplicon or shotgun read for the current library.
.PP
Usage : my \f(CW$read\fR = \f(CW$factory\fR\->next_read; # for single read
my \f(CW$mate1\fR = \f(CW$factory\fR\->next_read; # for mate pairs
my \f(CW$mate2\fR = \f(CW$factory\fR\->next_read;
.PP
Returns : A sequence represented as a Bio::Seq::SimulatedRead object
.SS "get_random_seed"
.IX Subsection "get_random_seed"
Title : get_random_seed
.PP
Function: Return the number used to seed the pseudo-random number generator
.PP
Usage : my \f(CW$seed\fR = \f(CW$factory\fR\->get_random_seed;
.PP
Returns : seed number
.SH "COPYRIGHT"
.IX Header "COPYRIGHT"
Copyright 2009\-2012 Florent \s-1ANGLY\s0
.PP
Grinder is free software: you can redistribute it and/or modify
it under the terms of the \s-1GNU\s0 General Public License (\s-1GPL\s0) as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
Grinder is distributed in the hope that it will be useful,
but \s-1WITHOUT\s0 \s-1ANY\s0 \s-1WARRANTY\s0; without even the implied warranty of
\&\s-1MERCHANTABILITY\s0 or \s-1FITNESS\s0 \s-1FOR\s0 A \s-1PARTICULAR\s0 \s-1PURPOSE\s0. See the
\&\s-1GNU\s0 General Public License for more details.
You should have received a copy of the \s-1GNU\s0 General Public License
along with Grinder. If not, see .
.SH "BUGS"
.IX Header "BUGS"
All complex software has bugs lurking in it, and this program is no exception.
If you find a bug, please report it on the SourceForge Tracker for Grinder:
.PP
Bug reports, suggestions and patches are welcome. Grinder's code is developed
on Sourceforge () and is
under Git revision control. To get started with a patch, do:
.PP
.Vb 1
\& git clone git://biogrinder.git.sourceforge.net/gitroot/biogrinder/biogrinder
.Ve
Grinder-0.5.3/man/change_paired_read_orientation.1 0000644 0001750 0001750 00000013140 12151575456 022357 0 ustar floflooo floflooo .\" Automatically generated by Pod::Man 2.25 (Pod::Simple 3.26)
.\"
.\" Standard preamble:
.\" ========================================================================
.de Sp \" Vertical space (when we can't use .PP)
.if t .sp .5v
.if n .sp
..
.de Vb \" Begin verbatim text
.ft CW
.nf
.ne \\$1
..
.de Ve \" End verbatim text
.ft R
.fi
..
.\" Set up some character translations and predefined strings. \*(-- will
.\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left
.\" double quote, and \*(R" will give a right double quote. \*(C+ will
.\" give a nicer C++. Capital omega is used to do unbreakable dashes and
.\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff,
.\" nothing in troff, for use with C<>.
.tr \(*W-
.ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p'
.ie n \{\
. ds -- \(*W-
. ds PI pi
. if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch
. if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch
. ds L" ""
. ds R" ""
. ds C` ""
. ds C' ""
'br\}
.el\{\
. ds -- \|\(em\|
. ds PI \(*p
. ds L" ``
. ds R" ''
'br\}
.\"
.\" Escape single quotes in literal strings from groff's Unicode transform.
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\"
.\" If the F register is turned on, we'll generate index entries on stderr for
.\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index
.\" entries marked with X<> in POD. Of course, you'll have to process the
.\" output yourself in some meaningful fashion.
.ie \nF \{\
. de IX
. tm Index:\\$1\t\\n%\t"\\$2"
..
. nr % 0
. rr F
.\}
.el \{\
. de IX
..
.\}
.\"
.\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2).
.\" Fear. Run. Save yourself. No user-serviceable parts.
. \" fudge factors for nroff and troff
.if n \{\
. ds #H 0
. ds #V .8m
. ds #F .3m
. ds #[ \f1
. ds #] \fP
.\}
.if t \{\
. ds #H ((1u-(\\\\n(.fu%2u))*.13m)
. ds #V .6m
. ds #F 0
. ds #[ \&
. ds #] \&
.\}
. \" simple accents for nroff and troff
.if n \{\
. ds ' \&
. ds ` \&
. ds ^ \&
. ds , \&
. ds ~ ~
. ds /
.\}
.if t \{\
. ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u"
. ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u'
. ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u'
. ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u'
. ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u'
. ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u'
.\}
. \" troff and (daisy-wheel) nroff accents
.ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V'
.ds 8 \h'\*(#H'\(*b\h'-\*(#H'
.ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#]
.ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H'
.ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u'
.ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#]
.ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#]
.ds ae a\h'-(\w'a'u*4/10)'e
.ds Ae A\h'-(\w'A'u*4/10)'E
. \" corrections for vroff
.if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u'
.if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u'
. \" for low resolution devices (crt and lpr)
.if \n(.H>23 .if \n(.V>19 \
\{\
. ds : e
. ds 8 ss
. ds o a
. ds d- d\h'-1'\(ga
. ds D- D\h'-1'\(hy
. ds th \o'bp'
. ds Th \o'LP'
. ds ae ae
. ds Ae AE
.\}
.rm #[ #] #H #V #F C
.\" ========================================================================
.\"
.IX Title "CHANGE_PAIRED_READ_ORIENTATION 1"
.TH CHANGE_PAIRED_READ_ORIENTATION 1 "2012-11-20" "perl v5.14.2" "User Contributed Perl Documentation"
.\" For nroff, turn off justification. Always turn off hyphenation; it makes
.\" way too many mistakes in technical documents.
.if n .ad l
.nh
.SH "NAME"
change_paired_read_orientation \- Change the orientation of paired\-end reads in a
FASTA file
.SH "DESCRIPTION"
.IX Header "DESCRIPTION"
Reverse the orientation, i.e. reverse-complement each right-hand paired-end read
(\s-1ID\s0 ending in /2) in a \s-1FASTA\s0 file.
.SH "REQUIRED ARGUMENTS"
.IX Header "REQUIRED ARGUMENTS"
.IP "" 4
.IX Item ""
\&\s-1FASTA\s0 file containing the reads to re-orient.
.IP "" 4
.IX Item ""
Output \s-1FASTA\s0 file where to write the reads.
.SH "COPYRIGHT"
.IX Header "COPYRIGHT"
Copyright 2009\-2012 Florent \s-1ANGLY\s0
.PP
Grinder is free software: you can redistribute it and/or modify
it under the terms of the \s-1GNU\s0 General Public License (\s-1GPL\s0) as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
Grinder is distributed in the hope that it will be useful,
but \s-1WITHOUT\s0 \s-1ANY\s0 \s-1WARRANTY\s0; without even the implied warranty of
\&\s-1MERCHANTABILITY\s0 or \s-1FITNESS\s0 \s-1FOR\s0 A \s-1PARTICULAR\s0 \s-1PURPOSE\s0. See the
\&\s-1GNU\s0 General Public License for more details.
You should have received a copy of the \s-1GNU\s0 General Public License
along with Grinder. If not, see .
.SH "BUGS"
.IX Header "BUGS"
All complex software has bugs lurking in it, and this program is no exception.
If you find a bug, please report it on the SourceForge Tracker for Grinder:
.PP
Bug reports, suggestions and patches are welcome. Grinder's code is developed
on Sourceforge () and is
under Git revision control. To get started with a patch, do:
.PP
.Vb 1
\& git clone git://biogrinder.git.sourceforge.net/gitroot/biogrinder/biogrinder
.Ve
Grinder-0.5.3/LICENSE 0000644 0001750 0001750 00000104774 12151575455 014365 0 ustar floflooo floflooo This software is Copyright (c) 2013 by Florent Angly .
This is free software, licensed under:
The GNU General Public License, Version 3, June 2007
GNU GENERAL PUBLIC LICENSE
Version 3, 29 June 2007
Copyright (C) 2007 Free Software Foundation, Inc.
Everyone is permitted to copy and distribute verbatim copies
of this license document, but changing it is not allowed.
Preamble
The GNU General Public License is a free, copyleft license for
software and other kinds of works.
The licenses for most software and other practical works are designed
to take away your freedom to share and change the works. By contrast,
the GNU General Public License is intended to guarantee your freedom to
share and change all versions of a program--to make sure it remains free
software for all its users. We, the Free Software Foundation, use the
GNU General Public License for most of our software; it applies also to
any other work released this way by its authors. You can apply it to
your programs, too.
When we speak of free software, we are referring to freedom, not
price. Our General Public Licenses are designed to make sure that you
have the freedom to distribute copies of free software (and charge for
them if you wish), that you receive source code or can get it if you
want it, that you can change the software or use pieces of it in new
free programs, and that you know you can do these things.
To protect your rights, we need to prevent others from denying you
these rights or asking you to surrender the rights. Therefore, you have
certain responsibilities if you distribute copies of the software, or if
you modify it: responsibilities to respect the freedom of others.
For example, if you distribute copies of such a program, whether
gratis or for a fee, you must pass on to the recipients the same
freedoms that you received. You must make sure that they, too, receive
or can get the source code. And you must show them these terms so they
know their rights.
Developers that use the GNU GPL protect your rights with two steps:
(1) assert copyright on the software, and (2) offer you this License
giving you legal permission to copy, distribute and/or modify it.
For the developers' and authors' protection, the GPL clearly explains
that there is no warranty for this free software. For both users' and
authors' sake, the GPL requires that modified versions be marked as
changed, so that their problems will not be attributed erroneously to
authors of previous versions.
Some devices are designed to deny users access to install or run
modified versions of the software inside them, although the manufacturer
can do so. This is fundamentally incompatible with the aim of
protecting users' freedom to change the software. The systematic
pattern of such abuse occurs in the area of products for individuals to
use, which is precisely where it is most unacceptable. Therefore, we
have designed this version of the GPL to prohibit the practice for those
products. If such problems arise substantially in other domains, we
stand ready to extend this provision to those domains in future versions
of the GPL, as needed to protect the freedom of users.
Finally, every program is threatened constantly by software patents.
States should not allow patents to restrict development and use of
software on general-purpose computers, but in those that do, we wish to
avoid the special danger that patents applied to a free program could
make it effectively proprietary. To prevent this, the GPL assures that
patents cannot be used to render the program non-free.
The precise terms and conditions for copying, distribution and
modification follow.
TERMS AND CONDITIONS
0. Definitions.
"This License" refers to version 3 of the GNU General Public License.
"Copyright" also means copyright-like laws that apply to other kinds of
works, such as semiconductor masks.
"The Program" refers to any copyrightable work licensed under this
License. Each licensee is addressed as "you". "Licensees" and
"recipients" may be individuals or organizations.
To "modify" a work means to copy from or adapt all or part of the work
in a fashion requiring copyright permission, other than the making of an
exact copy. The resulting work is called a "modified version" of the
earlier work or a work "based on" the earlier work.
A "covered work" means either the unmodified Program or a work based
on the Program.
To "propagate" a work means to do anything with it that, without
permission, would make you directly or secondarily liable for
infringement under applicable copyright law, except executing it on a
computer or modifying a private copy. Propagation includes copying,
distribution (with or without modification), making available to the
public, and in some countries other activities as well.
To "convey" a work means any kind of propagation that enables other
parties to make or receive copies. Mere interaction with a user through
a computer network, with no transfer of a copy, is not conveying.
An interactive user interface displays "Appropriate Legal Notices"
to the extent that it includes a convenient and prominently visible
feature that (1) displays an appropriate copyright notice, and (2)
tells the user that there is no warranty for the work (except to the
extent that warranties are provided), that licensees may convey the
work under this License, and how to view a copy of this License. If
the interface presents a list of user commands or options, such as a
menu, a prominent item in the list meets this criterion.
1. Source Code.
The "source code" for a work means the preferred form of the work
for making modifications to it. "Object code" means any non-source
form of a work.
A "Standard Interface" means an interface that either is an official
standard defined by a recognized standards body, or, in the case of
interfaces specified for a particular programming language, one that
is widely used among developers working in that language.
The "System Libraries" of an executable work include anything, other
than the work as a whole, that (a) is included in the normal form of
packaging a Major Component, but which is not part of that Major
Component, and (b) serves only to enable use of the work with that
Major Component, or to implement a Standard Interface for which an
implementation is available to the public in source code form. A
"Major Component", in this context, means a major essential component
(kernel, window system, and so on) of the specific operating system
(if any) on which the executable work runs, or a compiler used to
produce the work, or an object code interpreter used to run it.
The "Corresponding Source" for a work in object code form means all
the source code needed to generate, install, and (for an executable
work) run the object code and to modify the work, including scripts to
control those activities. However, it does not include the work's
System Libraries, or general-purpose tools or generally available free
programs which are used unmodified in performing those activities but
which are not part of the work. For example, Corresponding Source
includes interface definition files associated with source files for
the work, and the source code for shared libraries and dynamically
linked subprograms that the work is specifically designed to require,
such as by intimate data communication or control flow between those
subprograms and other parts of the work.
The Corresponding Source need not include anything that users
can regenerate automatically from other parts of the Corresponding
Source.
The Corresponding Source for a work in source code form is that
same work.
2. Basic Permissions.
All rights granted under this License are granted for the term of
copyright on the Program, and are irrevocable provided the stated
conditions are met. This License explicitly affirms your unlimited
permission to run the unmodified Program. The output from running a
covered work is covered by this License only if the output, given its
content, constitutes a covered work. This License acknowledges your
rights of fair use or other equivalent, as provided by copyright law.
You may make, run and propagate covered works that you do not
convey, without conditions so long as your license otherwise remains
in force. You may convey covered works to others for the sole purpose
of having them make modifications exclusively for you, or provide you
with facilities for running those works, provided that you comply with
the terms of this License in conveying all material for which you do
not control copyright. Those thus making or running the covered works
for you must do so exclusively on your behalf, under your direction
and control, on terms that prohibit them from making any copies of
your copyrighted material outside their relationship with you.
Conveying under any other circumstances is permitted solely under
the conditions stated below. Sublicensing is not allowed; section 10
makes it unnecessary.
3. Protecting Users' Legal Rights From Anti-Circumvention Law.
No covered work shall be deemed part of an effective technological
measure under any applicable law fulfilling obligations under article
11 of the WIPO copyright treaty adopted on 20 December 1996, or
similar laws prohibiting or restricting circumvention of such
measures.
When you convey a covered work, you waive any legal power to forbid
circumvention of technological measures to the extent such circumvention
is effected by exercising rights under this License with respect to
the covered work, and you disclaim any intention to limit operation or
modification of the work as a means of enforcing, against the work's
users, your or third parties' legal rights to forbid circumvention of
technological measures.
4. Conveying Verbatim Copies.
You may convey verbatim copies of the Program's source code as you
receive it, in any medium, provided that you conspicuously and
appropriately publish on each copy an appropriate copyright notice;
keep intact all notices stating that this License and any
non-permissive terms added in accord with section 7 apply to the code;
keep intact all notices of the absence of any warranty; and give all
recipients a copy of this License along with the Program.
You may charge any price or no price for each copy that you convey,
and you may offer support or warranty protection for a fee.
5. Conveying Modified Source Versions.
You may convey a work based on the Program, or the modifications to
produce it from the Program, in the form of source code under the
terms of section 4, provided that you also meet all of these conditions:
a) The work must carry prominent notices stating that you modified
it, and giving a relevant date.
b) The work must carry prominent notices stating that it is
released under this License and any conditions added under section
7. This requirement modifies the requirement in section 4 to
"keep intact all notices".
c) You must license the entire work, as a whole, under this
License to anyone who comes into possession of a copy. This
License will therefore apply, along with any applicable section 7
additional terms, to the whole of the work, and all its parts,
regardless of how they are packaged. This License gives no
permission to license the work in any other way, but it does not
invalidate such permission if you have separately received it.
d) If the work has interactive user interfaces, each must display
Appropriate Legal Notices; however, if the Program has interactive
interfaces that do not display Appropriate Legal Notices, your
work need not make them do so.
A compilation of a covered work with other separate and independent
works, which are not by their nature extensions of the covered work,
and which are not combined with it such as to form a larger program,
in or on a volume of a storage or distribution medium, is called an
"aggregate" if the compilation and its resulting copyright are not
used to limit the access or legal rights of the compilation's users
beyond what the individual works permit. Inclusion of a covered work
in an aggregate does not cause this License to apply to the other
parts of the aggregate.
6. Conveying Non-Source Forms.
You may convey a covered work in object code form under the terms
of sections 4 and 5, provided that you also convey the
machine-readable Corresponding Source under the terms of this License,
in one of these ways:
a) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by the
Corresponding Source fixed on a durable physical medium
customarily used for software interchange.
b) Convey the object code in, or embodied in, a physical product
(including a physical distribution medium), accompanied by a
written offer, valid for at least three years and valid for as
long as you offer spare parts or customer support for that product
model, to give anyone who possesses the object code either (1) a
copy of the Corresponding Source for all the software in the
product that is covered by this License, on a durable physical
medium customarily used for software interchange, for a price no
more than your reasonable cost of physically performing this
conveying of source, or (2) access to copy the
Corresponding Source from a network server at no charge.
c) Convey individual copies of the object code with a copy of the
written offer to provide the Corresponding Source. This
alternative is allowed only occasionally and noncommercially, and
only if you received the object code with such an offer, in accord
with subsection 6b.
d) Convey the object code by offering access from a designated
place (gratis or for a charge), and offer equivalent access to the
Corresponding Source in the same way through the same place at no
further charge. You need not require recipients to copy the
Corresponding Source along with the object code. If the place to
copy the object code is a network server, the Corresponding Source
may be on a different server (operated by you or a third party)
that supports equivalent copying facilities, provided you maintain
clear directions next to the object code saying where to find the
Corresponding Source. Regardless of what server hosts the
Corresponding Source, you remain obligated to ensure that it is
available for as long as needed to satisfy these requirements.
e) Convey the object code using peer-to-peer transmission, provided
you inform other peers where the object code and Corresponding
Source of the work are being offered to the general public at no
charge under subsection 6d.
A separable portion of the object code, whose source code is excluded
from the Corresponding Source as a System Library, need not be
included in conveying the object code work.
A "User Product" is either (1) a "consumer product", which means any
tangible personal property which is normally used for personal, family,
or household purposes, or (2) anything designed or sold for incorporation
into a dwelling. In determining whether a product is a consumer product,
doubtful cases shall be resolved in favor of coverage. For a particular
product received by a particular user, "normally used" refers to a
typical or common use of that class of product, regardless of the status
of the particular user or of the way in which the particular user
actually uses, or expects or is expected to use, the product. A product
is a consumer product regardless of whether the product has substantial
commercial, industrial or non-consumer uses, unless such uses represent
the only significant mode of use of the product.
"Installation Information" for a User Product means any methods,
procedures, authorization keys, or other information required to install
and execute modified versions of a covered work in that User Product from
a modified version of its Corresponding Source. The information must
suffice to ensure that the continued functioning of the modified object
code is in no case prevented or interfered with solely because
modification has been made.
If you convey an object code work under this section in, or with, or
specifically for use in, a User Product, and the conveying occurs as
part of a transaction in which the right of possession and use of the
User Product is transferred to the recipient in perpetuity or for a
fixed term (regardless of how the transaction is characterized), the
Corresponding Source conveyed under this section must be accompanied
by the Installation Information. But this requirement does not apply
if neither you nor any third party retains the ability to install
modified object code on the User Product (for example, the work has
been installed in ROM).
The requirement to provide Installation Information does not include a
requirement to continue to provide support service, warranty, or updates
for a work that has been modified or installed by the recipient, or for
the User Product in which it has been modified or installed. Access to a
network may be denied when the modification itself materially and
adversely affects the operation of the network or violates the rules and
protocols for communication across the network.
Corresponding Source conveyed, and Installation Information provided,
in accord with this section must be in a format that is publicly
documented (and with an implementation available to the public in
source code form), and must require no special password or key for
unpacking, reading or copying.
7. Additional Terms.
"Additional permissions" are terms that supplement the terms of this
License by making exceptions from one or more of its conditions.
Additional permissions that are applicable to the entire Program shall
be treated as though they were included in this License, to the extent
that they are valid under applicable law. If additional permissions
apply only to part of the Program, that part may be used separately
under those permissions, but the entire Program remains governed by
this License without regard to the additional permissions.
When you convey a copy of a covered work, you may at your option
remove any additional permissions from that copy, or from any part of
it. (Additional permissions may be written to require their own
removal in certain cases when you modify the work.) You may place
additional permissions on material, added by you to a covered work,
for which you have or can give appropriate copyright permission.
Notwithstanding any other provision of this License, for material you
add to a covered work, you may (if authorized by the copyright holders of
that material) supplement the terms of this License with terms:
a) Disclaiming warranty or limiting liability differently from the
terms of sections 15 and 16 of this License; or
b) Requiring preservation of specified reasonable legal notices or
author attributions in that material or in the Appropriate Legal
Notices displayed by works containing it; or
c) Prohibiting misrepresentation of the origin of that material, or
requiring that modified versions of such material be marked in
reasonable ways as different from the original version; or
d) Limiting the use for publicity purposes of names of licensors or
authors of the material; or
e) Declining to grant rights under trademark law for use of some
trade names, trademarks, or service marks; or
f) Requiring indemnification of licensors and authors of that
material by anyone who conveys the material (or modified versions of
it) with contractual assumptions of liability to the recipient, for
any liability that these contractual assumptions directly impose on
those licensors and authors.
All other non-permissive additional terms are considered "further
restrictions" within the meaning of section 10. If the Program as you
received it, or any part of it, contains a notice stating that it is
governed by this License along with a term that is a further
restriction, you may remove that term. If a license document contains
a further restriction but permits relicensing or conveying under this
License, you may add to a covered work material governed by the terms
of that license document, provided that the further restriction does
not survive such relicensing or conveying.
If you add terms to a covered work in accord with this section, you
must place, in the relevant source files, a statement of the
additional terms that apply to those files, or a notice indicating
where to find the applicable terms.
Additional terms, permissive or non-permissive, may be stated in the
form of a separately written license, or stated as exceptions;
the above requirements apply either way.
8. Termination.
You may not propagate or modify a covered work except as expressly
provided under this License. Any attempt otherwise to propagate or
modify it is void, and will automatically terminate your rights under
this License (including any patent licenses granted under the third
paragraph of section 11).
However, if you cease all violation of this License, then your
license from a particular copyright holder is reinstated (a)
provisionally, unless and until the copyright holder explicitly and
finally terminates your license, and (b) permanently, if the copyright
holder fails to notify you of the violation by some reasonable means
prior to 60 days after the cessation.
Moreover, your license from a particular copyright holder is
reinstated permanently if the copyright holder notifies you of the
violation by some reasonable means, this is the first time you have
received notice of violation of this License (for any work) from that
copyright holder, and you cure the violation prior to 30 days after
your receipt of the notice.
Termination of your rights under this section does not terminate the
licenses of parties who have received copies or rights from you under
this License. If your rights have been terminated and not permanently
reinstated, you do not qualify to receive new licenses for the same
material under section 10.
9. Acceptance Not Required for Having Copies.
You are not required to accept this License in order to receive or
run a copy of the Program. Ancillary propagation of a covered work
occurring solely as a consequence of using peer-to-peer transmission
to receive a copy likewise does not require acceptance. However,
nothing other than this License grants you permission to propagate or
modify any covered work. These actions infringe copyright if you do
not accept this License. Therefore, by modifying or propagating a
covered work, you indicate your acceptance of this License to do so.
10. Automatic Licensing of Downstream Recipients.
Each time you convey a covered work, the recipient automatically
receives a license from the original licensors, to run, modify and
propagate that work, subject to this License. You are not responsible
for enforcing compliance by third parties with this License.
An "entity transaction" is a transaction transferring control of an
organization, or substantially all assets of one, or subdividing an
organization, or merging organizations. If propagation of a covered
work results from an entity transaction, each party to that
transaction who receives a copy of the work also receives whatever
licenses to the work the party's predecessor in interest had or could
give under the previous paragraph, plus a right to possession of the
Corresponding Source of the work from the predecessor in interest, if
the predecessor has it or can get it with reasonable efforts.
You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License. For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted under this License, and you may not initiate litigation
(including a cross-claim or counterclaim in a lawsuit) alleging that
any patent claim is infringed by making, using, selling, offering for
sale, or importing the Program or any portion of it.
11. Patents.
A "contributor" is a copyright holder who authorizes use under this
License of the Program or a work on which the Program is based. The
work thus licensed is called the contributor's "contributor version".
A contributor's "essential patent claims" are all patent claims
owned or controlled by the contributor, whether already acquired or
hereafter acquired, that would be infringed by some manner, permitted
by this License, of making, using, or selling its contributor version,
but do not include claims that would be infringed only as a
consequence of further modification of the contributor version. For
purposes of this definition, "control" includes the right to grant
patent sublicenses in a manner consistent with the requirements of
this License.
Each contributor grants you a non-exclusive, worldwide, royalty-free
patent license under the contributor's essential patent claims, to
make, use, sell, offer for sale, import and otherwise run, modify and
propagate the contents of its contributor version.
In the following three paragraphs, a "patent license" is any express
agreement or commitment, however denominated, not to enforce a patent
(such as an express permission to practice a patent or covenant not to
sue for patent infringement). To "grant" such a patent license to a
party means to make such an agreement or commitment not to enforce a
patent against the party.
If you convey a covered work, knowingly relying on a patent license,
and the Corresponding Source of the work is not available for anyone
to copy, free of charge and under the terms of this License, through a
publicly available network server or other readily accessible means,
then you must either (1) cause the Corresponding Source to be so
available, or (2) arrange to deprive yourself of the benefit of the
patent license for this particular work, or (3) arrange, in a manner
consistent with the requirements of this License, to extend the patent
license to downstream recipients. "Knowingly relying" means you have
actual knowledge that, but for the patent license, your conveying the
covered work in a country, or your recipient's use of the covered work
in a country, would infringe one or more identifiable patents in that
country that you have reason to believe are valid.
If, pursuant to or in connection with a single transaction or
arrangement, you convey, or propagate by procuring conveyance of, a
covered work, and grant a patent license to some of the parties
receiving the covered work authorizing them to use, propagate, modify
or convey a specific copy of the covered work, then the patent license
you grant is automatically extended to all recipients of the covered
work and works based on it.
A patent license is "discriminatory" if it does not include within
the scope of its coverage, prohibits the exercise of, or is
conditioned on the non-exercise of one or more of the rights that are
specifically granted under this License. You may not convey a covered
work if you are a party to an arrangement with a third party that is
in the business of distributing software, under which you make payment
to the third party based on the extent of your activity of conveying
the work, and under which the third party grants, to any of the
parties who would receive the covered work from you, a discriminatory
patent license (a) in connection with copies of the covered work
conveyed by you (or copies made from those copies), or (b) primarily
for and in connection with specific products or compilations that
contain the covered work, unless you entered into that arrangement,
or that patent license was granted, prior to 28 March 2007.
Nothing in this License shall be construed as excluding or limiting
any implied license or other defenses to infringement that may
otherwise be available to you under applicable patent law.
12. No Surrender of Others' Freedom.
If conditions are imposed on you (whether by court order, agreement or
otherwise) that contradict the conditions of this License, they do not
excuse you from the conditions of this License. If you cannot convey a
covered work so as to satisfy simultaneously your obligations under this
License and any other pertinent obligations, then as a consequence you may
not convey it at all. For example, if you agree to terms that obligate you
to collect a royalty for further conveying from those to whom you convey
the Program, the only way you could satisfy both those terms and this
License would be to refrain entirely from conveying the Program.
13. Use with the GNU Affero General Public License.
Notwithstanding any other provision of this License, you have
permission to link or combine any covered work with a work licensed
under version 3 of the GNU Affero General Public License into a single
combined work, and to convey the resulting work. The terms of this
License will continue to apply to the part which is the covered work,
but the special requirements of the GNU Affero General Public License,
section 13, concerning interaction through a network will apply to the
combination as such.
14. Revised Versions of this License.
The Free Software Foundation may publish revised and/or new versions of
the GNU General Public License from time to time. Such new versions will
be similar in spirit to the present version, but may differ in detail to
address new problems or concerns.
Each version is given a distinguishing version number. If the
Program specifies that a certain numbered version of the GNU General
Public License "or any later version" applies to it, you have the
option of following the terms and conditions either of that numbered
version or of any later version published by the Free Software
Foundation. If the Program does not specify a version number of the
GNU General Public License, you may choose any version ever published
by the Free Software Foundation.
If the Program specifies that a proxy can decide which future
versions of the GNU General Public License can be used, that proxy's
public statement of acceptance of a version permanently authorizes you
to choose that version for the Program.
Later license versions may give you additional or different
permissions. However, no additional obligations are imposed on any
author or copyright holder as a result of your choosing to follow a
later version.
15. Disclaimer of Warranty.
THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY
APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT
HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY
OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO,
THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR
PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM
IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF
ALL NECESSARY SERVICING, REPAIR OR CORRECTION.
16. Limitation of Liability.
IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING
WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS
THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY
GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE
USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF
DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD
PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS),
EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF
SUCH DAMAGES.
17. Interpretation of Sections 15 and 16.
If the disclaimer of warranty and limitation of liability provided
above cannot be given local legal effect according to their terms,
reviewing courts shall apply local law that most closely approximates
an absolute waiver of all civil liability in connection with the
Program, unless a warranty or assumption of liability accompanies a
copy of the Program in return for a fee.
END OF TERMS AND CONDITIONS
How to Apply These Terms to Your New Programs
If you develop a new program, and you want it to be of the greatest
possible use to the public, the best way to achieve this is to make it
free software which everyone can redistribute and change under these terms.
To do so, attach the following notices to the program. It is safest
to attach them to the start of each source file to most effectively
state the exclusion of warranty; and each file should have at least
the "copyright" line and a pointer to where the full notice is found.
Copyright (C)
This program is free software: you can redistribute it and/or modify
it under the terms of the GNU General Public License as published by
the Free Software Foundation, either version 3 of the License, or
(at your option) any later version.
This program is distributed in the hope that it will be useful,
but WITHOUT ANY WARRANTY; without even the implied warranty of
MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
GNU General Public License for more details.
You should have received a copy of the GNU General Public License
along with this program. If not, see .
Also add information on how to contact you by electronic and paper mail.
If the program does terminal interaction, make it output a short
notice like this when it starts in an interactive mode:
Copyright (C)
This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'.
This is free software, and you are welcome to redistribute it
under certain conditions; type `show c' for details.
The hypothetical commands `show w' and `show c' should show the appropriate
parts of the General Public License. Of course, your program's commands
might be different; for a GUI interface, you would use an "about box".
You should also get your employer (if you work as a programmer) or school,
if any, to sign a "copyright disclaimer" for the program, if necessary.
For more information on this, and how to apply and follow the GNU GPL, see
.
The GNU General Public License does not permit incorporating your program
into proprietary programs. If your program is a subroutine library, you
may consider it more useful to permit linking proprietary applications with
the library. If this is what you want to do, use the GNU Lesser General
Public License instead of this License. But first, please read
.
Grinder-0.5.3/lib/ 0000755 0001750 0001750 00000000000 12151575606 014107 5 ustar floflooo floflooo Grinder-0.5.3/lib/Bio/ 0000755 0001750 0001750 00000000000 12151575606 014620 5 ustar floflooo floflooo Grinder-0.5.3/lib/Bio/Seq/ 0000755 0001750 0001750 00000000000 12151575606 015350 5 ustar floflooo floflooo Grinder-0.5.3/lib/Bio/Seq/SeqFastaSpeedFactory.pm 0000644 0001750 0001750 00000007566 12052601553 021733 0 ustar floflooo floflooo #
# BioPerl module for Bio::Seq::SeqFastaSpeedFactory
#
# Please direct questions and support issues to
#
# Cared for by Jason Stajich
#
# Copyright Jason Stajich
#
# You may distribute this module under the same terms as perl itself
# POD documentation - main docs before the code
=head1 NAME
Bio::Seq::SeqFastaSpeedFactory - Rapid creation of Bio::Seq objects through a factory
=head1 SYNOPSIS
use Bio::Seq::SeqFastaSpeedFactory;
my $factory = Bio::Seq::SeqFastaSpeedFactory->new();
my $seq = $factory->create( -seq => 'WYRAVLC',
-id => 'name' );
=head1 DESCRIPTION
This factory was designed to build Bio::Seq objects as quickly as possible, but
is not as generic as L. It can be used to create sequences
from non-rich file formats. The L sequence parser uses this
factory.
=head1 FEEDBACK
=head2 Mailing Lists
User feedback is an integral part of the evolution of this and other
Bioperl modules. Send your comments and suggestions preferably to
the Bioperl mailing list. Your participation is much appreciated.
bioperl-l@bioperl.org - General discussion
http://bioperl.org/wiki/Mailing_lists - About the mailing lists
=head2 Support
Please direct usage questions or support issues to the mailing list:
I
rather than to the module maintainer directly. Many experienced and
reponsive experts will be able look at the problem and quickly
address it. Please include a thorough description of the problem
with code and data examples if at all possible.
=head2 Reporting Bugs
Report bugs to the Bioperl bug tracking system to help us keep track
of the bugs and their resolution. Bug reports can be submitted via the
web:
https://redmine.open-bio.org/projects/bioperl/
=head1 AUTHOR - Jason Stajich
Email jason@bioperl.org
=head1 APPENDIX
The rest of the documentation details each of the object methods.
Internal methods are usually preceded with a _
=cut
# Let the code begin...
package Bio::Seq::SeqFastaSpeedFactory;
use strict;
use Bio::Seq;
use Bio::PrimarySeq;
use base qw(Bio::Root::Root Bio::Factory::SequenceFactoryI);
=head2 new
Title : new
Usage : my $obj = Bio::Seq::SeqFastaSpeedFactory->new();
Function: Builds a new Bio::Seq::SeqFastaSpeedFactory object
Returns : Bio::Seq::SeqFastaSpeedFactory
Args : None
=cut
sub new {
my($class,@args) = @_;
my $self = $class->SUPER::new(@args);
return $self;
}
=head2 create
Title : create
Usage : my $seq = $seqbuilder->create(-seq => 'CAGT', -id => 'name');
Function: Instantiates a new Bio::Seq object, correctly built but very
fast, knowing stuff about Bio::PrimarySeq and Bio::Seq
Returns : A Bio::Seq object
Args : Initialization parameters for the sequence object we want:
-id
-primary_id
-display_id
-desc
-seq
-alphabet
=cut
sub create {
my ($self,@args) = @_;
my %param = @args;
@param{ map { lc $_ } keys %param } = values %param; # lowercase keys
my $sequence = $param{'-seq'};
my $fulldesc = $param{'-desc'};
my $id = defined $param{'-id'} ? $param{'-id'} : $param{'-primary_id'};
my $alphabet = $param{'-alphabet'};
my $seq = bless {}, 'Bio::Seq';
my $t_pseq = $seq->{'primary_seq'} = bless {}, 'Bio::PrimarySeq';
$t_pseq->{'seq'} = $sequence;
$t_pseq->{'length'} = CORE::length($sequence);
$t_pseq->{'desc'} = $fulldesc;
$t_pseq->{'display_id'} = $id;
$t_pseq->{'primary_id'} = $id;
$seq->{'primary_id'} = $id; # currently Bio::Seq does not delegate this
if( $sequence and !$alphabet ) {
$t_pseq->_guess_alphabet();
} elsif ( $sequence and $alphabet ) {
$t_pseq->{'alphabet'} = $alphabet;
}
return $seq;
}
1;
Grinder-0.5.3/lib/Bio/Seq/SimulatedRead.pm 0000644 0001750 0001750 00000054076 12052037144 020433 0 ustar floflooo floflooo package Bio::Seq::SimulatedRead;
=head1 NAME
Bio::Seq::SimulatedRead - Read with sequencing errors taken from a reference sequence
=head1 SYNOPSIS
use Bio::Seq::SimulatedRead;
use Bio::PrimarySeq;
# Create a reference sequence
my $genome = Bio::PrimarySeq->new( -id => 'human_chr2',
-seq => 'TAAAAAAACCCCTG',
-desc => 'The human genome' );
# A 10-bp error-free read taken from a genome
my $read = Bio::Seq::SimulatedRead->new(
-reference => $genome , # sequence to generate the read from
-id => 'read001', # read ID
-start => 3 , # start of the read on the genome forward strand
-end => 12 , # end of the read on the genome forward strand
-strand => 1 , # genome strand that the read is on
);
# Display the sequence of the read
print $read->seq."\n";
# Add a tag or MID to the beginning of the read
$read->mid('ACGT');
# Add sequencing errors (error positions are 1-based and relative to the
# error-free MID-containing read)
my $errors = {};
$errors->{'8'}->{'+'} = 'AAA'; # insertion of AAA after residue 8
$errors->{'1'}->{'%'} = 'G'; # substitution of residue 1 by a G
$errors->{'4'}->{'-'} = undef; # deletion of residue 4
$read->errors($errors);
# Display the sequence of the read with errors
print $read->seq."\n";
# String representation of where the read came from and its errors
print $read->desc."\n";
=head1 DESCRIPTION
This object is a simulated read with sequencing errors. The user can provide a
reference sequence to take a read from, the position and orientation of the
read on the reference sequence, and the sequencing errors to generate.
The sequence of the read is automatically calculated based on this information.
By default, the description of the reads contain tracking information and will
look like this (Bioperl-style):
reference=human_chr2 start=3 end=12 strand=-1 mid=ACGT errors=1%G,4-,8+AAA description="The human genome"
or Genbank-style:
reference=human_chr2 position=complement(3..12) mid=ACGT errors=1%G,4-,8+AAA description="The human genome"
Creating a simulated read follows these steps:
1/ Define the read start(), end(), strand() and qual_levels() if you want
quality scores to be generated. Do not change these values once set because
the read will not be updated.
2/ Specify the reference sequence that the read should be taken from. Once
this is done, you have a fully functional read. Do not use the reference()
method again after you have gone to the next step.
3/ Use mid() to input a MID (or tag or barcode) to add to the beginning of the
read. You can change the MID until you go to next step.
4/ Give sequencing error specifications using errors() as the last step. You
can do that as many times as you like, and the read will be updated.
=head1 AUTHOR
Florent E Angly Eflorent . angly @ gmail-dot-comE.
Copyright (c) 2011 Florent E Angly.
This library is free software; you can redistribute it under the GNU General
Public License version 3.
=cut
use strict;
use warnings;
use Bio::LocatableSeq;
use base qw( Bio::Seq::Quality Bio::LocatableSeq );
=head2 new
Title : new
Function : Create a new simulated read object
Usage : my $read = Bio::Seq::SimulatedRead->new(
-id => 'read001',
-reference => $seq_obj ,
-errors => $errors ,
-start => 10 ,
-end => 135 ,
-strand => 1 ,
);
Arguments: -reference => Bio::SeqI, Bio::PrimarySeqI object representing the
reference sequence to take the read from. See
reference().
-errors => Hashref representing the position of errors in the read
See errors().
-mid => String of a MID to prepend to the read. See mid().
-track => Track where the read came from in the read description?
See track().
-coord_style => Define what coordinate system to use. See coord_style().
All other methods from Bio::LocatableSeq are available.
Returns : new Bio::Seq::SimulatedRead object
=cut
sub new {
my ($class, @args) = @_;
my $self = $class->SUPER::new(@args);
my ($qual_levels, $reference, $mid, $errors, $track, $coord_style) =
$self->_rearrange([qw(QUAL_LEVELS REFERENCE MID ERRORS TRACK COORD_STYLE)], @args);
$coord_style = defined $coord_style ? $coord_style : 'bioperl';
$self->coord_style($coord_style);
$track = defined $track ? $track : 1;
$self->track($track);
$qual_levels = defined $qual_levels ? $qual_levels : [];
$self->qual_levels($qual_levels) if defined $qual_levels;
$self->reference($reference) if defined $reference;
$self->mid($mid) if defined $mid;
$self->{_mutated} = 0;
$self->errors($errors) if defined $errors;
return $self;
}
=head2 qual_levels
Title : qual_levels
Function : Get or set the quality scores to give to the read. By default, if your
reference sequence does not have quality scores, no quality scores
are generated for the simulated read. The generated quality scores
are very basic. If a residue is error-free, it gets the quality score
defined for good residues. If the residue has an error (is an
addition or a mutation), the residue gets the quality score specified
for bad residues. Call the qual_levels() method before using the
reference() method.
Usage : my $qual_levels = $read->qual_levels( );
Arguments: Array reference containing the quality scores to use for:
1/ good residues (e.g. 30)
2/ bad residues (e.g. 10)
Returns : Array reference containing the quality scores to use.
=cut
sub qual_levels {
my ($self, $qual_levels) = @_;
if (defined $qual_levels) {
if ( (scalar @$qual_levels != 0) && (scalar @$qual_levels != 2) ) {
$self->throw("The quality score specification must define the score".
" to use for good and for bad residues\n");
}
$self->{qual_levels} = $qual_levels;
}
return $self->{qual_levels};
}
=head2 reference
Title : reference
Function : Get or set the reference sequence that the read comes from. Once the
reference has been set, you have a functional simulated read which
supports all the Bio::LocatableSeq methods. This method must be
called after qual_levels() but before mid() or errors().
Usage : my $seq_obj = $read->reference();
Arguments: Bio::SeqI or Bio::PrimarySeqI object
Returns : Bio::SeqI or Bio::PrimarySeqI object
=cut
sub reference {
my ($self, $reference) = @_;
if (defined $reference) {
# Sanity check 1
if ( (not $reference->isa('Bio::SeqI')) && (not $reference->isa('Bio::PrimarySeqI')) ) {
$self->throw("Expected a Bio::SeqI object as reference, but got: $reference\n");
}
# Sanity check 2
if ($self->{mid} || $self->{errors}) {
$self->throw("Cannot change the reference sequence after an MID or ".
"sequencing errors have been added to the read\n");
}
# Use beginning of reference sequence as start default
if (not defined $self->start) {
$self->start(1);
}
# Use end of reference sequence as end default
if (not defined $self->end) {
$self->end($reference->length);
}
# Use strand 1 as strand default
if (not defined $self->strand) {
$self->strand(1);
}
# Set the reference sequence object
$self->{reference} = $reference;
# Create a sequence, quality scores and description from the reference
$self->_create_seq;
$self->_create_qual if scalar @{$self->qual_levels};
$self->_create_desc if $self->track;
}
return $self->{reference};
}
sub _create_seq {
my $self = shift;
# Get a truncation of the reference sequence
my $reference = $self->reference;
my $read_obj = $reference->trunc( $self->start, $self->end );
# Reverse complement the read if needed
if ($self->strand == -1) {
$read_obj = $read_obj->revcom();
}
$self->seq($read_obj->seq);
return 1;
}
sub _create_qual {
my $self = shift;
$self->qual([ ($self->qual_levels->[0]) x ($self->end - $self->start + 1) ]);
return 1;
}
sub _create_desc {
# Create the read description of the error-free read
my $self = shift;
# Reference sequence ID
my $desc_str = '';
my $ref_id = $self->reference->id;
if (defined $ref_id) {
$desc_str .= 'reference='.$ref_id.' ';
}
# Position of read on reference sequence: start, end and strand
my $strand = $self->strand;
if ($self->coord_style eq 'bioperl') {
$desc_str .= 'start='.$self->start.' end='.$self->end.' ';
if (defined $strand) {
# Strand of the reference sequence that the read is on
$strand = '+1' if $strand == 1;
$desc_str .= 'strand='.$strand.' ';
}
} else {
if ( (defined $strand) && ($strand == -1) ) {
# Reverse complemented
$desc_str .= 'position=complement('.$self->start.'..'.$self->end.') ';
} else {
# Regular (forward) orientation
$desc_str .= 'position='.$self->start.'..'.$self->end.' ';
}
}
# Description of the original sequence
my $ref_desc = $self->reference->desc;
if ( (defined $self->reference->desc) && ($self->reference->desc !~ m/^\s*$/) ) {
$ref_desc =~ s/"/\\"/g; # escape double-quotes to \"
$desc_str .= 'description="'.$ref_desc.'" ';
}
$desc_str =~ s/\s$//g;
# Record new description
$self->desc($desc_str);
return 1;
}
=head2 mid
Title : mid
Function : Get or set a multiplex identifier (or MID, or tag, or barcode) to
add to the read. By default, no MID is used. This method must be
called after reference() but before errors().
Usage : my $mid = read->mid();
Arguments: MID sequence string (e.g. 'ACGT')
Returns : MID sequence string
=cut
sub mid {
my ($self, $mid) = @_;
if (defined $mid) {
# Sanity check 1
if (not defined $self->reference) {
$self->throw("Cannot add MID because the reference sequence was not ".
"set\n");
}
# Sanity check 2
if ($self->{errors}) {
$self->throw("Cannot add an MID after sequencing errors have been ".
"introduced in the read\n");
}
# Sanity check 3
if (not $self->validate_seq($mid)) {
$self->throw("MID is not a valid DNA sequence\n");
}
# Update sequence, quality scores and description with the MID
$self->_update_seq_mid($mid);
$self->_update_qual_mid($mid) if scalar @{$self->qual_levels};
$self->_update_desc_mid($mid) if $self->track;
# Set the MID value
$self->{mid} = $mid;
}
return $self->{mid}
}
sub _update_seq_mid {
# Update the MID of a sequence
my ($self, $mid) = @_;
# Remove old MID
my $seq = $self->seq;
my $old_mid = $self->{mid};
if (defined $old_mid) {
$seq =~ s/^$old_mid//;
}
# Add new MID
$seq = $mid . $seq;
$self->seq( $seq );
return 1;
}
sub _update_qual_mid {
# Update the MID of a quality scores
my ($self, $mid) = @_;
# Remove old MID
my $qual = $self->qual;
my $old_mid = $self->{mid};
if (defined $old_mid) {
splice @$qual, 0, length($old_mid);
}
$qual = [($self->qual_levels->[0]) x length($mid), @$qual];
$self->qual( $qual );
return 1;
}
sub _update_desc_mid {
# Update MID specifications in the read description
my ($self, $mid) = @_;
if ($mid) {
# Sequencing errors introduced in the read
my $mid_str = "mid=".$mid;
my $desc_str = $self->desc;
$desc_str =~ s/((position|strand)=\S+)( mid=\S+)?/$1 $mid_str/g;
$self->desc( $desc_str );
}
return 1;
}
=head2 errors
Title : errors
Function : Get or set the sequencing errors and update the read. By default, no
errors are made. This method must be called after the mid() method.
Usage : my $errors = $read->errors();
Arguments: Reference to a hash of the position and nature of sequencing errors.
The positions are 1-based relative to the error-free MID-containing
read (not relative to the reference sequence). For example:
$errors->{34}->{'%'} = 'T' ; # substitution of residue 34 by a T
$errors->{23}->{'+'} = 'GG' ; # insertion of GG after residue 23
$errors->{45}->{'-'} = undef; # deletion of residue 45
Substitutions and deletions are for a single residue, but additions
can be additions of several residues.
An alternative way to specify errors is by using array references
instead of scalar for the hash values. This allows to specify
redundant mutations. For example, the case presented above would
result in the same read sequence as the example below:
$errors->{34}->{'%'} = ['C', 'T'] ; # substitution by a C and then a T
$errors->{23}->{'+'} = ['G', 'G'] ; # insertion of G and then a G
$errors->{45}->{'-'} = [undef, undef]; # deletion of residue, and again
Returns : Reference to a hash of the position and nature of sequencing errors.
=cut
sub errors {
my ($self, $errors) = @_;
if (defined $errors) {
# Verify that we have a hashref
if ( (not defined ref $errors) || (not ref $errors eq 'HASH') ) {
$self->throw("Error specification has to be a hashref. Got: $errors\n");
}
# Verify that we have a reference sequence
if (not defined $self->reference) {
$self->throw("Cannot add errors because the reference sequence was not set\n");
}
# Convert scalar error specs to arrayref specs
$errors = $self->_scalar_to_arrayref($errors);
# Check validity of error specifications
$errors = $self->_validate_error_specs($errors);
# Set the error specifications
$self->{errors} = $errors;
# Need to recalculate the read from the reference if previously mutated
if ($self->{_mutated}) {
$self->_create_seq;
$self->_create_qual if scalar @{$self->qual_levels};
$self->_create_desc if $self->track;
}
# Now mutate the read, quality score and description
$self->_update_seq_errors;
$self->_update_qual_errors if scalar @{$self->qual_levels};
$self->_update_desc_errors if $self->track;
}
return $self->{errors};
}
sub _scalar_to_arrayref {
# Replace the scalar values in the error specs by more versatile arrayrefs
my ($self, $errors) = @_;
while ( my ($pos, $ops) = each %$errors ) {
while ( my ($op, $res) = each %$ops ) {
if (ref $res eq '') {
my $arr = [ split //, ($res || '') ];
$arr = [undef] if scalar @$arr == 0;
$$errors{$pos}{$op} = $arr;
}
}
}
return $errors;
}
sub _validate_error_specs {
# Clean error specifications and warn of any issues encountered
my ($self, $errors) = @_;
my %valid_ops = ('%' => undef, '-' => undef, '+' => undef); # valid operations
# Calculate read length
my $read_length = $self->length;
while ( my ($pos, $ops) = each %$errors ) {
# Position cannot be no longer than the read length
if ( (defined $read_length) && ($pos > $read_length) ) {
$self->warn("Position $pos is beyond end of read ($read_length ".
"residues). Skipping errors specified at this position.\n");
delete $errors->{$pos};
}
# Position has to be 0+ for addition, 1+ for substitution and deletion
if ( $pos < 1 && (exists $ops->{'%'} || exists $ops->{'-'}) ) {
$self->warn("Positions of substitutions and deletions have to be ".
"strictly positive but got $pos. Skipping substitution or deletion".
" at this position\n");
delete $ops->{'%'};
delete $ops->{'-'};
}
if ( $pos < 0 && exists $ops->{'+'}) {
$self->warn("Positions of additions have to be zero or more. ".
"Skipping addition at position $pos.\n");
delete $ops->{'+'};
}
# Valid operations are '%', '+' and '-'
while ( my ($op, $res) = each %$ops ) {
if (not exists $valid_ops{$op}) {
$self->warn("Skipping unknown error operation '$op' at position".
" $pos\n");
delete $ops->{$op};
} else {
# Substitutions: have to have at least one residue to substitute
if ( ($op eq '%') && (scalar @$res < 1) ) {
$self->warn("At least one residue must be provided for substitutions,".
"but got ".scalar(@$res)." at position $pos.\n");
}
# Additions: have to have at least one residue to add
if ( ($op eq '+') && (scalar @$res < 1) ) {
$self->warn("At least one residue must be provided for additions,".
"but got ".scalar(@$res)." at position $pos.\n");
}
# Deletions
if ( ($op eq '-') && (scalar @$res < 1) ) {
$self->warn("At least one 'undef' must be provided for deletions,".
"but got ".scalar(@$res)." at position $pos.\n");
}
}
}
delete $errors->{$pos} unless scalar keys %$ops;
}
return $errors;
}
sub _update_seq_errors {
my $self = shift;
my $seq_str = $self->seq;
my $errors = $self->errors;
if (scalar keys %$errors > 0) {
my $off = 0;
for my $pos ( sort {$a <=> $b} (keys %$errors) ) {
# Process sequencing errors at that position
for my $type ( '%', '-', '+' ) {
next if not exists $$errors{$pos}{$type};
my $arr = $$errors{$pos}{$type};
if ($type eq '%') {
# Substitution at residue position. If there are multiple
# substitutions to do, directly skip to the last one.
substr $seq_str, $pos - 1 + $off, 1, $$arr[-1];
} elsif ($type eq '-') {
# Deletion at residue position
substr $seq_str, $pos - 1 + $off, 1, '';
$off--;
} elsif ($type eq '+') {
# Insertion after residue position
substr $seq_str, $pos + $off, 0, join('', @$arr);
$off += scalar @$arr;
}
}
}
$self->{_mutated} = 1;
} else {
$self->{_mutated} = 0;
}
$self->seq($seq_str);
return 1;
}
sub _update_qual_errors {
my $self = shift;
my $qual = $self->qual;
my $errors = $self->errors;
my $bad_qual = $self->qual_levels->[1];
if (scalar keys %$errors > 0) {
my $off = 0;
for my $pos ( sort {$a <=> $b} (keys %$errors) ) {
# Process sequencing errors at that position
for my $type ( '%', '-', '+' ) {
next if not exists $$errors{$pos}{$type};
my $arr = $$errors{$pos}{$type};
if ($type eq '%') {
# Substitution at residue position
splice @$qual, $pos - 1 + $off, 1, $bad_qual;
} elsif ($type eq '-') {
# Deletion at residue position
splice @$qual, $pos - 1 + $off, 1;
$off--;
} elsif ($type eq '+') {
# Insertion after residue position
splice @$qual, $pos + $off, 0, ($bad_qual) x scalar(@$arr);
$off += scalar @$arr;
}
}
}
}
$self->qual($qual);
return 1;
}
sub _update_desc_errors {
# Add or update error specifications in the read description
my $self = shift;
my $errors = $self->errors;
if (defined $errors and scalar keys %$errors > 0) {
# Sequencing errors introduced in the read
my $err_str = 'errors=';
for my $pos ( sort {$a <=> $b} (keys %$errors) ) {
# Process sequencing errors at that position
for my $type ( '%', '-', '+' ) {
next if not exists $$errors{$pos}{$type};
for my $val ( @{$$errors{$pos}{$type}} ) {
$val = '' if not defined $val;
$err_str .= $pos . $type . $val . ',';
}
}
}
$err_str =~ s/,$//;
my $desc_str = $self->desc;
$desc_str =~ s/((position|strand)=\S+( mid=\S+)?)( errors=\S+)?/$1 $err_str/g;
$self->desc( $desc_str );
}
return 1;
}
=head2 track
Title : track
Function : Get or set the tracking status in the read description. By default,
tracking is on. This method can be called at any time.
Usage : my $track = $read->track();
Arguments: 1 for tracking, 0 otherwise
Returns : 1 for tracking, 0 otherwise
=cut
sub track {
my ($self, $track) = @_;
if (defined $track) {
if (defined $self->reference) {
if ($track == 1) {
$self->_create_desc;
$self->_update_desc_mid($self->mid);
$self->_update_desc_errors;
} else {
$self->desc(undef);
}
}
$self->{track} = $track;
}
return $self->{track};
}
=head2 coord_style
Title : coord_style
Function : When tracking is on, define which 1-based coordinate system to use
in the read description:
* 'bioperl' uses the start, end and strand keywords (default),
similarly to the GFF3 format. Example:
start=1 end=10 strand=+1
start=1 end=10 strand=-1
* 'genbank' does only provide the position keyword. Example:
position=1..10
position=complement(1..10)
Usage : my $coord_style = $read->track();
Arguments: 'bioperl' or 'genbank'
Returns : 'bioperl' or 'genbank'
=cut
sub coord_style {
my ($self, $coord_style) = @_;
my %styles = ( 'bioperl' => undef, 'genbank' => undef );
if (defined $coord_style) {
if (not exists $styles{$coord_style}) {
die "Error: Invalid coordinate style '$coord_style'\n";
}
$self->{coord_style} = $coord_style;
}
return $self->{coord_style};
}
1;
Grinder-0.5.3/lib/Bio/DB/ 0000755 0001750 0001750 00000000000 12151575606 015105 5 ustar floflooo floflooo Grinder-0.5.3/lib/Bio/DB/IndexedBase.pm 0000644 0001750 0001750 00000074241 12052601553 017615 0 ustar floflooo floflooo #
# BioPerl module for Bio::DB::IndexedBase
#
# You may distribute this module under the same terms as perl itself
#
=head1 NAME
Bio::DB::IndexedBase - Base class for modules using indexed sequence files
=head1 SYNOPSIS
use Bio::DB::XXX; # a made-up class that uses Bio::IndexedBase
# 1/ Bio::SeqIO-style access
# Index some sequence files
my $db = Bio::DB::XXX->new('/path/to/file'); # from a single file
my $db = Bio::DB::XXX->new(['file1', 'file2']); # from multiple files
my $db = Bio::DB::XXX->new('/path/to/files/'); # from a directory
# Get IDs of all the sequences in the database
my @ids = $db->get_all_primary_ids;
# Get a specific sequence
my $seq = $db->get_Seq_by_id('CHROMOSOME_I');
# Loop through all sequences
my $stream = $db->get_PrimarySeq_stream;
while (my $seq = $stream->next_seq) {
# Do something...
}
# 2/ Access via filehandle
my $fh = Bio::DB::XXX->newFh('/path/to/file');
while (my $seq = <$fh>) {
# Do something...
}
# 3/ Tied-hash access
tie %sequences, 'Bio::DB::XXX', '/path/to/file';
print $sequences{'CHROMOSOME_I:1,20000'};
=head1 DESCRIPTION
Bio::DB::IndexedBase provides a base class for modules that want to index
and read sequence files and provides persistent, random access to each sequence
entry, without bringing the entire file into memory. This module is compliant
with the Bio::SeqI interface and both. Bio::DB::Fasta and Bio::DB::Qual both use
Bio::DB::IndexedBase.
When you initialize the module, you point it at a single file, several files, or
a directory of files. The first time it is run, the module generates an index
of the content of the files using the AnyDBM_File module (BerkeleyDB preferred,
followed by GDBM_File, NDBM_File, and SDBM_File). Subsequently, it uses the
index file to find the sequence file and offset for any requested sequence. If
one of the source files is updated, the module reindexes just that one file. You
can also force reindexing manually at any time. For improved performance, the
module keeps a cache of open filehandles, closing less-recently used ones when
the cache is full.
Entries may have any line length up to 65,536 characters, and different line
lengths are allowed in the same file. However, within a sequence entry, all
lines must be the same length except for the last. An error will be thrown if
this is not the case!
This module was developed for use with the C. elegans and human genomes, and has
been tested with sequence segments as large as 20 megabases. Indexing the C.
elegans genome (100 megabases of genomic sequence plus 100,000 ESTs) takes ~5
minutes on my 300 MHz pentium laptop. On the same system, average access time
for any 200-mer within the C. elegans genome was E0.02s.
=head1 DATABASE CREATION AND INDEXING
The two constructors for this class are new() and newFh(). The former creates a
Bio::DB::IndexedBase object which is accessed via method calls. The latter
creates a tied filehandle which can be used Bio::SeqIO style to fetch sequence
objects in a stream fashion. There is also a tied hash interface.
=over
=item $db = Bio::DB::IndexedBase-Enew($path [,%options])
Create a new Bio::DB::IndexedBase object from the files designated by $path
$path may be a single file, an arrayref of files, or a directory containing
such files.
After the database is created, you can use methods like get_all_primary_ids()
and get_Seq_by_id() to retrieve sequence objects.
=item $fh = Bio::DB::IndexedBase-EnewFh($path [,%options])
Create a tied filehandle opened on a Bio::DB::IndexedBase object. Reading
from this filehandle with EE will return a stream of sequence objects,
Bio::SeqIO style. The path and the options should be specified as for new().
=item $obj = tie %db,'Bio::DB::IndexedBase', '/path/to/file' [,@args]
Create a tied-hash by tieing %db to Bio::DB::IndexedBase using the indicated
path to the files. The optional @args list is the same set used by new(). If
successful, tie() returns the tied object, undef otherwise.
Once tied, you can use the hash to retrieve an individual sequence by
its ID, like this:
my $seq = $db{CHROMOSOME_I};
The keys() and values() functions will return the sequence IDs and their
sequences, respectively. In addition, each() can be used to iterate over the
entire data set:
while (my ($id,$sequence) = each %db) {
print "$id => $sequence\n";
}
When dealing with very large sequences, you can avoid bringing them into memory
by calling each() in a scalar context. This returns the key only. You can then
use tied(%db) to recover the Bio::DB::IndexedBase object and call its methods.
while (my $id = each %db) {
print "$id: $db{$sequence:1,100}\n";
print "$id: ".tied(%db)->length($id)."\n";
}
In addition, you may invoke the FIRSTKEY and NEXTKEY tied hash methods directly
to retrieve the first and next ID in the database, respectively. This allows to
write the following iterative loop using just the object-oriented interface:
my $db = Bio::DB::IndexedBase->new('/path/to/file');
for (my $id=$db->FIRSTKEY; $id; $id=$db->NEXTKEY($id)) {
# do something with sequence
}
=back
=head1 INDEX CONTENT
Several attributes of each sequence are stored in the index file. Given a
sequence ID, these attributes can be retrieved using the following methods:
=over
=item offset($id)
Get the offset of the indicated sequence from the beginning of the file in which
it is located. The offset points to the beginning of the sequence, not the
beginning of the header line.
=item strlen($id)
Get the number of characters in the sequence string.
=item length($id)
Get the number of residues of the sequence.
=item linelen($id)
Get the length of the line for this sequence. If the sequence is wrapped, then
linelen() is likely to be much shorter than strlen().
=item headerlen($id)
Get the length of the header line for the indicated sequence.
=item header_offset
Get the offset of the header line for the indicated sequence from the beginning
of the file in which it is located. This attribute is not stored. It is
calculated from offset() and headerlen().
=item alphabet($id)
Get the molecular type (alphabet) of the indicated sequence. This method handles
residues according to the IUPAC convention.
=item file($id)
Get the the name of the file in which the indicated sequence can be found.
=back
=head1 INTERFACE COMPLIANCE NOTES
Bio::DB::IndexedBase is compliant with the Bio::DB::SeqI and hence with the
Bio::RandomAccessI interfaces.
Database do not necessarily provide any meaningful internal primary ID for the
sequences they store. However, Bio::DB::IndexedBase's internal primary IDs are
the IDs of the sequences. This means that the same ID passed to get_Seq_by_id()
and get_Seq_by_primary_id() will return the same sequence.
Since this database index has no notion of sequence version or namespace, the
get_Seq_by_id(), get_Seq_by_acc() and get_Seq_by_version() are identical.
=head1 BUGS
When a sequence is deleted from one of the files, this deletion is not detected
by the module and removed from the index. As a result, a "ghost" entry will
remain in the index and will return garbage results if accessed.
Also, if you are indexing a directory, it is wise to not add or remove files
from it.
In case you have changed the files in a directory, or the sequences in a file,
you can to rebuild the entire index, either by deleting it manually, or by
passing -reindex=E1 to new() when initializing the module.
=head1 SEE ALSO
L
L
L
=head1 AUTHOR
Lincoln Stein Elstein@cshl.orgE.
Copyright (c) 2001 Cold Spring Harbor Laboratory.
Florent Angly (for the modularization)
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself. See DISCLAIMER.txt for
disclaimers of warranty.
=head1 APPENDIX
The rest of the documentation details each of the object
methods. Internal methods are usually preceded with a _
=cut
package Bio::DB::IndexedBase;
BEGIN {
@AnyDBM_File::ISA = qw(DB_File GDBM_File NDBM_File SDBM_File)
if(!$INC{'AnyDBM_File.pm'});
}
use strict;
use IO::File;
use AnyDBM_File;
use Fcntl;
use File::Spec;
use File::Basename qw(basename dirname);
use Bio::PrimarySeq;
use base qw(Bio::DB::SeqI);
# Store offset, strlen, linelen, headerlen, type and fileno
use constant STRUCT => 'NNNnnCa*'; # 32-bit file offset and seq length
use constant STRUCTBIG => 'QQQnnCa*'; # 64-bit
use constant NA => 0;
use constant DNA => 1;
use constant RNA => 2;
use constant PROTEIN => 3;
use constant DIE_ON_MISSMATCHED_LINES => 1;
# you can avoid dying if you want but you may get incorrect results
=head2 new
Title : new
Usage : my $db = Bio::DB::IndexedBase->new($path, -reindex => 1);
Function: Initialize a new database object
Returns : A Bio::DB::IndexedBase object
Args : A single file, or path to dir, or arrayref of files
Optional arguments:
Option Description Default
----------- ----------- -------
-glob Glob expression to search for files in directories *
-makeid A code subroutine for transforming IDs None
-maxopen Maximum size of filehandle cache 32
-debug Turn on status messages 0
-reindex Force the index to be rebuilt 0
-dbmargs Additional arguments to pass to the DBM routine None
-index_name Name of the file that will hold the indices
-clean Remove the index file when finished 0
The -dbmargs option can be used to control the format of the index. For example,
you can pass $DB_BTREE to this argument so as to force the IDs to be sorted and
retrieved alphabetically. Note that you must use the same arguments every time
you open the index!
The -makeid option gives you a chance to modify sequence IDs during indexing.
For example, you may wish to extract a portion of the gi|gb|abc|xyz nonsense
that GenBank Fasta files use. The original header line can be recovered later.
The option value for -makeid should be a code reference that takes a scalar
argument (the full header line) and returns a scalar or an array of scalars (the
ID or IDs you want to assign). For example:
$db = Bio::DB::IndexedBase->new('file.fa', -makeid => \&extract_gi);
sub extract_gi {
# Extract GI from GenBank
my $header = shift;
my ($id) = ($header =~ /gi\|(\d+)/m);
return $id || '';
}
extract_gi() will be called with the full header line, e.g. a Fasta line would
include the "E", the ID and the description:
>gi|352962132|ref|NG_030353.1| Homo sapiens sal-like 3 (Drosophila) (SALL3)
In the database, this sequence can now be retrieved by its GI instead of its
complete ID:
my $seq = $db->get_Seq_by_id(352962132);
The -makeid option is ignored after the index is constructed.
=cut
sub new {
my ($class, $path, %opts) = @_;
my $self = bless {
debug => $opts{-debug} || 0,
makeid => $opts{-makeid},
glob => $opts{-glob} || eval '$'.$class.'::file_glob' || '*',
maxopen => $opts{-maxopen} || 32,
clean => $opts{-clean} || 0,
dbmargs => $opts{-dbmargs} || undef,
fhcache => {},
cacheseq => {},
curopen => 0,
openseq => 1,
dirname => undef,
offsets => undef,
index_name => $opts{-index_name},
obj_class => eval '$'.$class.'::obj_class',
offset_meth => \&{$class.'::_calculate_offsets'},
fileno2path => [],
filepath2no => {},
}, $class;
my ($offsets, $dirname);
my $ref = ref $path || '';
if ( $ref eq 'ARRAY' ) {
$offsets = $self->index_files($path, $opts{-reindex});
require Cwd;
$dirname = Cwd::getcwd();
} else {
if (-d $path) {
# because Win32 glob() is broken with respect to long file names
# that contain whitespace.
$path = Win32::GetShortPathName($path)
if $^O =~ /^MSWin/i && eval 'use Win32; 1';
$offsets = $self->index_dir($path, $opts{-reindex});
$dirname = $path;
} elsif (-f _) {
$offsets = $self->index_file($path, $opts{-reindex});
$dirname = dirname($path);
} else {
$self->throw( "$path: Invalid file or dirname");
}
}
@{$self}{qw(dirname offsets)} = ($dirname, $offsets);
return $self;
}
=head2 newFh
Title : newFh
Usage : my $fh = Bio::DB::IndexedBase->newFh('/path/to/files/', %options);
Function: Index and get a new Fh for a single file, several files or a directory
Returns : Filehandle object
Args : Same as new()
=cut
sub newFh {
my ($class, @args) = @_;
my $self = $class->new(@args);
require Symbol;
my $fh = Symbol::gensym;
tie $$fh, 'Bio::DB::Indexed::Stream', $self
or $self->throw("Could not tie filehandle: $!");
return $fh;
}
=head2 dbmargs
Title : dbmargs
Usage : my @args = $db->dbmargs;
Function: Get stored dbm arguments
Returns : Array
Args : None
=cut
sub dbmargs {
my $self = shift;
my $args = $self->{dbmargs} or return;
return ref($args) eq 'ARRAY' ? @$args : $args;
}
=head2 glob
Title : glob
Usage : my $glob = $db->glob;
Function: Get the expression used to match files in directories
Returns : String
Args : None
=cut
sub glob {
my $self = shift;
return $self->{glob};
}
=head2 index_dir
Title : index_dir
Usage : $db->index_dir($dir);
Function: Index the files that match -glob in the given directory
Returns : Hashref of offsets
Args : Dirname
Boolean to force a reindexing the directory
=cut
sub index_dir {
my ($self, $dir, $force_reindex) = @_;
my @files = glob( File::Spec->catfile($dir, $self->{glob}) );
$self->throw("No suitable files found in $dir") if scalar @files == 0;
$self->{index_name} ||= File::Spec->catfile($dir, 'directory.index');
my $offsets = $self->_index_files(\@files, $force_reindex);
return $offsets;
}
=head2 get_all_primary_ids
Title : get_all_primary_ids, get_all_ids, ids
Usage : my @ids = $db->get_all_primary_ids;
Function: Get the IDs stored in all indexes. This is a Bio::DB::SeqI method
implementation. Note that in this implementation, the internal
database primary IDs are also the sequence IDs.
Returns : List of ids
Args : None
=cut
sub get_all_primary_ids {
return keys %{shift->{offsets}};
}
*ids = *get_all_ids = \&get_all_primary_ids;
=head2 index_file
Title : index_file
Usage : $db->index_file($filename);
Function: Index the given file
Returns : Hashref of offsets
Args : Filename
Boolean to force reindexing the file
=cut
sub index_file {
my ($self, $file, $force_reindex) = @_;
$self->{index_name} ||= "$file.index";
my $offsets = $self->_index_files([$file], $force_reindex);
return $offsets;
}
=head2 index_files
Title : index_files
Usage : $db->index_files(\@files);
Function: Index the given files
Returns : Hashref of offsets
Args : Arrayref of filenames
Boolean to force reindexing the files
=cut
sub index_files {
my ($self, $files, $force_reindex) = @_;
my @paths = map { File::Spec->rel2abs($_) } @$files;
require Digest::MD5;
my $digest = Digest::MD5::md5_hex( join('', sort @paths) );
$self->{index_name} ||= "fileset_$digest.index"; # unique name for the given files
my $offsets = $self->_index_files($files, $force_reindex);
return $offsets;
}
=head2 index_name
Title : index_name
Usage : my $indexname = $db->index_name($path);
Function: Get the full name of the index file
Returns : String
Args : None
=cut
sub index_name {
return shift->{index_name};
}
=head2 path
Title : path
Usage : my $path = $db->path($path);
Function: When a simple file or a directory of files is indexed, this returns
the file directory. When indexing an arbitrary list of files, the
return value is the path of the current working directory.
Returns : String
Args : None
=cut
sub path {
return shift->{dirname};
}
=head2 get_PrimarySeq_stream
Title : get_PrimarySeq_stream
Usage : my $stream = $db->get_PrimarySeq_stream();
Function: Get a SeqIO-like stream of sequence objects. The stream supports a
single method, next_seq(). Each call to next_seq() returns a new
PrimarySeqI compliant sequence object, until no more sequences remain.
This is a Bio::DB::SeqI method implementation.
Returns : A Bio::DB::Indexed::Stream object
Args : None
=cut
sub get_PrimarySeq_stream {
my $self = shift;
return Bio::DB::Indexed::Stream->new($self);
}
=head2 get_Seq_by_id
Title : get_Seq_by_id, get_Seq_by_acc, get_Seq_by_version, get_Seq_by_primary_id
Usage : my $seq = $db->get_Seq_by_id($id);
Function: Given an ID, fetch the corresponding sequence from the database.
This is a Bio::DB::SeqI and Bio::DB::RandomAccessI method implementation.
Returns : A sequence object
Args : ID
=cut
sub get_Seq_by_id {
my ($self, $id) = @_;
$self->throw('Need to provide a sequence ID') if not defined $id;
return if not exists $self->{offsets}{$id};
return $self->{obj_class}->new($self, $id);
}
*get_Seq_by_version = *get_Seq_by_primary_id = *get_Seq_by_acc = \&get_Seq_by_id;
=head2 _calculate_offsets
Title : _calculate_offsets
Usage : $db->_calculate_offsets($filename, $offsets);
Function: This method calculates the sequence offsets in a file based on ID and
should be implemented by classes that use Bio::DB::IndexedBase.
Returns : Hash of offsets
Args : File to process
Hashref of file offsets keyed by IDs.
=cut
sub _calculate_offsets {
my $self = shift;
$self->throw_not_implemented();
}
sub _index_files {
# Do the indexing of the given files using the index file on record
my ($self, $files, $force_reindex) = @_;
$self->_set_pack_method( @$files );
# Get name of index file
my $index = $self->index_name;
# If caller has requested reindexing, unlink the index file.
unlink $index if $force_reindex;
# Get the modification time of the index
my $indextime = (stat $index)[9] || 0;
# Register files and find if there has been any update
my $modtime = 0;
my @updated;
for my $file (@$files) {
# Register file
$self->_path2fileno(basename($file));
# Any update?
my $m = (stat $file)[9] || 0;
if ($m > $modtime) {
$modtime = $m;
}
if ($m > $indextime) {
push @updated, $file;
}
}
# Get termination length from first file
$self->{termination_length} = $self->_calc_termination_length( $files->[0] );
# Reindex contents of changed files if needed
my $reindex = $force_reindex || (scalar @updated > 0);
$self->{offsets} = $self->_open_index($index, $reindex) or return;
if ($reindex) {
$self->{indexing} = $index;
for my $file (@updated) {
my $fileno = $self->_path2fileno(basename($file));
&{$self->{offset_meth}}($self, $fileno, $file, $self->{offsets});
}
delete $self->{indexing};
}
# Closing and reopening might help corrupted index file problem on Windows
$self->_close_index($self->{offsets});
return $self->{offsets} = $self->_open_index($index);
}
sub _open_index {
# Open index file in read-only or write mode
my ($self, $index_file, $write) = @_;
my %offsets;
my $flags = $write ? O_CREAT|O_RDWR : O_RDONLY;
my @dbmargs = $self->dbmargs;
tie %offsets, 'AnyDBM_File', $index_file, $flags, 0644, @dbmargs
or $self->throw( "Could not open index file $index_file: $!");
return \%offsets;
}
sub _close_index {
# Close index file
my ($self, $index) = @_;
untie %$index;
return 1;
}
sub _parse_compound_id {
# Handle compound IDs:
# $db->seq($id)
# $db->seq($id, $start, $stop, $strand)
# $db->seq("$id:$start,$stop")
# $db->seq("$id:$start..$stop")
# $db->seq("$id:$start-$stop")
# $db->seq("$id:$start,$stop/$strand")
# $db->seq("$id:$start..$stop/$strand")
# $db->seq("$id:$start-$stop/$strand")
# $db->seq("$id/$strand")
my ($self, $id, $start, $stop, $strand) = @_;
if ( (not defined $start ) &&
(not defined $stop ) &&
(not defined $strand) &&
($id =~ /^ (.+?) (?:\:([\d_]+)(?:,|-|\.\.)([\d_]+))? (?:\/(.+))? $/x) ) {
# Start, stop and strand not provided and ID looks like a compound ID
($id, $start, $stop, $strand) = ($1, $2, $3, $4);
}
# Start, stop and strand defaults
$stop ||= $self->length($id) || 0; # 0 if sequence not found in database
$start ||= ($stop > 0) ? 1 : 0;
$strand ||= 1;
# Convert numbers such as 1_000_000 to 1000000
$start =~ s/_//g;
$stop =~ s/_//g;
if ($start > $stop) {
# Change the strand
($start, $stop) = ($stop, $start);
$strand *= -1;
}
return $id, $start, $stop, $strand;
}
sub _guess_alphabet {
# Determine the molecular type of the given sequence string:
# 'dna', 'rna', 'protein' or '' (unknown/empty)
my ($self, $string) = @_;
# Handle IUPAC residues like PrimarySeq does
my $alphabet = Bio::PrimarySeq::_guess_alphabet_from_string($self, $string, 1);
return $alphabet eq 'dna' ? DNA
: $alphabet eq 'rna' ? RNA
: $alphabet eq 'protein' ? PROTEIN
: NA;
}
sub _makeid {
# Process the header line by applying any transformation given in -makeid
my ($self, $header_line) = @_;
return ref($self->{makeid}) eq 'CODE' ? $self->{makeid}->($header_line) : $1;
}
sub _check_linelength {
# Check that the line length is valid. Generate an error otherwise.
my ($self, $linelength) = @_;
return if not defined $linelength;
$self->throw(
"Each line of the qual file must be less than 65,536 characters. Line ".
"$. is $linelength chars."
) if $linelength > 65535;
}
sub _calc_termination_length {
# Try the beginning of the file to determine termination length
# Account for crlf-terminated Windows and Mac files
my ($self, $file) = @_;
my $fh = IO::File->new($file) or $self->throw( "Could not open $file: $!");
my $line = <$fh>;
close $fh;
$self->{termination_length} = ($line =~ /\r\n$/) ? 2 : 1;
return $self->{termination_length};
}
sub _calc_offset {
# Get the offset of the n-th residue of the sequence with the given ID
# and termination length (tl)
my ($self, $id, $n) = @_;
my $tl = $self->{termination_length};
$n--;
my ($offset, $seqlen, $linelen) = (&{$self->{unpackmeth}}($self->{offsets}{$id}))[0,1,3];
$n = 0 if $n < 0;
$n = $seqlen-1 if $n >= $seqlen;
return $offset + $linelen * int($n/($linelen-$tl)) + $n % ($linelen-$tl);
}
sub _fh {
# Given a sequence ID, return the filehandle on which to find this sequence
my ($self, $id) = @_;
$self->throw('Need to provide a sequence ID') if not defined $id;
my $file = $self->file($id) or return;
return $self->_fhcache( File::Spec->catfile($self->{dirname}, $file) ) or
$self->throw( "Can't open file $file");
}
sub _fhcache {
my ($self, $path) = @_;
if (!$self->{fhcache}{$path}) {
if ($self->{curopen} >= $self->{maxopen}) {
my @lru = sort {$self->{cacheseq}{$a} <=> $self->{cacheseq}{$b};}
keys %{$self->{fhcache}};
splice(@lru, $self->{maxopen} / 3);
$self->{curopen} -= @lru;
for (@lru) {
delete $self->{fhcache}{$_};
}
}
$self->{fhcache}{$path} = IO::File->new($path) || return;
binmode $self->{fhcache}{$path};
$self->{curopen}++;
}
$self->{cacheseq}{$path}++;
return $self->{fhcache}{$path};
}
#-------------------------------------------------------------
# Methods to store and retrieve data from indexed file
#
=head2 offset
Title : offset
Usage : my $offset = $db->offset($id);
Function: Get the offset of the indicated sequence from the beginning of the
file in which it is located. The offset points to the beginning of
the sequence, not the beginning of the header line.
Returns : String
Args : ID of sequence
=cut
sub offset {
my ($self, $id) = @_;
$self->throw('Need to provide a sequence ID') if not defined $id;
my $offset = $self->{offsets}{$id} or return;
return (&{$self->{unpackmeth}}($offset))[0];
}
=head2 strlen
Title : strlen
Usage : my $length = $db->strlen($id);
Function: Get the number of characters in the sequence string.
Returns : Integer
Args : ID of sequence
=cut
sub strlen {
my ($self, $id) = @_;
$self->throw('Need to provide a sequence ID') if not defined $id;
my $offset = $self->{offsets}{$id} or return;
return (&{$self->{unpackmeth}}($offset))[1];
}
=head2 length
Title : length
Usage : my $length = $db->length($id);
Function: Get the number of residues of the sequence.
Returns : Integer
Args : ID of sequence
=cut
sub length {
my ($self, $id) = @_;
$self->throw('Need to provide a sequence ID') if not defined $id;
my $offset = $self->{offsets}{$id} or return;
return (&{$self->{unpackmeth}}($offset))[2];
}
=head2 linelen
Title : linelen
Usage : my $linelen = $db->linelen($id);
Function: Get the length of the line for this sequence.
Returns : Integer
Args : ID of sequence
=cut
sub linelen {
my ($self, $id) = @_;
$self->throw('Need to provide a sequence ID') if not defined $id;
my $offset = $self->{offsets}{$id} or return;
return (&{$self->{unpackmeth}}($offset))[3];
}
=head2 headerlen
Title : headerlen
Usage : my $length = $db->headerlen($id);
Function: Get the length of the header line for the indicated sequence.
Returns : Integer
Args : ID of sequence
=cut
sub headerlen {
my ($self, $id) = @_;
$self->throw('Need to provide a sequence ID') if not defined $id;
my $offset = $self->{offsets}{$id} or return;
return (&{$self->{unpackmeth}}($offset))[4];
}
=head2 header_offset
Title : header_offset
Usage : my $offset = $db->header_offset($id);
Function: Get the offset of the header line for the indicated sequence from
the beginning of the file in which it is located.
Returns : String
Args : ID of sequence
=cut
sub header_offset {
my ($self, $id) = @_;
$self->throw('Need to provide a sequence ID') if not defined $id;
return if not $self->{offsets}{$id};
return $self->offset($id) - $self->headerlen($id);
}
=head2 alphabet
Title : alphabet
Usage : my $alphabet = $db->alphabet($id);
Function: Get the molecular type of the indicated sequence: dna, rna or protein
Returns : String
Args : ID of sequence
=cut
sub alphabet {
my ($self, $id) = @_;
$self->throw('Need to provide a sequence ID') if not defined $id;
my $offset = $self->{offsets}{$id} or return;
my $alphabet = (&{$self->{unpackmeth}}($offset))[5];
return : $alphabet == Bio::DB::IndexedBase::DNA ? 'dna'
: $alphabet == Bio::DB::IndexedBase::RNA ? 'rna'
: $alphabet == Bio::DB::IndexedBase::PROTEIN ? 'protein'
: '';
}
=head2 file
Title : file
Usage : my $file = $db->file($id);
Function: Get the the name of the file in which the indicated sequence can be
found.
Returns : String
Args : ID of sequence
=cut
sub file {
my ($self, $id) = @_;
$self->throw('Need to provide a sequence ID') if not defined $id;
my $offset = $self->{offsets}{$id} or return;
return $self->_fileno2path((&{$self->{unpackmeth}}($offset))[6]);
}
sub _fileno2path {
my ($self, $fileno) = @_;
return $self->{fileno2path}->[$fileno];
}
sub _path2fileno {
my ($self, $path) = @_;
if ( not exists $self->{filepath2no}->{$path} ) {
my $fileno = ($self->{filepath2no}->{$path} = 0+ $self->{fileno}++);
$self->{fileno2path}->[$fileno] = $path; # Save path
}
return $self->{filepath2no}->{$path};
}
sub _packSmall {
return pack STRUCT, @_;
}
sub _packBig {
return pack STRUCTBIG, @_;
}
sub _unpackSmall {
return unpack STRUCT, shift;
}
sub _unpackBig {
return unpack STRUCTBIG, shift;
}
sub _set_pack_method {
# Determine whether to use 32 or 64 bit integers for the given files.
my $self = shift;
# Find the maximum file size:
my ($maxsize) = sort { $b <=> $a } map { -s $_ } @_;
my $fourGB = (2 ** 32) - 1;
if ($maxsize > $fourGB) {
# At least one file exceeds 4Gb - we will need to use 64 bit ints
$self->{packmeth} = \&_packBig;
$self->{unpackmeth} = \&_unpackBig;
} else {
$self->{packmeth} = \&_packSmall;
$self->{unpackmeth} = \&_unpackSmall;
}
return 1;
}
#-------------------------------------------------------------
# Tied hash logic
#
sub TIEHASH {
return shift->new(@_);
}
sub FETCH {
return shift->subseq(@_);
}
sub STORE {
shift->throw("Read-only database");
}
sub DELETE {
shift->throw("Read-only database");
}
sub CLEAR {
shift->throw("Read-only database");
}
sub EXISTS {
return defined shift->offset(@_);
}
sub FIRSTKEY {
return tied(%{shift->{offsets}})->FIRSTKEY(@_);
}
sub NEXTKEY {
return tied(%{shift->{offsets}})->NEXTKEY(@_);
}
sub DESTROY {
my $self = shift;
if ( $self->{clean} || $self->{indexing} ) {
# Indexing aborted or cleaning requested. Delete the index file.
unlink $self->{index_name};
}
return 1;
}
#-------------------------------------------------------------
# stream-based access to the database
#
package Bio::DB::Indexed::Stream;
use base qw(Tie::Handle Bio::DB::SeqI);
sub new {
my ($class, $db) = @_;
my $key = $db->FIRSTKEY;
return bless {
db => $db,
key => $key
}, $class;
}
sub next_seq {
my $self = shift;
my ($key, $db) = @{$self}{'key', 'db'};
return if not defined $key;
my $value = $db->get_Seq_by_id($key);
$self->{key} = $db->NEXTKEY($key);
return $value;
}
sub TIEHANDLE {
my ($class, $db) = @_;
return $class->new($db);
}
sub READLINE {
my $self = shift;
return $self->next_seq;
}
1;
Grinder-0.5.3/lib/Bio/DB/Fasta.pm 0000644 0001750 0001750 00000033704 12052601553 016477 0 ustar floflooo floflooo #
# BioPerl module for Bio::DB::Fasta
#
# You may distribute this module under the same terms as perl itself
#
=head1 NAME
Bio::DB::Fasta - Fast indexed access to fasta files
=head1 SYNOPSIS
use Bio::DB::Fasta;
# Create database from a directory of Fasta files
my $db = Bio::DB::Fasta->new('/path/to/fasta/files/');
my @ids = $db->get_all_primary_ids;
# Simple access
my $seqstr = $db->seq('CHROMOSOME_I', 4_000_000 => 4_100_000);
my $revseq = $db->seq('CHROMOSOME_I', 4_100_000 => 4_000_000);
my $length = $db->length('CHROMOSOME_I');
my $header = $db->header('CHROMOSOME_I');
my $alphabet = $db->alphabet('CHROMOSOME_I');
# Access to sequence objects. See Bio::PrimarySeqI.
my $seq = $db->get_Seq_by_id('CHROMOSOME_I');
my $seqstr = $seq->seq;
my $subseq = $seq->subseq(4_000_000 => 4_100_000);
my $trunc = $seq->trunc(4_000_000 => 4_100_000);
my $length = $seq->length;
# Loop through sequence objects
my $stream = $db->get_PrimarySeq_stream;
while (my $seq = $stream->next_seq) {
# Bio::PrimarySeqI stuff
}
# Filehandle access
my $fh = Bio::DB::Fasta->newFh('/path/to/fasta/files/');
while (my $seq = <$fh>) {
# Bio::PrimarySeqI stuff
}
# Tied hash access
tie %sequences,'Bio::DB::Fasta','/path/to/fasta/files/';
print $sequences{'CHROMOSOME_I:1,20000'};
=head1 DESCRIPTION
Bio::DB::Fasta provides indexed access to a single Fasta file, several files,
or a directory of files. It provides persistent random access to each sequence
entry (either as a Bio::PrimarySeqI-compliant object or a string), and to
subsequences within each entry, allowing you to retrieve portions of very large
sequences without bringing the entire sequence into memory. Bio::DB::Fasta is
based on Bio::DB::IndexedBase. See this module's documentation for details.
The Fasta files may contain any combination of nucleotide and protein sequences;
during indexing the module guesses the molecular type. Entries may have any line
length up to 65,536 characters, and different line lengths are allowed in the
same file. However, within a sequence entry, all lines must be the same length
except for the last. An error will be thrown if this is not the case.
The module uses /^E(\S+)/ to extract the primary ID of each sequence
from the Fasta header. See -makeid in Bio::DB::IndexedBase to pass a callback
routine to reversibly modify this primary ID, e.g. if you wish to extract a
specific portion of the gi|gb|abc|xyz GenBank IDs.
=head1 DATABASE CREATION AND INDEXING
The object-oriented constructor is new(), the filehandle constructor is newFh()
and the tied hash constructor is tie(). They all allow to index a single Fasta
file, several files, or a directory of files. See Bio::DB::IndexedBase.
=head1 SEE ALSO
L
L
L
=head1 AUTHOR
Lincoln Stein Elstein@cshl.orgE.
Copyright (c) 2001 Cold Spring Harbor Laboratory.
This library is free software; you can redistribute it and/or modify
it under the same terms as Perl itself. See DISCLAIMER.txt for
disclaimers of warranty.
=head1 APPENDIX
The rest of the documentation details each of the object
methods. Internal methods are usually preceded with a _
For BioPerl-style access, the following methods are provided:
=head2 get_Seq_by_id
Title : get_Seq_by_id, get_Seq_by_acc, get_Seq_by_primary_id
Usage : my $seq = $db->get_Seq_by_id($id);
Function: Given an ID, fetch the corresponding sequence from the database.
Returns : A Bio::PrimarySeq::Fasta object (Bio::PrimarySeqI compliant)
Note that to save resource, Bio::PrimarySeq::Fasta sequence objects
only load the sequence string into memory when requested using seq().
See L for methods provided by the sequence objects
returned from get_Seq_by_id() and get_PrimarySeq_stream().
Args : ID
=head2 get_PrimarySeq_stream
Title : get_PrimarySeq_stream
Usage : my $stream = $db->get_PrimarySeq_stream();
Function: Get a stream of Bio::PrimarySeq::Fasta objects. The stream supports a
single method, next_seq(). Each call to next_seq() returns a new
Bio::PrimarySeq::Fasta sequence object, until no more sequences remain.
Returns : A Bio::DB::Indexed::Stream object
Args : None
=head1
For simple access, the following methods are provided:
=cut
package Bio::DB::Fasta;
use strict;
use IO::File;
use File::Spec;
use Bio::PrimarySeqI;
use base qw(Bio::DB::IndexedBase);
our $obj_class = 'Bio::PrimarySeq::Fasta';
our $file_glob = '*.{fa,FA,fasta,FASTA,fast,FAST,dna,DNA,fna,FNA,faa,FAA,fsa,FSA}';
=head2 new
Title : new
Usage : my $db = Bio::DB::Fasta->new( $path, %options);
Function: Initialize a new database object. When indexing a directory, files
ending in .fa,fasta,fast,dna,fna,faa,fsa are indexed by default.
Returns : A new Bio::DB::Fasta object.
Args : A single file, or path to dir, or arrayref of files
Optional arguments: see Bio::DB::IndexedBase
=cut
sub _calculate_offsets {
# Bio::DB::IndexedBase calls this to calculate offsets
my ($self, $fileno, $file, $offsets) = @_;
my $fh = IO::File->new($file) or $self->throw( "Could not open $file: $!");
binmode $fh;
warn "Indexing $file\n" if $self->{debug};
my ($offset, @ids, $linelen, $alphabet, $headerlen, $count, $seq_lines,
$last_line, %offsets);
my ($l3_len, $l2_len, $l_len, $blank_lines) = (0, 0, 0, 0);
my $termination_length = $self->{termination_length};
while (my $line = <$fh>) {
# Account for crlf-terminated Windows files
if (index($line, '>') == 0) {
if ($line =~ /^>(\S+)/) {
print STDERR "Indexed $count sequences...\n"
if $self->{debug} && (++$count%1000) == 0;
$self->_check_linelength($linelen);
my $pos = tell($fh);
if (@ids) {
my $strlen = $pos - $offset - length($line);
$strlen -= $termination_length * $seq_lines;
my $ppos = &{$self->{packmeth}}($offset, $strlen, $strlen,
$linelen, $headerlen, $alphabet, $fileno);
$alphabet = Bio::DB::IndexedBase::NA;
for my $id (@ids) {
$offsets->{$id} = $ppos;
}
}
@ids = $self->_makeid($line);
($offset, $headerlen, $linelen, $seq_lines) = ($pos, length $line, 0, 0);
($l3_len, $l2_len, $l_len, $blank_lines) = (0, 0, 0, 0);
} else {
# Catch bad header lines, bug 3172
$self->throw("FASTA header doesn't match '>(\\S+)': $line");
}
} elsif ($line !~ /\S/) {
# Skip blank line
$blank_lines++;
next;
} else {
# Need to check every line :(
$l3_len = $l2_len;
$l2_len = $l_len;
$l_len = length $line;
if (Bio::DB::IndexedBase::DIE_ON_MISSMATCHED_LINES) {
if ( ($l3_len > 0) && ($l2_len > 0) && ($l3_len != $l2_len) ) {
my $fap = substr($line, 0, 20)."..";
$self->throw("Each line of the fasta entry must be the same ".
"length except the last. Line above #$. '$fap' is $l2_len".
" != $l3_len chars.");
}
if ($blank_lines) {
# Blank lines not allowed in entry
$self->throw("Blank lines can only precede header lines, ".
"found preceding line #$.");
}
}
$linelen ||= length $line;
$alphabet ||= $self->_guess_alphabet($line);
$seq_lines++;
}
$last_line = $line;
}
# Process last entry
$self->_check_linelength($linelen);
my $pos = tell $fh;
if (@ids) {
my $strlen = $pos - $offset;
if ($linelen == 0) { # yet another pesky empty chr_random.fa file
$strlen = 0;
} else {
if ($last_line !~ /\s$/) {
$seq_lines--;
}
$strlen -= $termination_length * $seq_lines;
}
my $ppos = &{$self->{packmeth}}($offset, $strlen, $strlen, $linelen,
$headerlen, $alphabet, $fileno);
for my $id (@ids) {
$offsets->{$id} = $ppos;
}
}
return \%offsets;
}
=head2 seq
Title : seq, sequence, subseq
Usage : # Entire sequence string
my $seqstr = $db->seq($id);
# Subsequence
my $subseqstr = $db->seq($id, $start, $stop, $strand);
# or...
my $subseqstr = $db->seq($compound_id);
Function: Get a subseq of a sequence from the database. For your convenience,
the sequence to extract can be specified with any of the following
compound IDs:
$db->seq("$id:$start,$stop")
$db->seq("$id:$start..$stop")
$db->seq("$id:$start-$stop")
$db->seq("$id:$start,$stop/$strand")
$db->seq("$id:$start..$stop/$strand")
$db->seq("$id:$start-$stop/$strand")
$db->seq("$id/$strand")
In the case of DNA or RNA sequence, if $stop is less than $start,
then the reverse complement of the sequence is returned. Avoid using
it if possible since this goes against Bio::Seq conventions.
Returns : A string
Args : ID of sequence to retrieve
or
Compound ID of subsequence to fetch
or
ID, optional start (defaults to 1), optional end (defaults to length
of sequence) and optional strand (defaults to 1).
=cut
sub subseq {
my ($self, $id, $start, $stop, $strand) = @_;
$self->throw('Need to provide a sequence ID') if not defined $id;
($id, $start, $stop, $strand) = $self->_parse_compound_id($id, $start, $stop, $strand);
my $data;
my $fh = $self->_fh($id) or return;
my $filestart = $self->_calc_offset($id, $start);
my $filestop = $self->_calc_offset($id, $stop );
seek($fh, $filestart,0);
read($fh, $data, $filestop-$filestart+1);
$data =~ s/\n//g;
$data =~ s/\r//g;
if ($strand == -1) {
# Reverse-complement the sequence
$data = Bio::PrimarySeqI::_revcom_from_string($self, $data, $self->alphabet($id));
}
return $data;
}
*seq = *sequence = \&subseq;
=head2 length
Title : length
Usage : my $length = $qualdb->length($id);
Function: Get the number of residues in the indicated sequence.
Returns : Number
Args : ID of entry
=head2 header
Title : header
Usage : my $header = $db->header($id);
Function: Get the header line (ID and description fields) of the specified
sequence.
Returns : String
Args : ID of sequence
=cut
sub header {
my ($self, $id) = @_;
$self->throw('Need to provide a sequence ID') if not defined $id;
my ($offset, $headerlen) = (&{$self->{unpackmeth}}($self->{offsets}{$id}))[0,4];
$offset -= $headerlen;
my $data;
my $fh = $self->_fh($id) or return;
seek($fh, $offset, 0);
read($fh, $data, $headerlen);
chomp $data;
substr($data, 0, 1) = '';
return $data;
}
=head2 alphabet
Title : alphabet
Usage : my $alphabet = $db->alphabet($id);
Function: Get the molecular type of the indicated sequence: dna, rna or protein
Returns : String
Args : ID of sequence
=cut
#-------------------------------------------------------------
# Bio::PrimarySeqI compatibility
#
package Bio::PrimarySeq::Fasta;
use overload '""' => 'display_id';
use base qw(Bio::Root::Root Bio::PrimarySeqI);
sub new {
my ($class, @args) = @_;
my $self = $class->SUPER::new(@args);
my ($db, $id, $start, $stop) = $self->_rearrange(
[qw(DATABASE ID START STOP)],
@args);
$self->{db} = $db;
$self->{id} = $id;
$self->{stop} = $stop || $db->length($id);
$self->{start} = $start || ($self->{stop} > 0 ? 1 : 0); # handle 0-length seqs
return $self;
}
sub fetch_sequence {
return shift->seq(@_);
}
sub seq {
my $self = shift;
return $self->{db}->seq($self->{id}, $self->{start}, $self->{stop});
}
sub subseq {
my $self = shift;
return $self->trunc(@_)->seq();
}
sub trunc {
# Override Bio::PrimarySeqI trunc() method. This way, we create an object
# that does not store the sequence in memory.
my ($self, $start, $stop) = @_;
$self->throw("Stop cannot be smaller than start") if $stop < $start;
if ($self->{start} <= $self->{stop}) {
$start = $self->{start}+$start-1;
$stop = $self->{start}+$stop-1;
} else {
$start = $self->{start}-($start-1);
$stop = $self->{start}-($stop-1);
}
return $self->new( $self->{db}, $self->{id}, $start, $stop );
}
sub is_circular {
my $self = shift;
return $self->{is_circular};
}
sub display_id {
my $self = shift;
return $self->{id};
}
sub accession_number {
my $self = shift;
return 'unknown';
}
sub primary_id {
# Following Bio::PrimarySeqI, since this sequence has no accession number,
# its primary_id should be a stringified memory location.
my $self = shift;
return overload::StrVal($self);
}
sub can_call_new {
return 0;
}
sub alphabet {
my $self = shift;
return $self->{db}->alphabet($self->{id});
}
sub revcom {
# Override Bio::PrimarySeqI revcom() with optimized method.
my $self = shift;
return $self->new(@{$self}{'db', 'id', 'stop', 'start'});
}
sub length {
# Get length from sequence location, not the sequence string (too expensive)
my $self = shift;
return $self->{start} < $self->{stop} ?
$self->{stop} - $self->{start} + 1 :
$self->{start} - $self->{stop} + 1 ;
}
sub description {
my $self = shift;
my $header = $self->{'db'}->header($self->{id});
# Remove the ID from the header
return (split(/\s+/, $header, 2))[1];
}
*desc = \&description;
1;
Grinder-0.5.3/lib/Bio/SeqFeature/ 0000755 0001750 0001750 00000000000 12151575606 016664 5 ustar floflooo floflooo Grinder-0.5.3/lib/Bio/SeqFeature/Primer.pm 0000644 0001750 0001750 00000030127 12023266271 020454 0 ustar floflooo floflooo #
# BioPerl module for Bio::SeqFeature::Primer
#
# This is the original copyright statement. I have relied on Chad's module
# extensively for this module.
#
# Copyright (c) 1997-2001 bioperl, Chad Matsalla. All Rights Reserved.
# This module is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#
# Copyright Chad Matsalla
#
# You may distribute this module under the same terms as perl itself
# POD documentation - main docs before the code
#
# But I have modified lots of it, so I guess I should add:
#
# Copyright (c) 2003 bioperl, Rob Edwards. All Rights Reserved.
# This module is free software; you can redistribute it and/or
# modify it under the same terms as Perl itself.
#
# Copyright Rob Edwards
#
# You may distribute this module under the same terms as perl itself
# POD documentation - main docs before the code
=head1 NAME
Bio::SeqFeature::Primer - Primer Generic SeqFeature
=head1 SYNOPSIS
use Bio::SeqFeature::Primer;
# Primer object with explicitly-defined sequence object or sequence string
my $primer = Bio::SeqFeature::Primer->new( -seq => 'ACGTAGCT' );
$primer->display_name('test_id');
print "These are the details of the primer:\n".
"Name: ".$primer->display_name."\n".
"Tag: ".$primer->primary_tag."\n". # always 'Primer'
"Sequence: ".$primer->seq->seq."\n".
"Tm: ".$primer->Tm."\n\n"; # melting temperature
# Primer object with implicit sequence object
# It is a lighter approach for when the primer location on a template is known
use Bio::Seq;
my $template = Bio::Seq->new( -seq => 'ACGTAGCTCTTTTCATTCTGACTGCAACG' );
$primer = Bio::SeqFeature::Primer->new( -start => 1, -end =>5, -strand => 1 );
$template->add_SeqFeature($primer);
print "Primer sequence is: ".$primer->seq->seq."\n";
# Primer sequence is 'ACGTA'
=head1 DESCRIPTION
This module handles PCR primer sequences. The L object
is a L object that can additionally contain a primer
sequence and its coordinates on a template sequence. The primary_tag() for this
object is 'Primer'. A method is provided to calculate the melting temperature Tm
of the primer. L objects are useful to build
L amplicon objects such as the ones returned by
L.
=head1 FEEDBACK
=head2 Mailing Lists
User feedback is an integral part of the evolution of this and other
Bioperl modules. Send your comments and suggestions preferably to one
of the Bioperl mailing lists. Your participation is much appreciated.
bioperl-l@bioperl.org - General discussion
http://bioperl.org/wiki/Mailing_lists - About the mailing lists
=head2 Support
Please direct usage questions or support issues to the mailing list:
I
rather than to the module maintainer directly. Many experienced and
reponsive experts will be able look at the problem and quickly
address it. Please include a thorough description of the problem
with code and data examples if at all possible.
=head2 Reporting Bugs
Report bugs to the Bioperl bug tracking system to help us keep track
the bugs and their resolution. Bug reports can be submitted via the
web:
https://redmine.open-bio.org/projects/bioperl/
=head1 AUTHOR
Rob Edwards, redwards@utmem.edu
The original concept and much of the code was written by
Chad Matsalla, bioinformatics1@dieselwurks.com
=head1 APPENDIX
The rest of the documentation details each of the object
methods. Internal methods are usually preceded with a _
=cut
package Bio::SeqFeature::Primer;
use strict;
use Bio::PrimarySeq;
use Bio::Tools::SeqStats;
use base qw(Bio::SeqFeature::SubSeq);
=head2 new()
Title : new()
Usage : my $primer = Bio::SeqFeature::Primer( -seq => $seq_object );
Function: Instantiate a new Bio::SeqFeature::Primer object
Returns : A Bio::SeqFeature::Primer object
Args : -seq , a sequence object or a sequence string (optional)
-id , the ID to give to the primer sequence, not feature (optional)
=cut
sub new {
my ($class, %args) = @_;
# Legacy stuff
my $sequence = delete $args{-sequence};
if ($sequence) {
Bio::Root::Root->deprecated(
-message => 'Creating a Bio::SeqFeature::Primer with -sequence is deprecated. Use -seq instead.',
-warn_version => '1.006',
-throw_version => '1.008',
);
$args{-seq} = $sequence;
}
# Initialize Primer object
my $self = $class->SUPER::new(%args);
my ($id) = $self->_rearrange([qw(ID)], %args);
$id && $self->seq->id($id);
$self->primary_tag('Primer');
return $self;
}
# Bypass B::SF::Generic's location() when a string is passed (for compatibility)
sub location {
my ($self, $location) = @_;
if ($location) {
if ( not ref $location ) {
# Use location as a string for backward compatibility
Bio::Root::Root->deprecated(
-message => 'Passing a string to location() is deprecated. Pass a Bio::Location::Simple object or use start() and end() instead.',
-warn_version => '1.006',
-throw_version => '1.008',
);
$self->{'_location'} = $location;
} else {
$self->SUPER::location($location);
}
}
return $self->SUPER::location;
}
=head2 Tm()
Title : Tm()
Usage : my $tm = $primer->Tm(-salt => 0.05, -oligo => 0.0000001);
Function: Calculate the Tm (melting temperature) of the primer
Returns : A scalar containing the Tm.
Args : -salt : set the Na+ concentration on which to base the calculation
(default=0.05 molar).
: -oligo : set the oligo concentration on which to base the
calculation (default=0.00000025 molar).
Notes : Calculation of Tm as per Allawi et. al Biochemistry 1997
36:10581-10594. Also see documentation at
http://www.idtdna.com/Scitools/Scitools.aspx as they use this
formula and have a couple nice help pages. These Tm values will be
about are about 0.5-3 degrees off from those of the idtdna web tool.
I don't know why.
This was suggested by Barry Moore (thanks!). See the discussion on
the bioperl-l with the subject "Bio::SeqFeature::Primer Calculating
the PrimerTM"
=cut
sub Tm {
my ($self, %args) = @_;
my $salt_conc = 0.05; # salt concentration (molar units)
my $oligo_conc = 0.00000025; # oligo concentration (molar units)
if ($args{'-salt'}) {
# Accept object defined salt concentration
$salt_conc = $args{'-salt'};
}
if ($args{'-oligo'}) {
# Accept object defined oligo concentration
$oligo_conc = $args{'-oligo'};
}
my $seqobj = $self->seq();
my $length = $seqobj->length();
my $sequence = uc $seqobj->seq();
my @dinucleotides;
my $enthalpy;
my $entropy;
# Break sequence string into an array of all possible dinucleotides
while ($sequence =~ /(.)(?=(.))/g) {
push @dinucleotides, $1.$2;
}
# Build a hash with the thermodynamic values
my %thermo_values = ('AA' => {'enthalpy' => -7.9,
'entropy' => -22.2},
'AC' => {'enthalpy' => -8.4,
'entropy' => -22.4},
'AG' => {'enthalpy' => -7.8,
'entropy' => -21},
'AT' => {'enthalpy' => -7.2,
'entropy' => -20.4},
'CA' => {'enthalpy' => -8.5,
'entropy' => -22.7},
'CC' => {'enthalpy' => -8,
'entropy' => -19.9},
'CG' => {'enthalpy' => -10.6,
'entropy' => -27.2},
'CT' => {'enthalpy' => -7.8,
'entropy' => -21},
'GA' => {'enthalpy' => -8.2,
'entropy' => -22.2},
'GC' => {'enthalpy' => -9.8,
'entropy' => -24.4},
'GG' => {'enthalpy' => -8,
'entropy' => -19.9},
'GT' => {'enthalpy' => -8.4,
'entropy' => -22.4},
'TA' => {'enthalpy' => -7.2,
'entropy' => -21.3},
'TC' => {'enthalpy' => -8.2,
'entropy' => -22.2},
'TG' => {'enthalpy' => -8.5,
'entropy' => -22.7},
'TT' => {'enthalpy' => -7.9,
'entropy' => -22.2},
'A' => {'enthalpy' => 2.3,
'entropy' => 4.1},
'C' => {'enthalpy' => 0.1,
'entropy' => -2.8},
'G' => {'enthalpy' => 0.1,
'entropy' => -2.8},
'T' => {'enthalpy' => 2.3,
'entropy' => 4.1}
);
# Loop through dinucleotides and calculate cumulative enthalpy and entropy values
for (@dinucleotides) {
$enthalpy += $thermo_values{$_}{enthalpy};
$entropy += $thermo_values{$_}{entropy};
}
# Account for initiation parameters
$enthalpy += $thermo_values{substr($sequence, 0, 1)}{enthalpy};
$entropy += $thermo_values{substr($sequence, 0, 1)}{entropy};
$enthalpy += $thermo_values{substr($sequence, -1, 1)}{enthalpy};
$entropy += $thermo_values{substr($sequence, -1, 1)}{entropy};
# Symmetry correction
$entropy -= 1.4;
my $r = 1.987; # molar gas constant
my $tm = $enthalpy * 1000 / ($entropy + ($r * log($oligo_conc))) - 273.15 + (12* (log($salt_conc)/log(10)));
return $tm;
}
=head2 Tm_estimate
Title : Tm_estimate
Usage : my $tm = $primer->Tm_estimate(-salt => 0.05);
Function: Estimate the Tm (melting temperature) of the primer
Returns : A scalar containing the Tm.
Args : -salt set the Na+ concentration on which to base the calculation.
Notes : This is only an estimate of the Tm that is kept in for comparative
reasons. You should probably use Tm instead!
This Tm calculations are taken from the Primer3 docs: They are
based on Bolton and McCarthy, PNAS 84:1390 (1962)
as presented in Sambrook, Fritsch and Maniatis,
Molecular Cloning, p 11.46 (1989, CSHL Press).
Tm = 81.5 + 16.6(log10([Na+])) + .41*(%GC) - 600/length
where [Na+] is the molar sodium concentration, %GC is the
%G+C of the sequence, and length is the length of the sequence.
However.... I can never get this calculation to give me the same result
as primer3 does. Don't ask why, I never figured it out. But I did
want to include a Tm calculation here because I use these modules for
other things besides reading primer3 output.
The primer3 calculation is saved as 'PRIMER_LEFT_TM' or 'PRIMER_RIGHT_TM'
and this calculation is saved as $primer->Tm so you can get both and
average them!
=cut
sub Tm_estimate {
# This should probably be put into seqstats as it is more generic, but what the heck.
my ($self, %args) = @_;
my $salt = 0.2;
if ($args{'-salt'}) {
$salt = $args{'-salt'}
};
my $seqobj = $self->seq();
my $length = $seqobj->length();
my $seqdata = Bio::Tools::SeqStats->count_monomers($seqobj);
my $gc=$$seqdata{'G'} + $$seqdata{'C'};
my $percent_gc = ($gc/$length)*100;
my $tm = 81.5+(16.6*(log($salt)/log(10)))+(0.41*$percent_gc) - (600/$length);
return $tm;
}
=head2 primary_tag, source_tag, location, start, end, strand...
The documentation of L describes all the methods that
L object inherit.
=cut
1;
Grinder-0.5.3/lib/Bio/SeqFeature/Amplicon.pm 0000644 0001750 0001750 00000011453 12023266271 020761 0 ustar floflooo floflooo #
# BioPerl module for Bio::SeqFeature::Amplicon
#
# Please direct questions and support issues to
#
# Copyright Florent Angly
#
# You may distribute this module under the same terms as perl itself
=head1 NAME
Bio::SeqFeature::Amplicon - Amplicon feature
=head1 SYNOPSIS
# Amplicon with explicit sequence
use Bio::SeqFeature::Amplicon;
my $amplicon = Bio::SeqFeature::Amplicon->new(
-seq => $seq_object,
-fwd_primer => $primer_object_1,
-rev_primer => $primer_object_2,
);
# Amplicon with implicit sequence
use Bio::Seq;
my $template = Bio::Seq->new( -seq => 'AAAAACCCCCGGGGGTTTTT' );
$amplicon = Bio::SeqFeature::Amplicon->new(
-start => 6,
-end => 15,
);
$template->add_SeqFeature($amplicon);
print "Amplicon start : ".$amplicon->start."\n";
print "Amplicon end : ".$amplicon->end."\n";
print "Amplicon sequence: ".$amplicon->seq->seq."\n";
# Amplicon sequence should be 'CCCCCGGGGG'
=head1 DESCRIPTION
Bio::SeqFeature::Amplicon extends L to represent an
amplicon sequence and optional primer sequences.
=head1 FEEDBACK
=head2 Mailing Lists
User feedback is an integral part of the evolution of this and other
Bioperl modules. Send your comments and suggestions preferably to one
of the Bioperl mailing lists. Your participation is much appreciated.
bioperl-l@bioperl.org - General discussion
http://bioperl.org/wiki/Mailing_lists - About the mailing lists
=head2 Support
Please direct usage questions or support issues to the mailing list:
I
rather than to the module maintainer directly. Many experienced and
reponsive experts will be able look at the problem and quickly
address it. Please include a thorough description of the problem
with code and data examples if at all possible.
=head2 Reporting Bugs
Report bugs to the Bioperl bug tracking system to help us keep track
the bugs and their resolution. Bug reports can be submitted via
the web:
https://redmine.open-bio.org/projects/bioperl/
=head1 AUTHOR
Florent Angly
=head1 APPENDIX
The rest of the documentation details each of the object
methods. Internal methods are usually preceded with a _
=cut
package Bio::SeqFeature::Amplicon;
use strict;
use base qw(Bio::SeqFeature::SubSeq);
=head2 new
Title : new()
Usage : my $amplicon = Bio::SeqFeature::Amplicon( -seq => $seq_object );
Function: Instantiate a new Bio::SeqFeature::Amplicon object
Args : -seq , the sequence object or sequence string of the amplicon (optional)
-fwd_primer , a Bio::SeqFeature primer object with specified location on amplicon (optional)
-rev_primer , a Bio::SeqFeature primer object with specified location on amplicon (optional)
Returns : A Bio::SeqFeature::Amplicon object
=cut
sub new {
my ($class, @args) = @_;
my $self = $class->SUPER::new(@args);
my ($fwd_primer, $rev_primer) =
$self->_rearrange([qw(FWD_PRIMER REV_PRIMER)], @args);
$fwd_primer && $self->fwd_primer($fwd_primer);
$rev_primer && $self->rev_primer($rev_primer);
return $self;
}
sub _primer {
# Get or set a primer. Type is either 'fwd' or 'rev'.
my ($self, $type, $primer) = @_;
if (defined $primer) {
if ( not(ref $primer) || not $primer->isa('Bio::SeqFeature::Primer') ) {
$self->throw("Expected a primer object but got a '".ref($primer)."'\n");
}
if ( not defined $self->location ) {
$self->throw("Location of $type primer on amplicon is not known. ".
"Use start(), end() or location() to set it.");
}
$primer->primary_tag($type.'_primer');
$self->add_SeqFeature($primer);
}
return (grep { $_->primary_tag eq $type.'_primer' } $self->get_SeqFeatures)[0];
}
=head2 fwd_primer
Title : fwd_primer
Usage : my $primer = $feat->fwd_primer();
Function: Get or set the forward primer. When setting it, the primary tag
'fwd_primer' is added to the primer and its start, stop and strand
attributes are set if needed, assuming that the forward primer is
at the beginning of the amplicon and the reverse primer at the end.
Args : A Bio::SeqFeature::Primer object (optional)
Returns : A Bio::SeqFeature::Primer object
=cut
sub fwd_primer {
my ($self, $primer) = @_;
return $self->_primer('fwd', $primer);
}
=head2 rev_primer
Title : rev_primer
Usage : my $primer = $feat->rev_primer();
Function: Get or set the reverse primer. When setting it, the primary tag
'rev_primer' is added to the primer.
Args : A Bio::SeqFeature::Primer object (optional)
Returns : A Bio::SeqFeature::Primer object
=cut
sub rev_primer {
my ($self, $primer) = @_;
return $self->_primer('rev', $primer);
}
1;
Grinder-0.5.3/lib/Bio/SeqFeature/SubSeq.pm 0000644 0001750 0001750 00000014170 12023266271 020420 0 ustar floflooo floflooo #
# BioPerl module for Bio::SeqFeature::SubSeq
#
# Please direct questions and support issues to
#
# Copyright Florent Angly
#
# You may distribute this module under the same terms as perl itself
=head1 NAME
Bio::SeqFeature::SubSeq - Feature representing a subsequence
=head1 SYNOPSIS
# SubSeq with implicit sequence
use Bio::Seq;
my $template = Bio::Seq->new( -seq => 'AAAAACCCCCGGGGGTTTTT' );
$subseq = Bio::SeqFeature::Amplicon->new(
-start => 6,
-end => 15,
-template => $template,
);
print "Subsequence is: ".$amplicon->seq->seq."\n"; # Should be 'CCCCCGGGGG'
# SubSeq with explicit sequence
use Bio::SeqFeature::Subseq;
my $subseq = Bio::SeqFeature::Amplicon->new(
-seq => $seq_object,
);
=head1 DESCRIPTION
Bio::SeqFeature::SubSeq extends L features to
represent a subsequence. When this feature is attached to a template sequence,
the sequence of feature is the subsequence of the template at this location. The
purpose of this class is to represent a sequence as a feature without having to
explictly store its sequence string.
Of course, you might have reasons to explicitly set a sequence. In that case,
note that the length of the sequence is allowed to not match the position of the
feature. For example, you can set sequence of length 10 in a SubSeq feature that
spans positions 30 to 50 of the template if you so desire.
=head1 FEEDBACK
=head2 Mailing Lists
User feedback is an integral part of the evolution of this and other
Bioperl modules. Send your comments and suggestions preferably to one
of the Bioperl mailing lists. Your participation is much appreciated.
bioperl-l@bioperl.org - General discussion
http://bioperl.org/wiki/Mailing_lists - About the mailing lists
=head2 Support
Please direct usage questions or support issues to the mailing list:
I
rather than to the module maintainer directly. Many experienced and
reponsive experts will be able look at the problem and quickly
address it. Please include a thorough description of the problem
with code and data examples if at all possible.
=head2 Reporting Bugs
Report bugs to the Bioperl bug tracking system to help us keep track
the bugs and their resolution. Bug reports can be submitted via
the web:
https://redmine.open-bio.org/projects/bioperl/
=head1 AUTHOR
Florent Angly
=head1 APPENDIX
The rest of the documentation details each of the object
methods. Internal methods are usually preceded with a _
=cut
package Bio::SeqFeature::SubSeq;
use strict;
use base qw(Bio::SeqFeature::Generic);
=head2 new
Title : new()
Usage : my $subseq = Bio::SeqFeature::SubSeq( -start => 1, -end => 10, -strand => -1);
Function: Instantiate a new Bio::SeqFeature::SubSeq feature object
Args : -seq , the sequence object or sequence string of the feature (optional)
-template , attach the feature to the provided parent template sequence or feature (optional).
Note that you must specify the feature location to do this.
-start, -end, -location, -strand and all other L argument can be used.
Returns : A Bio::SeqFeature::SubSeq object
=cut
sub new {
my ($class, @args) = @_;
my $self = $class->SUPER::new(@args);
my ($seq, $template) = $self->_rearrange([qw(SEQ TEMPLATE)], @args);
if (defined $seq) {
# Set the subsequence explicitly
if (not ref $seq) {
# Convert string to sequence object
$seq = Bio::PrimarySeq->new( -seq => $seq );
} else {
# Sanity check
if (not $seq->isa('Bio::PrimarySeqI')) {
$self->throw("Expected a sequence object but got a '".ref($seq)."'\n");
}
}
$self->seq($seq);
}
if ($template) {
if ( not($self->start) || not($self->end) ) {
$self->throw('Could not attach feature to template $template because'.
' the feature location was not specified.');
}
# Need to attach to parent sequence and then add sequence feature
my $template_seq;
if ($template->isa('Bio::SeqFeature::Generic')) {
$template_seq = $template->entire_seq;
} elsif ($template->isa('Bio::SeqI')) {
$template_seq = $template;
} else {
$self->throw("Expected a Bio::SeqFeature::Generic or Bio::SeqI object".
" as template, but got '$template'.");
}
$self->attach_seq($template_seq);
$template->add_SeqFeature($self);
}
return $self;
}
=head2 seq
Title : seq()
Usage : my $seq = $subseq->seq();
Function: Get or set the sequence object of this SubSeq feature. If no sequence
was provided, but the subseq is attached to a sequence, get the
corresponding subsequence.
Returns : A sequence object or undef
Args : None.
=cut
sub seq {
my ($self, $value) = @_;
if (defined $value) {
# The sequence is explicit
if ( not(ref $value) || not $value->isa('Bio::PrimarySeqI') ) {
$self->throw("Expected a sequence object but got a '".ref($value)."'\n");
}
$self->{seq} = $value;
}
my $seq = $self->{seq};
if (not defined $seq) {
# The sequence is implied
$seq = $self->SUPER::seq;
}
return $seq;
}
=head2 length
Title : seq()
Usage : my $length = $subseq->seq();
Function: Get the length of the SubSeq feature. It is similar to the length()
method of L, which computes length based
on the location of the feature. However, if the feature was not
given a location, return the length of the subsequence if possible.
Returns : integer or undef
Args : None.
=cut
sub length {
my ($self) = @_;
# Try length from location first
if ($self->start && $self->end) {
return $self->SUPER::length();
}
# Then try length from subsequence
my $seq = $self->seq;
if (defined $seq) {
return length $seq->seq;
}
# We failed
return undef;
}
1;
Grinder-0.5.3/lib/Bio/PrimarySeq.pm 0000644 0001750 0001750 00000072607 12052263540 017255 0 ustar floflooo floflooo #
# bioperl module for Bio::PrimarySeq
#
# Please direct questions and support issues to
#
# Cared for by Ewan Birney
#
# Copyright Ewan Birney
#
# You may distribute this module under the same terms as perl itself
# POD documentation - main docs before the code
=head1 NAME
Bio::PrimarySeq - Bioperl lightweight sequence object
=head1 SYNOPSIS
# Bio::SeqIO for file reading, Bio::DB::GenBank for
# database reading
use Bio::Seq;
use Bio::SeqIO;
use Bio::DB::GenBank;
# make from memory
$seqobj = Bio::PrimarySeq->new (
-seq => 'ATGGGGTGGGCGGTGGGTGGTTTG',
-id => 'GeneFragment-12',
-accession_number => 'X78121',
-alphabet => 'dna',
-is_circular => 1,
);
print "Sequence ", $seqobj->id(), " with accession ",
$seqobj->accession_number, "\n";
# read from file
$inputstream = Bio::SeqIO->new(
-file => "myseq.fa",
-format => 'Fasta',
);
$seqobj = $inputstream->next_seq();
print "Sequence ", $seqobj->id(), " and desc ", $seqobj->desc, "\n";
# to get out parts of the sequence.
print "Sequence ", $seqobj->id(), " with accession ",
$seqobj->accession_number, " and desc ", $seqobj->desc, "\n";
$string = $seqobj->seq();
$string2 = $seqobj->subseq(1,40);
=head1 DESCRIPTION
PrimarySeq is a lightweight sequence object, storing the sequence, its
name, a computer-useful unique name, and other fundamental attributes.
It does not contain sequence features or other information. To have a
sequence with sequence features you should use the Seq object which uses
this object.
Although new users will use Bio::PrimarySeq a lot, in general you will
be using it from the Bio::Seq object. For more information on Bio::Seq
see L. For interest you might like to know that
Bio::Seq has-a Bio::PrimarySeq and forwards most of the function calls
to do with sequence to it (the has-a relationship lets us get out of a
otherwise nasty cyclical reference in Perl which would leak memory).
Sequence objects are defined by the Bio::PrimarySeqI interface, and this
object is a pure Perl implementation of the interface. If that's
gibberish to you, don't worry. The take home message is that this
object is the bioperl default sequence object, but other people can
use their own objects as sequences if they so wish. If you are
interested in wrapping your own objects as compliant Bioperl sequence
objects, then you should read the Bio::PrimarySeqI documentation
The documentation of this object is a merge of the Bio::PrimarySeq and
Bio::PrimarySeqI documentation. This allows all the methods which you can
call on sequence objects here.
=head1 FEEDBACK
=head2 Mailing Lists
User feedback is an integral part of the evolution of this and other
Bioperl modules. Send your comments and suggestions preferably to one
of the Bioperl mailing lists. Your participation is much appreciated.
bioperl-l@bioperl.org - General discussion
http://bioperl.org/wiki/Mailing_lists - About the mailing lists
=head2 Support
Please direct usage questions or support issues to the mailing list:
I
rather than to the module maintainer directly. Many experienced and
reponsive experts will be able look at the problem and quickly
address it. Please include a thorough description of the problem
with code and data examples if at all possible.
=head2 Reporting Bugs
Report bugs to the Bioperl bug tracking system to help us keep track
the bugs and their resolution. Bug reports can be submitted via the
web:
https://redmine.open-bio.org/projects/bioperl/
=head1 AUTHOR - Ewan Birney
Email birney@ebi.ac.uk
=head1 APPENDIX
The rest of the documentation details each of the object
methods. Internal methods are usually preceded with a _
=cut
package Bio::PrimarySeq;
use strict;
our $MATCHPATTERN = 'A-Za-z\-\.\*\?=~';
our $GAP_SYMBOLS = '-~';
use base qw(Bio::Root::Root Bio::PrimarySeqI
Bio::IdentifiableI Bio::DescribableI);
# Setup the allowed values for alphabet()
my %valid_type = map {$_, 1} qw( dna rna protein );
=head2 new
Title : new
Usage : $seqobj = Bio::PrimarySeq->new( -seq => 'ATGGGGGTGGTGGTACCCT',
-id => 'human_id',
-accession_number => 'AL000012',
);
Function: Returns a new primary seq object from
basic constructors, being a string for the sequence
and strings for id and accession_number.
Note that you can provide an empty sequence string. However, in
this case you MUST specify the type of sequence you wish to
initialize by the parameter -alphabet. See alphabet() for possible
values.
Returns : a new Bio::PrimarySeq object
Args : -seq => sequence string
-ref_to_seq => ... or reference to a sequence string
-display_id => display id of the sequence (locus name)
-accession_number => accession number
-primary_id => primary id (Genbank id)
-version => version number
-namespace => the namespace for the accession
-authority => the authority for the namespace
-description => description text
-desc => alias for description
-alphabet => skip alphabet guess and set it to dna, rna or protein
-id => alias for display id
-is_circular => boolean to indicate that sequence is circular
-direct => boolean to directly set sequences. The next time -seq,
seq() or -ref_to_seq is use, the sequence will not be
validated. Be careful with this...
-nowarnonempty => boolean to avoid warning when sequence is empty
=cut
sub new {
my ($class, @args) = @_;
my $self = $class->SUPER::new(@args);
my ($seq, $id, $acc, $pid, $ns, $auth, $v, $oid, $desc, $description,
$alphabet, $given_id, $is_circular, $direct, $ref_to_seq, $len,
$nowarnonempty) =
$self->_rearrange([qw(SEQ
DISPLAY_ID
ACCESSION_NUMBER
PRIMARY_ID
NAMESPACE
AUTHORITY
VERSION
OBJECT_ID
DESC
DESCRIPTION
ALPHABET
ID
IS_CIRCULAR
DIRECT
REF_TO_SEQ
LENGTH
NOWARNONEMPTY
)],
@args);
# Private var _nowarnonempty, needs to be set before calling _guess_alphabet
$self->{'_nowarnonempty'} = $nowarnonempty;
$self->{'_direct'} = $direct;
if( defined $id && defined $given_id ) {
if( $id ne $given_id ) {
$self->throw("Provided both id and display_id constructors: [$id] [$given_id]");
}
}
if( defined $given_id ) { $id = $given_id; }
# Bernd's idea: set ids now for more informative invalid sequence messages
defined $id && $self->display_id($id);
$acc && $self->accession_number($acc);
defined $pid && $self->primary_id($pid);
# Set alphabet now to avoid guessing it later, when sequence is set
$alphabet && $self->alphabet($alphabet);
# Set the length before the seq. If there is a seq, length will be updated later
$self->{'length'} = $len || 0;
# Set the sequence (but also alphabet and length)
if ($ref_to_seq) {
$self->_set_seq_by_ref($ref_to_seq, $alphabet);
} else {
if (defined $seq) {
# Note: the sequence string may be empty
$self->seq($seq);
}
}
$desc && $self->desc($desc);
$description && $self->description($description);
$is_circular && $self->is_circular($is_circular);
$ns && $self->namespace($ns);
$auth && $self->authority($auth);
defined($v) && $self->version($v);
defined($oid) && $self->object_id($oid);
return $self;
}
=head2 seq
Title : seq
Usage : $string = $seqobj->seq();
Function: Get or set the sequence as a string of letters. The case of
the letters is left up to the implementer. Suggested cases are
upper case for proteins and lower case for DNA sequence (IUPAC
standard), but you should not rely on this. An error is thrown if
the sequence contains invalid characters: see validate_seq().
Returns : A scalar
Args : - Optional new sequence value (a string) to set
- Optional alphabet (it is guessed by default)
=cut
sub seq {
my ($self, @args) = @_;
if( scalar @args == 0 ) {
return $self->{'seq'};
}
my ($seq_str, $alphabet) = @args;
if (@args) {
$self->_set_seq_by_ref(\$seq_str, $alphabet);
}
return $self->{'seq'};
}
sub _set_seq_by_ref {
# Set a sequence by reference. A reference is used to avoid the cost of
# copying the sequence (which can be very large) between functions.
my ($self, $seq_str_ref, $alphabet) = @_;
# Validate sequence if sequence is not empty and we are not in direct mode
if ( (! $self->{'_direct'}) && (defined $$seq_str_ref) ) {
$self->validate_seq($$seq_str_ref, 1);
}
delete $self->{'_direct'}; # next sequence will have to be validated
# Record sequence length
my $len = CORE::length($$seq_str_ref || '');
my $is_changed_seq = (exists $self->{'seq'}) && ($len > 0);
# Note: if the new seq is empty or undef, this is not considered a change
delete $self->{'_freeze_length'} if $is_changed_seq;
$self->{'length'} = $len if not exists $self->{'_freeze_length'};
# Set sequence
$self->{'seq'} = $$seq_str_ref;
# Set or guess alphabet
if ($alphabet) {
# Alphabet specified, set it no matter what
$self->alphabet($alphabet);
} elsif ($is_changed_seq || (! defined($self->alphabet()))) {
# If we changed a previous sequence to a new one or if there is no
# alphabet yet at all, we need to guess the (possibly new) alphabet
$self->_guess_alphabet();
} # else (seq not changed and alphabet was defined) do nothing
return 1;
}
=head2 validate_seq
Title : validate_seq
Usage : if(! $seqobj->validate_seq($seq_str) ) {
print "sequence $seq_str is not valid for an object of
alphabet ",$seqobj->alphabet, "\n";
}
Function: Test that the given sequence is valid, i.e. contains only valid
characters. The allowed characters are all letters (A-Z) and '-','.',
'*','?','=' and '~'. Spaces are not valid. Note that this
implementation does not take alphabet() into account and that empty
sequences are considered valid.
Returns : 1 if the supplied sequence string is valid, 0 otherwise.
Args : - Sequence string to be validated
- Boolean to optionally throw an error if the sequence is invalid
=cut
sub validate_seq {
my ($self, $seqstr, $throw) = @_;
if ( (defined $seqstr ) &&
($seqstr !~ /^[$MATCHPATTERN]*$/) ) {
if ($throw) {
$self->throw("Failed validation of sequence '".(defined($self->id) ||
'[unidentified sequence]')."'. Invalid characters were: " .
join('',($seqstr =~ /[^$MATCHPATTERN]/g)));
}
return 0;
}
return 1;
}
=head2 subseq
Title : subseq
Usage : $substring = $seqobj->subseq(10,40);
$substring = $seqobj->subseq(10,40,'nogap');
$substring = $seqobj->subseq(-start=>10, -end=>40, -replace_with=>'tga');
$substring = $seqobj->subseq($location_obj);
$substring = $seqobj->subseq($location_obj, -nogap => 1);
Function: Return the subseq from start to end, where the first sequence
character has coordinate 1 number is inclusive, ie 1-2 are the
first two characters of the sequence. The given start coordinate
has to be larger than the end, even if the sequence is circular.
Returns : a string
Args : integer for start position
integer for end position
OR
Bio::LocationI location for subseq (strand honored)
Specify -NOGAP=>1 to return subseq with gap characters removed
Specify -REPLACE_WITH=>$new_subseq to replace the subseq returned
with $new_subseq in the sequence object
=cut
sub subseq {
my $self = shift;
my @args = @_;
my ($start, $end, $nogap, $replace) = $self->_rearrange([qw(START
END
NOGAP
REPLACE_WITH)], @args);
# If -replace_with is specified, validate the replacement sequence
if (defined $replace) {
$self->validate_seq( $replace ) ||
$self->throw("Replacement sequence does not look valid");
}
if( ref($start) && $start->isa('Bio::LocationI') ) {
my $loc = $start;
my $seq = '';
foreach my $subloc ($loc->each_Location()) {
my $piece = $self->subseq(-start => $subloc->start(),
-end => $subloc->end(),
-replace_with => $replace,
-nogap => $nogap);
$piece =~ s/[$GAP_SYMBOLS]//g if $nogap;
if ($subloc->strand() < 0) {
$piece = $self->_revcom_from_string($piece, $self->alphabet);
}
$seq .= $piece;
}
return $seq;
} elsif( defined $start && defined $end ) {
if( $start > $end ){
$self->throw("Bad start,end parameters. Start [$start] has to be ".
"less than end [$end]");
}
if( $start <= 0 ) {
$self->throw("Bad start parameter ($start). Start must be positive.");
}
# Remove one from start, and then length is end-start
$start--;
my $seqstr;
if (defined $replace) {
$seqstr = substr $self->{seq}, $start, $end-$start, $replace;
} else {
$seqstr = substr $self->{seq}, $start, $end-$start;
}
if ($end > $self->length) {
if ($self->is_circular) {
my $start = 0;
my $end = $end - $self->length;
my $appendstr;
if (defined $replace) {
$appendstr = substr $self->{seq}, $start, $end-$start, $replace;
} else {
$appendstr = substr $self->{seq}, $start, $end-$start;
}
$seqstr .= $appendstr;
} else {
$self->throw("Bad end parameter ($end). End must be less than ".
"the total length of sequence (total=".$self->length.")")
}
}
$seqstr =~ s/[$GAP_SYMBOLS]//g if ($nogap);
return $seqstr;
} else {
$self->warn("Incorrect parameters to subseq - must be two integers or ".
"a Bio::LocationI object. Got:", $self,$start,$end,$replace,$nogap);
return;
}
}
=head2 length
Title : length
Usage : $len = $seqobj->length();
Function: Get the stored length of the sequence in number of symbols (bases
or amino acids).
In some circumstances, you can also set this attribute:
1/ For empty sequences, you can set the length to anything you want:
my $seqobj = Bio::PrimarySeq->new( -length => 123 );
my $len = $seqobj->len; # 123
2/ To save memory when using very long sequences, you can set the
length of the sequence to the length of the sequence (and nothing
else):
my $seqobj = Bio::PrimarySeq->new( -seq => 'ACGT...' ); # 1 Mbp sequence
# process $seqobj... then after you're done with it
$seqobj->length($seqobj->length);
$seqobj->seq(undef); # free memory!
my $len = $seqobj->len; # 1 Mbp
Note that if you set seq() to a value other than undef at any time,
the length attribute will be reset.
Returns : integer representing the length of the sequence.
Args : Optionally, the value on set
=cut
sub length {
my ($self, $val) = @_;
if (defined $val) {
my $len = $self->{'length'};
if ($len && ($len != $val)) {
$self->throw("You're trying to lie about the length: ".
"is $len but you say ".$val);
}
$self->{'length'} = $val;
$self->{'_freeze_length'} = undef;
}
return $self->{'length'};
}
=head2 display_id
Title : display_id or display_name
Usage : $id_string = $seqobj->display_id();
Function: Get or set the display id, aka the common name of the sequence object.
The semantics of this is that it is the most likely string to
be used as an identifier of the sequence, and likely to have
"human" readability. The id is equivalent to the ID field of
the GenBank/EMBL databanks and the id field of the
Swissprot/sptrembl database. In fasta format, the >(\S+) is
presumed to be the id, though some people overload the id to
embed other information. Bioperl does not use any embedded
information in the ID field, and people are encouraged to use
other mechanisms (accession field for example, or extending
the sequence object) to solve this.
With the new Bio::DescribeableI interface, display_name aliases
to this method.
Returns : A string for the display ID
Args : Optional string for the display ID to set
=cut
sub display_id {
my ($self, $value) = @_;
if( defined $value) {
$self->{'display_id'} = $value;
}
return $self->{'display_id'};
}
=head2 accession_number
Title : accession_number or object_id
Usage : $unique_key = $seqobj->accession_number;
Function: Returns the unique biological id for a sequence, commonly
called the accession_number. For sequences from established
databases, the implementors should try to use the correct
accession number. Notice that primary_id() provides the
unique id for the implemetation, allowing multiple objects
to have the same accession number in a particular implementation.
For sequences with no accession number, this method should
return "unknown".
[Note this method name is likely to change in 1.3]
With the new Bio::IdentifiableI interface, this is aliased
to object_id
Returns : A string
Args : A string (optional) for setting
=cut
sub accession_number {
my( $self, $acc ) = @_;
if (defined $acc) {
$self->{'accession_number'} = $acc;
} else {
$acc = $self->{'accession_number'};
$acc = 'unknown' unless defined $acc;
}
return $acc;
}
=head2 primary_id
Title : primary_id
Usage : $unique_key = $seqobj->primary_id;
Function: Returns the unique id for this object in this
implementation. This allows implementations to manage their
own object ids in a way the implementaiton can control
clients can expect one id to map to one object.
For sequences with no natural primary id, this method
should return a stringified memory location.
Returns : A string
Args : A string (optional, for setting)
=cut
sub primary_id {
my $self = shift;
if(@_) {
$self->{'primary_id'} = shift;
}
if( ! defined($self->{'primary_id'}) ) {
return "$self";
}
return $self->{'primary_id'};
}
=head2 alphabet
Title : alphabet
Usage : if( $seqobj->alphabet eq 'dna' ) { # Do something }
Function: Get/set the alphabet of sequence, one of
'dna', 'rna' or 'protein'. This is case sensitive.
This is not called because this would cause
upgrade problems from the 0.5 and earlier Seq objects.
Returns : a string either 'dna','rna','protein'. NB - the object must
make a call of the type - if there is no alphabet specified it
has to guess.
Args : optional string to set : 'dna' | 'rna' | 'protein'
=cut
sub alphabet {
my ($self,$value) = @_;
if (defined $value) {
$value = lc $value;
unless ( $valid_type{$value} ) {
$self->throw("Alphabet '$value' is not a valid alphabet (".
join(',', map "'$_'", sort keys %valid_type) .") lowercase");
}
$self->{'alphabet'} = $value;
}
return $self->{'alphabet'};
}
=head2 desc
Title : desc or description
Usage : $seqobj->desc($newval);
Function: Get/set description of the sequence.
'description' is an alias for this for compliance with the
Bio::DescribeableI interface.
Returns : value of desc (a string)
Args : newvalue (a string or undef, optional)
=cut
sub desc{
my $self = shift;
return $self->{'desc'} = shift if @_;
return $self->{'desc'};
}
=head2 can_call_new
Title : can_call_new
Usage :
Function:
Example :
Returns : true
Args :
=cut
sub can_call_new {
my ($self) = @_;
return 1;
}
=head2 id
Title : id
Usage : $id = $seqobj->id();
Function: This is mapped on display_id
Example :
Returns :
Args :
=cut
sub id {
return shift->display_id(@_);
}
=head2 is_circular
Title : is_circular
Usage : if( $seqobj->is_circular) { # Do something }
Function: Returns true if the molecule is circular
Returns : Boolean value
Args : none
=cut
sub is_circular{
my $self = shift;
return $self->{'is_circular'} = shift if @_;
return $self->{'is_circular'};
}
=head1 Methods for Bio::IdentifiableI compliance
=head2 object_id
Title : object_id
Usage : $string = $seqobj->object_id();
Function: Get or set a string which represents the stable primary identifier
in this namespace of this object. For DNA sequences this
is its accession_number, similarly for protein sequences.
This is aliased to accession_number().
Returns : A scalar
Args : Optional object ID to set.
=cut
sub object_id {
return shift->accession_number(@_);
}
=head2 version
Title : version
Usage : $version = $seqobj->version();
Function: Get or set a number which differentiates between versions of
the same object. Higher numbers are considered to be
later and more relevant, but a single object described
the same identifier should represent the same concept.
Returns : A number
Args : Optional version to set.
=cut
sub version{
my ($self,$value) = @_;
if( defined $value) {
$self->{'_version'} = $value;
}
return $self->{'_version'};
}
=head2 authority
Title : authority
Usage : $authority = $seqobj->authority();
Function: Get or set a string which represents the organisation which
granted the namespace, written as the DNS name of the
organisation (eg, wormbase.org).
Returns : A scalar
Args : Optional authority to set.
=cut
sub authority {
my ($self, $value) = @_;
if( defined $value) {
$self->{'authority'} = $value;
}
return $self->{'authority'};
}
=head2 namespace
Title : namespace
Usage : $string = $seqobj->namespace();
Function: Get or set a string representing the name space this identifier
is valid in, often the database name or the name describing the
collection.
Returns : A scalar
Args : Optional namespace to set.
=cut
sub namespace{
my ($self,$value) = @_;
if( defined $value) {
$self->{'namespace'} = $value;
}
return $self->{'namespace'} || "";
}
=head1 Methods for Bio::DescribableI compliance
This comprises of display_name and description.
=head2 display_name
Title : display_name
Usage : $string = $seqobj->display_name();
Function: Get or set a string which is what should be displayed to the user.
The string should have no spaces (ideally, though a cautious
user of this interface would not assumme this) and should be
less than thirty characters (though again, double checking
this is a good idea).
This is aliased to display_id().
Returns : A string for the display name
Args : Optional string for the display name to set.
=cut
sub display_name {
return shift->display_id(@_);
}
=head2 description
Title : description
Usage : $string = $seqobj->description();
Function: Get or set a text string suitable for displaying to the user a
description. This string is likely to have spaces, but
should not have any newlines or formatting - just plain
text. The string should not be greater than 255 characters
and clients can feel justified at truncating strings at 255
characters for the purposes of display.
This is aliased to desc().
Returns : A string for the description
Args : Optional string for the description to set.
=cut
sub description {
return shift->desc(@_);
}
=head1 Methods Inherited from Bio::PrimarySeqI
These methods are available on Bio::PrimarySeq, although they are
actually implemented on Bio::PrimarySeqI
=head2 revcom
Title : revcom
Usage : $rev = $seqobj->revcom();
Function: Produces a new Bio::SeqI implementing object which
is the reversed complement of the sequence. For protein
sequences this throws an exception of
"Sequence is a protein. Cannot revcom".
The id is the same id as the orginal sequence, and the
accession number is also indentical. If someone wants to
track that this sequence has be reversed, it needs to
define its own extensions.
To do an inplace edit of an object you can go:
$seqobj = $seqobj->revcom();
This of course, causes Perl to handle the garbage
collection of the old object, but it is roughly speaking as
efficient as an inplace edit.
Returns : A new (fresh) Bio::SeqI object
Args : none
=head2 trunc
Title : trunc
Usage : $subseq = $myseq->trunc(10,100);
Function: Provides a truncation of a sequence,
Returns : A fresh Bio::SeqI implementing object.
Args : Numbers for the start and end positions
=head1 Internal methods
These are internal methods to PrimarySeq
=head2 _guess_alphabet
Title : _guess_alphabet
Usage :
Function: Automatically guess and set the type of sequence: dna, rna, protein
or '' if the sequence was empty. This method first removes dots (.),
dashes (-) and question marks (?) before guessing the alphabet
using the IUPAC conventions for ambiguous residues. Since the DNA and
RNA characters are also valid characters for proteins, there is
no foolproof way of determining the right alphabet. This is our best
guess only!
Returns : string 'dna', 'rna', 'protein' or ''.
Args : none
=cut
sub _guess_alphabet {
my ($self) = @_;
# Guess alphabet
my $alphabet = $self->_guess_alphabet_from_string($self->seq, $self->{'_nowarnonempty'});
# Set alphabet unless it is unknown
$self->alphabet($alphabet) if $alphabet;
return $alphabet;
}
sub _guess_alphabet_from_string {
# Get the alphabet from a sequence string
my ($self, $str, $nowarnonempty) = @_;
$nowarnonempty = 0 if not defined $nowarnonempty;
# Remove chars that clearly don't denote nucleic or amino acids
$str =~ s/[-.?]//gi;
# Check for sequences without valid letters
my $alphabet;
my $total = CORE::length($str);
if( $total == 0 ) {
if (not $nowarnonempty) {
$self->warn("Got a sequence without letters. Could not guess alphabet");
}
$alphabet = '';
}
# Determine alphabet now
if (not defined $alphabet) {
if ($str =~ m/[EFIJLOPQXZ]/i) {
# Start with a safe method to find proteins.
# Unambiguous IUPAC letters for proteins are: E,F,I,J,L,O,P,Q,X,Z
$alphabet = 'protein';
} else {
# Alphabet is unsure, could still be DNA, RNA or protein
# DNA and RNA contain mostly A, T, U, G, C and N, but the other
# letters they use are also among the 15 valid letters that a
# protein sequence can contain at this stage. Make our best guess
# based on sequence composition. If it contains over 70% of ACGTUN,
# it is likely nucleic.
if( ($str =~ tr/ATUGCNatugcn//) / $total > 0.7 ) {
if ( $str =~ m/U/i ) {
$alphabet = 'rna';
} else {
$alphabet = 'dna';
}
} else {
$alphabet = 'protein';
}
}
}
return $alphabet;
}
############################################################################
# aliases due to name changes or to compensate for our lack of consistency #
############################################################################
sub accession {
my $self = shift;
$self->warn(ref($self)."::accession is deprecated, ".
"use accession_number() instead");
return $self->accession_number(@_);
}
1;
Grinder-0.5.3/lib/Bio/Tools/ 0000755 0001750 0001750 00000000000 12151575606 015720 5 ustar floflooo floflooo Grinder-0.5.3/lib/Bio/Tools/AmpliconSearch.pm 0000644 0001750 0001750 00000037367 12052037144 021154 0 ustar floflooo floflooo # BioPerl module for Bio::Tools::AmpliconSearch
#
# Copyright Florent Angly
#
# You may distribute this module under the same terms as perl itself
package Bio::Tools::AmpliconSearch;
use strict;
use warnings;
use Bio::Tools::IUPAC;
use Bio::SeqFeature::Amplicon;
use Bio::Tools::SeqPattern;
# we require Bio::SeqIO
# and Bio::SeqFeature::Primer
use base qw(Bio::Root::Root);
my $template_str;
=head1 NAME
Bio::Tools::AmpliconSearch - Find amplicons in a template using degenerate PCR primers
=head1 SYNOPSIS
use Bio::PrimarySeq;
use Bio::Tools::AmpliconSearch;
my $template = Bio::PrimarySeq->new(
-seq => 'aaaaaCCCCaaaaaaaaaaTTTTTTaaaaaCCACaaaaaTTTTTTaaaaaaaaaa',
);
my $fwd_primer = Bio::PrimarySeq->new(
-seq => 'CCNC',
);
my $rev_primer = Bio::PrimarySeq->new(
-seq => 'AAAAA',
);
my $search = Bio::Tools::AmpliconSearch->new(
-template => $template,
-fwd_primer => $fwd_primer,
-rev_primer => $rev_primer,
);
while (my $amplicon = $search->next_amplicon) {
print "Found amplicon at position ".$amplicon->start.'..'.$amplicon->end.":\n";
print $amplicon->seq->seq."\n\n";
}
# Now change the template (but you could change the primers instead) and look
# for amplicons again
$template = Bio::PrimarySeq->new(
-seq => 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa',
);
$search->template($template);
while (my $amplicon = $search->next_amplicon) {
print "Found amplicon at position ".$amplicon->start.'..'.$amplicon->end.":\n";
print $amplicon->seq->seq."\n\n";
}
=head1 DESCRIPTION
Perform an in silico PCR reaction, i.e. search for amplicons in a given template
sequence using the specified degenerate primer.
The template sequence is a sequence object, e.g. L, and the primers
can be a sequence or a L object and contain ambiguous
residues as defined in the IUPAC conventions. The primer sequences are converted
into regular expressions using L and the matching regions of
the template sequence, i.e. the amplicons, are returned as L
objects.
AmpliconSearch will look for amplicons on both strands (forward and reverse-
complement) of the specified template sequence. If the reverse primer is not
provided, an amplicon will be returned and span a match of the forward primer to
the end of the template. Similarly, when no forward primer is given, match from
the beginning of the template sequence. When several amplicons overlap, only the
shortest one to more accurately represent the biases of PCR. Future improvements
may include modelling the effects of the number of PCR cycles or temperature on
the PCR products.
=head1 TODO
Future improvements may include:
=over
=item *
Allowing a small number of primer mismatches
=item *
Reporting all amplicons, including overlapping ones
=item *
Putting a limit on the length of amplicons, in accordance with the processivity
of the polymerase used
=back
=head1 FEEDBACK
=head2 Mailing Lists
User feedback is an integral part of the evolution of this and other
Bioperl modules. Send your comments and suggestions preferably to one
of the Bioperl mailing lists. Your participation is much appreciated.
bioperl-l@bioperl.org - General discussion
http://bioperl.org/wiki/Mailing_lists - About the mailing lists
=head2 Support
Please direct usage questions or support issues to the mailing list:
I
rather than to the module maintainer directly. Many experienced and
reponsive experts will be able look at the problem and quickly
address it. Please include a thorough description of the problem
with code and data examples if at all possible.
=head2 Reporting Bugs
Report bugs to the Bioperl bug tracking system to help us keep track
the bugs and their resolution. Bug reports can be submitted via the
web:
https://redmine.open-bio.org/projects/bioperl/
=head1 AUTHOR
Florent Angly
=head1 APPENDIX
The rest of the documentation details each of the object
methods. Internal methods are usually preceded with a _
=head2 new
Title : new
Usage : my $search = Bio::Tools::AmpliconSearch->new( );
Function : Initialize an amplicon search
Args : -template Sequence object for the template sequence. This object
will be converted to Bio::Seq if needed in since features
(amplicons and primers) will be added to this object.
-fwd_primer A sequence object representing the forward primer
-rev_primer A sequence object representing the reverse primer
-primer_file Read primers from a sequence file. It replaces
-fwd_primer and -rev_primer (optional)
-attach_primers Whether or not to attach primers to Amplicon objects. Default: 0 (off)
Returns : A Bio::Tools::AmpliconSearch object
=cut
sub new {
my ($class, @args) = @_;
my $self = $class->SUPER::new(@args);
my ($template, $primer_file, $fwd_primer, $rev_primer, $attach_primers) =
$self->_rearrange([qw(TEMPLATE PRIMER_FILE FWD_PRIMER REV_PRIMER ATTACH_PRIMERS)],
@args);
# Get primers
if (defined $primer_file) {
$self->primer_file($primer_file);
} else {
$self->fwd_primer($fwd_primer || '');
$self->rev_primer($rev_primer || '');
}
# Get template sequence
$self->template($template) if defined $template;
$self->attach_primers($attach_primers) if defined $attach_primers;
return $self;
}
=head2 template
Title : template
Usage : my $template = $search->template;
Function : Get/set the template sequence. Setting a new template resets any
search in progress.
Args : Optional Bio::Seq object
Returns : A Bio::Seq object
=cut
sub template {
my ($self, $template) = @_;
if (defined $template) {
if ( not(ref $template) || not $template->isa('Bio::PrimarySeqI') ) {
# Not a Bio::Seq or Bio::PrimarySeq
$self->throw("Expected a sequence object as input but got a '".ref($template)."'\n");
}
if (not $template->isa('Bio::SeqI')) {
# Convert sequence object to Bio::Seq Seq so that features can be added
my $primary_seq = $template;
$template = Bio::Seq->new();
$template->primary_seq($primary_seq);
}
$self->{template} = $template;
# Reset search in progress
$template_str = undef;
}
return $self->{template};
}
=head2 fwd_primer
Title : fwd_primer
Usage : my $primer = $search->fwd_primer;
Function : Get/set the forward primer. Setting a new forward primer resets any
search in progress.
Args : Optional sequence object or primer object or '' to match beginning
of sequence.
Returns : A sequence object or primer object or undef
=cut
sub fwd_primer {
my ($self, $primer) = @_;
if (defined $primer) {
$self->_set_primer('fwd', $primer);
}
return $self->{fwd_primer};
}
=head2 rev_primer
Title : rev_primer
Usage : my $primer = $search->rev_primer;
Function : Get/set the reverse primer. Setting a new reverse primer resets any
search in progress.
Args : Optional sequence object or primer object or '' to match end of
sequence.
Returns : A sequence object or primer object or undef
=cut
sub rev_primer {
my ($self, $primer) = @_;
if (defined $primer) {
$self->_set_primer('rev', $primer);
}
return $self->{rev_primer};
}
sub _set_primer {
# Save a primer (sequence object) and convert it to regexp. Type is 'fwd' for
# the forward primer or 'rev' for the reverse primer.
my ($self, $type, $primer) = @_;
my $re;
my $match_rna = 1;
if ($primer eq '') {
$re = $type eq 'fwd' ? '^' : '$';
} else {
if ( not(ref $primer) || (
not($primer->isa('Bio::PrimarySeqI')) &&
not($primer->isa('Bio::SeqFeature::Primer')) ) ) {
$self->throw('Expected a sequence or primer object as input but got a '.ref($primer)."\n");
}
$self->{$type.'_primer'} = $primer;
my $seq = $primer->isa('Bio::SeqFeature::Primer') ? $primer->seq : $primer;
$re = Bio::Tools::IUPAC->new(
-seq => $type eq 'fwd' ? $seq : $seq->revcom,
)->regexp($match_rna);
}
$self->{$type.'_regexp'} = $re;
# Reset search in progress
$template_str = undef;
$self->{regexp} = undef;
return $self->{$type.'_primer'};
}
=head2 primer_file
Title : primer_file
Usage : my ($fwd, $rev) = $search->primer_file;
Function : Get/set a sequence file to read the primer from. The first sequence
must be the forward primer, and the second is the optional reverse
primer. After reading the file, the primers are set using fwd_primer()
and rev_primer() and returned.
Args : Sequence file
Returns : Array containing forward and reverse primers as sequence objects.
=cut
sub primer_file {
my ($self, $primer_file) = @_;
# Read primer file and convert primers into regular expressions to catch
# amplicons present in the database
if (not defined $primer_file) {
$self->throw("Need to provide an input file\n");
}
# Mandatory first primer
require Bio::SeqIO;
my $in = Bio::SeqIO->new( -file => $primer_file );
my $fwd_primer = $in->next_seq;
if (not defined $fwd_primer) {
$self->throw("The file '$primer_file' contains no primers\n");
}
$fwd_primer->alphabet('dna'); # Force the alphabet since degenerate primers can look like protein sequences
# Optional reverse primers
my $rev_primer = $in->next_seq;
if (defined $rev_primer) {
$rev_primer->alphabet('dna');
} else {
$rev_primer = '';
}
$in->close;
$self->fwd_primer($fwd_primer);
$self->rev_primer($rev_primer);
return ($fwd_primer, $rev_primer);
}
=head2 attach_primers
Title : attach_primers
Usage : my $attached = $search->attach_primers;
Function : Get/set whether or not to attach primer objects to the amplicon
objects.
Args : Optional integer (1 for yes, 0 for no)
Returns : Integer (1 for yes, 0 for no)
=cut
sub attach_primers {
my ($self, $attach) = @_;
if (defined $attach) {
$self->{attach_primers} = $attach;
require Bio::SeqFeature::Primer;
}
return $self->{attach_primers} || 0;
}
=head2 next_amplicon
Title : next_amplicon
Usage : my $amplicon = $search->next_amplicon;
Function : Get the next amplicon
Args : None
Returns : A Bio::SeqFeature::Amplicon object
=cut
sub next_amplicon {
my ($self) = @_;
# Initialize search
if (not defined $template_str) {
$self->_init;
}
my $re = $self->_regexp;
my $amplicon;
if ($template_str =~ m/$re/g) {
my ($match, $rev_match) = ($1, $2);
my $strand = $rev_match ? -1 : 1;
$match = $match || $rev_match;
my $end = pos($template_str);
my $start = $end - length($match) + 1;
$amplicon = $self->_attach_amplicon($start, $end, $strand);
}
# If no more matches. Make sure calls to next_amplicon() will return undef.
if (not $amplicon) {
$template_str = '';
}
return $amplicon;
}
sub _init {
my ($self) = @_;
# Sanity checks
if ( not $self->template ) {
$self->throw('Need to provide a template sequence');
}
if ( not($self->fwd_primer) && not($self->rev_primer) ) {
$self->throw('Need to provide at least a primer');
}
# Set the template sequence string
$template_str = $self->template->seq;
# Set the regular expression to match amplicons
$self->_regexp;
return 1;
}
sub _regexp {
# Get the regexp to match amplicon. If the regexp is not set, initialize it.
my ($self, $regexp) = @_;
if ( not defined $self->{regexp} ) {
# Build regexp that matches amplicons on both strands and reports shortest
# amplicon when there are several overlapping amplicons
my $fwd_regexp = $self->_fwd_regexp;
my $rev_regexp = $self->_rev_regexp;
my ($fwd_regexp_rc, $basic_fwd_match, $rev_regexp_rc, $basic_rev_match);
if ($fwd_regexp eq '^') {
$fwd_regexp_rc = '';
$basic_fwd_match = "(?:.*?$rev_regexp)";
} else {
$fwd_regexp_rc = Bio::Tools::SeqPattern->new(
-seq => $fwd_regexp,
-type => 'dna',
)->revcom->str;
$basic_fwd_match = "(?:$fwd_regexp.*?$rev_regexp)";
}
if ($rev_regexp eq '$') {
$rev_regexp_rc = '';
$basic_rev_match = "(?:.*?$fwd_regexp_rc)";
} else {
$rev_regexp_rc = Bio::Tools::SeqPattern->new(
-seq => $rev_regexp,
-type => 'dna',
)->revcom->str;
$basic_rev_match = "(?:$rev_regexp_rc.*?$fwd_regexp_rc)";
}
my $fwd_exclude = "(?!$basic_rev_match".
($fwd_regexp eq '^' ? '' : "|$fwd_regexp").
")";
my $rev_exclude = "(?!$basic_fwd_match".
($rev_regexp eq '$' ? '' : "|$rev_regexp_rc").
')';
$self->{regexp} = qr/
( $fwd_regexp (?:$fwd_exclude.)*? $rev_regexp ) |
( $rev_regexp_rc (?:$rev_exclude.)*? $fwd_regexp_rc )
/xi;
}
return $self->{regexp};
}
=head2 annotate_template
Title : annotate_template
Usage : my $template = $search->annotate_template;
Function : Search for all amplicons and attach them to the template.
This is equivalent to running:
while (my $amplicon = $self->next_amplicon) {
# do something
}
my $annotated = $self->template;
Args : None
Returns : A Bio::Seq object with attached Bio::SeqFeature::Amplicons (and
Bio::SeqFeature::Primers if you set -attach_primers to 1).
=cut
sub annotate_template {
my ($self) = @_;
# Search all amplicons and attach them to template
1 while $self->next_amplicon;
# Return annotated template
return $self->template;
}
sub _fwd_regexp {
my ($self) = @_;
return $self->{fwd_regexp};
}
sub _rev_regexp {
my ($self) = @_;
return $self->{rev_regexp};
}
sub _attach_amplicon {
# Create an amplicon object and attach it to template
my ($self, $start, $end, $strand) = @_;
# Create Bio::SeqFeature::Amplicon feature and attach it to the template
my $amplicon = Bio::SeqFeature::Amplicon->new(
-start => $start,
-end => $end,
-strand => $strand,
-template => $self->template,
);
# Create Bio::SeqFeature::Primer feature and attach them to the amplicon
if ($self->attach_primers) {
for my $type ('fwd', 'rev') {
my ($pstart, $pend, $pstrand, $primer_seq);
# Coordinates relative to amplicon
if ($type eq 'fwd') {
# Forward primer
$primer_seq = $self->fwd_primer;
next if not defined $primer_seq;
$pstart = 1;
$pend = $primer_seq->length;
$pstrand = $amplicon->strand;
} else {
# Optional reverse primer
$primer_seq = $self->rev_primer;
next if not defined $primer_seq;
$pstart = $end - $primer_seq->length + 1;
$pend = $end;
$pstrand = -1 * $amplicon->strand;
}
# Absolute coordinates needed
$pstart += $start - 1;
$pend += $start - 1;
my $primer = Bio::SeqFeature::Primer->new(
-start => $pstart,
-end => $pend,
-strand => $pstrand,
-template => $amplicon,
);
# Attach primer to amplicon
if ($type eq 'fwd') {
$amplicon->fwd_primer($primer);
} else {
$amplicon->rev_primer($primer);
}
}
}
return $amplicon;
}
1;
Grinder-0.5.3/lib/Bio/Tools/IUPAC.pm 0000644 0001750 0001750 00000033143 12052037144 017111 0 ustar floflooo floflooo #
# BioPerl module for IUPAC
#
# Please direct questions and support issues to
#
# Cared for by Aaron Mackey
#
# Copyright Aaron Mackey
#
# You may distribute this module under the same terms as perl itself
# POD documentation - main docs before the code
=head1 NAME
Bio::Tools::IUPAC - Generates unique sequence objects or regular expressions from
an ambiguous IUPAC sequence
=head1 SYNOPSIS
use Bio::PrimarySeq;
use Bio::Tools::IUPAC;
# Get the IUPAC code for proteins
my %iupac_prot = Bio::Tools::IUPAC->new->iupac_iup;
# Create a sequence with degenerate residues
my $ambiseq = Bio::PrimarySeq->new(-seq => 'ARTCGUTGN', -alphabet => 'dna');
# Create all possible non-degenerate sequences
my $iupac = Bio::Tools::IUPAC->new(-seq => $ambiseq);
while ($uniqueseq = $iupac->next_seq()) {
# process the unique Bio::Seq object.
}
# Get a regular expression that matches all possible sequences
my $regexp = $iupac->regexp();
=head1 DESCRIPTION
Bio::Tools::IUPAC is a tool that manipulates sequences with ambiguous residues
following the IUPAC conventions. Non-standard characters have the meaning
described below:
IUPAC-IUB SYMBOLS FOR NUCLEOTIDE (DNA OR RNA) NOMENCLATURE:
Cornish-Bowden (1985) Nucl. Acids Res. 13: 3021-3030
------------------------------------------
Symbol Meaning Nucleic Acid
------------------------------------------
A A Adenine
C C Cytosine
G G Guanine
T T Thymine
U U Uracil
M A or C
R A or G
W A or T
S C or G
Y C or T
K G or T
V A or C or G
H A or C or T
D A or G or T
B C or G or T
X G or A or T or C
N G or A or T or C
IUPAC-IUP AMINO ACID SYMBOLS:
Biochem J. 1984 Apr 15; 219(2): 345-373
Eur J Biochem. 1993 Apr 1; 213(1): 2
------------------------------------------
Symbol Meaning
------------------------------------------
A Alanine
B Aspartic Acid, Asparagine
C Cysteine
D Aspartic Acid
E Glutamic Acid
F Phenylalanine
G Glycine
H Histidine
I Isoleucine
J Isoleucine/Leucine
K Lysine
L Leucine
M Methionine
N Asparagine
O Pyrrolysine
P Proline
Q Glutamine
R Arginine
S Serine
T Threonine
U Selenocysteine
V Valine
W Tryptophan
X Unknown
Y Tyrosine
Z Glutamic Acid, Glutamine
* Terminator
There are a few things Bio::Tools::IUPAC can do for you:
=over
=item *
report the IUPAC mapping between ambiguous and non-ambiguous residues
=item *
produce a stream of all possible corresponding unambiguous Bio::Seq objects given
an ambiguous sequence object
=item *
convert an ambiguous sequence object to a corresponding regular expression
=back
=head1 FEEDBACK
=head2 Mailing Lists
User feedback is an integral part of the evolution of this and other
Bioperl modules. Send your comments and suggestions preferably to one
of the Bioperl mailing lists. Your participation is much appreciated.
bioperl-l@bioperl.org - General discussion
http://bioperl.org/wiki/Mailing_lists - About the mailing lists
=head2 Support
Please direct usage questions or support issues to the mailing list:
I