Grinder-0.5.4/0000755000175000017500000000000012647202511013332 5ustar flofloooflofloooGrinder-0.5.4/MYMETA.yml0000644000175000017500000000233312647202457015063 0ustar floflooofloflooo--- abstract: 'A versatile omics shotgun and amplicon sequencing read simulator' author: - 'Florent Angly ' build_requires: ExtUtils::MakeMaker: '6.59' Test::More: '0' Test::Warn: '0' configure_requires: ExtUtils::MakeMaker: '0' dynamic_config: 0 generated_by: 'Module::Install version 1.16, CPAN::Meta::Converter version 2.150005' license: unknown meta-spec: url: http://module-build.sourceforge.net/META-spec-v1.4.html version: '1.4' name: Grinder no_index: directory: - inc - t requires: Bio::DB::Fasta: '0' Bio::Location::Split: '0' Bio::PrimarySeq: '0' Bio::Root::Root: '0' Bio::Root::Version: '1.006923' Bio::Seq::SimulatedRead: '0' Bio::SeqFeature::SubSeq: '0' Bio::SeqIO: '0' Bio::Tools::AmpliconSearch: '0' Getopt::Euclid: v0.4.4 List::Util: '0' Math::Random::MT: '1.16' perl: '5.006' version: '0.77' resources: bugtracker: http://sourceforge.net/tracker/?group_id=244196&atid=1124737 homepage: http://sourceforge.net/projects/biogrinder/ license: http://opensource.org/licenses/gpl-3.0.html repository: git://biogrinder.git.sourceforge.net/gitroot/biogrinder/biogrinder version: '0.005004' x_serialization_backend: 'CPAN::Meta::YAML version 0.012' Grinder-0.5.4/README.htm0000644000175000017500000011337412647202457015023 0ustar floflooofloflooo

NAME

grinder - A versatile omics shotgun and amplicon sequencing read simulator

DESCRIPTION

Grinder is a versatile program to create random shotgun and amplicon sequence libraries based on DNA, RNA or proteic reference sequences provided in a FASTA file.

Grinder can produce genomic, metagenomic, transcriptomic, metatranscriptomic, proteomic, metaproteomic shotgun and amplicon datasets from current sequencing technologies such as Sanger, 454, Illumina. These simulated datasets can be used to test the accuracy of bioinformatic tools under specific hypothesis, e.g. with or without sequencing errors, or with low or high community diversity. Grinder may also be used to help decide between alternative sequencing methods for a sequence-based project, e.g. should the library be paired-end or not, how many reads should be sequenced.

Grinder features include:

Briefly, given a FASTA file containing reference sequence (genomes, genes, transcripts or proteins), Grinder performs the following steps:

  1. Read the reference sequences, and for amplicon datasets, extracts full-length reference PCR amplicons using the provided degenerate PCR primers.

  2. Determine the community structure based on the provided alpha diversity (number of reference sequences in the library), beta diversity (number of reference sequences in common between several independent libraries) and specified rank- abundance model.

  3. Take shotgun reads from the reference sequences or amplicon reads from the full- length reference PCR amplicons. The reads may be paired-end reads when an insert size distribution is specified. The length of the reads depends on the provided read length distribution and their abundance depends on the relative abundance in the community structure. Genome length may also biases the number of reads to take for shotgun datasets at this step. Similarly, for amplicon datasets, the number of copies of the target gene in the reference genomes may bias the number of reads to take.

  4. Alter reads by inserting sequencing errors (indels, substitutions and homopolymer errors) following a position-specific model to simulate reads created by current sequencing technologies (Sanger, 454, Illumina). Write the reads and their quality scores in FASTA, QUAL and FASTQ files.

CITATION

If you use Grinder in your research, please cite:

   Angly FE, Willner D, Rohwer F, Hugenholtz P, Tyson GW (2012), Grinder: a
   versatile amplicon and shotgun sequence simulator, Nucleic Acids Reseach

Available from http://dx.doi.org/10.1093/nar/gks251.

VERSION

This document refers to grinder version 0.5.3

AUTHOR

Florent Angly <florent.angly@gmail.com>

INSTALLATION

Dependencies

You need to install these dependencies first:

The following CPAN Perl modules are dependencies that will be installed automatically for you:

Procedure

To install Grinder globally on your system, run the following commands in a terminal or command prompt:

On Linux, Unix, MacOS:

   perl Makefile.PL
   make

And finally, with administrator privileges:

   make install

On Windows, run the same commands but with nmake instead of make.

No administrator privileges?

If you do not have administrator privileges, Grinder needs to be installed in your home directory.

First, follow the instructions to install local::lib at http://search.cpan.org/~apeiron/local-lib-1.008004/lib/local/lib.pm#The_bootstrapping_technique. After local::lib is installed, every Perl module that you install manually or through the CPAN command-line application will be installed in your home directory.

Then, install Grinder by following the instructions detailed in the "Procedure" section.

RUNNING GRINDER

After installation, you can run Grinder using a command-line interface (CLI), an application programming interface (API) or a graphical user interface (GUI) in Galaxy.

To get the usage of the CLI, type:

  grinder --help

More information, including the documentation of the Grinder API, which allows you to run Grinder from within other Perl programs, is available by typing:

  perldoc Grinder

To run the GUI, refer to the Galaxy documentation at http://wiki.g2.bx.psu.edu/FrontPage.

The 'utils' folder included in the Grinder package contains some utilities:

average genome size:

This calculates the average genome size (in bp) of a simulated random library produced by Grinder.

change_paired_read_orientation:

This reverses the orientation of each second mate-pair read (ID ending in /2) in a FASTA file.

REFERENCE SEQUENCE DATABASE

A variety of FASTA databases can be used as input for Grinder. For example, the GreenGenes database (http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/Isolated_named_strains_16S_aligned.fasta) contains over 180,000 16S rRNA clone sequences from various species which would be appropriate to produce a 16S rRNA amplicon dataset. A set of over 41,000 OTU representative sequences and their affiliation in seven different taxonomic sytems can also be used for the same purpose (http://greengenes.lbl.gov/Download/OTUs/gg_otus_6oct2010/rep_set/gg_97_otus_6oct2010.fasta and http://greengenes.lbl.gov/Download/OTUs/gg_otus_6oct2010/taxonomies/). The RDP (http://rdp.cme.msu.edu/download/release10_27_unaligned.fa.gz) and Silva (http://www.arb-silva.de/no_cache/download/archive/release_108/Exports/) databases also provide many 16S rRNA sequences and Silva includes eukaryotic sequences. While 16S rRNA is a popular gene, datasets containing any type of gene could be used in the same fashion to generate simulated amplicon datasets, provided appropriate primers are used.

The >2,400 curated microbial genome sequences in the NCBI RefSeq collection (ftp://ftp.ncbi.nih.gov/refseq/release/microbial/) would also be suitable for producing 16S rRNA simulated datasets (using the adequate primers). However, the lower diversity of this database compared to the previous two makes it more appropriate for producing artificial microbial metagenomes. Individual genomes from this database are also very suitable for the simulation of single or double-barreled shotgun libraries. Similarly, the RefSeq database contains over 3,100 curated viral sequences (ftp://ftp.ncbi.nih.gov/refseq/release/viral/) which can be used to produce artificial viral metagenomes.

Quite a few eukaryotic organisms have been sequenced and their genome or genes can be the basis for simulating genomic, transcriptomic (RNA-seq) or proteomic datasets. For example, you can use the human genome available at ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/, the human transcripts downloadable from ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.rna.fna.gz or the human proteome at ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.protein.faa.gz.

CLI EXAMPLES

Here are a few examples that illustrate the use of Grinder in a terminal:

  1. A shotgun DNA library with a coverage of 0.1X

       grinder -reference_file genomes.fna -coverage_fold 0.1
  2. Same thing but save the result files in a specific folder and with a specific name

       grinder -reference_file genomes.fna -coverage_fold 0.1 -base_name my_name -output_dir my_dir
  3. A DNA shotgun library with 1000 reads

       grinder -reference_file genomes.fna -total_reads 1000
  4. A DNA shotgun library where species are distributed according to a power law

       grinder -reference_file genomes.fna -abundance_model powerlaw 0.1
  5. A DNA shotgun library with 123 genomes taken random from the given genomes

       grinder -reference_file genomes.fna -diversity 123
  6. Two DNA shotgun libraries that have 50% of the species in common

       grinder -reference_file genomes.fna -num_libraries 2 -shared_perc 50
  7. Two DNA shotgun library with no species in common and distributed according to a exponential rank-abundance model. Note that because the parameter value for the exponential model is omitted, each library uses a different randomly chosen value:

       grinder -reference_file genomes.fna -num_libraries 2 -abundance_model exponential
  8. A DNA shotgun library where species relative abundances are manually specified

       grinder -reference_file genomes.fna -abundance_file my_abundances.txt
  9. A DNA shotgun library with Sanger reads

       grinder -reference_file genomes.fna -read_dist 800 -mutation_dist linear 1 2 -mutation_ratio 80 20
  10. A DNA shotgun library with first-generation 454 reads

       grinder -reference_file genomes.fna -read_dist 100 normal 10 -homopolymer_dist balzer
  11. A paired-end DNA shotgun library, where the insert size is normally distributed around 2.5 kbp and has 0.2 kbp standard deviation

       grinder -reference_file genomes.fna -insert_dist 2500 normal 200
  12. A transcriptomic dataset

       grinder -reference_file transcripts.fna
  13. A unidirectional transcriptomic dataset

       grinder -reference_file transcripts.fna -unidirectional 1

    Note the use of -unidirectional 1 to prevent reads to be taken from the reverse- complement of the reference sequences.

  14. A proteomic dataset

       grinder -reference_file proteins.faa -unidirectional 1
  15. A 16S rRNA amplicon library

       grinder -reference_file 16Sgenes.fna -forward_reverse 16Sprimers.fna -length_bias 0 -unidirectional 1

    Note the use of -length_bias 0 because reference sequence length should not affect the relative abundance of amplicons.

  16. The same amplicon library with 20% of chimeric reads (90% bimera, 10% trimera)

       grinder -reference_file 16Sgenes.fna -forward_reverse 16Sprimers.fna -length_bias 0 -unidirectional 1 -chimera_perc 20 -chimera_dist 90 10
  17. Three 16S rRNA amplicon libraries with specified MIDs and no reference sequences in common

       grinder -reference_file 16Sgenes.fna -forward_reverse 16Sprimers.fna -length_bias 0 -unidirectional 1 -num_libraries 3 -multiplex_ids MIDs.fna
  18. Reading reference sequences from the standard input, which allows you to decompress FASTA files on the fly:

       zcat microbial_db.fna.gz | grinder -reference_file - -total_reads 100

CLI REQUIRED ARGUMENTS

-rf <reference_file> | -reference_file <reference_file> | -gf <reference_file> | -genome_file <reference_file>

FASTA file that contains the input reference sequences (full genomes, 16S rRNA genes, transcripts, proteins...) or '-' to read them from the standard input. See the README file for examples of databases you can use and where to get them from. Default: -

CLI OPTIONAL ARGUMENTS

-tr <total_reads> | -total_reads <total_reads>

Number of shotgun or amplicon reads to generate for each library. Do not specify this if you specify the fold coverage. Default: 100

-cf <coverage_fold> | -coverage_fold <coverage_fold>

Desired fold coverage of the input reference sequences (the output FASTA length divided by the input FASTA length). Do not specify this if you specify the number of reads directly.

-rd <read_dist>... | -read_dist <read_dist>...

Desired shotgun or amplicon read length distribution specified as: average length, distribution ('uniform' or 'normal') and standard deviation.

Only the first element is required. Examples:

  All reads exactly 101 bp long (Illumina GA 2x): 101
  Uniform read distribution around 100+-10 bp: 100 uniform 10
  Reads normally distributed with an average of 800 and a standard deviation of 100
    bp (Sanger reads): 800 normal 100
  Reads normally distributed with an average of 450 and a standard deviation of 50
    bp (454 GS-FLX Ti): 450 normal 50

Reference sequences smaller than the specified read length are not used. Default: 100

-id <insert_dist>... | -insert_dist <insert_dist>...

Create paired-end or mate-pair reads spanning the given insert length. Important: the insert is defined in the biological sense, i.e. its length includes the length of both reads and of the stretch of DNA between them: 0 : off, or: insert size distribution in bp, in the same format as the read length distribution (a typical value is 2,500 bp for mate pairs) Two distinct reads are generated whether or not the mate pair overlaps. Default: 0

-mo <mate_orientation> | -mate_orientation <mate_orientation>

When generating paired-end or mate-pair reads (see <insert_dist>), specify the orientation of the reads (F: forward, R: reverse):

   FR:  ---> <---  e.g. Sanger, Illumina paired-end, IonTorrent mate-pair
   FF:  ---> --->  e.g. 454
   RF:  <--- --->  e.g. Illumina mate-pair
   RR:  <--- <---

Default: FR

-ec <exclude_chars> | -exclude_chars <exclude_chars>

Do not create reads containing any of the specified characters (case insensitive). For example, use 'NX' to prevent reads with ambiguities (N or X). Grinder will error if it fails to find a suitable read (or pair of reads) after 10 attempts. Consider using <delete_chars>, which may be more appropriate for your case. Default: ''

-dc <delete_chars> | -delete_chars <delete_chars>

Remove the specified characters from the reference sequences (case-insensitive), e.g. '-~*' to remove gaps (- or ~) or terminator (*). Removing these characters is done once, when reading the reference sequences, prior to taking reads. Hence it is more efficient than <exclude_chars>. Default:

-fr <forward_reverse> | -forward_reverse <forward_reverse>

Use DNA amplicon sequencing using a forward and reverse PCR primer sequence provided in a FASTA file. The reference sequences and their reverse complement will be searched for PCR primer matches. The primer sequences should use the IUPAC convention for degenerate residues and the reference sequences that that do not match the specified primers are excluded. If your reference sequences are full genomes, it is recommended to use <copy_bias> = 1 and <length_bias> = 0 to generate amplicon reads. To sequence from the forward strand, set <unidirectional> to 1 and put the forward primer first and reverse primer second in the FASTA file. To sequence from the reverse strand, invert the primers in the FASTA file and use <unidirectional> = -1. The second primer sequence in the FASTA file is always optional. Example: AAACTYAAAKGAATTGRCGG and ACGGGCGGTGTGTRC for the 926F and 1392R primers that target the V6 to V9 region of the 16S rRNA gene.

-un <unidirectional> | -unidirectional <unidirectional>

Instead of producing reads bidirectionally, from the reference strand and its reverse complement, proceed unidirectionally, from one strand only (forward or reverse). Values: 0 (off, i.e. bidirectional), 1 (forward), -1 (reverse). Use <unidirectional> = 1 for amplicon and strand-specific transcriptomic or proteomic datasets. Default: 0

-lb <length_bias> | -length_bias <length_bias>

In shotgun libraries, sample reference sequences proportionally to their length. For example, in simulated microbial datasets, this means that at the same relative abundance, larger genomes contribute more reads than smaller genomes (and all genomes have the same fold coverage). 0 = no, 1 = yes. Default: 1

-cb <copy_bias> | -copy_bias <copy_bias>

In amplicon libraries where full genomes are used as input, sample species proportionally to the number of copies of the target gene: at equal relative abundance, genomes that have multiple copies of the target gene contribute more amplicon reads than genomes that have a single copy. 0 = no, 1 = yes. Default: 1

-md <mutation_dist>... | -mutation_dist <mutation_dist>...

Introduce sequencing errors in the reads, under the form of mutations (substitutions, insertions and deletions) at positions that follow a specified distribution (with replacement): model (uniform, linear, poly4), model parameters. For example, for a uniform 0.1% error rate, use: uniform 0.1. To simulate Sanger errors, use a linear model where the errror rate is 1% at the 5' end of reads and 2% at the 3' end: linear 1 2. To model Illumina errors using the 4th degree polynome 3e-3 + 3.3e-8 * i^4 (Korbel et al 2009), use: poly4 3e-3 3.3e-8. Use the <mutation_ratio> option to alter how many of these mutations are substitutions or indels. Default: uniform 0 0

-mr <mutation_ratio>... | -mutation_ratio <mutation_ratio>...

Indicate the percentage of substitutions and the number of indels (insertions and deletions). For example, use '80 20' (4 substitutions for each indel) for Sanger reads. Note that this parameter has no effect unless you specify the <mutation_dist> option. Default: 80 20

-hd <homopolymer_dist> | -homopolymer_dist <homopolymer_dist>

Introduce sequencing errors in the reads under the form of homopolymeric stretches (e.g. AAA, CCCCC) using a specified model where the homopolymer length follows a normal distribution N(mean, standard deviation) that is function of the homopolymer length n:

  Margulies: N(n, 0.15 * n)              ,  Margulies et al. 2005.
  Richter  : N(n, 0.15 * sqrt(n))        ,  Richter et al. 2008.
  Balzer   : N(n, 0.03494 + n * 0.06856) ,  Balzer et al. 2010.

Default: 0

-cp <chimera_perc> | -chimera_perc <chimera_perc>

Specify the percent of reads in amplicon libraries that should be chimeric sequences. The 'reference' field in the description of chimeric reads will contain the ID of all the reference sequences forming the chimeric template. A typical value is 10% for amplicons. This option can be used to generate chimeric shotgun reads as well. Default: 0 %

-cd <chimera_dist>... | -chimera_dist <chimera_dist>...

Specify the distribution of chimeras: bimeras, trimeras, quadrameras and multimeras of higher order. The default is the average values from Quince et al. 2011: '314 38 1', which corresponds to 89% of bimeras, 11% of trimeras and 0.3% of quadrameras. Note that this option only takes effect when you request the generation of chimeras with the <chimera_perc> option. Default: 314 38 1

-ck <chimera_kmer> | -chimera_kmer <chimera_kmer>

Activate a method to form chimeras by picking breakpoints at places where k-mers are shared between sequences. <chimera_kmer> represents k, the length of the k-mers (in bp). The longer the kmer, the more similar the sequences have to be to be eligible to form chimeras. The more frequent a k-mer is in the pool of reference sequences (taking into account their relative abundance), the more often this k-mer will be chosen. For example, CHSIM (Edgar et al. 2011) uses this method with a k-mer length of 10 bp. If you do not want to use k-mer information to form chimeras, use 0, which will result in the reference sequences and breakpoints to be taken randomly on the "aligned" reference sequences. Note that this option only takes effect when you request the generation of chimeras with the <chimera_perc> option. Also, this options is quite memory intensive, so you should probably limit yourself to a relatively small number of reference sequences if you want to use it. Default: 10 bp

-af <abundance_file> | -abundance_file <abundance_file>

Specify the relative abundance of the reference sequences manually in an input file. Each line of the file should contain a sequence name and its relative abundance (%), e.g. 'seqABC 82.1' or 'seqABC 82.1 10.2' if you are specifying two different libraries.

-am <abundance_model>... | -abundance_model <abundance_model>...

Relative abundance model for the input reference sequences: uniform, linear, powerlaw, logarithmic or exponential. The uniform and linear models do not require a parameter, but the other models take a parameter in the range [0, infinity). If this parameter is not specified, then it is randomly chosen. Examples:

  uniform distribution: uniform
  powerlaw distribution with parameter 0.1: powerlaw 0.1
  exponential distribution with automatically chosen parameter: exponential

Default: uniform 1

-nl <num_libraries> | -num_libraries <num_libraries>

Number of independent libraries to create. Specify how diverse and similar they should be with <diversity>, <shared_perc> and <permuted_perc>. Assign them different MID tags with <multiplex_mids>. Default: 1

-mi <multiplex_ids> | -multiplex_ids <multiplex_ids>

Specify an optional FASTA file that contains multiplex sequence identifiers (a.k.a MIDs or barcodes) to add to the sequences (one sequence per library, in the order given). The MIDs are included in the length specified with the -read_dist option and can be altered by sequencing errors. See the MIDesigner or BarCrawl programs to generate MID sequences.

-di <diversity>... | -diversity <diversity>...

This option specifies alpha diversity, specifically the richness, i.e. number of reference sequences to take randomly and include in each library. Use 0 for the maximum richness possible (based on the number of reference sequences available). Provide one value to make all libraries have the same diversity, or one richness value per library otherwise. Default: 0

-sp <shared_perc> | -shared_perc <shared_perc>

This option controls an aspect of beta-diversity. When creating multiple libraries, specify the percent of reference sequences they should have in common (relative to the diversity of the least diverse library). Default: 0 %

-pp <permuted_perc> | -permuted_perc <permuted_perc>

This option controls another aspect of beta-diversity. For multiple libraries, choose the percent of the most-abundant reference sequences to permute (randomly shuffle) the rank-abundance of. Default: 100 %

-rs <random_seed> | -random_seed <random_seed>

Seed number to use for the pseudo-random number generator.

-dt <desc_track> | -desc_track <desc_track>

Track read information (reference sequence, position, errors, ...) by writing it in the read description. Default: 1

-ql <qual_levels>... | -qual_levels <qual_levels>...

Generate basic quality scores for the simulated reads. Good residues are given a specified good score (e.g. 30) and residues that are the result of an insertion or substitution are given a specified bad score (e.g. 10). Specify first the good score and then the bad score on the command-line, e.g.: 30 10. Default:

-fq <fastq_output> | -fastq_output <fastq_output>

Whether to write the generated reads in FASTQ format (with Sanger-encoded quality scores) instead of FASTA and QUAL or not (1: yes, 0: no). <qual_levels> need to be specified for this option to be effective. Default: 0

-bn <base_name> | -base_name <base_name>

Prefix of the output files. Default: grinder

-od <output_dir> | -output_dir <output_dir>

Directory where the results should be written. This folder will be created if needed. Default: .

-pf <profile_file> | -profile_file <profile_file>

A file that contains Grinder arguments. This is useful if you use many options or often use the same options. Lines with comments (#) are ignored. Consider the profile file, 'simple_profile.txt':

  # A simple Grinder profile
  -read_dist 105 normal 12
  -total_reads 1000

Running: grinder -reference_file viral_genomes.fa -profile_file simple_profile.txt

Translates into: grinder -reference_file viral_genomes.fa -read_dist 105 normal 12 -total_reads 1000

Note that the arguments specified in the profile should not be specified again on the command line.

CLI OUTPUT

For each shotgun or amplicon read library requested, the following files are generated:

API EXAMPLES

The Grinder API allows to conveniently use Grinder within Perl scripts. Here is a synopsis:

  use Grinder;

  # Set up a new factory (see the OPTIONS section for a complete list of parameters)
  my $factory = Grinder->new( -reference_file => 'genomes.fna' );

  # Process all shotgun libraries requested
  while ( my $struct = $factory->next_lib ) {

    # The ID and abundance of the 3rd most abundant genome in this community
    my $id = $struct->{ids}->[2];
    my $ab = $struct->{abs}->[2];

    # Create shotgun reads
    while ( my $read = $factory->next_read) {

      # The read is a Bioperl sequence object with these properties:
      my $read_id     = $read->id;     # read ID given by Grinder
      my $read_seq    = $read->seq;    # nucleotide sequence
      my $read_mid    = $read->mid;    # MID or tag attached to the read
      my $read_errors = $read->errors; # errors that the read contains
 
      # Where was the read taken from? The reference sequence refers to the
      # database sequence for shotgun libraries, amplicon obtained from the
      # database sequence, or could even be a chimeric sequence
      my $ref_id     = $read->reference->id; # ID of the reference sequence
      my $ref_start  = $read->start;         # start of the read on the reference
      my $ref_end    = $read->end;           # end of the read on the reference
      my $ref_strand = $read->strand;        # strand of the reference
      
    }
  }

  # Similarly, for shotgun mate pairs
  my $factory = Grinder->new( -reference_file => 'genomes.fna',
                              -insert_dist    => 250            );
  while ( $factory->next_lib ) {
    while ( my $read = $factory->next_read ) {
      # The first read is the first mate of the mate pair
      # The second read is the second mate of the mate pair
      # The third read is the first mate of the next mate pair
      # ...
    }
  }

  # To generate an amplicon library
  my $factory = Grinder->new( -reference_file  => 'genomes.fna',
                              -forward_reverse => '16Sgenes.fna',
                              -length_bias     => 0,
                              -unidirectional  => 1              );
  while ( $factory->next_lib ) {
    while ( my $read = $factory->next_read) {
      # ...
    }
  }

API METHODS

The rest of the documentation details the available Grinder API methods.

new

Title : new

Function: Create a new Grinder factory initialized with the passed arguments. Available parameters described in the OPTIONS section.

Usage : my $factory = Grinder->new( -reference_file => 'genomes.fna' );

Returns : a new Grinder object

next_lib

Title : next_lib

Function: Go to the next shotgun library to process.

Usage : my $struct = $factory->next_lib;

Returns : Community structure to be used for this library, where $struct->{ids} is an array reference containing the IDs of the genome making up the community (sorted by decreasing relative abundance) and $struct->{abs} is an array reference of the genome abundances (in the same order as the IDs).

next_read

Title : next_read

Function: Create an amplicon or shotgun read for the current library.

Usage : my $read = $factory->next_read; # for single read my $mate1 = $factory->next_read; # for mate pairs my $mate2 = $factory->next_read;

Returns : A sequence represented as a Bio::Seq::SimulatedRead object

get_random_seed

Title : get_random_seed

Function: Return the number used to seed the pseudo-random number generator

Usage : my $seed = $factory->get_random_seed;

Returns : seed number

COPYRIGHT

Copyright 2009-2013 Florent ANGLY <florent.angly@gmail.com>

Grinder is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License (GPL) as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. Grinder is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with Grinder. If not, see <http://www.gnu.org/licenses/>.

BUGS

All complex software has bugs lurking in it, and this program is no exception. If you find a bug, please report it on the SourceForge Tracker for Grinder: http://sourceforge.net/tracker/?group_id=244196&atid=1124737

Bug reports, suggestions and patches are welcome. Grinder's code is developed on Sourceforge (http://sourceforge.net/scm/?type=git&group_id=244196) and is under Git revision control. To get started with a patch, do:

   git clone git://biogrinder.git.sourceforge.net/gitroot/biogrinder/biogrinder
Grinder-0.5.4/inc/0000755000175000017500000000000012647202511014103 5ustar flofloooflofloooGrinder-0.5.4/inc/Module/0000755000175000017500000000000012647202511015330 5ustar flofloooflofloooGrinder-0.5.4/inc/Module/AutoInstall.pm0000644000175000017500000006231112647202456020140 0ustar floflooofloflooo#line 1 package Module::AutoInstall; use strict; use Cwd (); use File::Spec (); use ExtUtils::MakeMaker (); use vars qw{$VERSION}; BEGIN { $VERSION = '1.16'; } # special map on pre-defined feature sets my %FeatureMap = ( '' => 'Core Features', # XXX: deprecated '-core' => 'Core Features', ); # various lexical flags my ( @Missing, @Existing, %DisabledTests, $UnderCPAN, $InstallDepsTarget, $HasCPANPLUS ); my ( $Config, $CheckOnly, $SkipInstall, $AcceptDefault, $TestOnly, $AllDeps, $UpgradeDeps ); my ( $PostambleActions, $PostambleActionsNoTest, $PostambleActionsUpgradeDeps, $PostambleActionsUpgradeDepsNoTest, $PostambleActionsListDeps, $PostambleActionsListAllDeps, $PostambleUsed, $NoTest); # See if it's a testing or non-interactive session _accept_default( $ENV{AUTOMATED_TESTING} or ! -t STDIN ); _init(); sub _accept_default { $AcceptDefault = shift; } sub _installdeps_target { $InstallDepsTarget = shift; } sub missing_modules { return @Missing; } sub do_install { __PACKAGE__->install( [ $Config ? ( UNIVERSAL::isa( $Config, 'HASH' ) ? %{$Config} : @{$Config} ) : () ], @Missing, ); } # initialize various flags, and/or perform install sub _init { foreach my $arg ( @ARGV, split( /[\s\t]+/, $ENV{PERL_AUTOINSTALL} || $ENV{PERL_EXTUTILS_AUTOINSTALL} || '' ) ) { if ( $arg =~ /^--config=(.*)$/ ) { $Config = [ split( ',', $1 ) ]; } elsif ( $arg =~ /^--installdeps=(.*)$/ ) { __PACKAGE__->install( $Config, @Missing = split( /,/, $1 ) ); exit 0; } elsif ( $arg =~ /^--upgradedeps=(.*)$/ ) { $UpgradeDeps = 1; __PACKAGE__->install( $Config, @Missing = split( /,/, $1 ) ); exit 0; } elsif ( $arg =~ /^--default(?:deps)?$/ ) { $AcceptDefault = 1; } elsif ( $arg =~ /^--check(?:deps)?$/ ) { $CheckOnly = 1; } elsif ( $arg =~ /^--skip(?:deps)?$/ ) { $SkipInstall = 1; } elsif ( $arg =~ /^--test(?:only)?$/ ) { $TestOnly = 1; } elsif ( $arg =~ /^--all(?:deps)?$/ ) { $AllDeps = 1; } } } # overrides MakeMaker's prompt() to automatically accept the default choice sub _prompt { goto &ExtUtils::MakeMaker::prompt unless $AcceptDefault; my ( $prompt, $default ) = @_; my $y = ( $default =~ /^[Yy]/ ); print $prompt, ' [', ( $y ? 'Y' : 'y' ), '/', ( $y ? 'n' : 'N' ), '] '; print "$default\n"; return $default; } # the workhorse sub import { my $class = shift; my @args = @_ or return; my $core_all; print "*** $class version " . $class->VERSION . "\n"; print "*** Checking for Perl dependencies...\n"; my $cwd = Cwd::getcwd(); $Config = []; my $maxlen = length( ( sort { length($b) <=> length($a) } grep { /^[^\-]/ } map { ref($_) ? ( ( ref($_) eq 'HASH' ) ? keys(%$_) : @{$_} ) : '' } map { +{@args}->{$_} } grep { /^[^\-]/ or /^-core$/i } keys %{ +{@args} } )[0] ); # We want to know if we're under CPAN early to avoid prompting, but # if we aren't going to try and install anything anyway then skip the # check entirely since we don't want to have to load (and configure) # an old CPAN just for a cosmetic message $UnderCPAN = _check_lock(1) unless $SkipInstall || $InstallDepsTarget; while ( my ( $feature, $modules ) = splice( @args, 0, 2 ) ) { my ( @required, @tests, @skiptests ); my $default = 1; my $conflict = 0; if ( $feature =~ m/^-(\w+)$/ ) { my $option = lc($1); # check for a newer version of myself _update_to( $modules, @_ ) and return if $option eq 'version'; # sets CPAN configuration options $Config = $modules if $option eq 'config'; # promote every features to core status $core_all = ( $modules =~ /^all$/i ) and next if $option eq 'core'; next unless $option eq 'core'; } print "[" . ( $FeatureMap{ lc($feature) } || $feature ) . "]\n"; $modules = [ %{$modules} ] if UNIVERSAL::isa( $modules, 'HASH' ); unshift @$modules, -default => &{ shift(@$modules) } if ( ref( $modules->[0] ) eq 'CODE' ); # XXX: bugward compatibility while ( my ( $mod, $arg ) = splice( @$modules, 0, 2 ) ) { if ( $mod =~ m/^-(\w+)$/ ) { my $option = lc($1); $default = $arg if ( $option eq 'default' ); $conflict = $arg if ( $option eq 'conflict' ); @tests = @{$arg} if ( $option eq 'tests' ); @skiptests = @{$arg} if ( $option eq 'skiptests' ); next; } printf( "- %-${maxlen}s ...", $mod ); if ( $arg and $arg =~ /^\D/ ) { unshift @$modules, $arg; $arg = 0; } # XXX: check for conflicts and uninstalls(!) them. my $cur = _version_of($mod); if (_version_cmp ($cur, $arg) >= 0) { print "loaded. ($cur" . ( $arg ? " >= $arg" : '' ) . ")\n"; push @Existing, $mod => $arg; $DisabledTests{$_} = 1 for map { glob($_) } @skiptests; } else { if (not defined $cur) # indeed missing { print "missing." . ( $arg ? " (would need $arg)" : '' ) . "\n"; } else { # no need to check $arg as _version_cmp ($cur, undef) would satisfy >= above print "too old. ($cur < $arg)\n"; } push @required, $mod => $arg; } } next unless @required; my $mandatory = ( $feature eq '-core' or $core_all ); if ( !$SkipInstall and ( $CheckOnly or ($mandatory and $UnderCPAN) or $AllDeps or $InstallDepsTarget or _prompt( qq{==> Auto-install the } . ( @required / 2 ) . ( $mandatory ? ' mandatory' : ' optional' ) . qq{ module(s) from CPAN?}, $default ? 'y' : 'n', ) =~ /^[Yy]/ ) ) { push( @Missing, @required ); $DisabledTests{$_} = 1 for map { glob($_) } @skiptests; } elsif ( !$SkipInstall and $default and $mandatory and _prompt( qq{==> The module(s) are mandatory! Really skip?}, 'n', ) =~ /^[Nn]/ ) { push( @Missing, @required ); $DisabledTests{$_} = 1 for map { glob($_) } @skiptests; } else { $DisabledTests{$_} = 1 for map { glob($_) } @tests; } } if ( @Missing and not( $CheckOnly or $UnderCPAN) ) { require Config; my $make = $Config::Config{make}; if ($InstallDepsTarget) { print "*** To install dependencies type '$make installdeps' or '$make installdeps_notest'.\n"; } else { print "*** Dependencies will be installed the next time you type '$make'.\n"; } # make an educated guess of whether we'll need root permission. print " (You may need to do that as the 'root' user.)\n" if eval '$>'; } print "*** $class configuration finished.\n"; chdir $cwd; # import to main:: no strict 'refs'; *{'main::WriteMakefile'} = \&Write if caller(0) eq 'main'; return (@Existing, @Missing); } sub _running_under { my $thing = shift; print <<"END_MESSAGE"; *** Since we're running under ${thing}, I'll just let it take care of the dependency's installation later. END_MESSAGE return 1; } # Check to see if we are currently running under CPAN.pm and/or CPANPLUS; # if we are, then we simply let it taking care of our dependencies sub _check_lock { return unless @Missing or @_; if ($ENV{PERL5_CPANM_IS_RUNNING}) { return _running_under('cpanminus'); } my $cpan_env = $ENV{PERL5_CPAN_IS_RUNNING}; if ($ENV{PERL5_CPANPLUS_IS_RUNNING}) { return _running_under($cpan_env ? 'CPAN' : 'CPANPLUS'); } require CPAN; if ($CPAN::VERSION > '1.89') { if ($cpan_env) { return _running_under('CPAN'); } return; # CPAN.pm new enough, don't need to check further } # last ditch attempt, this -will- configure CPAN, very sorry _load_cpan(1); # force initialize even though it's already loaded # Find the CPAN lock-file my $lock = MM->catfile( $CPAN::Config->{cpan_home}, ".lock" ); return unless -f $lock; # Check the lock local *LOCK; return unless open(LOCK, $lock); if ( ( $^O eq 'MSWin32' ? _under_cpan() : == getppid() ) and ( $CPAN::Config->{prerequisites_policy} || '' ) ne 'ignore' ) { print <<'END_MESSAGE'; *** Since we're running under CPAN, I'll just let it take care of the dependency's installation later. END_MESSAGE return 1; } close LOCK; return; } sub install { my $class = shift; my $i; # used below to strip leading '-' from config keys my @config = ( map { s/^-// if ++$i; $_ } @{ +shift } ); my ( @modules, @installed, @modules_to_upgrade ); while (my ($pkg, $ver) = splice(@_, 0, 2)) { # grep out those already installed if (_version_cmp(_version_of($pkg), $ver) >= 0) { push @installed, $pkg; if ($UpgradeDeps) { push @modules_to_upgrade, $pkg, $ver; } } else { push @modules, $pkg, $ver; } } if ($UpgradeDeps) { push @modules, @modules_to_upgrade; @installed = (); @modules_to_upgrade = (); } return @installed unless @modules; # nothing to do return @installed if _check_lock(); # defer to the CPAN shell print "*** Installing dependencies...\n"; return unless _connected_to('cpan.org'); my %args = @config; my %failed; local *FAILED; if ( $args{do_once} and open( FAILED, '.#autoinstall.failed' ) ) { while () { chomp; $failed{$_}++ } close FAILED; my @newmod; while ( my ( $k, $v ) = splice( @modules, 0, 2 ) ) { push @newmod, ( $k => $v ) unless $failed{$k}; } @modules = @newmod; } if ( _has_cpanplus() and not $ENV{PERL_AUTOINSTALL_PREFER_CPAN} ) { _install_cpanplus( \@modules, \@config ); } else { _install_cpan( \@modules, \@config ); } print "*** $class installation finished.\n"; # see if we have successfully installed them while ( my ( $pkg, $ver ) = splice( @modules, 0, 2 ) ) { if ( _version_cmp( _version_of($pkg), $ver ) >= 0 ) { push @installed, $pkg; } elsif ( $args{do_once} and open( FAILED, '>> .#autoinstall.failed' ) ) { print FAILED "$pkg\n"; } } close FAILED if $args{do_once}; return @installed; } sub _install_cpanplus { my @modules = @{ +shift }; my @config = _cpanplus_config( @{ +shift } ); my $installed = 0; require CPANPLUS::Backend; my $cp = CPANPLUS::Backend->new; my $conf = $cp->configure_object; return unless $conf->can('conf') # 0.05x+ with "sudo" support or _can_write($conf->_get_build('base')); # 0.04x # if we're root, set UNINST=1 to avoid trouble unless user asked for it. my $makeflags = $conf->get_conf('makeflags') || ''; if ( UNIVERSAL::isa( $makeflags, 'HASH' ) ) { # 0.03+ uses a hashref here $makeflags->{UNINST} = 1 unless exists $makeflags->{UNINST}; } else { # 0.02 and below uses a scalar $makeflags = join( ' ', split( ' ', $makeflags ), 'UNINST=1' ) if ( $makeflags !~ /\bUNINST\b/ and eval qq{ $> eq '0' } ); } $conf->set_conf( makeflags => $makeflags ); $conf->set_conf( prereqs => 1 ); while ( my ( $key, $val ) = splice( @config, 0, 2 ) ) { $conf->set_conf( $key, $val ); } my $modtree = $cp->module_tree; while ( my ( $pkg, $ver ) = splice( @modules, 0, 2 ) ) { print "*** Installing $pkg...\n"; MY::preinstall( $pkg, $ver ) or next if defined &MY::preinstall; my $success; my $obj = $modtree->{$pkg}; if ( $obj and _version_cmp( $obj->{version}, $ver ) >= 0 ) { my $pathname = $pkg; $pathname =~ s/::/\\W/; foreach my $inc ( grep { m/$pathname.pm/i } keys(%INC) ) { delete $INC{$inc}; } my $rv = $cp->install( modules => [ $obj->{module} ] ); if ( $rv and ( $rv->{ $obj->{module} } or $rv->{ok} ) ) { print "*** $pkg successfully installed.\n"; $success = 1; } else { print "*** $pkg installation cancelled.\n"; $success = 0; } $installed += $success; } else { print << "."; *** Could not find a version $ver or above for $pkg; skipping. . } MY::postinstall( $pkg, $ver, $success ) if defined &MY::postinstall; } return $installed; } sub _cpanplus_config { my @config = (); while ( @_ ) { my ($key, $value) = (shift(), shift()); if ( $key eq 'prerequisites_policy' ) { if ( $value eq 'follow' ) { $value = CPANPLUS::Internals::Constants::PREREQ_INSTALL(); } elsif ( $value eq 'ask' ) { $value = CPANPLUS::Internals::Constants::PREREQ_ASK(); } elsif ( $value eq 'ignore' ) { $value = CPANPLUS::Internals::Constants::PREREQ_IGNORE(); } else { die "*** Cannot convert option $key = '$value' to CPANPLUS version.\n"; } push @config, 'prereqs', $value; } elsif ( $key eq 'force' ) { push @config, $key, $value; } elsif ( $key eq 'notest' ) { push @config, 'skiptest', $value; } else { die "*** Cannot convert option $key to CPANPLUS version.\n"; } } return @config; } sub _install_cpan { my @modules = @{ +shift }; my @config = @{ +shift }; my $installed = 0; my %args; _load_cpan(); require Config; if (CPAN->VERSION < 1.80) { # no "sudo" support, probe for writableness return unless _can_write( MM->catfile( $CPAN::Config->{cpan_home}, 'sources' ) ) and _can_write( $Config::Config{sitelib} ); } # if we're root, set UNINST=1 to avoid trouble unless user asked for it. my $makeflags = $CPAN::Config->{make_install_arg} || ''; $CPAN::Config->{make_install_arg} = join( ' ', split( ' ', $makeflags ), 'UNINST=1' ) if ( $makeflags !~ /\bUNINST\b/ and eval qq{ $> eq '0' } ); # don't show start-up info $CPAN::Config->{inhibit_startup_message} = 1; # set additional options while ( my ( $opt, $arg ) = splice( @config, 0, 2 ) ) { ( $args{$opt} = $arg, next ) if $opt =~ /^(?:force|notest)$/; # pseudo-option $CPAN::Config->{$opt} = $opt eq 'urllist' ? [$arg] : $arg; } if ($args{notest} && (not CPAN::Shell->can('notest'))) { die "Your version of CPAN is too old to support the 'notest' pragma"; } local $CPAN::Config->{prerequisites_policy} = 'follow'; while ( my ( $pkg, $ver ) = splice( @modules, 0, 2 ) ) { MY::preinstall( $pkg, $ver ) or next if defined &MY::preinstall; print "*** Installing $pkg...\n"; my $obj = CPAN::Shell->expand( Module => $pkg ); my $success = 0; if ( $obj and _version_cmp( $obj->cpan_version, $ver ) >= 0 ) { my $pathname = $pkg; $pathname =~ s/::/\\W/; foreach my $inc ( grep { m/$pathname.pm/i } keys(%INC) ) { delete $INC{$inc}; } my $rv = do { if ($args{force}) { CPAN::Shell->force( install => $pkg ) } elsif ($args{notest}) { CPAN::Shell->notest( install => $pkg ) } else { CPAN::Shell->install($pkg) } }; $rv ||= eval { $CPAN::META->instance( 'CPAN::Distribution', $obj->cpan_file, ) ->{install} if $CPAN::META; }; if ( $rv eq 'YES' ) { print "*** $pkg successfully installed.\n"; $success = 1; } else { print "*** $pkg installation failed.\n"; $success = 0; } $installed += $success; } else { print << "."; *** Could not find a version $ver or above for $pkg; skipping. . } MY::postinstall( $pkg, $ver, $success ) if defined &MY::postinstall; } return $installed; } sub _has_cpanplus { return ( $HasCPANPLUS = ( $INC{'CPANPLUS/Config.pm'} or _load('CPANPLUS::Shell::Default') ) ); } # make guesses on whether we're under the CPAN installation directory sub _under_cpan { require Cwd; require File::Spec; my $cwd = File::Spec->canonpath( Cwd::getcwd() ); my $cpan = File::Spec->canonpath( $CPAN::Config->{cpan_home} ); return ( index( $cwd, $cpan ) > -1 ); } sub _update_to { my $class = __PACKAGE__; my $ver = shift; return if _version_cmp( _version_of($class), $ver ) >= 0; # no need to upgrade if ( _prompt( "==> A newer version of $class ($ver) is required. Install?", 'y' ) =~ /^[Nn]/ ) { die "*** Please install $class $ver manually.\n"; } print << "."; *** Trying to fetch it from CPAN... . # install ourselves _load($class) and return $class->import(@_) if $class->install( [], $class, $ver ); print << '.'; exit 1; *** Cannot bootstrap myself. :-( Installation terminated. . } # check if we're connected to some host, using inet_aton sub _connected_to { my $site = shift; return ( ( _load('Socket') and Socket::inet_aton($site) ) or _prompt( qq( *** Your host cannot resolve the domain name '$site', which probably means the Internet connections are unavailable. ==> Should we try to install the required module(s) anyway?), 'n' ) =~ /^[Yy]/ ); } # check if a directory is writable; may create it on demand sub _can_write { my $path = shift; mkdir( $path, 0755 ) unless -e $path; return 1 if -w $path; print << "."; *** You are not allowed to write to the directory '$path'; the installation may fail due to insufficient permissions. . if ( eval '$>' and lc(`sudo -V`) =~ /version/ and _prompt( qq( ==> Should we try to re-execute the autoinstall process with 'sudo'?), ((-t STDIN) ? 'y' : 'n') ) =~ /^[Yy]/ ) { # try to bootstrap ourselves from sudo print << "."; *** Trying to re-execute the autoinstall process with 'sudo'... . my $missing = join( ',', @Missing ); my $config = join( ',', UNIVERSAL::isa( $Config, 'HASH' ) ? %{$Config} : @{$Config} ) if $Config; return unless system( 'sudo', $^X, $0, "--config=$config", "--installdeps=$missing" ); print << "."; *** The 'sudo' command exited with error! Resuming... . } return _prompt( qq( ==> Should we try to install the required module(s) anyway?), 'n' ) =~ /^[Yy]/; } # load a module and return the version it reports sub _load { my $mod = pop; # method/function doesn't matter my $file = $mod; $file =~ s|::|/|g; $file .= '.pm'; local $@; return eval { require $file; $mod->VERSION } || ( $@ ? undef: 0 ); } # report version without loading a module sub _version_of { my $mod = pop; # method/function doesn't matter my $file = $mod; $file =~ s|::|/|g; $file .= '.pm'; foreach my $dir ( @INC ) { next if ref $dir; my $path = File::Spec->catfile($dir, $file); next unless -e $path; require ExtUtils::MM_Unix; return ExtUtils::MM_Unix->parse_version($path); } return undef; } # Load CPAN.pm and it's configuration sub _load_cpan { return if $CPAN::VERSION and $CPAN::Config and not @_; require CPAN; # CPAN-1.82+ adds CPAN::Config::AUTOLOAD to redirect to # CPAN::HandleConfig->load. CPAN reports that the redirection # is deprecated in a warning printed at the user. # CPAN-1.81 expects CPAN::HandleConfig->load, does not have # $CPAN::HandleConfig::VERSION but cannot handle # CPAN::Config->load # Which "versions expect CPAN::Config->load? if ( $CPAN::HandleConfig::VERSION || CPAN::HandleConfig->can('load') ) { # Newer versions of CPAN have a HandleConfig module CPAN::HandleConfig->load; } else { # Older versions had the load method in Config directly CPAN::Config->load; } } # compare two versions, either use Sort::Versions or plain comparison # return values same as <=> sub _version_cmp { my ( $cur, $min ) = @_; return -1 unless defined $cur; # if 0 keep comparing return 1 unless $min; $cur =~ s/\s+$//; # check for version numbers that are not in decimal format if ( ref($cur) or ref($min) or $cur =~ /v|\..*\./ or $min =~ /v|\..*\./ ) { if ( ( $version::VERSION or defined( _load('version') )) and version->can('new') ) { # use version.pm if it is installed. return version->new($cur) <=> version->new($min); } elsif ( $Sort::Versions::VERSION or defined( _load('Sort::Versions') ) ) { # use Sort::Versions as the sorting algorithm for a.b.c versions return Sort::Versions::versioncmp( $cur, $min ); } warn "Cannot reliably compare non-decimal formatted versions.\n" . "Please install version.pm or Sort::Versions.\n"; } # plain comparison local $^W = 0; # shuts off 'not numeric' bugs return $cur <=> $min; } # nothing; this usage is deprecated. sub main::PREREQ_PM { return {}; } sub _make_args { my %args = @_; $args{PREREQ_PM} = { %{ $args{PREREQ_PM} || {} }, @Existing, @Missing } if $UnderCPAN or $TestOnly; if ( $args{EXE_FILES} and -e 'MANIFEST' ) { require ExtUtils::Manifest; my $manifest = ExtUtils::Manifest::maniread('MANIFEST'); $args{EXE_FILES} = [ grep { exists $manifest->{$_} } @{ $args{EXE_FILES} } ]; } $args{test}{TESTS} ||= 't/*.t'; $args{test}{TESTS} = join( ' ', grep { !exists( $DisabledTests{$_} ) } map { glob($_) } split( /\s+/, $args{test}{TESTS} ) ); my $missing = join( ',', @Missing ); my $config = join( ',', UNIVERSAL::isa( $Config, 'HASH' ) ? %{$Config} : @{$Config} ) if $Config; $PostambleActions = ( ($missing and not $UnderCPAN) ? "\$(PERL) $0 --config=$config --installdeps=$missing" : "\$(NOECHO) \$(NOOP)" ); my $deps_list = join( ',', @Missing, @Existing ); $PostambleActionsUpgradeDeps = "\$(PERL) $0 --config=$config --upgradedeps=$deps_list"; my $config_notest = join( ',', (UNIVERSAL::isa( $Config, 'HASH' ) ? %{$Config} : @{$Config}), 'notest', 1 ) if $Config; $PostambleActionsNoTest = ( ($missing and not $UnderCPAN) ? "\$(PERL) $0 --config=$config_notest --installdeps=$missing" : "\$(NOECHO) \$(NOOP)" ); $PostambleActionsUpgradeDepsNoTest = "\$(PERL) $0 --config=$config_notest --upgradedeps=$deps_list"; $PostambleActionsListDeps = '@$(PERL) -le "print for @ARGV" ' . join(' ', map $Missing[$_], grep $_ % 2 == 0, 0..$#Missing); my @all = (@Missing, @Existing); $PostambleActionsListAllDeps = '@$(PERL) -le "print for @ARGV" ' . join(' ', map $all[$_], grep $_ % 2 == 0, 0..$#all); return %args; } # a wrapper to ExtUtils::MakeMaker::WriteMakefile sub Write { require Carp; Carp::croak "WriteMakefile: Need even number of args" if @_ % 2; if ($CheckOnly) { print << "."; *** Makefile not written in check-only mode. . return; } my %args = _make_args(@_); no strict 'refs'; $PostambleUsed = 0; local *MY::postamble = \&postamble unless defined &MY::postamble; ExtUtils::MakeMaker::WriteMakefile(%args); print << "." unless $PostambleUsed; *** WARNING: Makefile written with customized MY::postamble() without including contents from Module::AutoInstall::postamble() -- auto installation features disabled. Please contact the author. . return 1; } sub postamble { $PostambleUsed = 1; my $fragment; $fragment .= <<"AUTO_INSTALL" if !$InstallDepsTarget; config :: installdeps \t\$(NOECHO) \$(NOOP) AUTO_INSTALL $fragment .= <<"END_MAKE"; checkdeps :: \t\$(PERL) $0 --checkdeps installdeps :: \t$PostambleActions installdeps_notest :: \t$PostambleActionsNoTest upgradedeps :: \t$PostambleActionsUpgradeDeps upgradedeps_notest :: \t$PostambleActionsUpgradeDepsNoTest listdeps :: \t$PostambleActionsListDeps listalldeps :: \t$PostambleActionsListAllDeps END_MAKE return $fragment; } 1; __END__ #line 1197 Grinder-0.5.4/inc/Module/Install/0000755000175000017500000000000012647202511016736 5ustar flofloooflofloooGrinder-0.5.4/inc/Module/Install/AutoManifest.pm0000644000175000017500000000125712647202456021710 0ustar floflooofloflooo#line 1 use strict; use warnings; package Module::Install::AutoManifest; use Module::Install::Base; BEGIN { our $VERSION = '0.003'; our $ISCORE = 1; our @ISA = qw(Module::Install::Base); } sub auto_manifest { my ($self) = @_; return unless $Module::Install::AUTHOR; die "auto_manifest requested, but no MANIFEST.SKIP exists\n" unless -e "MANIFEST.SKIP"; if (-e "MANIFEST") { unlink('MANIFEST') or die "Can't remove MANIFEST: $!"; } $self->postamble(<<"END"); create_distdir: manifest_clean manifest distclean :: manifest_clean manifest_clean: \t\$(RM_F) MANIFEST END } 1; __END__ #line 48 #line 131 1; # End of Module::Install::AutoManifest Grinder-0.5.4/inc/Module/Install/AutoInstall.pm0000644000175000017500000000416212647202456021546 0ustar floflooofloflooo#line 1 package Module::Install::AutoInstall; use strict; use Module::Install::Base (); use vars qw{$VERSION @ISA $ISCORE}; BEGIN { $VERSION = '1.16'; @ISA = 'Module::Install::Base'; $ISCORE = 1; } sub AutoInstall { $_[0] } sub run { my $self = shift; $self->auto_install_now(@_); } sub write { my $self = shift; $self->auto_install(@_); } sub auto_install { my $self = shift; return if $self->{done}++; # Flatten array of arrays into a single array my @core = map @$_, map @$_, grep ref, $self->build_requires, $self->requires; my @config = @_; # We'll need Module::AutoInstall $self->include('Module::AutoInstall'); require Module::AutoInstall; my @features_require = Module::AutoInstall->import( (@config ? (-config => \@config) : ()), (@core ? (-core => \@core) : ()), $self->features, ); my %seen; my @requires = map @$_, map @$_, grep ref, $self->requires; while (my ($mod, $ver) = splice(@requires, 0, 2)) { $seen{$mod}{$ver}++; } my @build_requires = map @$_, map @$_, grep ref, $self->build_requires; while (my ($mod, $ver) = splice(@build_requires, 0, 2)) { $seen{$mod}{$ver}++; } my @configure_requires = map @$_, map @$_, grep ref, $self->configure_requires; while (my ($mod, $ver) = splice(@configure_requires, 0, 2)) { $seen{$mod}{$ver}++; } my @deduped; while (my ($mod, $ver) = splice(@features_require, 0, 2)) { push @deduped, $mod => $ver unless $seen{$mod}{$ver}++; } $self->requires(@deduped); $self->makemaker_args( Module::AutoInstall::_make_args() ); my $class = ref($self); $self->postamble( "# --- $class section:\n" . Module::AutoInstall::postamble() ); } sub installdeps_target { my ($self, @args) = @_; $self->include('Module::AutoInstall'); require Module::AutoInstall; Module::AutoInstall::_installdeps_target(1); $self->auto_install(@args); } sub auto_install_now { my $self = shift; $self->auto_install(@_); Module::AutoInstall::do_install(); } 1; Grinder-0.5.4/inc/Module/Install/AuthorRequires.pm0000644000175000017500000000113112647202456022262 0ustar floflooofloflooo#line 1 use strict; use warnings; package Module::Install::AuthorRequires; use base 'Module::Install::Base'; # cargo cult BEGIN { our $VERSION = '0.02'; our $ISCORE = 1; } sub author_requires { my $self = shift; return $self->{values}->{author_requires} unless @_; my @added; while (@_) { my $mod = shift or last; my $version = shift || 0; push @added, [$mod => $version]; } push @{ $self->{values}->{author_requires} }, @added; $self->admin->author_requires(@added); return map { @$_ } @added; } 1; __END__ #line 92 Grinder-0.5.4/inc/Module/Install/ReadmeFromPod.pm0000644000175000017500000000631112647202457021772 0ustar floflooofloflooo#line 1 package Module::Install::ReadmeFromPod; use 5.006; use strict; use warnings; use base qw(Module::Install::Base); use vars qw($VERSION); $VERSION = '0.22'; sub readme_from { my $self = shift; return unless $self->is_admin; # Input file my $in_file = shift || $self->_all_from or die "Can't determine file to make readme_from"; # Get optional arguments my ($clean, $format, $out_file, $options); my $args = shift; if ( ref $args ) { # Arguments are in a hashref if ( ref($args) ne 'HASH' ) { die "Expected a hashref but got a ".ref($args)."\n"; } else { $clean = $args->{'clean'}; $format = $args->{'format'}; $out_file = $args->{'output_file'}; $options = $args->{'options'}; } } else { # Arguments are in a list $clean = $args; $format = shift; $out_file = shift; $options = \@_; } # Default values; $clean ||= 0; $format ||= 'txt'; # Generate README print "readme_from $in_file to $format\n"; if ($format =~ m/te?xt/) { $out_file = $self->_readme_txt($in_file, $out_file, $options); } elsif ($format =~ m/html?/) { $out_file = $self->_readme_htm($in_file, $out_file, $options); } elsif ($format eq 'man') { $out_file = $self->_readme_man($in_file, $out_file, $options); } elsif ($format eq 'pdf') { $out_file = $self->_readme_pdf($in_file, $out_file, $options); } if ($clean) { $self->clean_files($out_file); } return 1; } sub _readme_txt { my ($self, $in_file, $out_file, $options) = @_; $out_file ||= 'README'; require Pod::Text; my $parser = Pod::Text->new( @$options ); open my $out_fh, '>', $out_file or die "Could not write file $out_file:\n$!\n"; $parser->output_fh( *$out_fh ); $parser->parse_file( $in_file ); close $out_fh; return $out_file; } sub _readme_htm { my ($self, $in_file, $out_file, $options) = @_; $out_file ||= 'README.htm'; require Pod::Html; Pod::Html::pod2html( "--infile=$in_file", "--outfile=$out_file", @$options, ); # Remove temporary files if needed for my $file ('pod2htmd.tmp', 'pod2htmi.tmp') { if (-e $file) { unlink $file or warn "Warning: Could not remove file '$file'.\n$!\n"; } } return $out_file; } sub _readme_man { my ($self, $in_file, $out_file, $options) = @_; $out_file ||= 'README.1'; require Pod::Man; my $parser = Pod::Man->new( @$options ); $parser->parse_from_file($in_file, $out_file); return $out_file; } sub _readme_pdf { my ($self, $in_file, $out_file, $options) = @_; $out_file ||= 'README.pdf'; eval { require App::pod2pdf; } or die "Could not generate $out_file because pod2pdf could not be found\n"; my $parser = App::pod2pdf->new( @$options ); $parser->parse_from_file($in_file); open my $out_fh, '>', $out_file or die "Could not write file $out_file:\n$!\n"; select $out_fh; $parser->output; select STDOUT; close $out_fh; return $out_file; } sub _all_from { my $self = shift; return unless $self->admin->{extensions}; my ($metadata) = grep { ref($_) eq 'Module::Install::Metadata'; } @{$self->admin->{extensions}}; return unless $metadata; return $metadata->{values}{all_from} || ''; } 'Readme!'; __END__ #line 254 Grinder-0.5.4/inc/Module/Install/Include.pm0000644000175000017500000000101512647202456020664 0ustar floflooofloflooo#line 1 package Module::Install::Include; use strict; use Module::Install::Base (); use vars qw{$VERSION @ISA $ISCORE}; BEGIN { $VERSION = '1.16'; @ISA = 'Module::Install::Base'; $ISCORE = 1; } sub include { shift()->admin->include(@_); } sub include_deps { shift()->admin->include_deps(@_); } sub auto_include { shift()->admin->auto_include(@_); } sub auto_include_deps { shift()->admin->auto_include_deps(@_); } sub auto_include_dependent_dists { shift()->admin->auto_include_dependent_dists(@_); } 1; Grinder-0.5.4/inc/Module/Install/WriteAll.pm0000644000175000017500000000237612647202456021037 0ustar floflooofloflooo#line 1 package Module::Install::WriteAll; use strict; use Module::Install::Base (); use vars qw{$VERSION @ISA $ISCORE}; BEGIN { $VERSION = '1.16'; @ISA = qw{Module::Install::Base}; $ISCORE = 1; } sub WriteAll { my $self = shift; my %args = ( meta => 1, sign => 0, inline => 0, check_nmake => 1, @_, ); $self->sign(1) if $args{sign}; $self->admin->WriteAll(%args) if $self->is_admin; $self->check_nmake if $args{check_nmake}; unless ( $self->makemaker_args->{PL_FILES} ) { # XXX: This still may be a bit over-defensive... unless ($self->makemaker(6.25)) { $self->makemaker_args( PL_FILES => {} ) if -f 'Build.PL'; } } # Until ExtUtils::MakeMaker support MYMETA.yml, make sure # we clean it up properly ourself. $self->realclean_files('MYMETA.yml'); if ( $args{inline} ) { $self->Inline->write; } else { $self->Makefile->write; } # The Makefile write process adds a couple of dependencies, # so write the META.yml files after the Makefile. if ( $args{meta} ) { $self->Meta->write; } # Experimental support for MYMETA if ( $ENV{X_MYMETA} ) { if ( $ENV{X_MYMETA} eq 'JSON' ) { $self->Meta->write_mymeta_json; } else { $self->Meta->write_mymeta_yaml; } } return 1; } 1; Grinder-0.5.4/inc/Module/Install/Makefile.pm0000644000175000017500000002743712647202456021036 0ustar floflooofloflooo#line 1 package Module::Install::Makefile; use strict 'vars'; use ExtUtils::MakeMaker (); use Module::Install::Base (); use Fcntl qw/:flock :seek/; use vars qw{$VERSION @ISA $ISCORE}; BEGIN { $VERSION = '1.16'; @ISA = 'Module::Install::Base'; $ISCORE = 1; } sub Makefile { $_[0] } my %seen = (); sub prompt { shift; # Infinite loop protection my @c = caller(); if ( ++$seen{"$c[1]|$c[2]|$_[0]"} > 3 ) { die "Caught an potential prompt infinite loop ($c[1]|$c[2]|$_[0])"; } # In automated testing or non-interactive session, always use defaults if ( ($ENV{AUTOMATED_TESTING} or -! -t STDIN) and ! $ENV{PERL_MM_USE_DEFAULT} ) { local $ENV{PERL_MM_USE_DEFAULT} = 1; goto &ExtUtils::MakeMaker::prompt; } else { goto &ExtUtils::MakeMaker::prompt; } } # Store a cleaned up version of the MakeMaker version, # since we need to behave differently in a variety of # ways based on the MM version. my $makemaker = eval $ExtUtils::MakeMaker::VERSION; # If we are passed a param, do a "newer than" comparison. # Otherwise, just return the MakeMaker version. sub makemaker { ( @_ < 2 or $makemaker >= eval($_[1]) ) ? $makemaker : 0 } # Ripped from ExtUtils::MakeMaker 6.56, and slightly modified # as we only need to know here whether the attribute is an array # or a hash or something else (which may or may not be appendable). my %makemaker_argtype = ( C => 'ARRAY', CONFIG => 'ARRAY', # CONFIGURE => 'CODE', # ignore DIR => 'ARRAY', DL_FUNCS => 'HASH', DL_VARS => 'ARRAY', EXCLUDE_EXT => 'ARRAY', EXE_FILES => 'ARRAY', FUNCLIST => 'ARRAY', H => 'ARRAY', IMPORTS => 'HASH', INCLUDE_EXT => 'ARRAY', LIBS => 'ARRAY', # ignore '' MAN1PODS => 'HASH', MAN3PODS => 'HASH', META_ADD => 'HASH', META_MERGE => 'HASH', PL_FILES => 'HASH', PM => 'HASH', PMLIBDIRS => 'ARRAY', PMLIBPARENTDIRS => 'ARRAY', PREREQ_PM => 'HASH', CONFIGURE_REQUIRES => 'HASH', SKIP => 'ARRAY', TYPEMAPS => 'ARRAY', XS => 'HASH', # VERSION => ['version',''], # ignore # _KEEP_AFTER_FLUSH => '', clean => 'HASH', depend => 'HASH', dist => 'HASH', dynamic_lib=> 'HASH', linkext => 'HASH', macro => 'HASH', postamble => 'HASH', realclean => 'HASH', test => 'HASH', tool_autosplit => 'HASH', # special cases where you can use makemaker_append CCFLAGS => 'APPENDABLE', DEFINE => 'APPENDABLE', INC => 'APPENDABLE', LDDLFLAGS => 'APPENDABLE', LDFROM => 'APPENDABLE', ); sub makemaker_args { my ($self, %new_args) = @_; my $args = ( $self->{makemaker_args} ||= {} ); foreach my $key (keys %new_args) { if ($makemaker_argtype{$key}) { if ($makemaker_argtype{$key} eq 'ARRAY') { $args->{$key} = [] unless defined $args->{$key}; unless (ref $args->{$key} eq 'ARRAY') { $args->{$key} = [$args->{$key}] } push @{$args->{$key}}, ref $new_args{$key} eq 'ARRAY' ? @{$new_args{$key}} : $new_args{$key}; } elsif ($makemaker_argtype{$key} eq 'HASH') { $args->{$key} = {} unless defined $args->{$key}; foreach my $skey (keys %{ $new_args{$key} }) { $args->{$key}{$skey} = $new_args{$key}{$skey}; } } elsif ($makemaker_argtype{$key} eq 'APPENDABLE') { $self->makemaker_append($key => $new_args{$key}); } } else { if (defined $args->{$key}) { warn qq{MakeMaker attribute "$key" is overriden; use "makemaker_append" to append values\n}; } $args->{$key} = $new_args{$key}; } } return $args; } # For mm args that take multiple space-separated args, # append an argument to the current list. sub makemaker_append { my $self = shift; my $name = shift; my $args = $self->makemaker_args; $args->{$name} = defined $args->{$name} ? join( ' ', $args->{$name}, @_ ) : join( ' ', @_ ); } sub build_subdirs { my $self = shift; my $subdirs = $self->makemaker_args->{DIR} ||= []; for my $subdir (@_) { push @$subdirs, $subdir; } } sub clean_files { my $self = shift; my $clean = $self->makemaker_args->{clean} ||= {}; %$clean = ( %$clean, FILES => join ' ', grep { length $_ } ($clean->{FILES} || (), @_), ); } sub realclean_files { my $self = shift; my $realclean = $self->makemaker_args->{realclean} ||= {}; %$realclean = ( %$realclean, FILES => join ' ', grep { length $_ } ($realclean->{FILES} || (), @_), ); } sub libs { my $self = shift; my $libs = ref $_[0] ? shift : [ shift ]; $self->makemaker_args( LIBS => $libs ); } sub inc { my $self = shift; $self->makemaker_args( INC => shift ); } sub _wanted_t { } sub tests_recursive { my $self = shift; my $dir = shift || 't'; unless ( -d $dir ) { die "tests_recursive dir '$dir' does not exist"; } my %tests = map { $_ => 1 } split / /, ($self->tests || ''); require File::Find; File::Find::find( sub { /\.t$/ and -f $_ and $tests{"$File::Find::dir/*.t"} = 1 }, $dir ); $self->tests( join ' ', sort keys %tests ); } sub write { my $self = shift; die "&Makefile->write() takes no arguments\n" if @_; # Check the current Perl version my $perl_version = $self->perl_version; if ( $perl_version ) { eval "use $perl_version; 1" or die "ERROR: perl: Version $] is installed, " . "but we need version >= $perl_version"; } # Make sure we have a new enough MakeMaker require ExtUtils::MakeMaker; if ( $perl_version and $self->_cmp($perl_version, '5.006') >= 0 ) { # This previous attempted to inherit the version of # ExtUtils::MakeMaker in use by the module author, but this # was found to be untenable as some authors build releases # using future dev versions of EU:MM that nobody else has. # Instead, #toolchain suggests we use 6.59 which is the most # stable version on CPAN at time of writing and is, to quote # ribasushi, "not terminally fucked, > and tested enough". # TODO: We will now need to maintain this over time to push # the version up as new versions are released. $self->build_requires( 'ExtUtils::MakeMaker' => 6.59 ); $self->configure_requires( 'ExtUtils::MakeMaker' => 6.59 ); } else { # Allow legacy-compatibility with 5.005 by depending on the # most recent EU:MM that supported 5.005. $self->build_requires( 'ExtUtils::MakeMaker' => 6.36 ); $self->configure_requires( 'ExtUtils::MakeMaker' => 6.36 ); } # Generate the MakeMaker params my $args = $self->makemaker_args; $args->{DISTNAME} = $self->name; $args->{NAME} = $self->module_name || $self->name; $args->{NAME} =~ s/-/::/g; $args->{VERSION} = $self->version or die <<'EOT'; ERROR: Can't determine distribution version. Please specify it explicitly via 'version' in Makefile.PL, or set a valid $VERSION in a module, and provide its file path via 'version_from' (or 'all_from' if you prefer) in Makefile.PL. EOT if ( $self->tests ) { my @tests = split ' ', $self->tests; my %seen; $args->{test} = { TESTS => (join ' ', grep {!$seen{$_}++} @tests), }; } elsif ( $Module::Install::ExtraTests::use_extratests ) { # Module::Install::ExtraTests doesn't set $self->tests and does its own tests via harness. # So, just ignore our xt tests here. } elsif ( -d 'xt' and ($Module::Install::AUTHOR or $ENV{RELEASE_TESTING}) ) { $args->{test} = { TESTS => join( ' ', map { "$_/*.t" } grep { -d $_ } qw{ t xt } ), }; } if ( $] >= 5.005 ) { $args->{ABSTRACT} = $self->abstract; $args->{AUTHOR} = join ', ', @{$self->author || []}; } if ( $self->makemaker(6.10) ) { $args->{NO_META} = 1; #$args->{NO_MYMETA} = 1; } if ( $self->makemaker(6.17) and $self->sign ) { $args->{SIGN} = 1; } unless ( $self->is_admin ) { delete $args->{SIGN}; } if ( $self->makemaker(6.31) and $self->license ) { $args->{LICENSE} = $self->license; } my $prereq = ($args->{PREREQ_PM} ||= {}); %$prereq = ( %$prereq, map { @$_ } # flatten [module => version] map { @$_ } grep $_, ($self->requires) ); # Remove any reference to perl, PREREQ_PM doesn't support it delete $args->{PREREQ_PM}->{perl}; # Merge both kinds of requires into BUILD_REQUIRES my $build_prereq = ($args->{BUILD_REQUIRES} ||= {}); %$build_prereq = ( %$build_prereq, map { @$_ } # flatten [module => version] map { @$_ } grep $_, ($self->configure_requires, $self->build_requires) ); # Remove any reference to perl, BUILD_REQUIRES doesn't support it delete $args->{BUILD_REQUIRES}->{perl}; # Delete bundled dists from prereq_pm, add it to Makefile DIR my $subdirs = ($args->{DIR} || []); if ($self->bundles) { my %processed; foreach my $bundle (@{ $self->bundles }) { my ($mod_name, $dist_dir) = @$bundle; delete $prereq->{$mod_name}; $dist_dir = File::Basename::basename($dist_dir); # dir for building this module if (not exists $processed{$dist_dir}) { if (-d $dist_dir) { # List as sub-directory to be processed by make push @$subdirs, $dist_dir; } # Else do nothing: the module is already present on the system $processed{$dist_dir} = undef; } } } unless ( $self->makemaker('6.55_03') ) { %$prereq = (%$prereq,%$build_prereq); delete $args->{BUILD_REQUIRES}; } if ( my $perl_version = $self->perl_version ) { eval "use $perl_version; 1" or die "ERROR: perl: Version $] is installed, " . "but we need version >= $perl_version"; if ( $self->makemaker(6.48) ) { $args->{MIN_PERL_VERSION} = $perl_version; } } if ($self->installdirs) { warn qq{old INSTALLDIRS (probably set by makemaker_args) is overriden by installdirs\n} if $args->{INSTALLDIRS}; $args->{INSTALLDIRS} = $self->installdirs; } my %args = map { ( $_ => $args->{$_} ) } grep {defined($args->{$_} ) } keys %$args; my $user_preop = delete $args{dist}->{PREOP}; if ( my $preop = $self->admin->preop($user_preop) ) { foreach my $key ( keys %$preop ) { $args{dist}->{$key} = $preop->{$key}; } } my $mm = ExtUtils::MakeMaker::WriteMakefile(%args); $self->fix_up_makefile($mm->{FIRST_MAKEFILE} || 'Makefile'); } sub fix_up_makefile { my $self = shift; my $makefile_name = shift; my $top_class = ref($self->_top) || ''; my $top_version = $self->_top->VERSION || ''; my $preamble = $self->preamble ? "# Preamble by $top_class $top_version\n" . $self->preamble : ''; my $postamble = "# Postamble by $top_class $top_version\n" . ($self->postamble || ''); local *MAKEFILE; open MAKEFILE, "+< $makefile_name" or die "fix_up_makefile: Couldn't open $makefile_name: $!"; eval { flock MAKEFILE, LOCK_EX }; my $makefile = do { local $/; }; $makefile =~ s/\b(test_harness\(\$\(TEST_VERBOSE\), )/$1'inc', /; $makefile =~ s/( -I\$\(INST_ARCHLIB\))/ -Iinc$1/g; $makefile =~ s/( "-I\$\(INST_LIB\)")/ "-Iinc"$1/g; $makefile =~ s/^(FULLPERL = .*)/$1 "-Iinc"/m; $makefile =~ s/^(PERL = .*)/$1 "-Iinc"/m; # Module::Install will never be used to build the Core Perl # Sometimes PERL_LIB and PERL_ARCHLIB get written anyway, which breaks # PREFIX/PERL5LIB, and thus, install_share. Blank them if they exist $makefile =~ s/^PERL_LIB = .+/PERL_LIB =/m; #$makefile =~ s/^PERL_ARCHLIB = .+/PERL_ARCHLIB =/m; # Perl 5.005 mentions PERL_LIB explicitly, so we have to remove that as well. $makefile =~ s/(\"?)-I\$\(PERL_LIB\)\1//g; # XXX - This is currently unused; not sure if it breaks other MM-users # $makefile =~ s/^pm_to_blib\s+:\s+/pm_to_blib :: /mg; seek MAKEFILE, 0, SEEK_SET; truncate MAKEFILE, 0; print MAKEFILE "$preamble$makefile$postamble" or die $!; close MAKEFILE or die $!; 1; } sub preamble { my ($self, $text) = @_; $self->{preamble} = $text . $self->{preamble} if defined $text; $self->{preamble}; } sub postamble { my ($self, $text) = @_; $self->{postamble} ||= $self->admin->postamble; $self->{postamble} .= $text if defined $text; $self->{postamble} } 1; __END__ #line 544 Grinder-0.5.4/inc/Module/Install/Fetch.pm0000644000175000017500000000462712647202456020346 0ustar floflooofloflooo#line 1 package Module::Install::Fetch; use strict; use Module::Install::Base (); use vars qw{$VERSION @ISA $ISCORE}; BEGIN { $VERSION = '1.16'; @ISA = 'Module::Install::Base'; $ISCORE = 1; } sub get_file { my ($self, %args) = @_; my ($scheme, $host, $path, $file) = $args{url} =~ m|^(\w+)://([^/]+)(.+)/(.+)| or return; if ( $scheme eq 'http' and ! eval { require LWP::Simple; 1 } ) { $args{url} = $args{ftp_url} or (warn("LWP support unavailable!\n"), return); ($scheme, $host, $path, $file) = $args{url} =~ m|^(\w+)://([^/]+)(.+)/(.+)| or return; } $|++; print "Fetching '$file' from $host... "; unless (eval { require Socket; Socket::inet_aton($host) }) { warn "'$host' resolve failed!\n"; return; } return unless $scheme eq 'ftp' or $scheme eq 'http'; require Cwd; my $dir = Cwd::getcwd(); chdir $args{local_dir} or return if exists $args{local_dir}; if (eval { require LWP::Simple; 1 }) { LWP::Simple::mirror($args{url}, $file); } elsif (eval { require Net::FTP; 1 }) { eval { # use Net::FTP to get past firewall my $ftp = Net::FTP->new($host, Passive => 1, Timeout => 600); $ftp->login("anonymous", 'anonymous@example.com'); $ftp->cwd($path); $ftp->binary; $ftp->get($file) or (warn("$!\n"), return); $ftp->quit; } } elsif (my $ftp = $self->can_run('ftp')) { eval { # no Net::FTP, fallback to ftp.exe require FileHandle; my $fh = FileHandle->new; local $SIG{CHLD} = 'IGNORE'; unless ($fh->open("|$ftp -n")) { warn "Couldn't open ftp: $!\n"; chdir $dir; return; } my @dialog = split(/\n/, <<"END_FTP"); open $host user anonymous anonymous\@example.com cd $path binary get $file $file quit END_FTP foreach (@dialog) { $fh->print("$_\n") } $fh->close; } } else { warn "No working 'ftp' program available!\n"; chdir $dir; return; } unless (-f $file) { warn "Fetching failed: $@\n"; chdir $dir; return; } return if exists $args{size} and -s $file != $args{size}; system($args{run}) if exists $args{run}; unlink($file) if $args{remove}; print(((!exists $args{check_for} or -e $args{check_for}) ? "done!" : "failed! ($!)"), "\n"); chdir $dir; return !$?; } 1; Grinder-0.5.4/inc/Module/Install/Metadata.pm0000644000175000017500000004330212647202456021026 0ustar floflooofloflooo#line 1 package Module::Install::Metadata; use strict 'vars'; use Module::Install::Base (); use vars qw{$VERSION @ISA $ISCORE}; BEGIN { $VERSION = '1.16'; @ISA = 'Module::Install::Base'; $ISCORE = 1; } my @boolean_keys = qw{ sign }; my @scalar_keys = qw{ name module_name abstract version distribution_type tests installdirs }; my @tuple_keys = qw{ configure_requires build_requires requires recommends bundles resources }; my @resource_keys = qw{ homepage bugtracker repository }; my @array_keys = qw{ keywords author }; *authors = \&author; sub Meta { shift } sub Meta_BooleanKeys { @boolean_keys } sub Meta_ScalarKeys { @scalar_keys } sub Meta_TupleKeys { @tuple_keys } sub Meta_ResourceKeys { @resource_keys } sub Meta_ArrayKeys { @array_keys } foreach my $key ( @boolean_keys ) { *$key = sub { my $self = shift; if ( defined wantarray and not @_ ) { return $self->{values}->{$key}; } $self->{values}->{$key} = ( @_ ? $_[0] : 1 ); return $self; }; } foreach my $key ( @scalar_keys ) { *$key = sub { my $self = shift; return $self->{values}->{$key} if defined wantarray and !@_; $self->{values}->{$key} = shift; return $self; }; } foreach my $key ( @array_keys ) { *$key = sub { my $self = shift; return $self->{values}->{$key} if defined wantarray and !@_; $self->{values}->{$key} ||= []; push @{$self->{values}->{$key}}, @_; return $self; }; } foreach my $key ( @resource_keys ) { *$key = sub { my $self = shift; unless ( @_ ) { return () unless $self->{values}->{resources}; return map { $_->[1] } grep { $_->[0] eq $key } @{ $self->{values}->{resources} }; } return $self->{values}->{resources}->{$key} unless @_; my $uri = shift or die( "Did not provide a value to $key()" ); $self->resources( $key => $uri ); return 1; }; } foreach my $key ( grep { $_ ne "resources" } @tuple_keys) { *$key = sub { my $self = shift; return $self->{values}->{$key} unless @_; my @added; while ( @_ ) { my $module = shift or last; my $version = shift || 0; push @added, [ $module, $version ]; } push @{ $self->{values}->{$key} }, @added; return map {@$_} @added; }; } # Resource handling my %lc_resource = map { $_ => 1 } qw{ homepage license bugtracker repository }; sub resources { my $self = shift; while ( @_ ) { my $name = shift or last; my $value = shift or next; if ( $name eq lc $name and ! $lc_resource{$name} ) { die("Unsupported reserved lowercase resource '$name'"); } $self->{values}->{resources} ||= []; push @{ $self->{values}->{resources} }, [ $name, $value ]; } $self->{values}->{resources}; } # Aliases for build_requires that will have alternative # meanings in some future version of META.yml. sub test_requires { shift->build_requires(@_) } sub install_requires { shift->build_requires(@_) } # Aliases for installdirs options sub install_as_core { $_[0]->installdirs('perl') } sub install_as_cpan { $_[0]->installdirs('site') } sub install_as_site { $_[0]->installdirs('site') } sub install_as_vendor { $_[0]->installdirs('vendor') } sub dynamic_config { my $self = shift; my $value = @_ ? shift : 1; if ( $self->{values}->{dynamic_config} ) { # Once dynamic we never change to static, for safety return 0; } $self->{values}->{dynamic_config} = $value ? 1 : 0; return 1; } # Convenience command sub static_config { shift->dynamic_config(0); } sub perl_version { my $self = shift; return $self->{values}->{perl_version} unless @_; my $version = shift or die( "Did not provide a value to perl_version()" ); # Normalize the version $version = $self->_perl_version($version); # We don't support the really old versions unless ( $version >= 5.005 ) { die "Module::Install only supports 5.005 or newer (use ExtUtils::MakeMaker)\n"; } $self->{values}->{perl_version} = $version; } sub all_from { my ( $self, $file ) = @_; unless ( defined($file) ) { my $name = $self->name or die( "all_from called with no args without setting name() first" ); $file = join('/', 'lib', split(/-/, $name)) . '.pm'; $file =~ s{.*/}{} unless -e $file; unless ( -e $file ) { die("all_from cannot find $file from $name"); } } unless ( -f $file ) { die("The path '$file' does not exist, or is not a file"); } $self->{values}{all_from} = $file; # Some methods pull from POD instead of code. # If there is a matching .pod, use that instead my $pod = $file; $pod =~ s/\.pm$/.pod/i; $pod = $file unless -e $pod; # Pull the different values $self->name_from($file) unless $self->name; $self->version_from($file) unless $self->version; $self->perl_version_from($file) unless $self->perl_version; $self->author_from($pod) unless @{$self->author || []}; $self->license_from($pod) unless $self->license; $self->abstract_from($pod) unless $self->abstract; return 1; } sub provides { my $self = shift; my $provides = ( $self->{values}->{provides} ||= {} ); %$provides = (%$provides, @_) if @_; return $provides; } sub auto_provides { my $self = shift; return $self unless $self->is_admin; unless (-e 'MANIFEST') { warn "Cannot deduce auto_provides without a MANIFEST, skipping\n"; return $self; } # Avoid spurious warnings as we are not checking manifest here. local $SIG{__WARN__} = sub {1}; require ExtUtils::Manifest; local *ExtUtils::Manifest::manicheck = sub { return }; require Module::Build; my $build = Module::Build->new( dist_name => $self->name, dist_version => $self->version, license => $self->license, ); $self->provides( %{ $build->find_dist_packages || {} } ); } sub feature { my $self = shift; my $name = shift; my $features = ( $self->{values}->{features} ||= [] ); my $mods; if ( @_ == 1 and ref( $_[0] ) ) { # The user used ->feature like ->features by passing in the second # argument as a reference. Accomodate for that. $mods = $_[0]; } else { $mods = \@_; } my $count = 0; push @$features, ( $name => [ map { ref($_) ? ( ref($_) eq 'HASH' ) ? %$_ : @$_ : $_ } @$mods ] ); return @$features; } sub features { my $self = shift; while ( my ( $name, $mods ) = splice( @_, 0, 2 ) ) { $self->feature( $name, @$mods ); } return $self->{values}->{features} ? @{ $self->{values}->{features} } : (); } sub no_index { my $self = shift; my $type = shift; push @{ $self->{values}->{no_index}->{$type} }, @_ if $type; return $self->{values}->{no_index}; } sub read { my $self = shift; $self->include_deps( 'YAML::Tiny', 0 ); require YAML::Tiny; my $data = YAML::Tiny::LoadFile('META.yml'); # Call methods explicitly in case user has already set some values. while ( my ( $key, $value ) = each %$data ) { next unless $self->can($key); if ( ref $value eq 'HASH' ) { while ( my ( $module, $version ) = each %$value ) { $self->can($key)->($self, $module => $version ); } } else { $self->can($key)->($self, $value); } } return $self; } sub write { my $self = shift; return $self unless $self->is_admin; $self->admin->write_meta; return $self; } sub version_from { require ExtUtils::MM_Unix; my ( $self, $file ) = @_; $self->version( ExtUtils::MM_Unix->parse_version($file) ); # for version integrity check $self->makemaker_args( VERSION_FROM => $file ); } sub abstract_from { require ExtUtils::MM_Unix; my ( $self, $file ) = @_; $self->abstract( bless( { DISTNAME => $self->name }, 'ExtUtils::MM_Unix' )->parse_abstract($file) ); } # Add both distribution and module name sub name_from { my ($self, $file) = @_; if ( Module::Install::_read($file) =~ m/ ^ \s* package \s* ([\w:]+) [\s|;]* /ixms ) { my ($name, $module_name) = ($1, $1); $name =~ s{::}{-}g; $self->name($name); unless ( $self->module_name ) { $self->module_name($module_name); } } else { die("Cannot determine name from $file\n"); } } sub _extract_perl_version { if ( $_[0] =~ m/ ^\s* (?:use|require) \s* v? ([\d_\.]+) \s* ; /ixms ) { my $perl_version = $1; $perl_version =~ s{_}{}g; return $perl_version; } else { return; } } sub perl_version_from { my $self = shift; my $perl_version=_extract_perl_version(Module::Install::_read($_[0])); if ($perl_version) { $self->perl_version($perl_version); } else { warn "Cannot determine perl version info from $_[0]\n"; return; } } sub author_from { my $self = shift; my $content = Module::Install::_read($_[0]); if ($content =~ m/ =head \d \s+ (?:authors?)\b \s* ([^\n]*) | =head \d \s+ (?:licen[cs]e|licensing|copyright|legal)\b \s* .*? copyright .*? \d\d\d[\d.]+ \s* (?:\bby\b)? \s* ([^\n]*) /ixms) { my $author = $1 || $2; # XXX: ugly but should work anyway... if (eval "require Pod::Escapes; 1") { # Pod::Escapes has a mapping table. # It's in core of perl >= 5.9.3, and should be installed # as one of the Pod::Simple's prereqs, which is a prereq # of Pod::Text 3.x (see also below). $author =~ s{ E<( (\d+) | ([A-Za-z]+) )> } { defined $2 ? chr($2) : defined $Pod::Escapes::Name2character_number{$1} ? chr($Pod::Escapes::Name2character_number{$1}) : do { warn "Unknown escape: E<$1>"; "E<$1>"; }; }gex; } elsif (eval "require Pod::Text; 1" && $Pod::Text::VERSION < 3) { # Pod::Text < 3.0 has yet another mapping table, # though the table name of 2.x and 1.x are different. # (1.x is in core of Perl < 5.6, 2.x is in core of # Perl < 5.9.3) my $mapping = ($Pod::Text::VERSION < 2) ? \%Pod::Text::HTML_Escapes : \%Pod::Text::ESCAPES; $author =~ s{ E<( (\d+) | ([A-Za-z]+) )> } { defined $2 ? chr($2) : defined $mapping->{$1} ? $mapping->{$1} : do { warn "Unknown escape: E<$1>"; "E<$1>"; }; }gex; } else { $author =~ s{E}{<}g; $author =~ s{E}{>}g; } $self->author($author); } else { warn "Cannot determine author info from $_[0]\n"; } } #Stolen from M::B my %license_urls = ( perl => 'http://dev.perl.org/licenses/', apache => 'http://apache.org/licenses/LICENSE-2.0', apache_1_1 => 'http://apache.org/licenses/LICENSE-1.1', artistic => 'http://opensource.org/licenses/artistic-license.php', artistic_2 => 'http://opensource.org/licenses/artistic-license-2.0.php', lgpl => 'http://opensource.org/licenses/lgpl-license.php', lgpl2 => 'http://opensource.org/licenses/lgpl-2.1.php', lgpl3 => 'http://opensource.org/licenses/lgpl-3.0.html', bsd => 'http://opensource.org/licenses/bsd-license.php', gpl => 'http://opensource.org/licenses/gpl-license.php', gpl2 => 'http://opensource.org/licenses/gpl-2.0.php', gpl3 => 'http://opensource.org/licenses/gpl-3.0.html', mit => 'http://opensource.org/licenses/mit-license.php', mozilla => 'http://opensource.org/licenses/mozilla1.1.php', open_source => undef, unrestricted => undef, restrictive => undef, unknown => undef, ); sub license { my $self = shift; return $self->{values}->{license} unless @_; my $license = shift or die( 'Did not provide a value to license()' ); $license = __extract_license($license) || lc $license; $self->{values}->{license} = $license; # Automatically fill in license URLs if ( $license_urls{$license} ) { $self->resources( license => $license_urls{$license} ); } return 1; } sub _extract_license { my $pod = shift; my $matched; return __extract_license( ($matched) = $pod =~ m/ (=head \d \s+ L(?i:ICEN[CS]E|ICENSING)\b.*?) (=head \d.*|=cut.*|)\z /xms ) || __extract_license( ($matched) = $pod =~ m/ (=head \d \s+ (?:C(?i:OPYRIGHTS?)|L(?i:EGAL))\b.*?) (=head \d.*|=cut.*|)\z /xms ); } sub __extract_license { my $license_text = shift or return; my @phrases = ( '(?:under )?the same (?:terms|license) as (?:perl|the perl (?:\d )?programming language)' => 'perl', 1, '(?:under )?the terms of (?:perl|the perl programming language) itself' => 'perl', 1, 'Artistic and GPL' => 'perl', 1, 'GNU general public license' => 'gpl', 1, 'GNU public license' => 'gpl', 1, 'GNU lesser general public license' => 'lgpl', 1, 'GNU lesser public license' => 'lgpl', 1, 'GNU library general public license' => 'lgpl', 1, 'GNU library public license' => 'lgpl', 1, 'GNU Free Documentation license' => 'unrestricted', 1, 'GNU Affero General Public License' => 'open_source', 1, '(?:Free)?BSD license' => 'bsd', 1, 'Artistic license 2\.0' => 'artistic_2', 1, 'Artistic license' => 'artistic', 1, 'Apache (?:Software )?license' => 'apache', 1, 'GPL' => 'gpl', 1, 'LGPL' => 'lgpl', 1, 'BSD' => 'bsd', 1, 'Artistic' => 'artistic', 1, 'MIT' => 'mit', 1, 'Mozilla Public License' => 'mozilla', 1, 'Q Public License' => 'open_source', 1, 'OpenSSL License' => 'unrestricted', 1, 'SSLeay License' => 'unrestricted', 1, 'zlib License' => 'open_source', 1, 'proprietary' => 'proprietary', 0, ); while ( my ($pattern, $license, $osi) = splice(@phrases, 0, 3) ) { $pattern =~ s#\s+#\\s+#gs; if ( $license_text =~ /\b$pattern\b/i ) { return $license; } } return ''; } sub license_from { my $self = shift; if (my $license=_extract_license(Module::Install::_read($_[0]))) { $self->license($license); } else { warn "Cannot determine license info from $_[0]\n"; return 'unknown'; } } sub _extract_bugtracker { my @links = $_[0] =~ m#L<( https?\Q://rt.cpan.org/\E[^>]+| https?\Q://github.com/\E[\w_]+/[\w_]+/issues| https?\Q://code.google.com/p/\E[\w_\-]+/issues/list )>#gx; my %links; @links{@links}=(); @links=keys %links; return @links; } sub bugtracker_from { my $self = shift; my $content = Module::Install::_read($_[0]); my @links = _extract_bugtracker($content); unless ( @links ) { warn "Cannot determine bugtracker info from $_[0]\n"; return 0; } if ( @links > 1 ) { warn "Found more than one bugtracker link in $_[0]\n"; return 0; } # Set the bugtracker bugtracker( $links[0] ); return 1; } sub requires_from { my $self = shift; my $content = Module::Install::_readperl($_[0]); my @requires = $content =~ m/^use\s+([^\W\d]\w*(?:::\w+)*)\s+(v?[\d\.]+)/mg; while ( @requires ) { my $module = shift @requires; my $version = shift @requires; $self->requires( $module => $version ); } } sub test_requires_from { my $self = shift; my $content = Module::Install::_readperl($_[0]); my @requires = $content =~ m/^use\s+([^\W\d]\w*(?:::\w+)*)\s+([\d\.]+)/mg; while ( @requires ) { my $module = shift @requires; my $version = shift @requires; $self->test_requires( $module => $version ); } } # Convert triple-part versions (eg, 5.6.1 or 5.8.9) to # numbers (eg, 5.006001 or 5.008009). # Also, convert double-part versions (eg, 5.8) sub _perl_version { my $v = $_[-1]; $v =~ s/^([1-9])\.([1-9]\d?\d?)$/sprintf("%d.%03d",$1,$2)/e; $v =~ s/^([1-9])\.([1-9]\d?\d?)\.(0|[1-9]\d?\d?)$/sprintf("%d.%03d%03d",$1,$2,$3 || 0)/e; $v =~ s/(\.\d\d\d)000$/$1/; $v =~ s/_.+$//; if ( ref($v) ) { # Numify $v = $v + 0; } return $v; } sub add_metadata { my $self = shift; my %hash = @_; for my $key (keys %hash) { warn "add_metadata: $key is not prefixed with 'x_'.\n" . "Use appopriate function to add non-private metadata.\n" unless $key =~ /^x_/; $self->{values}->{$key} = $hash{$key}; } } ###################################################################### # MYMETA Support sub WriteMyMeta { die "WriteMyMeta has been deprecated"; } sub write_mymeta_yaml { my $self = shift; # We need YAML::Tiny to write the MYMETA.yml file unless ( eval { require YAML::Tiny; 1; } ) { return 1; } # Generate the data my $meta = $self->_write_mymeta_data or return 1; # Save as the MYMETA.yml file print "Writing MYMETA.yml\n"; YAML::Tiny::DumpFile('MYMETA.yml', $meta); } sub write_mymeta_json { my $self = shift; # We need JSON to write the MYMETA.json file unless ( eval { require JSON; 1; } ) { return 1; } # Generate the data my $meta = $self->_write_mymeta_data or return 1; # Save as the MYMETA.yml file print "Writing MYMETA.json\n"; Module::Install::_write( 'MYMETA.json', JSON->new->pretty(1)->canonical->encode($meta), ); } sub _write_mymeta_data { my $self = shift; # If there's no existing META.yml there is nothing we can do return undef unless -f 'META.yml'; # We need Parse::CPAN::Meta to load the file unless ( eval { require Parse::CPAN::Meta; 1; } ) { return undef; } # Merge the perl version into the dependencies my $val = $self->Meta->{values}; my $perl = delete $val->{perl_version}; if ( $perl ) { $val->{requires} ||= []; my $requires = $val->{requires}; # Canonize to three-dot version after Perl 5.6 if ( $perl >= 5.006 ) { $perl =~ s{^(\d+)\.(\d\d\d)(\d*)}{join('.', $1, int($2||0), int($3||0))}e } unshift @$requires, [ perl => $perl ]; } # Load the advisory META.yml file my @yaml = Parse::CPAN::Meta::LoadFile('META.yml'); my $meta = $yaml[0]; # Overwrite the non-configure dependency hashes delete $meta->{requires}; delete $meta->{build_requires}; delete $meta->{recommends}; if ( exists $val->{requires} ) { $meta->{requires} = { map { @$_ } @{ $val->{requires} } }; } if ( exists $val->{build_requires} ) { $meta->{build_requires} = { map { @$_ } @{ $val->{build_requires} } }; } return $meta; } 1; Grinder-0.5.4/inc/Module/Install/AutoLicense.pm0000644000175000017500000000316612647202457021526 0ustar floflooofloflooo#line 1 package Module::Install::AutoLicense; use strict; use warnings; use base qw(Module::Install::Base); use vars qw($VERSION); $VERSION = '0.08'; my %licenses = ( perl => 'Software::License::Perl_5', apache => 'Software::License::Apache_2_0', artistic => 'Software::License::Artistic_1_0', artistic_2 => 'Software::License::Artistic_2_0', lgpl2 => 'Software::License::LGPL_2_1', lgpl3 => 'Software::License::LGPL_3_0', bsd => 'Software::License::BSD', gpl => 'Software::License::GPL_1', gpl2 => 'Software::License::GPL_2', gpl3 => 'Software::License::GPL_3', mit => 'Software::License::MIT', mozilla => 'Software::License::Mozilla_1_1', ); sub auto_license { my $self = shift; return unless $Module::Install::AUTHOR; my %opts = @_; $opts{lc $_} = delete $opts{$_} for keys %opts; my $holder = $opts{holder} || _get_authors( $self ); #my $holder = $opts{holder} || $self->author; my $license = $self->license(); unless ( defined $licenses{ $license } ) { warn "No license definition for '$license', aborting\n"; return 1; } my $class = $licenses{ $license }; eval "require $class"; my $sl = $class->new( { holder => $holder } ); open LICENSE, '>LICENSE' or die "$!\n"; print LICENSE $sl->fulltext; close LICENSE; $self->postamble(<<"END"); distclean :: license_clean license_clean: \t\$(RM_F) LICENSE END return 1; } sub _get_authors { my $self = shift; my $joined = join ', ', @{ $self->author() || [] }; return $joined; } 'Licensed to auto'; __END__ #line 125 Grinder-0.5.4/inc/Module/Install/Base.pm0000644000175000017500000000214712647202456020162 0ustar floflooofloflooo#line 1 package Module::Install::Base; use strict 'vars'; use vars qw{$VERSION}; BEGIN { $VERSION = '1.16'; } # Suspend handler for "redefined" warnings BEGIN { my $w = $SIG{__WARN__}; $SIG{__WARN__} = sub { $w }; } #line 42 sub new { my $class = shift; unless ( defined &{"${class}::call"} ) { *{"${class}::call"} = sub { shift->_top->call(@_) }; } unless ( defined &{"${class}::load"} ) { *{"${class}::load"} = sub { shift->_top->load(@_) }; } bless { @_ }, $class; } #line 61 sub AUTOLOAD { local $@; my $func = eval { shift->_top->autoload } or return; goto &$func; } #line 75 sub _top { $_[0]->{_top}; } #line 90 sub admin { $_[0]->_top->{admin} or Module::Install::Base::FakeAdmin->new; } #line 106 sub is_admin { ! $_[0]->admin->isa('Module::Install::Base::FakeAdmin'); } sub DESTROY {} package Module::Install::Base::FakeAdmin; use vars qw{$VERSION}; BEGIN { $VERSION = $Module::Install::Base::VERSION; } my $fake; sub new { $fake ||= bless(\@_, $_[0]); } sub AUTOLOAD {} sub DESTROY {} # Restore warning handler BEGIN { $SIG{__WARN__} = $SIG{__WARN__}->(); } 1; #line 159 Grinder-0.5.4/inc/Module/Install/PodFromEuclid.pm0000644000175000017500000000164712647202457022011 0ustar floflooofloflooo#line 1 package Module::Install::PodFromEuclid; #line 72 use 5.006; use strict; use warnings; use File::Spec; use Env qw(@INC); use base qw(Module::Install::Base); our $VERSION = '0.01'; sub pod_from { my ($self, $in_file) = @_; return unless $self->is_admin; if (not defined $in_file) { $in_file = $self->_all_from or die "Error: Could not determine file to make pod_from"; } my @inc = map { ( '-I', File::Spec->rel2abs($_) ) } @INC; # use same -I included modules as caller my @args = ($^X, @inc, $in_file, '--podfile'); system(@args) == 0 or die "Error: Could not run command ".join(' ',@args).": $?\n"; return 1; } sub _all_from { my $self = shift; return unless $self->admin->{extensions}; my ($metadata) = grep { ref($_) eq 'Module::Install::Metadata'; } @{$self->admin->{extensions}}; return unless $metadata; return $metadata->{values}{all_from} || ''; } 1; Grinder-0.5.4/inc/Module/Install/Scripts.pm0000644000175000017500000000101112647202456020724 0ustar floflooofloflooo#line 1 package Module::Install::Scripts; use strict 'vars'; use Module::Install::Base (); use vars qw{$VERSION @ISA $ISCORE}; BEGIN { $VERSION = '1.16'; @ISA = 'Module::Install::Base'; $ISCORE = 1; } sub install_script { my $self = shift; my $args = $self->makemaker_args; my $exe = $args->{EXE_FILES} ||= []; foreach ( @_ ) { if ( -f $_ ) { push @$exe, $_; } elsif ( -d 'script' and -f "script/$_" ) { push @$exe, "script/$_"; } else { die("Cannot find script '$_'"); } } } 1; Grinder-0.5.4/inc/Module/Install/Can.pm0000644000175000017500000000615712647202456020016 0ustar floflooofloflooo#line 1 package Module::Install::Can; use strict; use Config (); use ExtUtils::MakeMaker (); use Module::Install::Base (); use vars qw{$VERSION @ISA $ISCORE}; BEGIN { $VERSION = '1.16'; @ISA = 'Module::Install::Base'; $ISCORE = 1; } # check if we can load some module ### Upgrade this to not have to load the module if possible sub can_use { my ($self, $mod, $ver) = @_; $mod =~ s{::|\\}{/}g; $mod .= '.pm' unless $mod =~ /\.pm$/i; my $pkg = $mod; $pkg =~ s{/}{::}g; $pkg =~ s{\.pm$}{}i; local $@; eval { require $mod; $pkg->VERSION($ver || 0); 1 }; } # Check if we can run some command sub can_run { my ($self, $cmd) = @_; my $_cmd = $cmd; return $_cmd if (-x $_cmd or $_cmd = MM->maybe_command($_cmd)); for my $dir ((split /$Config::Config{path_sep}/, $ENV{PATH}), '.') { next if $dir eq ''; require File::Spec; my $abs = File::Spec->catfile($dir, $cmd); return $abs if (-x $abs or $abs = MM->maybe_command($abs)); } return; } # Can our C compiler environment build XS files sub can_xs { my $self = shift; # Ensure we have the CBuilder module $self->configure_requires( 'ExtUtils::CBuilder' => 0.27 ); # Do we have the configure_requires checker? local $@; eval "require ExtUtils::CBuilder;"; if ( $@ ) { # They don't obey configure_requires, so it is # someone old and delicate. Try to avoid hurting # them by falling back to an older simpler test. return $self->can_cc(); } # Do we have a working C compiler my $builder = ExtUtils::CBuilder->new( quiet => 1, ); unless ( $builder->have_compiler ) { # No working C compiler return 0; } # Write a C file representative of what XS becomes require File::Temp; my ( $FH, $tmpfile ) = File::Temp::tempfile( "compilexs-XXXXX", SUFFIX => '.c', ); binmode $FH; print $FH <<'END_C'; #include "EXTERN.h" #include "perl.h" #include "XSUB.h" int main(int argc, char **argv) { return 0; } int boot_sanexs() { return 1; } END_C close $FH; # Can the C compiler access the same headers XS does my @libs = (); my $object = undef; eval { local $^W = 0; $object = $builder->compile( source => $tmpfile, ); @libs = $builder->link( objects => $object, module_name => 'sanexs', ); }; my $result = $@ ? 0 : 1; # Clean up all the build files foreach ( $tmpfile, $object, @libs ) { next unless defined $_; 1 while unlink; } return $result; } # Can we locate a (the) C compiler sub can_cc { my $self = shift; my @chunks = split(/ /, $Config::Config{cc}) or return; # $Config{cc} may contain args; try to find out the program part while (@chunks) { return $self->can_run("@chunks") || (pop(@chunks), next); } return; } # Fix Cygwin bug on maybe_command(); if ( $^O eq 'cygwin' ) { require ExtUtils::MM_Cygwin; require ExtUtils::MM_Win32; if ( ! defined(&ExtUtils::MM_Cygwin::maybe_command) ) { *ExtUtils::MM_Cygwin::maybe_command = sub { my ($self, $file) = @_; if ($file =~ m{^/cygdrive/}i and ExtUtils::MM_Win32->can('maybe_command')) { ExtUtils::MM_Win32->maybe_command($file); } else { ExtUtils::MM_Unix->maybe_command($file); } } } } 1; __END__ #line 236 Grinder-0.5.4/inc/Module/Install/Win32.pm0000644000175000017500000000340312647202456020206 0ustar floflooofloflooo#line 1 package Module::Install::Win32; use strict; use Module::Install::Base (); use vars qw{$VERSION @ISA $ISCORE}; BEGIN { $VERSION = '1.16'; @ISA = 'Module::Install::Base'; $ISCORE = 1; } # determine if the user needs nmake, and download it if needed sub check_nmake { my $self = shift; $self->load('can_run'); $self->load('get_file'); require Config; return unless ( $^O eq 'MSWin32' and $Config::Config{make} and $Config::Config{make} =~ /^nmake\b/i and ! $self->can_run('nmake') ); print "The required 'nmake' executable not found, fetching it...\n"; require File::Basename; my $rv = $self->get_file( url => 'http://download.microsoft.com/download/vc15/Patch/1.52/W95/EN-US/Nmake15.exe', ftp_url => 'ftp://ftp.microsoft.com/Softlib/MSLFILES/Nmake15.exe', local_dir => File::Basename::dirname($^X), size => 51928, run => 'Nmake15.exe /o > nul', check_for => 'Nmake.exe', remove => 1, ); die <<'END_MESSAGE' unless $rv; ------------------------------------------------------------------------------- Since you are using Microsoft Windows, you will need the 'nmake' utility before installation. It's available at: http://download.microsoft.com/download/vc15/Patch/1.52/W95/EN-US/Nmake15.exe or ftp://ftp.microsoft.com/Softlib/MSLFILES/Nmake15.exe Please download the file manually, save it to a directory in %PATH% (e.g. C:\WINDOWS\COMMAND\), then launch the MS-DOS command line shell, "cd" to that directory, and run "Nmake15.exe" from there; that will create the 'nmake.exe' file needed by this module. You may then resume the installation process described in README. ------------------------------------------------------------------------------- END_MESSAGE } 1; Grinder-0.5.4/inc/Module/Install.pm0000644000175000017500000003021712647202456017307 0ustar floflooofloflooo#line 1 package Module::Install; # For any maintainers: # The load order for Module::Install is a bit magic. # It goes something like this... # # IF ( host has Module::Install installed, creating author mode ) { # 1. Makefile.PL calls "use inc::Module::Install" # 2. $INC{inc/Module/Install.pm} set to installed version of inc::Module::Install # 3. The installed version of inc::Module::Install loads # 4. inc::Module::Install calls "require Module::Install" # 5. The ./inc/ version of Module::Install loads # } ELSE { # 1. Makefile.PL calls "use inc::Module::Install" # 2. $INC{inc/Module/Install.pm} set to ./inc/ version of Module::Install # 3. The ./inc/ version of Module::Install loads # } use 5.006; use strict 'vars'; use Cwd (); use File::Find (); use File::Path (); use vars qw{$VERSION $MAIN}; BEGIN { # All Module::Install core packages now require synchronised versions. # This will be used to ensure we don't accidentally load old or # different versions of modules. # This is not enforced yet, but will be some time in the next few # releases once we can make sure it won't clash with custom # Module::Install extensions. $VERSION = '1.16'; # Storage for the pseudo-singleton $MAIN = undef; *inc::Module::Install::VERSION = *VERSION; @inc::Module::Install::ISA = __PACKAGE__; } sub import { my $class = shift; my $self = $class->new(@_); my $who = $self->_caller; #------------------------------------------------------------- # all of the following checks should be included in import(), # to allow "eval 'require Module::Install; 1' to test # installation of Module::Install. (RT #51267) #------------------------------------------------------------- # Whether or not inc::Module::Install is actually loaded, the # $INC{inc/Module/Install.pm} is what will still get set as long as # the caller loaded module this in the documented manner. # If not set, the caller may NOT have loaded the bundled version, and thus # they may not have a MI version that works with the Makefile.PL. This would # result in false errors or unexpected behaviour. And we don't want that. my $file = join( '/', 'inc', split /::/, __PACKAGE__ ) . '.pm'; unless ( $INC{$file} ) { die <<"END_DIE" } Please invoke ${\__PACKAGE__} with: use inc::${\__PACKAGE__}; not: use ${\__PACKAGE__}; END_DIE # This reportedly fixes a rare Win32 UTC file time issue, but # as this is a non-cross-platform XS module not in the core, # we shouldn't really depend on it. See RT #24194 for detail. # (Also, this module only supports Perl 5.6 and above). eval "use Win32::UTCFileTime" if $^O eq 'MSWin32' && $] >= 5.006; # If the script that is loading Module::Install is from the future, # then make will detect this and cause it to re-run over and over # again. This is bad. Rather than taking action to touch it (which # is unreliable on some platforms and requires write permissions) # for now we should catch this and refuse to run. if ( -f $0 ) { my $s = (stat($0))[9]; # If the modification time is only slightly in the future, # sleep briefly to remove the problem. my $a = $s - time; if ( $a > 0 and $a < 5 ) { sleep 5 } # Too far in the future, throw an error. my $t = time; if ( $s > $t ) { die <<"END_DIE" } Your installer $0 has a modification time in the future ($s > $t). This is known to create infinite loops in make. Please correct this, then run $0 again. END_DIE } # Build.PL was formerly supported, but no longer is due to excessive # difficulty in implementing every single feature twice. if ( $0 =~ /Build.PL$/i ) { die <<"END_DIE" } Module::Install no longer supports Build.PL. It was impossible to maintain duel backends, and has been deprecated. Please remove all Build.PL files and only use the Makefile.PL installer. END_DIE #------------------------------------------------------------- # To save some more typing in Module::Install installers, every... # use inc::Module::Install # ...also acts as an implicit use strict. $^H |= strict::bits(qw(refs subs vars)); #------------------------------------------------------------- unless ( -f $self->{file} ) { foreach my $key (keys %INC) { delete $INC{$key} if $key =~ /Module\/Install/; } local $^W; require "$self->{path}/$self->{dispatch}.pm"; File::Path::mkpath("$self->{prefix}/$self->{author}"); $self->{admin} = "$self->{name}::$self->{dispatch}"->new( _top => $self ); $self->{admin}->init; @_ = ($class, _self => $self); goto &{"$self->{name}::import"}; } local $^W; *{"${who}::AUTOLOAD"} = $self->autoload; $self->preload; # Unregister loader and worker packages so subdirs can use them again delete $INC{'inc/Module/Install.pm'}; delete $INC{'Module/Install.pm'}; # Save to the singleton $MAIN = $self; return 1; } sub autoload { my $self = shift; my $who = $self->_caller; my $cwd = Cwd::getcwd(); my $sym = "${who}::AUTOLOAD"; $sym->{$cwd} = sub { my $pwd = Cwd::getcwd(); if ( my $code = $sym->{$pwd} ) { # Delegate back to parent dirs goto &$code unless $cwd eq $pwd; } unless ($$sym =~ s/([^:]+)$//) { # XXX: it looks like we can't retrieve the missing function # via $$sym (usually $main::AUTOLOAD) in this case. # I'm still wondering if we should slurp Makefile.PL to # get some context or not ... my ($package, $file, $line) = caller; die <<"EOT"; Unknown function is found at $file line $line. Execution of $file aborted due to runtime errors. If you're a contributor to a project, you may need to install some Module::Install extensions from CPAN (or other repository). If you're a user of a module, please contact the author. EOT } my $method = $1; if ( uc($method) eq $method ) { # Do nothing return; } elsif ( $method =~ /^_/ and $self->can($method) ) { # Dispatch to the root M:I class return $self->$method(@_); } # Dispatch to the appropriate plugin unshift @_, ( $self, $1 ); goto &{$self->can('call')}; }; } sub preload { my $self = shift; unless ( $self->{extensions} ) { $self->load_extensions( "$self->{prefix}/$self->{path}", $self ); } my @exts = @{$self->{extensions}}; unless ( @exts ) { @exts = $self->{admin}->load_all_extensions; } my %seen; foreach my $obj ( @exts ) { while (my ($method, $glob) = each %{ref($obj) . '::'}) { next unless $obj->can($method); next if $method =~ /^_/; next if $method eq uc($method); $seen{$method}++; } } my $who = $self->_caller; foreach my $name ( sort keys %seen ) { local $^W; *{"${who}::$name"} = sub { ${"${who}::AUTOLOAD"} = "${who}::$name"; goto &{"${who}::AUTOLOAD"}; }; } } sub new { my ($class, %args) = @_; delete $INC{'FindBin.pm'}; { # to suppress the redefine warning local $SIG{__WARN__} = sub {}; require FindBin; } # ignore the prefix on extension modules built from top level. my $base_path = Cwd::abs_path($FindBin::Bin); unless ( Cwd::abs_path(Cwd::getcwd()) eq $base_path ) { delete $args{prefix}; } return $args{_self} if $args{_self}; $args{dispatch} ||= 'Admin'; $args{prefix} ||= 'inc'; $args{author} ||= ($^O eq 'VMS' ? '_author' : '.author'); $args{bundle} ||= 'inc/BUNDLES'; $args{base} ||= $base_path; $class =~ s/^\Q$args{prefix}\E:://; $args{name} ||= $class; $args{version} ||= $class->VERSION; unless ( $args{path} ) { $args{path} = $args{name}; $args{path} =~ s!::!/!g; } $args{file} ||= "$args{base}/$args{prefix}/$args{path}.pm"; $args{wrote} = 0; bless( \%args, $class ); } sub call { my ($self, $method) = @_; my $obj = $self->load($method) or return; splice(@_, 0, 2, $obj); goto &{$obj->can($method)}; } sub load { my ($self, $method) = @_; $self->load_extensions( "$self->{prefix}/$self->{path}", $self ) unless $self->{extensions}; foreach my $obj (@{$self->{extensions}}) { return $obj if $obj->can($method); } my $admin = $self->{admin} or die <<"END_DIE"; The '$method' method does not exist in the '$self->{prefix}' path! Please remove the '$self->{prefix}' directory and run $0 again to load it. END_DIE my $obj = $admin->load($method, 1); push @{$self->{extensions}}, $obj; $obj; } sub load_extensions { my ($self, $path, $top) = @_; my $should_reload = 0; unless ( grep { ! ref $_ and lc $_ eq lc $self->{prefix} } @INC ) { unshift @INC, $self->{prefix}; $should_reload = 1; } foreach my $rv ( $self->find_extensions($path) ) { my ($file, $pkg) = @{$rv}; next if $self->{pathnames}{$pkg}; local $@; my $new = eval { local $^W; require $file; $pkg->can('new') }; unless ( $new ) { warn $@ if $@; next; } $self->{pathnames}{$pkg} = $should_reload ? delete $INC{$file} : $INC{$file}; push @{$self->{extensions}}, &{$new}($pkg, _top => $top ); } $self->{extensions} ||= []; } sub find_extensions { my ($self, $path) = @_; my @found; File::Find::find( sub { my $file = $File::Find::name; return unless $file =~ m!^\Q$path\E/(.+)\.pm\Z!is; my $subpath = $1; return if lc($subpath) eq lc($self->{dispatch}); $file = "$self->{path}/$subpath.pm"; my $pkg = "$self->{name}::$subpath"; $pkg =~ s!/!::!g; # If we have a mixed-case package name, assume case has been preserved # correctly. Otherwise, root through the file to locate the case-preserved # version of the package name. if ( $subpath eq lc($subpath) || $subpath eq uc($subpath) ) { my $content = Module::Install::_read($subpath . '.pm'); my $in_pod = 0; foreach ( split /\n/, $content ) { $in_pod = 1 if /^=\w/; $in_pod = 0 if /^=cut/; next if ($in_pod || /^=cut/); # skip pod text next if /^\s*#/; # and comments if ( m/^\s*package\s+($pkg)\s*;/i ) { $pkg = $1; last; } } } push @found, [ $file, $pkg ]; }, $path ) if -d $path; @found; } ##################################################################### # Common Utility Functions sub _caller { my $depth = 0; my $call = caller($depth); while ( $call eq __PACKAGE__ ) { $depth++; $call = caller($depth); } return $call; } # Done in evals to avoid confusing Perl::MinimumVersion eval( $] >= 5.006 ? <<'END_NEW' : <<'END_OLD' ); die $@ if $@; sub _read { local *FH; open( FH, '<', $_[0] ) or die "open($_[0]): $!"; binmode FH; my $string = do { local $/; }; close FH or die "close($_[0]): $!"; return $string; } END_NEW sub _read { local *FH; open( FH, "< $_[0]" ) or die "open($_[0]): $!"; binmode FH; my $string = do { local $/; }; close FH or die "close($_[0]): $!"; return $string; } END_OLD sub _readperl { my $string = Module::Install::_read($_[0]); $string =~ s/(?:\015{1,2}\012|\015|\012)/\n/sg; $string =~ s/(\n)\n*__(?:DATA|END)__\b.*\z/$1/s; $string =~ s/\n\n=\w+.+?\n\n=cut\b.+?\n+/\n\n/sg; return $string; } sub _readpod { my $string = Module::Install::_read($_[0]); $string =~ s/(?:\015{1,2}\012|\015|\012)/\n/sg; return $string if $_[0] =~ /\.pod\z/; $string =~ s/(^|\n=cut\b.+?\n+)[^=\s].+?\n(\n=\w+|\z)/$1$2/sg; $string =~ s/\n*=pod\b[^\n]*\n+/\n\n/sg; $string =~ s/\n*=cut\b[^\n]*\n+/\n\n/sg; $string =~ s/^\n+//s; return $string; } # Done in evals to avoid confusing Perl::MinimumVersion eval( $] >= 5.006 ? <<'END_NEW' : <<'END_OLD' ); die $@ if $@; sub _write { local *FH; open( FH, '>', $_[0] ) or die "open($_[0]): $!"; binmode FH; foreach ( 1 .. $#_ ) { print FH $_[$_] or die "print($_[0]): $!"; } close FH or die "close($_[0]): $!"; } END_NEW sub _write { local *FH; open( FH, "> $_[0]" ) or die "open($_[0]): $!"; binmode FH; foreach ( 1 .. $#_ ) { print FH $_[$_] or die "print($_[0]): $!"; } close FH or die "close($_[0]): $!"; } END_OLD # _version is for processing module versions (eg, 1.03_05) not # Perl versions (eg, 5.8.1). sub _version { my $s = shift || 0; my $d =()= $s =~ /(\.)/g; if ( $d >= 2 ) { # Normalise multipart versions $s =~ s/(\.)(\d{1,3})/sprintf("$1%03d",$2)/eg; } $s =~ s/^(\d+)\.?//; my $l = $1 || 0; my @v = map { $_ . '0' x (3 - length $_) } $s =~ /(\d{1,3})\D?/g; $l = $l . '.' . join '', @v if @v; return $l + 0; } sub _cmp { _version($_[1]) <=> _version($_[2]); } # Cloned from Params::Util::_CLASS sub _CLASS { ( defined $_[0] and ! ref $_[0] and $_[0] =~ m/^[^\W\d]\w*(?:::\w+)*\z/s ) ? $_[0] : undef; } 1; # Copyright 2008 - 2012 Adam Kennedy. Grinder-0.5.4/utils/0000755000175000017500000000000012647202511014472 5ustar flofloooflofloooGrinder-0.5.4/utils/average_genome_size0000755000175000017500000001220312263016714020416 0ustar floflooofloflooo#! /usr/bin/env perl # This file is part of the Grinder package, copyright 2009-2012 # Florent Angly , under the GPLv3 license =head1 NAME average_genome_size - Calculate the average genome size (in bp) of species in a Grinder library =head1 DESCRIPTION Calculate the average genome size (in bp) of species in a Grinder library given the library composition and the full-genomes used to produce it. =head1 REQUIRED ARGUMENTS =over =item FASTA file containing the full-genomes used to produce the Grinder library. =for Euclid: db_fasta.type: readable =item Grinder rank file that describes the library composition. =for Euclid: rank_file.type: readable =back =head1 COPYRIGHT Copyright 2009-2012 Florent ANGLY Grinder is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License (GPL) as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. Grinder is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with Grinder. If not, see . =head1 BUGS All complex software has bugs lurking in it, and this program is no exception. If you find a bug, please report it on the SourceForge Tracker for Grinder: L Bug reports, suggestions and patches are welcome. Grinder's code is developed on Sourceforge (L) and is under Git revision control. To get started with a patch, do: git clone git://biogrinder.git.sourceforge.net/gitroot/biogrinder/biogrinder =cut use strict; use warnings; use Getopt::Euclid qw( :minimal_keys ); average_genome_size($ARGV{'db_fasta'}, $ARGV{'rank_file'}); exit; sub average_genome_size { my ( $db_fasta, $rank_file ) = @_; # Calculate the average genome size of a Grinder simulated random library # Read size of the genomes my ($gen_size) = get_sequence_size($db_fasta); my $nof_gens = scalar(keys(%$gen_size)); # Read relative abundance of the genomes my $gen_rel_ab = read_rel_ab($rank_file); # Calculate average my ($avg_gen_size, $gen_size_stdev, $gen_size_stderr) = avg_genome_size($gen_rel_ab, $gen_size, $nof_gens); # Display results print "$avg_gen_size bp\n"; return 1; } sub get_sequence_size { # Get the size of sequences in a FASTA file # Input: path to FASTA file containing metagenomic sequences # Output: hashref of sequence sizes indexed by sequence ID, # number of nucleotides, # length of smallest sequence # hashref of sequence names indexed by sequence ID my $fasta = shift; my ($sizes, $nof_bp, $min_length, $names) = ({}, 0, undef, {}); my ($id, $name, $length) = (undef, '', 0); if (not -f $fasta) { die "Error: '$fasta' does not seem to be a valid file\n"; } open(FASTAIN, $fasta) || die("Error: could not read file '$fasta': $!"); while (my $line = ) { chomp $line; if ($line =~ m/^>(\S+)\s*(.*)$/) { # Save old sequence, start new sequence $id && _save_seq($sizes,\$nof_bp,\$min_length,$names,$id,$name,$length); ($id, $name, $length) = ($1, $2, 0); } elsif ($line =~ m/^\s*$/) { # Line to ignore next; } else { # Continuation of current sequence $length += length($line); } } # Save last sequence $id && _save_seq($sizes,\$nof_bp,\$min_length,$names,$id,$name,$length); close FASTAIN; return $sizes, $nof_bp, $min_length, $names; sub _save_seq { my ($sizes, $nof_bp, $min_length, $names, $id, $name, $length) = @_; $$sizes{$id} = $length; $$nof_bp += $length; $$min_length = $length if ((!defined $$min_length)||($length<$$min_length)); $$names{$id} = $name; } } sub read_rel_ab { my $rank_file = shift; open(IN, $rank_file) || die("Could not read file '$rank_file': $!"); my %rel_ab; for my $line () { if ($line =~ m/^#/) { # Comment line to ignore next; #} elsif ($line =~ m/^(\S+)\s+(\S+)\s+(\S+)$/) { } elsif ($line =~ m/^(.+)\t(.+)\t(.+)$/) { # Data to keep my $rank = $1; my $id = $2; my $ab = $3/100; # between 0 and 1 $rel_ab{$id} = $ab; } else { # Unknown format to ignore warn "Skipping unknown line format:\n$line"; next; } } close IN; return \%rel_ab; } sub avg_genome_size { my ($gen_rel_ab, $gen_size, $nof_gens) = @_; #my ($spectrum, $nof_hits) = @_; my $avg = 0; my $stdev = 0; my $stderr = 0; for my $genome (keys %$gen_size) { my $size = $$gen_size{$genome}; my $ab = $$gen_rel_ab{$genome}; next if not defined $ab; my $tmp = $ab * $size; $avg += $tmp; $stdev += $tmp * $size; } $stdev = sqrt($stdev - $avg**2); # sigma = sqrt( E(X^2) - E(X)^2 ) $stderr = $stdev; $stdev /= sqrt($nof_gens) unless ($nof_gens == 0); return $avg, $stdev, $stderr; } Grinder-0.5.4/utils/change_paired_read_orientation0000755000175000017500000000462012263016714022603 0ustar floflooofloflooo#! /usr/bin/env perl # This file is part of the Grinder package, copyright 2009-2012 # Florent Angly , under the GPLv3 license =head1 NAME change_paired_read_orientation - Change the orientation of paired-end reads in a FASTA file =head1 DESCRIPTION Reverse the orientation, i.e. reverse-complement each right-hand paired-end read (ID ending in /2) in a FASTA file. =head1 REQUIRED ARGUMENTS =over =item FASTA file containing the reads to re-orient. =for Euclid: in_fasta.type: readable =item Output FASTA file where to write the reads. =for Euclid: out_fasta.type: writable =back =head1 COPYRIGHT Copyright 2009-2012 Florent ANGLY Grinder is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License (GPL) as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. Grinder is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with Grinder. If not, see . =head1 BUGS All complex software has bugs lurking in it, and this program is no exception. If you find a bug, please report it on the SourceForge Tracker for Grinder: L Bug reports, suggestions and patches are welcome. Grinder's code is developed on Sourceforge (L) and is under Git revision control. To get started with a patch, do: git clone git://biogrinder.git.sourceforge.net/gitroot/biogrinder/biogrinder =cut use strict; use warnings; use Getopt::Euclid qw( :minimal_keys ); use Bio::SeqIO; change_paired_read_orientation($ARGV{'in_fasta'}, $ARGV{'out_fasta'}); exit; sub change_paired_read_orientation { my ($in_fasta, $out_fasta) = @_; my $in = Bio::SeqIO->new( -file => "<$in_fasta" , -format => 'fasta' ); my $out = Bio::SeqIO->new( -file => ">$out_fasta", -format => 'fasta' ); while ( my $seq = $in->next_seq ) { if ($seq->id =~ m#/2$#) { $seq = $seq->revcom; } $out->write_seq($seq); } $in->close; $out->close; return 1; } Grinder-0.5.4/Makefile.PL0000644000175000017500000000610012647201157015306 0ustar floflooofloflooouse inc::Module::Install; # Package information name 'Grinder'; all_from 'lib/Grinder.pm'; license 'gpl3'; # Module::Install 1.04 does not parse the GPL version number resources homepage 'http://sourceforge.net/projects/biogrinder/'; bugtracker 'http://sourceforge.net/tracker/?group_id=244196&atid=1124737'; repository 'git://biogrinder.git.sourceforge.net/gitroot/biogrinder/biogrinder'; # Dependencies for everyone build_requires 'Test::More' => 0; # first released with Perl v5.6.2 build_requires 'Test::Warn' => 0; requires 'Bio::Root::Version' => '1.006923'; # Bioperl v1.6.923 requires 'Bio::DB::Fasta' => 0; requires 'Bio::Location::Split' => 0; requires 'Bio::PrimarySeq' => 0; requires 'Bio::Root::Root' => 0; requires 'Bio::SeqIO' => 0; requires 'Bio::SeqFeature::SubSeq' => 0; requires 'Bio::Seq::SimulatedRead' => 0; requires 'Bio::Tools::AmpliconSearch' => 0; requires 'Getopt::Euclid' => '0.4.4'; requires 'List::Util' => 0; # first released with Perl v5.7.3 requires 'Math::Random::MT' => '1.16'; requires 'version' => '0.77'; # first released with Perl v5.9.0 # Dependencies for authors only author_requires 'Module::Install'; author_requires 'Module::Install::AuthorRequires'; author_requires 'Module::Install::AutoLicense'; author_requires 'Module::Install::PodFromEuclid'; author_requires 'Module::Install::ReadmeFromPod' => '0.14'; author_requires 'Module::Install::AutoManifest'; author_requires 'Statistics::R' => '0.32'; # Also install R and the fitdistrplus R library # Bundle dependencies # This system does not support Build.PL based dependencies #perl_version( '5.005' ); #auto_bundle_deps(); # Install dependencies auto_install; # Extra scripts to install install_script 'bin/grinder'; install_script 'utils/average_genome_size'; install_script 'utils/change_paired_read_orientation'; # Generate MANIFEST file auto_manifest(); # Generate Makefile and META.yml files WriteAll; # Generate the LICENSE file auto_license(); # Generate the README and manpage files from the POD docs auto_doc(); #--------- UTILS --------------------------------------------------------------# sub auto_doc { print "*** Building doc...\n"; pod_from 'bin/grinder'; my $grinder = 'bin/grinder.pod'; my $script1 = 'utils/average_genome_size'; my $script2 = 'utils/change_paired_read_orientation'; my $clean = 1; my $man_dir = 'man'; if (not -d $man_dir) { mkdir $man_dir or die "Could not write folder $man_dir:\n$!\n"; } readme_from $grinder, $clean, 'txt', 'README'; readme_from $grinder, $clean, 'htm', 'README.htm'; readme_from $grinder, $clean, 'man', "$man_dir/grinder.1"; readme_from $script1, $clean, 'man', "$man_dir/average_genome_size.1"; readme_from $script2, $clean, 'man', "$man_dir/change_paired_read_orientation.1"; return 1; } Grinder-0.5.4/lib/0000755000175000017500000000000012647202511014100 5ustar flofloooflofloooGrinder-0.5.4/lib/Grinder/0000755000175000017500000000000012647202511015472 5ustar flofloooflofloooGrinder-0.5.4/lib/Grinder/KmerCollection.pm0000644000175000017500000003112112636326131020743 0ustar floflooofloflooo# This file is part of the Grinder package, copyright 2009,2010,2011,2012 # Florent Angly , under the GPLv3 license package Grinder::KmerCollection; =head1 NAME Grinder::KmerCollection - A collection of kmers from sequences =head1 SYNOPSIS my $col = Grinder::KmerCollection->new( -k => 10, -file => 'seqs.fa' ); =head1 DESCRIPTION Manage a collection of kmers found in various sequences. Store information about what sequence a kmer was found in and its starting position on the sequence. =head1 AUTHOR Florent Angly =head1 APPENDIX The rest of the documentation details each of the object methods. Internal methods are usually preceded with a _ =cut use strict; use warnings; use Grinder; use Bio::SeqIO; use base qw(Bio::Root::Root); # using throw() and _rearrange() methods =head2 new Title : new Usage : my $col = Grinder::KmerCollection->new( -k => 10, -file => 'seqs.fa', -revcom => 1 ); Function: Build a new kmer collection Args : -k set the kmer length (default: 10 bp) -revcom count kmers before and after reverse-complementing sequences (default: 0) -seqs count kmers in the provided arrayref of sequences (Bio::Seq or Bio::SeqFeature objects) -ids if specified, index the sequences provided to -seq using the the IDs in this arrayref instead of using the sequences $seq->id() method -file count kmers in the provided file of sequences -weights if specified, assign the abundance of each sequence from the values in this arrayref Returns : Grinder::KmerCollection object =cut sub new { my ($class, @args) = @_; my $self = $class->SUPER::new(@args); my($k, $revcom, $seqs, $ids, $file, $weights) = $self->_rearrange([qw(K REVCOM SEQS IDS FILE WEIGHTS)], @args); $self->k( defined $k ? $k : 10 ); $self->weights($weights) if defined $weights; $self->add_seqs($seqs, $ids) if defined $seqs; $self->add_file($file) if defined $file; return $self; } =head2 k Usage : $col->k; Function: Get the length of the kmers Args : None Returns : Positive integer =cut sub k { my ($self, $val) = @_; if ($val) { if ($val < 1) { $self->throw("Error: The minimum kmer length is 1 but got $val\n"); } $self->{'k'} = $val; } return $self->{'k'}; } =head2 weights Usage : $col->weights({'seq1' => 3, 'seq10' => 0.45}); Function: Get or set the weight of each sequence. Each sequence is given a weight of 1 by default. Args : hashref where the keys are sequence IDs and the values are the weight of the corresponding (e.g. their relative abundance) Returns : Grinder::KmerCollection object =cut sub weights { my ($self, $val) = @_; if ($val) { $self->{'weights'} = $val; } return $self->{'weights'}; } =head2 collection_by_kmer Usage : $col->collection_by_kmer; Function: Get the collection of kmers, indexed by kmer Args : None Returns : A hashref of hashref of arrayref: hash->{kmer}->{ID of sequences with this kmer}->[starts of kmer on sequence] =cut sub collection_by_kmer { my ($self, $val) = @_; if ($val) { $self->{'collection_by_kmer'} = $val; } return $self->{'collection_by_kmer'}; } =head2 collection_by_seq Usage : $col->collection_by_seq; Function: Get the collection of kmers, indexed by sequence ID Args : None Returns : A hashref of hashref of arrayref: hash->{ID of sequences with this kmer}->{kmer}->[starts of kmer on sequence] =cut sub collection_by_seq { my ($self, $val) = @_; if ($val) { $self->{'collection_by_seq'} = $val; } return $self->{'collection_by_seq'}; } #==============================================================================# =head2 add_file Usage : $col->add_file('seqs.fa'); Function: Process the kmers in the given file of sequences. Args : filename Returns : Grinder::KmerCollection object =cut sub add_file { my ($self, $file) = @_; my $in = Bio::SeqIO->new( -file => $file ); while (my $seq = $in->next_seq) { $self->add_seqs([ $seq ]); } $in->close; return $self; } =head2 add_seqs Usage : $col->add_seqs([$seq1, $seq2]); Function: Process the kmers in the given sequences. Args : * arrayref of Bio::Seq or Bio::SeqFeature objects * arrayref of IDs to use for the indexing of the sequences Returns : Grinder::KmerCollection object =cut sub add_seqs { my ($self, $seqs, $ids) = @_; my $col_by_kmer = $self->collection_by_kmer || {}; my $col_by_seq = $self->collection_by_seq || {}; my $i = 0; for my $seq (@$seqs) { my $kmer_counts = $self->_find_kmers($seq); while ( my ($kmer, $positions) = each %$kmer_counts ) { my $seq_id; if (defined $ids) { $seq_id = $$ids[$i]; } else { $seq_id = $seq->id; } $col_by_kmer->{$kmer}->{$seq_id} = $positions; $col_by_seq->{$seq_id}->{$kmer} = $positions; } $i++; } $self->collection_by_kmer($col_by_kmer); $self->collection_by_seq($col_by_seq); return $self; } =head2 filter_rare Usage : $col->filter_rare( 2 ); Function: Remove kmers occurring at less than the (weighted) abundance specified Args : integer Returns : Grinder::KmerCollection object =cut sub filter_rare { my ($self, $min_num) = @_; my $changed = 0; my $col_by_kmer = $self->collection_by_kmer; my $col_by_seq = $self->collection_by_seq; while ( my ($kmer, $sources) = each %$col_by_kmer ) { my $count = $self->_sum_from_sources( $sources ); if ($count < $min_num) { # Remove this kmer $changed = 1; delete $col_by_kmer->{$kmer}; while ( my ($seq, $seq_kmers) = each %$col_by_seq ) { delete $seq_kmers->{$kmer}; delete $col_by_seq->{$seq} if keys %{$seq_kmers} == 0; } } } if ($changed) { $self->collection_by_kmer( $col_by_kmer ); $self->collection_by_seq( $col_by_seq ); } return $self; } =head2 filter_shared Usage : $col->filter_shared( 2 ); Function: Remove kmers occurring in less than the number of sequences specified Args : integer Returns : Grinder::KmerCollection object =cut sub filter_shared { my ($self, $min_shared) = @_; my $changed = 0; my $col_by_kmer = $self->collection_by_kmer; my $col_by_seq = $self->collection_by_seq; while ( my ($kmer, $sources) = each %$col_by_kmer ) { my $num_shared = scalar keys %$sources; if ($num_shared < $min_shared) { $changed = 1; delete $col_by_kmer->{$kmer}; while ( my ($seq, $seq_kmers) = each %$col_by_seq ) { delete $seq_kmers->{$kmer}; delete $col_by_seq->{$seq} if keys %{$seq_kmers} == 0; } } } if ($changed) { $self->collection_by_kmer( $col_by_kmer ); $self->collection_by_seq( $col_by_seq ); } return $self; } =head2 counts Usage : $col->counts Function: Calculate the total count of each kmer. Counts are affected by the weights given to the sequences. Args : * restrict sequences to search to specified sequence ID (optional) * starting position from which counting should start (optional) * 0 to report counts (default), 1 to report frequencies (normalize to 1) Returns : * arrayref of the different kmers * arrayref of the corresponding total counts =cut sub counts { my ($self, $id, $start, $freq) = @_; my $kmers; my $counts; my $total = 0; my $col_by_kmer = $self->collection_by_kmer; while ( my ($kmer, $sources) = each %$col_by_kmer ) { my $count = $self->_sum_from_sources( $sources, $id, $start ); if ($count > 0) { push @$kmers, $kmer; push @$counts, $count; $total += $count; } } if ($freq && $total) { $counts = Grinder::normalize($counts, $total); } return $kmers, $counts; } =head2 sources Usage : $col->sources() Function: Return the sources of a kmer and their (weighted) abundance. Args : * kmer to get the sources of * sources to exclude from the results (optional) * 0 to report counts (default), 1 to report frequencies (normalize to 1) Returns : * arrayref of the different sources * arrayref of the corresponding total counts If the kmer requested does not exist, the array will be empty. =cut sub sources { my ($self, $kmer, $excl, $freq) = @_; if (not defined $kmer) { die "Error: Need to provide a kmer to sources().\n"; } my $sources = []; my $counts = []; my $total = 0; my $kmer_sources = $self->collection_by_kmer->{$kmer}; if (defined $kmer_sources) { while ( my ($source, $positions) = each %$kmer_sources ) { if ( (defined $excl) && ($source eq $excl) ) { next; } push @$sources, $source; my $weight = (defined $self->weights) ? ($self->weights->{$source} || 0) : 1; my $count = $weight * scalar @$positions; push @$counts, $count; $total += $count; } if ($freq) { $counts = Grinder::normalize($counts, $total) if $total > 0; } } return $sources, $counts; } =head2 kmers Usage : $col->kmers('seq1'); Function: This is the inverse of sources(). Return the kmers found in a sequence (given its ID) and their (weighted) abundance. Args : * sequence ID to get the kmers of * 0 to report counts (default), 1 to report frequencies (normalize to 1) Returns : * arrayref of sequence IDs * arrayref of the corresponding total counts If the sequence ID requested does not exist, the arrays will be empty. =cut sub kmers { my ($self, $seq_id, $freq) = @_; my $kmers = []; my $counts = []; my $total = 0; my $seq_kmers = $self->collection_by_seq->{$seq_id}; if (defined $seq_kmers) { while ( my ($kmer, $positions) = each %$seq_kmers ) { push @$kmers, $kmer; my $weight = (defined $self->weights) ? ($self->weights->{$seq_id} || 0) : 1; my $count = $weight * scalar @$positions; push @$counts, $count; $total += $count; } $counts = Grinder::normalize($counts, $total) if $freq; } return $kmers, $counts; } =head2 positions Usage : $col->positions() Function: Return the positions of the given kmer on a given sequence. An error is reported if the kmer requested does not exist Args : * desired kmer * desired sequence with this kmer Returns : Arrayref of the different positions. The arrays will be empty if the desired combination of kmer and sequence was not found. =cut sub positions { my ($self, $kmer, $source) = @_; my $kmer_positions = []; my $kmer_sources = $self->collection_by_kmer->{$kmer}; if (defined $kmer_sources) { $kmer_positions = $kmer_sources->{$source} || []; } return $kmer_positions; } #======== Internals ===========================================================# sub _find_kmers { # Find all kmers of size k in a sequence (Bio::Seq or Bio::SeqFeature) and # return a hashref where the keys are the kmers and the values are the # positions of the kmers in the sequences. my ($self, $seq) = @_; my $k = $self->k; my $seq_str; if ($seq->isa('Bio::PrimarySeqI')) { $seq_str = $seq->seq; } elsif ($seq->isa('Bio::SeqFeatureI')) { $seq_str = $seq->seq->seq; } else { $self->throw('Error: Input sequence is not a Bio::SeqI or Bio::SeqFeatureI'. ' compliant object'); } $seq_str = uc $seq_str; # case-insensitive my $seq_len = length $seq_str; my $hash = {}; for (my $i = 0; $i <= $seq_len - $k ; $i++) { my $kmer = substr $seq_str, $i, $k; push @{$hash->{$kmer}}, $i + 1; } return $hash; } sub _sum_from_sources { # Calculate the number of (weighted) occurences of a kmer. An optional # sequence ID and start position to restrict the kmers can be specified. my ($self, $sources, $id, $start) = @_; $start ||= 1; my $count = 0; if (defined $id) { my $new_sources; $new_sources->{$id} = $sources->{$id}; $sources = $new_sources; } while ( my ($source, $positions) = each %$sources ) { for my $position (@$positions) { if ($position >= $start) { my $weight = (defined $self->weights) ? ($self->weights->{$source} || 0) : 1; $count += $weight; } } } return $count; } 1; Grinder-0.5.4/lib/Grinder/Database.pm0000644000175000017500000003600212265563324017545 0ustar floflooofloflooopackage Grinder::Database; use strict; use warnings; use Bio::DB::Fasta; use Bio::PrimarySeq; use base qw(Bio::Root::Root); # using throw() and _rearrange() methods sub new { my ($class, @args) = @_; my $self = $class->SUPER::new(@args); my ($fasta_file, $unidirectional, $primers, $abundance_file, $delete_chars, $minimum_length) = $self->_rearrange([qw(FASTA_FILE UNIDIRECTIONAL PRIMERS ABUNDANCE_FILE DELETE_CHARS MINIMUM_LENGTH)], @args); $minimum_length = 1 if not defined $minimum_length; $self->_set_minimum_length($minimum_length); $delete_chars = '' if not defined $delete_chars; $self->_set_delete_chars($delete_chars); # Index file, filter sequences and get IDs $self->_init_db($fasta_file, $abundance_file, $delete_chars, $minimum_length); $unidirectional = 0 if not defined $unidirectional; # bidirectional $self->_set_unidirectional($unidirectional); # Read amplicon primers $self->_set_primers($primers) if defined $primers; # Error if trying to reverse complement a protein database if ( ($self->get_alphabet eq 'protein') && ($self->get_unidirectional != 1) ) { $self->throw("Got = $unidirectional but can only use ". " = 1 with proteic reference sequences\n"); } return $self; } sub _init_db { # Read and import sequences # Parameters: # * FASTA file containing the sequences or '-' for stdin. REQUIRED # * Abundance file (optional): To avoid registering unwanted sequences # * Delete chars (optional): Characters to delete from the sequences. # * Minimum sequence size: Skip sequences smaller than that my ($self, $fasta_file, $abundance_file, $delete_chars, $min_len) = @_; # Get list of all IDs with a manually-specified abundance my %ids_to_keep; my $nof_ids_to_keep = 0; if ($abundance_file) { my ($ids) = community_read_abundances($abundance_file); for my $comm_num (0 .. $#$ids) { for my $gen_num ( 0 .. scalar @{$$ids[$comm_num]} - 1 ) { my $id = $$ids[$comm_num][$gen_num]; $ids_to_keep{$id} = undef; $nof_ids_to_keep++; } } } # Index input file my $db = Bio::DB::Fasta->new($fasta_file, -reindex => 1, -clean => 1); $self->_set_database($db); # List sequences that are ok to use my %seq_ids; my $nof_seqs; my %mol_types; my $stream = $db->get_PrimarySeq_stream; while (my $seq = $stream->next_seq) { # Skip empty sequences next if not $seq->seq; # Record molecule type $mol_types{$seq->alphabet}++; # Skip unwanted sequences my $seq_id = $seq->id; next if ($nof_ids_to_keep > 0) && (not exists $ids_to_keep{$seq_id}); # Remove specified characters $seq = $self->_remove_chars($seq, $delete_chars); # Skip sequence if is not empty next if not defined $seq; # Skip the sequence if it is too small next if $seq->length < $min_len; # Record this sequence $seq_ids{$seq->id} = undef; $nof_seqs++; } # Error if no usable sequences in the database if ($nof_seqs == 0) { $self->throw("No genome sequences could be used. If you specified a file ". "of abundances for the genome sequences, make sure that their ID match". " the ID in the FASTA file. If you specified amplicon primers, verify ". "that they match some genome sequences.\n"); } # Determine database type: dna, rna, protein my $db_alphabet = $self->_set_alphabet( $self->_get_mol_type(\%mol_types) ); # Record the sequence IDs $self->_set_ids( \%seq_ids ); return $db; } #sub get_primers { # my ($self) = @_; # return $self->{'primers'}; #} #sub _set_primers { # my ($self, $forward_reverse_primers) = @_; # # Read primer file and convert primers into regular expressions to catch # # amplicons present in the database # if (defined $forward_reverse_primers) { # # Read primers from FASTA file # my $primer_in = Bio::SeqIO->newFh( # -file => $forward_reverse_primers, # -format => 'fasta', # ); # # Mandatory first primer # my $primer = <$primer_in>; # if (not defined $primer) { # $self->throw("The file '$forward_reverse_primers' contains no primers\n"); # } # $primer->alphabet('dna'); # Force the alphabet since degenerate primers can look like protein sequences # $self->_set_forward_regexp( iupac_to_regexp($primer->seq) ); # $primer = undef; # # Take reverse-complement of optional reverse primers # $primer = <$primer_in>; # if (defined $primer) { # $primer->alphabet('dna'); # $primer = $primer->revcom; # $self->_set_reverse_regexp( iupac_to_regexp($primer->seq) ); # } # } # $self->{'primers'} = $forward_reverse_primers; # return $self->get_primers; #} #sub get_forward_regexp { # my ($self) = @_; # return $self->{'forward_regexp'}; #} #sub _set_forward_regexp { # my ($self, $val) = @_; # $self->{'forward_regexp'} = $val; # return $self->get_forward_regexp; #} #sub get_reverse_regexp { # my ($self) = @_; # return $self->{'reverse_regexp'}; #} #sub _set_reverse_regexp { # my ($self, $val) = @_; # $self->{'reverse_regexp'} = $val; # return $self->get_reverse_regexp; #} sub get_alphabet { my ($self) = @_; return $self->{'alphabet'}; } sub _set_alphabet { my ($self, $val) = @_; $self->{'alphabet'} = $val; return $self->get_alphabet; } sub get_ids { # Retrieve IDs from database, in no particular order my ($self) = @_; my @ids = keys %{$self->{'ids'}}; return \@ids; } sub _set_ids { my ($self, $val) = @_; $self->{'ids'} = $val; return $self->get_ids; } sub get_unidirectional { my ($self) = @_; return $self->{'unidirectional'}; } sub _set_unidirectional { my ($self, $val) = @_; # Error if using wrong direction on protein database if ( ($self->get_alphabet eq 'protein') && ($val != 1) ) { $self->throw("Got = $val but can only use ". " = 1 with proteic reference sequences\n"); } $self->{'unidirectional'} = $val; return $self->get_unidirectional; } sub get_minimum_length { my ($self) = @_; return $self->{'minimum_length'}; } sub _set_minimum_length { my ($self, $val) = @_; $self->{'minimum_length'} = $val; return $self->get_minimum_length; } sub get_delete_chars { my ($self) = @_; return $self->{'delete_chars'}; } sub _set_delete_chars { my ($self, $val) = @_; $self->{'delete_chars'} = $val; return $self->get_delete_chars; } sub get_database { my ($self) = @_; return $self->{'database'}; } sub _set_database { my ($self, $val) = @_; $self->{'database'} = $val; return $self->get_database; } sub get_seq { my ($self, $id) = @_; # Get a sequence from the database. The query format is: id:start..end/strand # Only the id is mandatory. Start and end default to the full-length sequence # and strand defaults to 1. # Extract id, start, stop, and strand $id =~ s/\/(.+)$//i; my $strand = $1 || 1; ($id =~ s/:(\d+)..(\d+)$//i); my ($start, $stop) = ($1, $2); # Check that sequence is allowed if (not exists $self->{'ids'}->{$id}) { return undef; } # Invert start and stop for sequences on reverse strand if ($start && $stop && ($strand < 0) ) { ($start, $stop) = ($stop, $start); } #### if forbidden chars, start and stop provided, probably need to remove #### forbidden chars first # Get sequence from database my $seq = Bio::PrimarySeq->new( -id => $id, -seq => $self->{'database'}->seq($id, $start, $stop), ); if ( ((not $start) || (not $stop)) && ($strand < 0) ) { $seq = $seq->revcom; } return $seq; } #sub next_seq { # my ($self) = @_; # # Get the database sequence stream, or set it the first time # my $stream = $self->get_stream || # $self->_set_stream($self->get_database->get_PrimarySeq_stream); # my $seq = $stream->next_seq; # if (not defined $seq) { # # End of stream # return undef; # } # # If we are sequencing from the reverse strand, reverse complement now # if ($self->get_unidirectional == -1) { # $seq = $seq->revcom; # } # # then delete chars # #my $delete_chars = $self->get_delete_chars; # # then fetch amplicons # # finally remove seqs < min_len # # # Extract amplicons if needed ## my $amp_seqs; ## if (defined $self->get_forward_regexp) { ## $amp_seqs = $self->database_extract_amplicons($seq, $self->get_forward_regexp, ## $self->get_reverse_regexp, \%ids_to_keep); ## next if scalar @$amp_seqs == 0; ## } else { ## $amp_seqs = [$seq]; ## } ## for my $amp_seq (@$amp_seqs) { ## # Remove forbidden chars ## if ( (defined $delete_chars) && (not $delete_chars eq '') ) { ## my $clean_seq = $amp_seq->seq; ## $clean_seq =~ s/[$delete_chars]//gi; ## $amp_seq->seq($clean_seq); ## } ## # Skip the sequence if it is too small ## next if $amp_seq->length < $min_len; ## # Save amplicon sequence and identify them by their unique object reference ## $seq_db{$amp_seq} = $amp_seq; ## $seq_ids{$ref_seq_id}{$amp_seq} = undef; ## } #} sub _remove_chars { # Remove forbidden chars my ($self, $seq, $chars) = @_; if ( defined($chars) && not($chars eq '') ) { my $seq_string = $seq->seq; my $count = ($seq_string =~ s/[$chars]//gi); if ( length $seq_string == 0 ) { # All characters were removed $seq = undef; } else { if ($count > 0) { # Some characters were removed. # Cannot modify a sequence from Bio::DB::Fasta. Create a new one if needed. $seq = Bio::PrimarySeq->new( -id => $seq->id, -seq => $seq_string, ); } } } return $seq; } sub _get_mol_type { # Given a count of the different molecule types in the database, determine # what molecule type it is. my ($self, $mol_types) = @_; my $max_count = 0; my $max_type = ''; while (my ($type, $count) = each %$mol_types) { if ($count > $max_count) { $max_count = $count; $max_type = $type; } } my $other_count = 0; while (my ($type, $count) = each %$mol_types) { if (not $type eq $max_type) { $other_count += $count; } } if ($max_count < $other_count) { $self->throw("Cannot determine to what type of molecules the reference ". "sequences belong. Got $max_count sequences of type '$max_type' and ". "$other_count others.\n"); } if ( (not $max_type eq 'dna') && (not $max_type eq 'rna') && (not $max_type eq 'protein') ) { $self->throw("Reference sequences have an unknown alphabet '$max_type'.\n"); } return $max_type; } ###sub _extract_amplicons { ### my ($self, $seq, $forward_regexp, $reverse_regexp, $ids_to_keep) = @_; ### # A database sequence can have several amplicons, e.g. a genome can have ### # several 16S rRNA genes. Extract all amplicons from a sequence (both strands) ### # but take only the shortest when amplicons are nested. ### # Fetch amplicons from both strands ### # Get amplicons from forward and reverse strand ### my $fwd_amplicons = _extract_amplicons_from_strand($seq, $forward_regexp, $reverse_regexp, 1); ### my $rev_amplicons = _extract_amplicons_from_strand($seq, $forward_regexp, $reverse_regexp, -1); ### # Deal with nested amplicons by removing the longest of the two ### my $re = qr/(\d+)\.\.(\d+)/; ### for (my $rev = 0; $rev < scalar @$rev_amplicons; $rev++) { ### my ($rev_start, $rev_end) = ( $rev_amplicons->[$rev]->{_amplicon} =~ m/$re/ ); ### for (my $fwd = 0; $fwd < scalar @$fwd_amplicons; $fwd++) { ### my ($fwd_start, $fwd_end) = ( $fwd_amplicons->[$fwd]->{_amplicon} =~ m/$re/ ); ### if ( ($fwd_start < $rev_start) && ($rev_end < $fwd_end) ) { ### splice @$fwd_amplicons, $fwd, 1; # Remove forward amplicon ### $fwd--; ### next; ### } ### if ( ($rev_start < $fwd_start) && ($fwd_end < $rev_end) ) { ### splice @$rev_amplicons, $rev, 1; # Remove reverse amplicon ### $rev--; ### } ### } ### } ### ### my $amplicons = [ @$fwd_amplicons, @$rev_amplicons ]; ### # Complain if primers did not match explicitly specified reference sequence ### my $seqid = $seq->id; ### if ( (scalar keys %{$ids_to_keep} > 0) && ### (exists $$ids_to_keep{$seqid} ) && ### (scalar @$amplicons == 0 ) ) { ### die "Error: Requested sequence $seqid did not match the specified forward primer.\n"; ### } ### return $amplicons; ###} ###sub _extract_amplicons_from_strand { ### # Get amplicons from the given strand (orientation) of the given sequence. ### # For nested amplicons, only the shortest is returned to mimic PCR. ### my ($seq, $forward_regexp, $reverse_regexp, $orientation) = @_; ### # Reverse-complement sequence if looking at a -1 orientation ### my $seqstr; ### if ($orientation == 1) { ### $seqstr = $seq->seq; ### } elsif ($orientation == -1) { ### $seqstr = $seq->revcom->seq; ### } else { ### die "Error: Invalid orientation '$orientation'\n"; ### } ### # Get amplicons from sequence string ### my $amplicons = []; ### if ( (defined $forward_regexp) && (not defined $reverse_regexp) ) { ### while ( $seqstr =~ m/($forward_regexp)/g ) { ### my $start = pos($seqstr) - length($1) + 1; ### my $end = $seq->length; ### push @$amplicons, _create_amplicon($seq, $start, $end, $orientation); ### } ### } elsif ( (defined $forward_regexp) && (defined $reverse_regexp) ) { ### while ( $seqstr =~ m/($forward_regexp.*?$reverse_regexp)/g ) { ### my $end = pos($seqstr); ### my $start = $end - length($1) + 1; ### # Now trim the left end to obtain the shortest amplicon ### my $ampliconstr = substr $seqstr, $start - 1, $end - $start + 1; ### if ($ampliconstr =~ m/$forward_regexp.*($forward_regexp)/g) { ### $start += pos($ampliconstr) - length($1); ### } ### push @$amplicons, _create_amplicon($seq, $start, $end, $orientation); ### } ### } else { ### die "Error: Need to provide at least a forward primer\n"; ### } ### return $amplicons; ###} ###sub _create_amplicon { ### # Create an amplicon sequence and register its coordinates ### my ($seq, $start, $end, $orientation) = @_; ### my $amplicon; ### my $coord; ### if ($orientation == -1) { ### # Calculate coordinates relative to forward strand. For example, given a ### # read starting at 10 and ending at 23 on the reverse complement of a 100 bp ### # sequence, return complement(77..90). ### $amplicon = $seq->revcom->trunc($start, $end); ### my $seq_len = $seq->length; ### $start = $seq_len - $start + 1; ### $end = $seq_len - $end + 1; ### ($start, $end) = ($end, $start); ### $coord = "complement($start..$end)"; ### } else { ### $amplicon = $seq->trunc($start, $end); ### $coord = "$start..$end"; ### } ### $amplicon->{_amplicon} = $coord; ### return $amplicon ###} #### #sub DESTROY { # remove indexed files #} #### 1; Grinder-0.5.4/lib/Grinder.pm0000644000175000017500000033652712647200516016053 0ustar floflooofloflooo# This file is part of the Grinder package, copyright 2009-2013 # Florent Angly , under the GPLv3 license package Grinder; use 5.006; use strict; use warnings; use File::Spec; use List::Util qw(max); use Bio::SeqIO; use Grinder::KmerCollection; use Bio::Location::Split; use Bio::Seq::SimulatedRead; use Bio::SeqFeature::SubSeq; use Bio::Tools::AmpliconSearch; use Math::Random::MT qw(srand rand); use Getopt::Euclid qw(:minimal_keys :defer); use version; our $VERSION = version->declare('0.5.4'); #---------- GRINDER POD DOC ---------------------------------------------------# =head1 NAME Grinder - A versatile omics shotgun and amplicon sequencing read simulator =head1 DESCRIPTION Grinder is a versatile program to create random shotgun and amplicon sequence libraries based on DNA, RNA or proteic reference sequences provided in a FASTA file. Grinder can produce genomic, metagenomic, transcriptomic, metatranscriptomic, proteomic, metaproteomic shotgun and amplicon datasets from various sequencing technologies such as Sanger, 454, Illumina. These simulated datasets can be used to test the accuracy of bioinformatic tools under specific hypothesis, e.g. with or without sequencing errors, or with low or high community diversity. Grinder may also be used to help decide between alternative sequencing methods for a sequence-based project, e.g. should the library be paired-end or not, how many reads should be sequenced. Grinder features include: =over =item * shotgun or amplicon read libraries =item * omics support to generate genomic, transcriptomic, proteomic, metagenomic, metatranscriptomic or metaproteomic datasets =item * arbitrary read length distribution and number of reads =item * simulation of PCR and sequencing errors (chimeras, point mutations, homopolymers) =item * support for paired-end (mate pair) datasets =item * specific rank-abundance settings or manually given abundance for each genome, gene or protein =item * creation of datasets with a given richness (alpha diversity) =item * independent datasets can share a variable number of genomes (beta diversity) =item * modeling of the bias created by varying genome lengths or gene copy number =item * profile mechanism to store preferred options =item * available to biologists or power users through multiple interfaces: GUI, CLI and API =back Briefly, given a FASTA file containing reference sequence (genomes, genes, transcripts or proteins), Grinder performs the following steps: =over =item 1. Read the reference sequences, and for amplicon datasets, extracts full-length reference PCR amplicons using the provided degenerate PCR primers. =item 2. Determine the community structure based on the provided alpha diversity (number of reference sequences in the library), beta diversity (number of reference sequences in common between several independent libraries) and specified rank- abundance model. =item 3. Take shotgun reads from the reference sequences or amplicon reads from the full- length reference PCR amplicons. The reads may be paired-end reads when an insert size distribution is specified. The length of the reads depends on the provided read length distribution and their abundance depends on the relative abundance in the community structure. Genome length may also biases the number of reads to take for shotgun datasets at this step. Similarly, for amplicon datasets, the number of copies of the target gene in the reference genomes may bias the number of reads to take. =item 4. Alter reads by inserting sequencing errors (indels, substitutions and homopolymer errors) following a position-specific model to simulate reads created by current sequencing technologies (Sanger, 454, Illumina). Write the reads and their quality scores in FASTA, QUAL and FASTQ files. =back =head1 CITATION If you use Grinder in your research, please cite: Angly FE, Willner D, Rohwer F, Hugenholtz P, Tyson GW (2012), Grinder: a versatile amplicon and shotgun sequence simulator, Nucleic Acids Reseach Available from L. =head1 VERSION 0.5.4 =head1 AUTHOR Florent Angly =head1 INSTALLATION =head2 Dependencies You need to install these dependencies first: =over =item * Perl (>= 5.6) L =item * make Many systems have make installed by default. If your system does not, you should install the implementation of make of your choice, e.g. GNU make: L =back The following CPAN Perl modules are dependencies that will be installed automatically for you: =over =item * Bioperl modules (>=1.6.923) =item * Getopt::Euclid (>= 0.4.4) =item * List::Util First released with Perl v5.7.3 =item * Math::Random::MT (>= 1.16) =item * version (>= 0.77) First released with Perl v5.9.0 =back =head2 Extra dependencies for Grinder development only Perl modules: =over =item * Module::Install =item * Module::Install::AuthorRequires =item * Module::Install::AutoLicense =item * Module::Install::PodFromEuclid =item * Module::Install::ReadmeFromPod (>= 0.14) =item * Module::Install::AutoManifest =item * Statistics::R (>= 0.32) =back The R interpreter (L) and the following R library: =over =item * fitdistrplus =back When running R, install the library with this command: install.packages("fitdistrplus") =head2 Procedure To install Grinder globally on your system, run the following commands in a terminal or command prompt: On Linux, Unix, MacOS: perl Makefile.PL make And finally, with administrator privileges: make install On Windows, run the same commands but with nmake instead of make. =head2 No administrator privileges? If you do not have administrator privileges, Grinder needs to be installed in your home directory. First, follow the instructions to install local::lib at L. After local::lib is installed, every Perl module that you install manually or through the CPAN command-line application will be installed in your home directory. Then, install Grinder by following the instructions detailed in the "Procedure" section. =head1 RUNNING GRINDER After installation, you can run Grinder using a command-line interface (CLI), an application programming interface (API) or a graphical user interface (GUI) in Galaxy. To get the usage of the CLI, type: grinder --help More information, including the documentation of the Grinder API, which allows you to run Grinder from within other Perl programs, is available by typing: perldoc Grinder To run the GUI, refer to the Galaxy documentation at L. The 'utils' folder included in the Grinder package contains some utilities: =over =item average genome size: This calculates the average genome size (in bp) of a simulated random library produced by Grinder. =item change_paired_read_orientation: This reverses the orientation of each second mate-pair read (ID ending in /2) in a FASTA file. =back =head1 REFERENCE SEQUENCE DATABASE A variety of FASTA databases can be used as input for Grinder. For example, the GreenGenes database (L) contains over 180,000 16S rRNA clone sequences from various species which would be appropriate to produce a 16S rRNA amplicon dataset. A set of over 41,000 OTU representative sequences and their affiliation in seven different taxonomic sytems can also be used for the same purpose (L and L). The RDP (L) and Silva (L) databases also provide many 16S rRNA sequences and Silva includes eukaryotic sequences. While 16S rRNA is a popular gene, datasets containing any type of gene could be used in the same fashion to generate simulated amplicon datasets, provided appropriate primers are used. The >2,400 curated microbial genome sequences in the NCBI RefSeq collection (L) would also be suitable for producing 16S rRNA simulated datasets (using the adequate primers). However, the lower diversity of this database compared to the previous two makes it more appropriate for producing artificial microbial metagenomes. Individual genomes from this database are also very suitable for the simulation of single or double-barreled shotgun libraries. Similarly, the RefSeq database contains over 3,100 curated viral sequences (L) which can be used to produce artificial viral metagenomes. Quite a few eukaryotic organisms have been sequenced and their genome or genes can be the basis for simulating genomic, transcriptomic (RNA-seq) or proteomic datasets. For example, you can use the human genome available at L, the human transcripts downloadable from L or the human proteome at L. =head1 CLI EXAMPLES Here are a few examples that illustrate the use of Grinder in a terminal: =over =item 1. A shotgun DNA library with a coverage of 0.1X grinder -reference_file genomes.fna -coverage_fold 0.1 =item 2. Same thing but save the result files in a specific folder and with a specific name grinder -reference_file genomes.fna -coverage_fold 0.1 -base_name my_name -output_dir my_dir =item 3. A DNA shotgun library with 1000 reads grinder -reference_file genomes.fna -total_reads 1000 =item 4. A DNA shotgun library where species are distributed according to a power law grinder -reference_file genomes.fna -abundance_model powerlaw 0.1 =item 5. A DNA shotgun library with 123 genomes taken random from the given genomes grinder -reference_file genomes.fna -diversity 123 =item 6. Two DNA shotgun libraries that have 50% of the species in common grinder -reference_file genomes.fna -num_libraries 2 -shared_perc 50 =item 7. Two DNA shotgun library with no species in common and distributed according to a exponential rank-abundance model. Note that because the parameter value for the exponential model is omitted, each library uses a different randomly chosen value: grinder -reference_file genomes.fna -num_libraries 2 -abundance_model exponential =item 8. A DNA shotgun library where species relative abundances are manually specified grinder -reference_file genomes.fna -abundance_file my_abundances.txt =item 9. A DNA shotgun library with Sanger reads grinder -reference_file genomes.fna -read_dist 800 -mutation_dist linear 1 2 -mutation_ratio 80 20 =item 10. A DNA shotgun library with first-generation 454 reads grinder -reference_file genomes.fna -read_dist 100 normal 10 -homopolymer_dist balzer =item 11. A paired-end DNA shotgun library, where the insert size is normally distributed around 2.5 kbp and has 0.2 kbp standard deviation grinder -reference_file genomes.fna -insert_dist 2500 normal 200 =item 12. A transcriptomic dataset grinder -reference_file transcripts.fna =item 13. A unidirectional transcriptomic dataset grinder -reference_file transcripts.fna -unidirectional 1 Note the use of -unidirectional 1 to prevent reads to be taken from the reverse- complement of the reference sequences. =item 14. A proteomic dataset grinder -reference_file proteins.faa -unidirectional 1 =item 15. A 16S rRNA amplicon library grinder -reference_file 16Sgenes.fna -forward_reverse 16Sprimers.fna -length_bias 0 -unidirectional 1 Note the use of -length_bias 0 because reference sequence length should not affect the relative abundance of amplicons. =item 16. The same amplicon library with 20% of chimeric reads (90% bimera, 10% trimera) grinder -reference_file 16Sgenes.fna -forward_reverse 16Sprimers.fna -length_bias 0 -unidirectional 1 -chimera_perc 20 -chimera_dist 90 10 =item 17. Three 16S rRNA amplicon libraries with specified MIDs and no reference sequences in common grinder -reference_file 16Sgenes.fna -forward_reverse 16Sprimers.fna -length_bias 0 -unidirectional 1 -num_libraries 3 -multiplex_ids MIDs.fna =item 18. Reading reference sequences from the standard input, which allows you to decompress FASTA files on the fly: zcat microbial_db.fna.gz | grinder -reference_file - -total_reads 100 =back =head1 CLI REQUIRED ARGUMENTS =over =item -rf | -reference_file | -gf | -genome_file FASTA file that contains the input reference sequences (full genomes, 16S rRNA genes, transcripts, proteins...) or '-' to read them from the standard input. See the README file for examples of databases you can use and where to get them from. Default: reference_file.default =for Euclid: reference_file.type: readable reference_file.default: '-' =back =head1 CLI OPTIONAL ARGUMENTS Basic parameters =over =item -tr | -total_reads Number of shotgun or amplicon reads to generate for each library. Do not specify this if you specify the fold coverage. Default: total_reads.default =for Euclid: total_reads.type: +integer total_reads.default: 100 =item -cf | -coverage_fold Desired fold coverage of the input reference sequences (the output FASTA length divided by the input FASTA length). Do not specify this if you specify the number of reads directly. =for Euclid: coverage_fold.type: +number coverage_fold.excludes: total_reads =back Advanced shotgun and amplicon parameters =over =item -rd ... | -read_dist ... Desired shotgun or amplicon read length distribution specified as: average length, distribution ('uniform' or 'normal') and standard deviation. Only the first element is required. Examples: All reads exactly 101 bp long (Illumina GA 2x): 101 Uniform read distribution around 100+-10 bp: 100 uniform 10 Reads normally distributed with an average of 800 and a standard deviation of 100 bp (Sanger reads): 800 normal 100 Reads normally distributed with an average of 450 and a standard deviation of 50 bp (454 GS-FLX Ti): 450 normal 50 Reference sequences smaller than the specified read length are not used. Default: read_dist.default =for Euclid: read_dist.type: string read_dist.default: [100] =item -id ... | -insert_dist ... Create paired-end or mate-pair reads spanning the given insert length. Important: the insert is defined in the biological sense, i.e. its length includes the length of both reads and of the stretch of DNA between them: 0 : off, or: insert size distribution in bp, in the same format as the read length distribution (a typical value is 2,500 bp for mate pairs) Two distinct reads are generated whether or not the mate pair overlaps. Default: insert_dist.default =for Euclid: insert_dist.type: string insert_dist.default: [0] =item -mo | -mate_orientation When generating paired-end or mate-pair reads (see ), specify the orientation of the reads (F: forward, R: reverse): FR: ---> <--- e.g. Sanger, Illumina paired-end, IonTorrent mate-pair FF: ---> ---> e.g. 454 RF: <--- ---> e.g. Illumina mate-pair RR: <--- <--- Default: mate_orientation.default =for Euclid: mate_orientation.type: string, mate_orientation eq 'FF' || mate_orientation eq 'FR' || mate_orientation eq 'RF' || mate_orientation eq 'RR' mate_orientation.type.error: must be FR, FF, RF or RR (not mate_orientation) mate_orientation.default: 'FR' =item -ec | -exclude_chars Do not create reads containing any of the specified characters (case insensitive). For example, use 'NX' to prevent reads with ambiguities (N or X). Grinder will error if it fails to find a suitable read (or pair of reads) after 10 attempts. Consider using , which may be more appropriate for your case. Default: 'exclude_chars.default' =for Euclid: exclude_chars.type: string exclude_chars.default: '' =item -dc | -delete_chars Remove the specified characters from the reference sequences (case-insensitive), e.g. '-~*' to remove gaps (- or ~) or terminator (*). Removing these characters is done once, when reading the reference sequences, prior to taking reads. Hence it is more efficient than . Default: delete_chars.default =for Euclid: delete_chars.type: string delete_chars.default: '' =item -fr | -forward_reverse Use DNA amplicon sequencing using a forward and reverse PCR primer sequence provided in a FASTA file. The reference sequences and their reverse complement will be searched for PCR primer matches. The primer sequences should use the IUPAC convention for degenerate residues and the reference sequences that that do not match the specified primers are excluded. If your reference sequences are full genomes, it is recommended to use = 1 and = 0 to generate amplicon reads. To sequence from the forward strand, set to 1 and put the forward primer first and reverse primer second in the FASTA file. To sequence from the reverse strand, invert the primers in the FASTA file and use = -1. The second primer sequence in the FASTA file is always optional. Example: AAACTYAAAKGAATTGRCGG and ACGGGCGGTGTGTRC for the 926F and 1392R primers that target the V6 to V9 region of the 16S rRNA gene. =for Euclid: forward_reverse.type: readable =item -un | -unidirectional Instead of producing reads bidirectionally, from the reference strand and its reverse complement, proceed unidirectionally, from one strand only (forward or reverse). Values: 0 (off, i.e. bidirectional), 1 (forward), -1 (reverse). Use = 1 for amplicon and strand-specific transcriptomic or proteomic datasets. Default: unidirectional.default =for Euclid: unidirectional.type: integer, unidirectional >= -1 && unidirectional <= 1 unidirectional.type.error: must be 0, 1 or -1 (not unidirectional) unidirectional.default: 0 =item -lb | -length_bias In shotgun libraries, sample reference sequences proportionally to their length. For example, in simulated microbial datasets, this means that at the same relative abundance, larger genomes contribute more reads than smaller genomes (and all genomes have the same fold coverage). 0 = no, 1 = yes. Default: length_bias.default =for Euclid: length_bias.type: integer, length_bias == 0 || length_bias == 1 length_bias.type.error: must be 0 or 1 (not length_bias) length_bias.default: 1 =item -cb | -copy_bias In amplicon libraries where full genomes are used as input, sample species proportionally to the number of copies of the target gene: at equal relative abundance, genomes that have multiple copies of the target gene contribute more amplicon reads than genomes that have a single copy. 0 = no, 1 = yes. Default: copy_bias.default =for Euclid: copy_bias.type: integer, copy_bias == 0 || copy_bias == 1 copy_bias.type.error: must be 0 or 1 (not copy_bias) copy_bias.default: 1 =back Aberrations and sequencing errors =over =item -md ... | -mutation_dist ... Introduce sequencing errors in the reads, under the form of mutations (substitutions, insertions and deletions) at positions that follow a specified distribution (with replacement): model (uniform, linear, poly4), model parameters. For example, for a uniform 0.1% error rate, use: uniform 0.1. To simulate Sanger errors, use a linear model where the errror rate is 1% at the 5' end of reads and 2% at the 3' end: linear 1 2. To model Illumina errors using the 4th degree polynome 3e-3 + 3.3e-8 * i^4 (Korbel et al 2009), use: poly4 3e-3 3.3e-8. Use the option to alter how many of these mutations are substitutions or indels. Default: mutation_dist.default =for Euclid: mutation_dist.type: string mutation_dist.default: ['uniform', 0, 0] =item -mr ... | -mutation_ratio ... Indicate the percentage of substitutions and the number of indels (insertions and deletions). For example, use '80 20' (4 substitutions for each indel) for Sanger reads. Note that this parameter has no effect unless you specify the option. Default: mutation_ratio.default =for Euclid: mutation_ratio.type: num, mutation_ratio >= 0 mutation_ratio.default: [80, 20] =item -hd | -homopolymer_dist Introduce sequencing errors in the reads under the form of homopolymeric stretches (e.g. AAA, CCCCC) using a specified model where the homopolymer length follows a normal distribution N(mean, standard deviation) that is function of the homopolymer length n: Margulies: N(n, 0.15 * n) , Margulies et al. 2005. Richter : N(n, 0.15 * sqrt(n)) , Richter et al. 2008. Balzer : N(n, 0.03494 + n * 0.06856) , Balzer et al. 2010. Default: homopolymer_dist.default =for Euclid: homopolymer_dist.type: string homopolymer_dist.default: 0 =item -cp | -chimera_perc Specify the percent of reads in amplicon libraries that should be chimeric sequences. The 'reference' field in the description of chimeric reads will contain the ID of all the reference sequences forming the chimeric template. A typical value is 10% for amplicons. This option can be used to generate chimeric shotgun reads as well. Default: chimera_perc.default % =for Euclid: chimera_perc.type: number, chimera_perc >= 0 && chimera_perc <= 100 chimera_perc.type.error: must be a number between 0 and 100 (not chimera_perc) chimera_perc.default: 0 =item -cd ... | -chimera_dist ... Specify the distribution of chimeras: bimeras, trimeras, quadrameras and multimeras of higher order. The default is the average values from Quince et al. 2011: '314 38 1', which corresponds to 89% of bimeras, 11% of trimeras and 0.3% of quadrameras. Note that this option only takes effect when you request the generation of chimeras with the option. Default: chimera_dist.default =for Euclid: chimera_dist.type: number, chimera_dist >= 0 chimera_dist.type.error: must be a positive number (not chimera_dist) chimera_dist.default: [314, 38, 1] =item -ck | -chimera_kmer Activate a method to form chimeras by picking breakpoints at places where k-mers are shared between sequences. represents k, the length of the k-mers (in bp). The longer the kmer, the more similar the sequences have to be to be eligible to form chimeras. The more frequent a k-mer is in the pool of reference sequences (taking into account their relative abundance), the more often this k-mer will be chosen. For example, CHSIM (Edgar et al. 2011) uses this method with a k-mer length of 10 bp. If you do not want to use k-mer information to form chimeras, use 0, which will result in the reference sequences and breakpoints to be taken randomly on the "aligned" reference sequences. Note that this option only takes effect when you request the generation of chimeras with the option. Also, this options is quite memory intensive, so you should probably limit yourself to a relatively small number of reference sequences if you want to use it. Default: chimera_kmer.default bp =for Euclid: chimera_kmer.type: number, chimera_kmer == 0 || chimera_kmer >= 2 chimera_kmer.type.error: must be 0 or an integer larger than 1 (not chimera_kmer) chimera_kmer.default: 10 =back Community structure and diversity =over =item -af | -abundance_file Specify the relative abundance of the reference sequences manually in an input file. Each line of the file should contain a sequence name and its relative abundance (%), e.g. 'seqABC 82.1' or 'seqABC 82.1 10.2' if you are specifying two different libraries. =for Euclid: abundance_file.type: readable =item -am ... | -abundance_model ... Relative abundance model for the input reference sequences: uniform, linear, powerlaw, logarithmic or exponential. The uniform and linear models do not require a parameter, but the other models take a parameter in the range [0, infinity). If this parameter is not specified, then it is randomly chosen. Examples: uniform distribution: uniform powerlaw distribution with parameter 0.1: powerlaw 0.1 exponential distribution with automatically chosen parameter: exponential Default: abundance_model.default =for Euclid: abundance_model.type: string abundance_model.default: ['uniform', 1] =item -nl | -num_libraries Number of independent libraries to create. Specify how diverse and similar they should be with , and . Assign them different MID tags with . Default: num_libraries.default =for Euclid: num_libraries.type: +integer num_libraries.default: 1 =item -mi | -multiplex_ids Specify an optional FASTA file that contains multiplex sequence identifiers (a.k.a MIDs or barcodes) to add to the sequences (one sequence per library, in the order given). The MIDs are included in the length specified with the -read_dist option and can be altered by sequencing errors. See the MIDesigner or BarCrawl programs to generate MID sequences. =for Euclid: multiplex_ids.type: readable =item -di ... | -diversity ... This option specifies alpha diversity, specifically the richness, i.e. number of reference sequences to take randomly and include in each library. Use 0 for the maximum richness possible (based on the number of reference sequences available). Provide one value to make all libraries have the same diversity, or one richness value per library otherwise. Default: diversity.default =for Euclid: diversity.type: 0+integer diversity.default: [ 0 ] =item -sp | -shared_perc This option controls an aspect of beta-diversity. When creating multiple libraries, specify the percent of reference sequences they should have in common (relative to the diversity of the least diverse library). Default: shared_perc.default % =for Euclid: shared_perc.type: number, shared_perc >= 0 && shared_perc <= 100 shared_perc.type.error: must be a number between 0 and 100 (not shared_perc) shared_perc.default: 0 =item -pp | -permuted_perc This option controls another aspect of beta-diversity. For multiple libraries, choose the percent of the most-abundant reference sequences to permute (randomly shuffle) the rank-abundance of. Default: permuted_perc.default % =for Euclid: permuted_perc.type: number, permuted_perc >= 0 && permuted_perc <= 100 permuted_perc.type.error: must be a number between 0 and 100 (not permuted_perc) permuted_perc.default: 100 =back Miscellaneous =over =item -rs | -random_seed Seed number to use for the pseudo-random number generator. =for Euclid: random_seed.type: +integer =item -dt | -desc_track Track read information (reference sequence, position, errors, ...) by writing it in the read description. Default: desc_track.default =for Euclid: desc_track.type: integer, desc_track == 0 || desc_track == 1 desc_track.type.error: must be 0 or 1 (not desc_track) desc_track.default: 1 =item -ql ... | -qual_levels ... Generate basic quality scores for the simulated reads. Good residues are given a specified good score (e.g. 30) and residues that are the result of an insertion or substitution are given a specified bad score (e.g. 10). Specify first the good score and then the bad score on the command-line, e.g.: 30 10. Default: qual_levels.default =for Euclid: qual_levels.type: 0+integer qual_levels.default: [ ] =item -fq | -fastq_output Whether to write the generated reads in FASTQ format (with Sanger-encoded quality scores) instead of FASTA and QUAL or not (1: yes, 0: no). need to be specified for this option to be effective. Default: fastq_output.default =for Euclid: fastq_output.type: integer, fastq_output == 0 || fastq_output == 1 fastq_output.type.error: must be 0 or 1 (not fastq_output) fastq_output.default: 0 =item -bn | -base_name Prefix of the output files. Default: base_name.default =for Euclid: base_name.type: string base_name.default: 'grinder' =item -od | -output_dir Directory where the results should be written. This folder will be created if needed. Default: output_dir.default =for Euclid: output_dir.type: writable output_dir.default: '.' =item -pf | -profile_file A file that contains Grinder arguments. This is useful if you use many options or often use the same options. Lines with comments (#) are ignored. Consider the profile file, 'simple_profile.txt': # A simple Grinder profile -read_dist 105 normal 12 -total_reads 1000 Running: grinder -reference_file viral_genomes.fa -profile_file simple_profile.txt Translates into: grinder -reference_file viral_genomes.fa -read_dist 105 normal 12 -total_reads 1000 Note that the arguments specified in the profile should not be specified again on the command line. =back =head1 CLI OUTPUT For each shotgun or amplicon read library requested, the following files are generated: =over =item * A rank-abundance file, tab-delimited, that shows the relative abundance of the different reference sequences =item * A file containing the read sequences in FASTA format. The read headers contain information necessary to track from which reference sequence each read was taken and what errors it contains. This file is not generated if option was provided. =item * If the option was specified, a file containing the quality scores of the reads (in QUAL format). =item * If the option was provided, a file containing the read sequences in FASTQ format. =back =head1 API EXAMPLES The Grinder API allows to conveniently use Grinder within Perl scripts. The same options as the CLI apply, but when passing multiple values to an options, you will need to pass them as an array (not a scalar or arrayref). Here is a example: use Grinder; # Set up a new factory my $factory = Grinder->new( -reference_file => 'genomes.fna', -read_dist => (100, 'uniform', 10) ); # Process all shotgun libraries requested while ( my $struct = $factory->next_lib ) { # The ID and abundance of the 3rd most abundant genome in this community my $id = $struct->{ids}->[2]; my $ab = $struct->{abs}->[2]; # Create shotgun reads while ( my $read = $factory->next_read) { # The read is a Bioperl sequence object with these properties: my $read_id = $read->id; # read ID given by Grinder my $read_seq = $read->seq; # nucleotide sequence my $read_mid = $read->mid; # MID or tag attached to the read my $read_errors = $read->errors; # errors that the read contains # Where was the read taken from? The reference sequence refers to the # database sequence for shotgun libraries, amplicon obtained from the # database sequence, or could even be a chimeric sequence my $ref_id = $read->reference->id; # ID of the reference sequence my $ref_start = $read->start; # start of the read on the reference my $ref_end = $read->end; # end of the read on the reference my $ref_strand = $read->strand; # strand of the reference } } # Similarly, for shotgun mate pairs my $factory = Grinder->new( -reference_file => 'genomes.fna', -insert_dist => 250 ); while ( $factory->next_lib ) { while ( my $read = $factory->next_read ) { # The first read is the first mate of the mate pair # The second read is the second mate of the mate pair # The third read is the first mate of the next mate pair # ... } } # To generate an amplicon library my $factory = Grinder->new( -reference_file => 'genomes.fna', -forward_reverse => '16Sgenes.fna', -length_bias => 0, -unidirectional => 1 ); while ( $factory->next_lib ) { while ( my $read = $factory->next_read) { # ... } } =head1 API METHODS The rest of the documentation details the available Grinder API methods. =head2 new Title : new Function: Create a new Grinder factory initialized with the passed arguments. Available parameters described in the OPTIONS section. Usage : my $factory = Grinder->new( -reference_file => 'genomes.fna' ); Returns : a new Grinder object =head2 next_lib Title : next_lib Function: Go to the next shotgun library to process. Usage : my $struct = $factory->next_lib; Returns : Community structure to be used for this library, where $struct->{ids} is an array reference containing the IDs of the genome making up the community (sorted by decreasing relative abundance) and $struct->{abs} is an array reference of the genome abundances (in the same order as the IDs). =head2 next_read Title : next_read Function: Create an amplicon or shotgun read for the current library. Usage : my $read = $factory->next_read; # for single read my $mate1 = $factory->next_read; # for mate pairs my $mate2 = $factory->next_read; Returns : A sequence represented as a Bio::Seq::SimulatedRead object =head2 get_random_seed Title : get_random_seed Function: Return the number used to seed the pseudo-random number generator Usage : my $seed = $factory->get_random_seed; Returns : seed number =head1 COPYRIGHT Copyright 2009-2013 Florent ANGLY Grinder is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License (GPL) as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. Grinder is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with Grinder. If not, see . =head1 BUGS All complex software has bugs lurking in it, and this program is no exception. If you find a bug, please report it on the SourceForge Tracker for Grinder: L Bug reports, suggestions and patches are welcome. Grinder's code is developed on Sourceforge (L) and is under Git revision control. To get started with a patch, do: git clone git://biogrinder.git.sourceforge.net/gitroot/biogrinder/biogrinder =cut #---------- GRINDER FUNCTIONAL API --------------------------------------------# sub Grinder { # This is the main function and is called by the script 'grinder' my (@args) = @_; # Create Grinder object my $factory = Grinder->new(@args); # Print diversity and percent shared and permuted diversity_report( $factory->{num_libraries}, $factory->{shared_perc}, $factory->{permuted_perc}, $factory->{overall_diversity} ); # Create the output directory if needed if ( not -d $factory->{output_dir} ) { mkdir $factory->{output_dir} or die "Error: Could not create output folder ". $factory->{output_dir}."\n$!\n"; } # Generate sequences while ( my $c_struct = $factory->next_lib ) { my $cur_lib = $factory->{cur_lib}; # Output filenames my $lib_str = ''; if ($factory->{num_libraries} > 1) { $lib_str = '-'.sprintf('%0'.length($factory->{num_libraries}).'d', $cur_lib); } my $out_reads_basename = File::Spec->catfile($factory->{output_dir}, $factory->{base_name}.$lib_str.'-reads.'); my $out_fasta_file; my $out_qual_file; my $out_fastq_file; if ( $factory->{fastq_output} ) { $out_fastq_file = $out_reads_basename . 'fastq'; } else { $out_fasta_file = $out_reads_basename . 'fa'; if (scalar @{$factory->{qual_levels}} > 0) { $out_qual_file = $out_reads_basename . 'qual'; } } my $out_ranks_file = File::Spec->catfile($factory->{output_dir}, $factory->{base_name}.$lib_str."-ranks.txt"); # Write community structure file $factory->write_community_structure($c_struct, $out_ranks_file); # Prepare output FASTA file my $out_fastq; if ( defined $out_fastq_file ) { $out_fastq = Bio::SeqIO->new( -format => 'fastq', -variant => 'sanger', -flush => 0, -file => ">$out_fastq_file" ); } my $out_fasta; if ( defined $out_fasta_file ) { $out_fasta = Bio::SeqIO->new( -format => 'fasta', -flush => 0, -file => ">$out_fasta_file" ); } my $out_qual; if ( defined $out_qual_file ) { $out_qual = Bio::SeqIO->new( -format => 'qual', -flush => 0, -file => ">$out_qual_file" ); } # Library report my $diversity = $factory->{diversity}[$cur_lib-1]; library_report( $cur_lib, $factory->{alphabet}, $factory->{forward_reverse}, $out_ranks_file, $out_fastq_file, $out_fasta_file, $out_qual_file, $factory->{cur_coverage_fold}, $factory->{cur_total_reads}, $diversity); # Generate shotgun or amplicon reads and write them to a file while ( my $read = $factory->next_read ) { $out_fastq->write_seq($read) if defined $out_fastq; $out_fasta->write_seq($read) if defined $out_fasta; $out_qual->write_seq($read) if defined $out_qual } $out_fastq->close if defined $out_fastq; $out_fasta->close if defined $out_fasta; $out_qual->close if defined $out_qual; } return 1; } sub diversity_report { my ($num_libraries, $perc_shared, $perc_permuted, $overall_diversity) = @_; my $format = '%.1f'; print "Overall diversity = $overall_diversity genomes\n"; if ($num_libraries > 1) { my $nof_shared = $perc_shared / 100 * $overall_diversity; $perc_shared = sprintf($format, $perc_shared); print "Percent shared = $perc_shared % ($nof_shared genomes)\n"; my $nof_permuted = $perc_permuted / 100 * $overall_diversity; $perc_permuted = sprintf($format, $perc_permuted); print "Percent permuted = $perc_permuted % ($nof_permuted top genomes)\n"; } return 1; } sub write_community_structure { my ($self, $c_struct, $filename) = @_; open(OUT, ">$filename") || die("Error: Could not write in file $filename: $!\n"); print OUT "# rank\tseq_id\trel_abund_perc\n"; my $diversity = scalar @{$c_struct->{ids}}; my %species_abs; for my $rank ( 1 .. $diversity ) { my $oid = $c_struct->{'ids'}->[$rank-1]; my $species_id = $self->database_get_parent_id($oid); my $seq_ab = $c_struct->{'abs'}->[$rank-1]; $species_abs{$species_id} += $seq_ab; } my $rank = 0; for my $species_id ( sort { $species_abs{$b} <=> $species_abs{$a} } keys %species_abs ) { $rank++; my $species_ab = $species_abs{$species_id}; $species_ab *= 100; # in percentage print OUT "$rank\t$species_id\t$species_ab\n"; } close OUT; return 1; } sub library_report { my ($cur_lib, $alphabet, $forward_reverse, $ranks_file, $fastq_file, $fasta_file, $qual_file, $coverage, $nof_seqs, $diversity) = @_; my $format = '%.3f'; $coverage = sprintf($format, $coverage); my $lib_alphabet = uc $alphabet; $lib_alphabet =~ s/protein/Proteic/i; my $lib_type = defined $forward_reverse ? 'amplicon' : 'shotgun'; print "$lib_alphabet $lib_type library $cur_lib:\n"; print " Community structure = $ranks_file\n"; print " FASTQ file = $fastq_file\n" if defined $fastq_file; print " FASTA file = $fasta_file\n" if defined $fasta_file; print " QUAL file = $qual_file\n" if defined $qual_file; print " Library coverage = $coverage x\n"; print " Number of reads = $nof_seqs\n"; print " Diversity (richness) = $diversity\n"; return 1; } #---------- GRINDER OO API ----------------------------------------------------# sub new { my ($class, @args) = @_; my $self = {}; bless $self, ref($class) || $class; $self->argparse(\@args); $self->initialize(); return $self; } sub next_lib { my ($self) = @_; $self->{cur_lib}++; $self->{cur_read} = 0; $self->{cur_total_reads} = 0; $self->{cur_coverage_fold} = 0; $self->{next_mate} = undef; $self->{positions} = undef; # Prepare sampling from this community my $c_struct = $self->{c_structs}[$self->{cur_lib}-1]; if ( defined $c_struct ) { # Create probabilities of picking genomes from community structure $self->{positions} = $self->proba_create($c_struct, $self->{length_bias}, $self->{copy_bias}); # Calculate needed number of sequences based on desired coverage ($self->{cur_total_reads}, $self->{cur_coverage_fold}) = $self->lib_coverage($c_struct); # If chimeras are needed, update the kmer collection with sequence abundance my $kmer_col = $self->{chimera_kmer_col}; if ($kmer_col) { my $weights; for (my $i = 0; $i < scalar @{$c_struct->{'ids'}}; $i++) { my $id = $c_struct->{'ids'}->[$i]; my $weight = $c_struct->{'abs'}->[$i]; $weights->{$id} = $weight; } $kmer_col->weights($weights); my ($kmers, $freqs) = $kmer_col->counts(undef, undef, 1); $self->{chimera_kmer_arr} = $kmers; $self->{chimera_kmer_cdf} = $self->proba_cumul($freqs); } } return $c_struct; } sub next_read { my ($self) = @_; $self->next_lib if not $self->{cur_lib}; $self->{cur_read}++; my $read; if ( $self->{cur_read} <= $self->{cur_total_reads} ) { # Generate the next read if ($self->{mate_length}) { # Generate a mate pair read if ( not $self->{next_mate} ) { # Generate a new pair of reads ($read, my $read2) = $self->next_mate_pair( ); # Save second read of the pair for later $self->{next_mate} = $read2; } else { # Use saved read $read = $self->{next_mate}; $self->{next_mate} = undef; } } else { # Generate a single shotgun or amplicon read $read = $self->next_single_read( ); } } return $read; } sub get_random_seed { my ($self) = @_; return $self->{random_seed}; } #---------- GRINDER INTERNALS -------------------------------------------------# sub argparse { # Process arguments my ($self, $args) = @_; my @old_args = @$args; # Read profile file $args = process_profile_file($args); # Parse and validate arguments with Getopt::Euclid Getopt::Euclid->process_args($args); # Check that Euclid worked, i.e. that there is at least one parameter in %ARGV if ( scalar keys %ARGV == 0 ) { die "Error: the command line arguments could not be parsed because of an ". "internal problem\n"; } # Get parsed arguments from %ARGV and put them in $self while (my ($arg, $val) = each %ARGV) { # Skip short argument names (they are also represented with long names) next if length($arg) <= 2; # Process long argument names. Copy their value into $self my $ref = ref($val); if (not $ref) { $self->{$arg} = $val; } elsif ($ref eq 'ARRAY') { @{$self->{$arg}} = @{$val}; } else { die "Error: unsupported operation on argument '$arg' which is a reference". "of type $ref\n"; } } return 1; } sub process_profile_file { # Find profile file in arguments and read the profiles. The profile file # only contains Grinder arguments, and lines starting with a '#' are comments. my ($args) = @_; my $file; for (my $i = 0; $i < scalar @$args; $i++) { my $arg = $$args[$i]; if ($arg =~ m/^-profile_file/ || $arg =~ m/-pf/) { $file = $$args[$i+1]; if ( (not defined $file) || ($file =~ m/^-/) ) { die "Error: no value was given to --profile_file\n"; } } } if (defined $file) { open my $in, '<', $file or die "Error: Could not read file '$file'\n$!\n"; my $profile = ''; while (my $line = <$in>) { chomp $line; next if $line =~ m/^\s*$/; next if $line =~ m/^\s*#/; $profile .= "$line "; } close $in; push @$args, split /\s+/, $profile; } return $args; } sub initialize { my ($self) = @_; # Parameter processing: read length distribution if ( (not ref $self->{read_dist}) or (ref $self->{read_dist} eq 'SCALAR') ){ $self->{read_dist} = [$self->{read_dist}]; } $self->{read_length} = $self->{read_dist}[0] || 100; $self->{read_model} = $self->{read_dist}[1] || 'uniform'; $self->{read_delta} = $self->{read_dist}[2] || 0; delete $self->{read_dist}; # Parameter processing: mate insert length distribution if ( (not ref $self->{insert_dist}) or (ref $self->{insert_dist} eq 'SCALAR') ){ $self->{insert_dist} = [$self->{insert_dist}]; } $self->{mate_length} = $self->{insert_dist}[0] || 0; $self->{mate_model} = $self->{insert_dist}[1] || 'uniform'; $self->{mate_delta} = $self->{insert_dist}[2] || 0; delete $self->{insert_dist}; # Parameter processing: genome abundance distribution if ( (not ref $self->{abundance_model}) or (ref $self->{abundance_model} eq 'SCALAR') ){ $self->{abundance_model} = [$self->{abundance_model}]; } $self->{distrib} = $self->{abundance_model}[0] || 'uniform'; $self->{param} = $self->{abundance_model}[1]; delete $self->{abundance_model}; # Parameter processing: point sequencing error distribution if ( (not ref $self->{mutation_dist}) or (ref $self->{mutation_dist} eq 'SCALAR') ) { $self->{mutation_dist} = [$self->{mutation_dist}]; } $self->{mutation_model} = $self->{mutation_dist}[0] || 'uniform'; $self->{mutation_para1} = $self->{mutation_dist}[1] || 0; $self->{mutation_para2} = $self->{mutation_dist}[2] || 0; delete $self->{mutation_dist}; # Parameter processing: mutation ratio $self->{mutation_ratio}[0] = $self->{mutation_ratio}[0] || 0; $self->{mutation_ratio}[1] = $self->{mutation_ratio}[1] || 0; my $sum = $self->{mutation_ratio}[0] + $self->{mutation_ratio}[1]; if ($sum == 0) { $self->{mutation_ratio}[0] = 50; $self->{mutation_ratio}[1] = 50; } else { $self->{mutation_ratio}[0] = $self->{mutation_ratio}[0] *100 / $sum; $self->{mutation_ratio}[1] = $self->{mutation_ratio}[1] *100 / $sum; } # Parameter processing: homopolymer model $self->{homopolymer_dist} = lc $self->{homopolymer_dist} if defined $self->{homopolymer_dist}; # Parameter processing: chimera_distribution if ( (not ref $self->{chimera_dist}) or (ref $self->{chimera_dist} eq 'SCALAR') ) { $self->{chimera_dist} = [$self->{chimera_dist}]; } if ($self->{chimera_dist}) { # Normalize to 1 my $total = 0; for my $multimera_abundance (@{$self->{chimera_dist}}) { $total += $multimera_abundance; } $self->{chimera_dist} = undef if $total == 0; $self->{chimera_dist} = normalize($self->{chimera_dist}, $total); # Calculate cdf if ($self->{chimera_perc}) { $self->{chimera_dist_cdf} = $self->proba_cumul( $self->{chimera_dist} ); } } # Parameter processing: fastq_output requires qual_levels if ( ($self->{fastq_output}) && (not scalar @{$self->{qual_levels}} > 0) ) { die "Error: needs to be specified to output FASTQ reads\n"; } # Random number generator: seed or be auto-seeded if (defined $self->{random_seed}) { srand( $self->{random_seed} ); } else { $self->{random_seed} = srand( ); } # Sequence length check my $max_read_length = $self->{read_length} + $self->{read_delta}; # approximation if ($self->{mate_length}) { my $min_mate_length = $self->{mate_length} - $self->{mate_delta}; if ($max_read_length > $min_mate_length) { die("Error: The mate insert length cannot be smaller than read length. ". "Try increasing the mate insert length or decreasing the read length\n"); } } # Pre-compile regular expression to check if reads are valid if ( (defined $self->{exclude_chars}) && (not $self->{exclude_chars} eq '') ) { $self->{exclude_re} = qr/[${\$self->{exclude_chars}}]/i; # Match any of the chars } # Read MIDs $self->{multiplex_ids} = $self->read_multiplex_id_file($self->{multiplex_ids}, $self->{num_libraries}) if defined $self->{multiplex_ids}; # Import reference sequences my $min_seq_len; if ($self->{chimera_dist_cdf}) { # Each chimera needs >= 1 bp. Use # sequences required by largest chimera. $min_seq_len = scalar @{$self->{chimera_dist}} + 1; } else { $min_seq_len = 1; } $self->{database} = $self->database_create( $self->{reference_file}, $self->{unidirectional}, $self->{forward_reverse}, $self->{abundance_file}, $self->{delete_chars}, $min_seq_len ); $self->initialize_alphabet; if ( ($self->{alphabet} eq 'protein') && ($self->{mate_length} != 0) && (not $self->{mate_orientation} eq 'FF') ) { die "Error: Can only use FF with proteic reference sequences\n"; } # Genome relative abundance in the different independent libraries to create $self->{c_structs} = $self->community_structures( $self->{database}->{ids}, $self->{abundance_file}, $self->{distrib}, $self->{param}, $self->{num_libraries}, $self->{shared_perc}, $self->{permuted_perc}, $self->{diversity}, $self->{forward_reverse} ); # Count kmers in the database if we need to form kmer-based chimeras if ($self->{chimera_perc} && $self->{chimera_kmer}) { # Get all wanted sequences (not all the sequences in the database) my %ids_hash; my @ids; my @seqs; for my $c_struct ( @{ $self->{c_structs} } ) { for my $id (@{$c_struct->{ids}}) { if (not exists $ids_hash{$id}) { $ids_hash{$id} = undef; push @ids, $id; push @seqs, $self->database_get_seq($id); } } } %ids_hash = (); # Now create a collection of kmers $self->{chimera_kmer_col} = Grinder::KmerCollection->new( -k => $self->{chimera_kmer}, -seqs => \@seqs, -ids => \@ids, )->filter_shared(2); } # Markers to keep track of computation progress $self->{cur_lib} = 0; $self->{cur_read} = 0; return $self; } sub initialize_alphabet { # Store the characters of the alphabet to use and calculate their cdf so that # we can easily pick them at random later my ($self) = @_; my $alphabet = $self->{alphabet}; # Characters available in alphabet my %alphabet_hash; if ($alphabet eq 'dna') { %alphabet_hash = ( 'A' => undef, 'C' => undef, 'G' => undef, 'T' => undef, ); } elsif ($alphabet eq 'rna') { %alphabet_hash = ( 'A' => undef, 'C' => undef, 'G' => undef, 'U' => undef, ); } elsif ($alphabet eq 'protein') { %alphabet_hash = ( 'A' => undef, 'R' => undef, 'N' => undef, 'D' => undef, 'C' => undef, 'Q' => undef, 'E' => undef, 'G' => undef, 'H' => undef, 'I' => undef, 'L' => undef, 'K' => undef, 'M' => undef, 'F' => undef, 'P' => undef, 'S' => undef, 'T' => undef, 'W' => undef, 'Y' => undef, 'V' => undef, #'B' => undef, # D or N #'Z' => undef, # Q or E #'X' => undef, # any amino-acid # J, O and U are the only unused letters ); } else { die "Error: unknown alphabet '$alphabet'\n"; } my $num_chars = scalar keys %alphabet_hash; $self->{alphabet_hash} = \%alphabet_hash; $self->{alphabet_arr} = [sort keys %alphabet_hash]; # CDF for this alphabet $self->{alphabet_complete_cdf} = $self->proba_cumul([(1/$num_chars) x $num_chars]); $self->{alphabet_truncated_cdf} = $self->proba_cumul([(1/($num_chars-1)) x ($num_chars-1)]); return 1; } sub read_multiplex_id_file { my ($self, $file, $nof_indep) = @_; my @mids; # Read FASTA file containing the MIDs my $in = Bio::SeqIO->newFh( -file => $file, -format => 'fasta', ); while (my $mid = <$in>) { push @mids, $mid->seq; } undef $in; # Sanity check my $nof_mids = scalar @mids; if ($nof_mids < $nof_indep) { die "Error: $nof_indep communities were requested but the MID file ". "had only $nof_mids sequences.\n"; } elsif ($nof_mids > $nof_indep) { warn "Warning: $nof_indep communities were requested but the MID file ". "contained $nof_mids sequences. Ignoring extraneous MIDs...\n"; } return \@mids; } sub community_structures { # Create communities with a specified structure, alpha and beta-diversity my ($self, $seq_ids, $abundance_file, $distrib, $param, $nof_indep, $perc_shared, $perc_permuted, $diversities, $forward_reverse) = @_; # Calculate community structures my $c_structs; if ($abundance_file) { # Sanity check if ( (scalar @$diversities > 1) || $$diversities[0] ) { warn "Warning: Diversity cannot be specified when an abundance file is specified. Ignoring it...\n"; } if ( ($perc_shared > 0) || ($perc_permuted < 100) ) { warn "Warning: Percent shared and percent permuted cannot be specified when an abundance file is specified. Ignoring them...\n"; } # One or several communities with specified rank-abundances $c_structs = community_given_abundances($abundance_file, $seq_ids); # Calculate number of libraries my $got_indep = scalar @$c_structs; if ($nof_indep != 1) { # 1 is the default value if ($nof_indep > $got_indep) { die "Error: $nof_indep communities were requested but the abundance file". " specified the abundances for only $got_indep.\n"; } elsif ($nof_indep < $got_indep) { warn "Warning: $nof_indep communities were requested by the abundance ". "file specified the abundances for $got_indep. Ignoring extraneous ". "communities specified in the file.\n"; } } $nof_indep = $got_indep; $self->{num_libraries} = $nof_indep; # Calculate diversities based on given community abundances ($self->{diversity}, $self->{overall_diversity}, $self->{shared_perc}, $self->{permuted_perc}) = community_calculate_diversities($c_structs); } else { # One or several communities with rank-abundance to be calculated # Sanity check if ($nof_indep == 1) { # 1 is the default value $nof_indep = scalar @$diversities; } if ($nof_indep != scalar @$diversities) { if (scalar @$diversities == 1) { # Use same diversity for all libraries my $diversity = $$diversities[0]; for my $i (1 .. $nof_indep-1) { push @$diversities, $diversity; } } else { die "Error: The number of richness values provided (".(scalar @$diversities). ") did not match the requested number of libraries ($nof_indep).\n"; } } $self->{num_libraries} = $nof_indep; # Select shared species my $c_ids; my $overall_diversity = 0; ($c_ids, $overall_diversity, $diversities, $perc_shared) = community_shared( $seq_ids, $nof_indep, $perc_shared, $diversities ); # Shuffle the abundance-ranks of the most abundant genomes ($c_ids, $perc_permuted) = community_permuted($c_ids, $perc_permuted); # Update values in $self object $self->{overall_diversity} = $overall_diversity; $self->{diversity} = $diversities; $self->{shared_perc} = $perc_shared; $self->{permuted_perc} = $perc_permuted; # Put results in a community structure "object" for my $c (1 .. $nof_indep) { # Assign a random parameter if needed my $comm_param = defined $param ? $param : randig(1,0.05); # Calculate relative abundance of the community members my $diversity = $self->{diversity}[$c-1]; my $c_abs = community_calculate_species_abundance($distrib, $comm_param, $diversity); my $c_ids = $$c_ids[$c-1]; my $c_struct; $c_struct->{'ids'} = $c_ids; $c_struct->{'abs'} = $c_abs; $c_struct->{'param'} = $comm_param; $c_struct->{'model'} = $distrib; push @$c_structs, $c_struct; } } # Convert sequence IDs to object IDs for my $c_struct (@$c_structs) { ($c_struct->{'abs'}, $c_struct->{'ids'}) = community_calculate_amplicon_abundance( $c_struct->{'abs'}, $c_struct->{'ids'}, $seq_ids ); } return $c_structs; } sub community_calculate_diversities { my ($c_structs) = @_; my ($diversities, $overall_diversity, $perc_shared, $perc_permuted) = (0, 0, 0, 0); # Calculate diversity (richness) based on given community abundances my $nof_libs = scalar @$c_structs; my %all_ids; my @richnesses; for my $c_struct (@$c_structs) { my $richness = 0; for my $i (0 .. scalar @{$$c_struct{ids}} - 1) { my $id = $$c_struct{ids}[$i]; my $ab = $$c_struct{abs}[$i]; next if not $ab; $richness++; if (defined $all_ids{$id}) { $all_ids{$id}++; } else { $all_ids{$id} = 1; } } push @richnesses, $richness; } $overall_diversity = scalar keys %all_ids; # Calculate percent shared my $nof_non_shared = 0; while (my ($id, $nof_samples) = each %all_ids) { if ($nof_samples < $nof_libs) { $nof_non_shared++; } } $perc_shared = ($overall_diversity - $nof_non_shared) * 100 / $overall_diversity; # TODO: Could calculate percent permuted return \@richnesses, $overall_diversity, $perc_shared, $perc_permuted; } sub community_given_abundances { # Read a file of genome abundances. The file should be space or tab-delimited. # The first column should be the IDs of genomes, and the subsequent columns is # for their relative abundance in different communities. An optional list of # valid IDs can be provided. Then the abundances are normalized so that their # sum is 1. my ($file, $seq_ids) = @_; # Read abundances my ($ids, $abs) = community_read_abundances($file); # Remove genomes with unknown IDs and calculate cumulative abundance my $totals; for my $comm_num (0 .. $#$ids) { my $i = 0; while ( $i < scalar @{$$ids[$comm_num]} ) { my $id = $$ids[$comm_num][$i]; my $ab = $$abs[$comm_num][$i]; if ( (scalar keys %$seq_ids == 0) || (exists $$seq_ids{$id}) ) { $$totals[$comm_num] += $ab; $i++; } else { die "Error: Requested reference sequence '$id' in file '$file' does not". " exist in the input database.\n"; splice @{$$ids[$comm_num]}, $i, 1; splice @{$$abs[$comm_num]}, $i, 1; } } } # Process the communities my @c_structs; for my $comm_num (0 .. scalar @$ids - 1) { my $comm_ids = $$ids[$comm_num]; my $comm_abs = $$abs[$comm_num]; my $comm_total = $$totals[$comm_num]; if ($comm_total == 0) { warn "Warning: The abundance of all the genomes for community ".($comm_num+1)." was zero. Skipping this community...\n"; next; } # Normalize the abundances $comm_abs = normalize($comm_abs, $comm_total); # Sort relative abundances by decreasing ($comm_abs, $comm_ids) = two_array_sort($comm_abs, $comm_ids); $comm_abs = [reverse(@$comm_abs)]; $comm_ids = [reverse(@$comm_ids)]; # Save community structure my $c_struct = { 'ids' => $comm_ids, 'abs' => $comm_abs }; push @c_structs, $c_struct; } return \@c_structs; } sub community_read_abundances { my ($file) = @_; # Read abundances of genomes from a file my $ids; # genome IDs my $abs; # genome relative abundance open my $io, '<', $file or die "Error: Could not read file '$file'\n$!\n"; while ( my $line = <$io> ) { # Ignore comment or empty lines if ( $line =~ m/^\s*$/ || $line =~ m/^#/ ) { next; } # Read abundance info from line my ($id, @rel_abs) = ($line =~ m/(\S+)/g); if (defined $id) { for my $comm_num (0 .. $#rel_abs) { my $rel_ab = $rel_abs[$comm_num]; push @{$$ids[$comm_num]}, $id; push @{$$abs[$comm_num]}, $rel_ab; } } else { warn "Warning: Line $. of file '$file' has an unknown format. Skipping it...\n"; } } close $io; return $ids, $abs; } sub community_permuted { # Change the abundance rank of species in all but the first community. # The number of species changed in abundance is determined by the percent # permuted, i.e. a given percentage of the most abundant species in this community. my ($c_ids, $perc_permuted) = @_; my $nof_indep = scalar @$c_ids; # Leave the first community alone, but permute the ones after for my $c ( 2 .. $nof_indep ) { my $ids = $$c_ids[$c-1]; my $diversity = scalar @$ids; # Number of top genomes to permute # Percent permuted is relative to diversity in this community my $nof_permuted = $perc_permuted / 100 * $diversity; $nof_permuted = int($nof_permuted + 0.5); # round number # Method published in Angly et al 2006 PLOS Biology # Take the $nof_permuted first ranks (most abundant genomes) and shuffle # (permute) their ranks amongst the $nof_permuted first ranks. # Caveat: cannot permute only 1 genome my $idxs; if ($nof_permuted > 0) { # Add shuffled top genomes my $permuted_idxs = randomize( [0 .. $nof_permuted-1] ); push @$idxs, @$permuted_idxs; } if ($diversity - $nof_permuted > 0) { # Add other genomes in same order my $non_permuted_idxs = [$nof_permuted .. $diversity-1]; push @$idxs, @$non_permuted_idxs; } @$ids = @$ids [ @$idxs ]; } return $c_ids, $perc_permuted; } sub community_shared { # Randomly split a library of sequences into a given number of groups that # share a specified percent of their genomes. # The % shared is the number of species shared / the total diversity in all communities # Input: arrayref of sequence ids # number of communities to produce # percentage of genomes shared between the communities # diversity (optional, will use all genomes if not specified) # Return: arrayref of IDs that are shared # arrayref of arrayref with the unique IDs for each community my ($seq_ids, $nof_indep, $perc_shared, $diversities) = @_; # If diversity is not specified (is '0'), use the maximum value possible my $nof_refs = scalar keys %$seq_ids; my $min_diversity = 1E99; for my $i (0 .. scalar @$diversities - 1) { if ($$diversities[$i] == 0) { $$diversities[$i] = $nof_refs / ( $perc_shared/100 + $nof_indep*(1-$perc_shared/100) ); $$diversities[$i] = int( $$diversities[$i] ); if ( ($i > 0) && ($$diversities[$i-1] != $$diversities[$i]) ) { die "Error: Define either all the diversities or none.\n"; } } if ($$diversities[$i] < $min_diversity) { $min_diversity = $$diversities[$i]; } } if ($min_diversity == 0) { die "Error: Cannot make $nof_indep libraries sharing $perc_shared % species". " from $nof_refs references\n"; } # Calculate the number of sequences to share, noting that the percent shared # is relative to the diversity of the least abundant library my $nof_shared = int($min_diversity * $perc_shared / 100); $perc_shared = $nof_shared * 100 / $min_diversity; # Unique sequences my @nof_uniques; my $sum_not_uniques = 0; for my $diversity (@$diversities) { my $nof_unique = $diversity - $nof_shared; $sum_not_uniques += $nof_unique; push @nof_uniques, $nof_unique; } # Overall diversity my $overall_diversity = $nof_shared + $sum_not_uniques; if ($nof_refs < $overall_diversity) { die "Error: The number of reference sequences available ($nof_refs) is not". " large enough to support the requested diversity ($overall_diversity ". "genomes overall with $perc_shared % genomes shared between $nof_indep ". "libraries)\n"; } # Add shared sequences my @ids = sort keys %$seq_ids; my @shared_ids; for (0 .. $nof_shared - 1) { # Pick a random sequence my $rand_offset = int(rand($nof_refs)); my $rand_id = splice @ids, $rand_offset, 1; $nof_refs = scalar(@ids); # Add this sequence in all independent libraries push @shared_ids, $rand_id; } # Add sequences not shared my @unique_ids; for my $lib_num (0 .. $nof_indep-1) { my $nof_unique = $nof_uniques[$lib_num]; for (0 .. $nof_unique - 1) { # Pick a random sequence my $rand_offset = int(rand($nof_refs)); my $rand_id = splice @ids, $rand_offset, 1; $nof_refs = scalar(@ids); # Add this sequence in this independent library only push @{$unique_ids[$lib_num]}, $rand_id; } } # Randomly pick the rank of the shared IDs my $shared_ranks = randomize( [1 .. $min_diversity] ); @$shared_ranks = splice @$shared_ranks, 0, $nof_shared; # Construct community ranks my @c_ranks; for my $lib_num (0 .. $nof_indep-1) { my $diversity = $$diversities[$lib_num]; my @ranks = (undef) x $diversity; # Add shared IDs for my $i (0 .. $nof_shared-1) { my $id = $shared_ids[$i]; my $rank = $$shared_ranks[$i]; $ranks[$rank-1] = $id; } # Add unique IDs my $ids = $unique_ids[$lib_num]; for my $rank (1 .. $diversity) { next if defined $ranks[$rank-1]; $ranks[$rank-1] = pop @$ids; } push @c_ranks, \@ranks; } return \@c_ranks, $overall_diversity, $diversities, $perc_shared; } sub community_calculate_species_abundance { # Calculate relative abundance based on a distribution and its parameters. # Input is a model, its 2 parameters, and the number of values to generate # Output is a reference to a list of relative abundance. The abundance adds up # to 1 my ($distrib, $param, $diversity) = @_; # First calculate rank-abundance values my $rel_ab; my $total = 0; if ($distrib eq 'uniform') { # no parameter my $val = 1 / $diversity; for (my $index = 0 ; $index < $diversity ; $index++) { $$rel_ab[$index] = $val; } $total = 1; } elsif ($distrib eq 'linear') { # no parameter my $slope = 1 / $diversity; for (my $index = 0 ; $index < $diversity ; $index++) { $$rel_ab[$index] = 1 - $slope * $index; $total += $$rel_ab[$index]; } } elsif ($distrib eq 'powerlaw') { # 1 parameter die "Error: The powerlaw model requires an input parameter (-p option)\n" if not defined $param; for (my $index = 0 ; $index < $diversity ; $index++) { $$rel_ab[$index] = ($index+1)**-$param; $total += $$rel_ab[$index]; } } elsif ($distrib eq 'logarithmic') { # 1 parameter die "Error: The logarithmic model requires an input parameter (-p option)\n" if not defined $param; for (my $index = 0 ; $index < $diversity ; $index++) { $$rel_ab[$index] = log($index+2)**-$param; $total += $$rel_ab[$index]; } } elsif ($distrib eq 'exponential') { # 1 parameter die "Error: The exponential model requires an input parameter (-p option)\n" if not defined $param; for (my $index = 0 ; $index < $diversity ; $index++) { $$rel_ab[$index] = exp(-($index+1)*$param); $total += $$rel_ab[$index]; } } else { die "Error: $distrib is not a valid rank-abundance distribution\n"; } # Normalize to 1 if needed if ($total != 1) { $rel_ab = normalize($rel_ab, $total); } return $rel_ab; } sub community_calculate_amplicon_abundance { my ($r_spp_abs, $r_spp_ids, $seq_ids) = @_; # Convert abundance of species into abundance of their amplicons because there # can be multiple amplicon per species and the amplicons have a different ID # from the species. The r_spp_ids and r_spp_abs arrays are the ID and abundance # of the species, sorted by decreasing abundance. # Give amplicons from the same species the same sampling probability for (my $i = 0; $i < scalar @$r_spp_ids; $i++) { my $species_ab = $$r_spp_abs[$i]; my $species_id = $$r_spp_ids[$i]; my @amplicon_ids = keys %{$seq_ids->{$species_id}}; my $nof_amplicons = scalar @amplicon_ids; my @amplicon_abs = ($species_ab / $nof_amplicons) x $nof_amplicons; splice @$r_spp_abs, $i, 1, @amplicon_abs; splice @$r_spp_ids, $i, 1, @amplicon_ids; $i += $nof_amplicons - 1; } return $r_spp_abs, $r_spp_ids; } sub next_single_read { # Generate a single shotgun or amplicon read my ($self) = @_; my $oids = $self->{c_structs}->[$self->{cur_lib}-1]->{ids}; my $mid = $self->{multiplex_ids}->[$self->{cur_lib}-1] || ''; my $lib_num = $self->{num_libraries} > 1 ? $self->{cur_lib} : undef; my $max_nof_tries = $self->{forward_reverse} ? 1 : 10; # Choose a random genome or amplicon my $genome = $self->rand_seq($self->{positions}, $oids); my $nof_tries = 0; my $shotgun_seq; do { # Error if we have exceeded the maximum number attempts $nof_tries++; if ($nof_tries > $max_nof_tries) { my $message = "Error: Could not take a random shotgun read without ". "forbidden characters from reference sequence ".$genome->seq->id; $message .= " ($max_nof_tries attempts made)" if ($max_nof_tries > 1); $message .= ".\n"; die $message; } # Chimerize the template sequence if needed $genome = $self->rand_seq_chimera($genome, $self->{chimera_perc}, $self->{positions}, $oids) if $self->{chimera_perc}; # Take a random orientation if needed my $orientation = ($self->{unidirectional} != 0) ? 1 : rand_seq_orientation(); # Choose a read size according to the specified distribution my $length = rand_seq_length($self->{read_length}, $self->{read_model}, $self->{read_delta}); # Shorten read length if too long my $max_length = $genome->length + length($mid); if ( $length > $max_length) { $length = $max_length; } # Read position on genome or amplicon my ($start, $end) = rand_seq_pos($genome, $length, $self->{forward_reverse}, $mid); # New sequence object $shotgun_seq = new_subseq($self->{cur_read}, $genome, $self->{unidirectional}, $orientation, $start, $end, $mid, undef, $lib_num, $self->{desc_track}, $self->{qual_levels}); # Simulate sequence aberrations and sequencing error if needed $shotgun_seq = $self->rand_seq_errors($shotgun_seq) if ($self->{homopolymer_dist} || $self->{mutation_para1}); } while ( $self->{exclude_re} && not $self->is_valid($shotgun_seq) ); return $shotgun_seq; } sub next_mate_pair { # Generate a shotgun mate pair my ($self) = @_; my $oids = $self->{c_structs}->[$self->{cur_lib}-1]->{ids}; my $mid = $self->{multiplex_ids}->[$self->{cur_lib}-1] || ''; my $lib_num = $self->{num_libraries} > 1 ? $self->{cur_lib} : undef; my $pair_num = int( $self->{cur_read} / 2 + 0.5 ); my $max_nof_tries = $self->{forward_reverse} ? 1 : 10; # Deal with mate orientation my @mate_orientations = split('', $self->{mate_orientation} ); my $mate_1_orientation = $mate_orientations[0] eq 'F' ? 1 : -1; my $mate_2_orientation = $mate_orientations[1] eq 'F' ? 1 : -1; # Choose a random genome my $genome = $self->rand_seq($self->{positions}, $oids); my $nof_tries = 0; my ($shotgun_seq_1, $shotgun_seq_2); while (1) { # Error if we have exceeded the maximum number of attempts $nof_tries++; if ($nof_tries > $max_nof_tries) { my $message = "Error: Could not take a pair of random shotgun read ". "without forbidden characters from reference sequence ".$genome->seq->id; $message .= " ($max_nof_tries attempts made)" if ($max_nof_tries > 1); $message .= ".\n"; die $message; } # Chimerize the template sequence if needed $genome = $self->rand_seq_chimera($genome, $self->{chimera_perc}, $self->{positions}, $oids) if $self->{chimera_perc}; # Take from a random strand if needed my $orientation = ($self->{unidirectional} != 0) ? 1 : rand_seq_orientation(); # Choose a mate pair length according to the specified distribution my $mate_length = rand_seq_length($self->{mate_length}, $self->{mate_model}, $self->{mate_delta}); # Shorten mate length if too long my $max_length = $genome->length + length($mid); if ( $mate_length > $max_length) { $mate_length = $max_length; } # Mate position on genome or amplicon my ($mate_start, $mate_end) = rand_seq_pos($genome, $mate_length, $self->{forward_reverse}, $mid); # Determine mate-pair position my $read_length = rand_seq_length($self->{read_length}, $self->{read_model}, $self->{read_delta}); my $seq_1_start = $mate_start; my $seq_1_end = $mate_start + $read_length - 1; $read_length = rand_seq_length($self->{read_length}, $self->{read_model}, $self->{read_delta}); my $seq_2_start = $mate_end - $read_length + 1; my $seq_2_end = $mate_end; if ($orientation == -1) { $mate_1_orientation = $orientation * $mate_1_orientation; $mate_2_orientation = $orientation * $mate_2_orientation; ($seq_1_start, $seq_2_start) = ($seq_2_start, $seq_1_start); ($seq_1_end , $seq_2_end ) = ($seq_2_end , $seq_1_end ); } # Generate first mate read $shotgun_seq_1 = new_subseq($pair_num, $genome, $self->{unidirectional}, $mate_1_orientation, $seq_1_start, $seq_1_end, $mid, '1', $lib_num, $self->{desc_track}, $self->{qual_levels}); $shotgun_seq_1 = $self->rand_seq_errors($shotgun_seq_1) if ($self->{homopolymer_dist} || $self->{mutation_para1}); if ($self->{exclude_re} && not $self->is_valid($shotgun_seq_1)) { next; } # Generate second mate read $shotgun_seq_2 = new_subseq($pair_num, $genome, $self->{unidirectional}, $mate_2_orientation, $seq_2_start, $seq_2_end, $mid, '2', $lib_num, $self->{desc_track}, $self->{qual_levels}); $shotgun_seq_2 = $self->rand_seq_errors($shotgun_seq_2) if ($self->{homopolymer_dist} || $self->{mutation_para1}); if ($self->{exclude_re} && not $self->is_valid($shotgun_seq_2)) { next; } # Both shotgun reads were valid last; } return $shotgun_seq_1, $shotgun_seq_2; } sub is_valid { # Return 1 if the sequence object is valid (is not empty and does not have any # of the specified forbidden characters), 0 otherwise. Specify the forbidden # characters as a single string, e.g. 'N-' to prevent any reads to have 'N' or # '-'. The search is case-insensitive. my ($self, $seq) = @_; if ($seq->seq =~ $self->{exclude_re}) { return 0; } return 1; } sub proba_create { my ($self, $c_struct, $size_dep, $copy_bias) = @_; # 1/ Calculate size-dependent, copy number-dependent probabilities my $probas = $self->proba_bias_dependency($c_struct, $size_dep, $copy_bias); # 2/ Generate proba starting position my $positions = $self->proba_cumul($probas); return $positions; } sub proba_bias_dependency { # Affect probability of picking a species by considering genome length or gene # copy number bias my ($self, $c_struct, $size_dep, $copy_bias) = @_; # Calculate probability my $probas; my $totproba = 0; my $diversity = scalar @{$c_struct->{'ids'}}; for my $i (0 .. scalar $diversity - 1) { my $proba = $c_struct->{'abs'}[$i]; if ( defined $self->{forward_reverse} ) { # Gene copy number bias if ($copy_bias) { my $refseq_id = $self->database_get_parent_id($c_struct->{'ids'}[$i]); my $nof_amplicons = scalar @{ $self->database_get_children_seq($refseq_id) }; $proba *= $nof_amplicons; } } else { # Genome length bias if ($size_dep) { my $id = $c_struct->{'ids'}[$i]; my $seq = $self->database_get_seq($id); my $len = $seq->length; $proba *= $len; } } push @$probas, $proba; $totproba += $proba; } # Normalize if necessary if ($totproba != 1) { $probas = normalize($probas, $totproba); } return $probas; } sub proba_cumul { # Put the probas end to end on a line and generate their start position on the # line (cumulative distribution). This will help with picking genomes or # nucleotides at random using the rand_weighted() subroutine. my ($self, $probas) = @_; my $sum = 0; return [ 0, map { $sum += $_ } @$probas ]; } sub rand_weighted { # Pick a random number based on the given cumulative probabilities. # Cumulative weights can be obtained from the proba_cumul() subroutine. my ($cum_probas, $pick, $index) = (shift, rand, -1); map { $pick >= $_ ? $index++ : return $index } @$cum_probas; } sub rand_seq { # Choose a sequence object randomly using a probability distribution my ($self, $positions, $oids) = @_; return $self->database_get_seq( $$oids[rand_weighted($positions)] ); } sub rand_seq_chimera { my ($self, $sequence, $chimera_perc, $positions, $oids) = @_; # Produce an amplicon that is a chimera of multiple sequences my $chimera; # Sanity check if ( (scalar @$oids < 2) && ($chimera_perc > 0) ) { die "Error: Not enough sequences to produce chimeras\n"; } # Fate now decides to produce a chimera or not if ( rand(100) <= $chimera_perc ) { # Pick multimera size my $m = $self->rand_chimera_size(); # Pick chimera fragments my @pos; if ($self->{chimera_kmer}) { @pos = $self->kmer_chimera_fragments($m); } else { # TODO: try to not provide $positions and $oids @pos = $self->rand_chimera_fragments($m, $sequence, $positions, $oids); } # Join chimera fragments $chimera = assemble_chimera(@pos); } else { # No chimera needed $chimera = $sequence; } return $chimera; } sub rand_chimera_size { # Decide of the number of sequences that the chimera will have, based on the # user-defined chimera distribution my ($self) = @_; return rand_weighted( $self->{chimera_dist_cdf} ) + 2; } sub kmer_chimera_fragments { # Return a kmer-based chimera of the required size. It is impossible to # randomly make one that will meet the required size. So, make multiple # attempts and save failed attempts in a pool for later reuse. my ($self, $m) = @_; my $frags; my $pool = $self->{chimera_kmer_pool}->{$m}; if ( (defined $pool) && (scalar @$pool > 0) ) { # Pick a chimera from the pool if possible $frags = shift @$pool; } else { # Attempt multiple times to generate a suitable chimera my $actual_m = 0; my $nof_tries = 0; my $max_nof_tries = 100; while ( ($actual_m < $m) && ($nof_tries <= $max_nof_tries) ) { $nof_tries++; $frags = [ $self->kmer_chimera_fragments_backend($m) ]; my $actual_m = scalar @$frags / 3; if ($nof_tries >= $max_nof_tries) { # Could not make a suitable chimera, accept the current chimera warn "Warning: Could not make a chimera of $m sequences after ". "$max_nof_tries attempts. Accepting a chimera of $actual_m sequences". " instead...\n"; $actual_m = $m; } if ($actual_m < $m) { # Add unsuitable chimera to the pool $pool = $self->{chimera_kmer_pool}->{$actual_m}; push @$pool, $frags; # Prevent the pool from growing too big my $max_pool_size = 100; shift @$pool if scalar @$pool > $max_pool_size; } else { # We got a suitable chimera... done last; } } } return @$frags; } sub kmer_chimera_fragments_backend { # Pick sequence fragments for multimeras where breakpoints are located on # shared kmers. A smaller chimera than requested may be returned. my ($self, $m) = @_; # Initial pair of fragments my @pos = $self->rand_kmer_chimera_initial(); # Append sequence to chimera for my $i (3 .. $m) { my ($seqid1, $start1, $end1, $seqid2, $start2, $end2) = $self->rand_kmer_chimera_extend($pos[-3], $pos[-2], $pos[-1]); if (not defined $seqid2) { # Could not find a sequence that shared a suitable kmer last; } @pos[-3..-1] = ($seqid1, $start1, $end1); push @pos, ($seqid2, $start2, $end2); } # Put sequence objects instead of sequence IDs for (my $i = 0; $i < scalar @pos; $i = $i+3) { my $seqid = $pos[$i]; my $seq = $self->database_get_seq($seqid); $pos[$i] = $seq; } return @pos; } sub rand_kmer_chimera_extend { # Pick another fragment to add to a kmer-based chimera. Return undef if none # can be found my ($self, $seqid1, $start1, $end1) = @_; my ($seqid2, $start2, $end2); # Get kmer frequencies in the end part of sequence 1 my ($kmer_arr, $freqs) = $self->{chimera_kmer_col}->counts($seqid1, $start1, 1); if (defined $kmer_arr) { # Pick a random kmer my $kmer_cdf = $self->proba_cumul($freqs); my $kmer = $self->rand_kmer_from_collection($kmer_arr, $kmer_cdf); # Get a sequence that has the same kmer as the first but is not the first $seqid2 = $self->rand_seq_with_kmer( $kmer, $seqid1 ); # Pick a suitable kmer start on that sequence if (defined $seqid2) { # Pick a random breakpoint # TODO: can we prefer a position not too crazy? my $pos1 = $self->rand_kmer_start( $kmer, $seqid1, $start1 ); my $pos2 = $self->rand_kmer_start( $kmer, $seqid2 ); # Place breakpoint about the middle of the kmer (kmers are at least 2 bp long) my $middle = int($self->{chimera_kmer} / 2); #$start1 = $start1; $end1 = $pos1 + $middle - 1; $start2 = $pos2 + $middle; $end2 = $self->database_get_seq($seqid2)->length; } } return $seqid1, $start1, $end1, $seqid2, $start2, $end2; } sub rand_kmer_chimera_initial { # Pick two sequences and start points to assemble a kmer-based bimera. # An optional starting sequence can be provided. my ($self, $seqid1) = @_; my $kmer; if (defined $seqid1) { # Try to pick a kmer from the requested sequence $kmer = $self->rand_kmer_of_seq( $seqid1 ); if (not defined $kmer) { die "Error: Sequence $seqid1 did not contain a suitable kmer\n"; } } else { # Pick a random kmer and sequence containing that kmer $kmer = $self->rand_kmer_from_collection(); $seqid1 = $self->rand_seq_with_kmer( $kmer ); } # Get a sequence that has the same kmer as the first but is not the first my $seqid2 = $self->rand_seq_with_kmer( $kmer, $seqid1 ); if (not defined $seqid2) { die "Error: Could not find another sequence that contains kmer $kmer\n"; } # Pick random breakpoint positions my $pos1 = $self->rand_kmer_start( $kmer, $seqid1 ); my $pos2 = $self->rand_kmer_start( $kmer, $seqid2 ); # Swap sequences so that pos1 < pos2 if ($pos1 > $pos2) { ($seqid1, $seqid2) = ($seqid2, $seqid1); ($pos1, $pos2) = ($pos2, $pos1); } # Place breakpoint about the middle of the kmer (kmers are at least 2 bp long) my $middle = int($self->{chimera_kmer} / 2); my $start1 = 1; my $end1 = $pos1 + $middle - 1; my $start2 = $pos2 + $middle; my $end2 = $self->database_get_seq($seqid2)->length; return $seqid1, $start1, $end1, $seqid2, $start2, $end2; } sub rand_kmer_from_collection { # Pick a kmer at random amongst all possible kmers in the collection my ($self, $kmer_arr, $kmer_cdf) = @_; my $kmers = defined $kmer_arr ? $kmer_arr : $self->{chimera_kmer_arr}; my $cdf = defined $kmer_cdf ? $kmer_cdf : $self->{chimera_kmer_cdf}; my $kmer = $$kmers[rand_weighted($cdf)]; return $kmer; } sub rand_seq_with_kmer { # Pick a random sequence ID that contains the given kmer. An optional sequence # ID to exclude can be provided. my ($self, $kmer, $excl) = @_; my $source; my ($sources, $freqs) = $self->{chimera_kmer_col}->sources($kmer, $excl, 1); my $num_sources = scalar @$sources; if ($num_sources > 0) { my $cdf = $self->proba_cumul($freqs); $source = $$sources[rand_weighted($cdf)]; } return $source; } sub rand_kmer_of_seq { # Pick a kmer amongst the possible kmers of the given sequence my ($self, $seqid) = @_; my $kmer; my ($kmers, $freqs) = $self->{chimera_kmer_col}->kmers($seqid, 1); if (scalar @$kmers > 0) { my $cdf = $self->proba_cumul($freqs); $kmer = $$kmers[rand_weighted($cdf)]; } return $kmer; } sub rand_kmer_start { # Pick a kmer starting position at random for the given kmer and sequence ID. # An optional minimum start position can be given. my ($self, $kmer, $source, $min_start) = @_; my $start; $min_start ||= 1; my $kmer_col = $self->{chimera_kmer_col}; my $kmer_starts = $kmer_col->positions($kmer, $source); # Find index of first index min_idx where position respects min_start my $min_idx; for (my $i = 0; $i < scalar @$kmer_starts; $i++) { my $start = $kmer_starts->[$i]; if ($start >= $min_start) { $min_idx = $i; last; } } if (defined $min_idx) { # Get a random index between min_idx and the end of the array my $rand_idx = $min_idx + int rand (scalar @$kmer_starts - $min_idx); # Get the value for this random index $start = $kmer_starts->[ $rand_idx ]; } return $start; } sub rand_chimera_fragments { # Pick which sequences and breakpoints to use to form a chimera my ($self, $m, $sequence, $positions, $oids) = @_; # Pick random sequences my @seqs = ($sequence); my $min_len = $sequence->length; for (my $i = 2; $i <= $m; $i++) { my $prev_seq = $seqs[-1]; my $seq; do { $seq = $self->rand_seq($positions, $oids); } while ($seq->seq->id eq $prev_seq->seq->id); push @seqs, $seq; my $seq_len = $seq->length; if ( (not defined $min_len) || ($seq_len < $min_len) ) { $min_len = $seq_len; } } # Pick random breakpoints my $nof_breaks = $m - 1; my %breaks = (); while ( scalar keys %breaks < $nof_breaks ) { # pick a random break my $rand_pos = 1 + int( rand($min_len - 1) ); $breaks{$rand_pos} = undef; } my @breaks = (1, sort {$a <=> $b} (keys %breaks)); undef %breaks; # Assemble the positional array my @pos; for (my $i = 1; $i <= $m; $i++) { my $seq = $seqs[$i-1]; my $start = shift @breaks; my $end = $breaks[0] || $seq->length; $breaks[0]++; push @pos, ($seq, $start, $end); } return @pos; } sub assemble_chimera { # Create a chimera sequence object based on positional information: # seq1, start1, end1, seq2, start2, end2, ... my (@pos) = @_; # Create the ID, sequence and split location my ($chimera_id, $chimera_seq); my $chimera_loc = Bio::Location::Split->new(); while ( my ($seq, $start, $end) = splice @pos, 0, 3 ) { # Add amplicon position $chimera_loc->add_sub_Location( $seq->location ); # Add amplicon ID if (defined $chimera_id) { $chimera_id .= ','; } # Add subsequence my $chimera = $seq->seq; $chimera_id .= $chimera->id; $chimera_seq .= $chimera->subseq($start, $end); } # Create a sequence object my $chimera = Bio::SeqFeature::SubSeq->new( -seq => Bio::PrimarySeq->new( -id => $chimera_id, -seq => $chimera_seq ), ); # Save split location object (a bit hackish) $chimera->{_chimera} = $chimera_loc; return $chimera; } sub rand_seq_orientation { # Return a random read orientation: 1 for uncomplemented, or -1 for complemented return int(rand()+0.5) ? 1 : -1; } sub rand_seq_errors { # Introduce sequencing errors (point mutations, homopolymers) in a sequence # based on error models my ($self, $seq) = @_; my $seq_str = $seq->seq(); my $error_specs = {}; # Error specifications # First, specify errors in homopolymeric stretches $error_specs = $self->rand_homopolymer_errors($seq_str, $error_specs) if $self->{homopolymer_dist}; # Then, specify point sequencing errors: substitutions, insertions, deletions $error_specs = $self->rand_point_errors($seq_str, $error_specs) if $self->{mutation_para1}; # Finally, actually implement the errors as per the specifications $seq->errors($error_specs) if (scalar keys %$error_specs > 0); return $seq; } sub rand_homopolymer_errors { # Specify sequencing errors in a sequence's homopolymeric stretches my ($self, $seq_str, $error_specs) = @_; while ( $seq_str =~ m/(.)(\1+)/g ) { # Found a homopolymer my $res = $1; # residue in homopolymer my $len = length($2) + 1; # length of the homopolymer my $pos = pos($seq_str) - $len + 1; # start of the homopolymer (residue no.) # Apply homopolymer model based on normal distribution N(mean, standard deviation) # Balzer: N(n, 0.03494 + n * 0.06856) Balzer et al. 2010 # Richter: N(n, 0.15 * sqrt(n)) Richter et al. 2008 # Margulies: N(n, 0.15 * n) Margulies et al. 2005 my ($stddev, $new_len, $diff) = (0, 0, 0); if ( $self->{homopolymer_dist} eq 'balzer' ) { $stddev = 0.03494 + $len * 0.06856; } elsif ($self->{homopolymer_dist} eq 'richter') { $stddev = 0.15 * sqrt($len); } elsif ($self->{homopolymer_dist} eq 'margulies') { $stddev = 0.15 * $len; } else { die "Error: Unknown homopolymer distribution '".$self->{homopolymer_dist}."'\n"; } $new_len = int( $len + $stddev * randn() + 0.5 ); $new_len = 0 if $new_len < 0; # We're done if no error was introduced $diff = $new_len - $len; next unless $diff; # Otherwise, track the error generated if ($diff > 0) { # Homopolymer extension push @{$$error_specs{$pos}{'+'}}, ($res) x $diff; } elsif ($diff < 0) { # Homopolymer shrinkage for my $offset ( 0 .. abs($diff)-1 ) { push @{$$error_specs{$pos+$offset}{'-'}}, undef; } } } return $error_specs; } sub rand_point_errors { # Do some random point sequencing errors on a sequence based on a model my ($self, $seq_str, $error_specs) = @_; # Mutation cumulative density functions (cdf) for this sequence length my $seq_len = length $seq_str; if ( not defined $self->{mutation_cdf}->{$seq_len} ) { my $mut_pdf = []; # probability density function my $mut_freq = 0; # average my $mut_sum = 0; if ($self->{mutation_model} eq 'uniform') { # Uniform error model # para1 is the average mutation frequency my $proba = 1 / $seq_len; $mut_pdf = [ map { $proba } (1 .. $seq_len) ]; $mut_freq = $self->{mutation_para1}; $mut_sum = 1; } elsif ($self->{mutation_model} eq 'linear') { # Linear error model # para 1 is the error rate at the 5' end of the read # para 2 is the error rate at the 3' end $mut_freq = abs( $self->{mutation_para2} + $self->{mutation_para1} ) / 2; if ($seq_len == 1) { $$mut_pdf[0] = $mut_freq; $mut_sum = $mut_freq } elsif ($seq_len > 1) { my $slope = ($self->{mutation_para2} - $self->{mutation_para1}) / ($seq_len-1); for my $i (0 .. $seq_len-1) { my $val = $self->{mutation_para1} + $i * $slope; $mut_sum += $val; $$mut_pdf[$i] = $val; } } } elsif ($self->{mutation_model} eq 'poly4') { # Fourth degree polynomial error model: e = para1 + para2 * i**4 for my $i (0 .. $seq_len-1) { my $val = $self->{mutation_para1} + $self->{mutation_para2} * ($i+1)**4; $mut_sum += $val; $$mut_pdf[$i] = $val; } $mut_freq = $mut_sum / $seq_len; } else { die "Error: '".$self->{mutation_model}."' is not a supported error distribution\n"; } # Normalize to 1 if needed if ($mut_sum != 1) { $mut_pdf = normalize($mut_pdf, $mut_sum); } # TODO: Could have sanity checks so that mut_pdf should have no values < 0 or > 100 $self->{mutation_cdf}->{$seq_len} = $self->proba_cumul($mut_pdf); $self->{mutation_avg}->{$seq_len} = $mut_freq; } my $mut_cdf = $self->{mutation_cdf}->{$seq_len}; my $mut_avg = $self->{mutation_avg}->{$seq_len}; # Number of mutations to make in this sequence is assumed to follow a Normal # distribution N( mutation_freq, 0.3 * mutation_freq ) my $read_mutation_freq = $mut_avg + 0.3 * $mut_avg * randn(); my $nof_mutations = $seq_len * $read_mutation_freq / 100; my $int_part = int $nof_mutations; my $dec_part = rand(1) < ($nof_mutations - $int_part); $nof_mutations = $int_part + $dec_part; # Exit without doing anything if there are no mutations to do return $error_specs if $nof_mutations == 0; # Make as many mutations in read as needed based on model my $subst_frac = $self->{mutation_ratio}->[0] / 100; for ( 1 .. $nof_mutations ) { # Position to mutate my $idx = rand_weighted( $mut_cdf ); # Do a substitution or indel if ( rand() <= $subst_frac ) { # Substitute at given position by a random replacement nucleotide push @{$$error_specs{$idx+1}{'%'}}, $self->rand_res( substr($seq_str, $idx, 1) ); } else { # Equiprobably insert or delete if ( rand() < 0.5 ) { # Insertion after given position push @{$$error_specs{$idx+1}{'+'}}, $self->rand_res(); } else { # Make a deletion at given position next if length($seq_str) == 1; # skip this deletion to avoid a 0 length push @{$$error_specs{$idx+1}{'-'}}, undef; } } } return $error_specs; } sub rand_res { # Pick a residue at random from the stored alphabet (dna, rna or protein). # An optional residue to exclude can be specified. my ($self, $not_nuc) = @_; my $cdf; my @res; if (not defined $not_nuc) { # Use complete alphabet @res = @{$self->{alphabet_arr}}; $cdf = $self->{alphabet_complete_cdf}; } else { # Remove non-desired residue from alphabet my %res = %{$self->{alphabet_hash}}; delete $res{uc($not_nuc)}; @res = sort keys %res; $cdf = $self->{alphabet_truncated_cdf}; } my $res = $res[rand_weighted($cdf)]; return $res; } sub rand_seq_length { # Choose the sequence length following a given probability distribution my($avg, $model, $stddev) = @_; my $length; if (not $model) { # No specified distribution: all the sequences have the length of the average $length = $avg; } else { if ($model eq 'uniform') { # Uniform distribution: integers uniformly distributed in [min, max] my ($min, $max) = ($avg - $stddev, $avg + $stddev); $length = $min + int( rand( $max - $min + 1 ) ); } elsif ($model eq 'normal') { # Gaussian distribution: decimal number normally distribution in N(avg,stddev) $length = $avg + $stddev * randn(); $length = int( $length + 0.5 ); } else { die "Error: '$model' is not a supported read or insert length distribution\n"; } } $length = 1 if ($length < 1); return $length; } sub rand_seq_pos { # Pick the coordinates (start and end) of an amplicon or random shotgun read. # Coordinate system: the first base is 1 and the number is inclusive, ie 1-2 # are the first two bases of the sequence my ($seq_obj, $read_length, $amplicon, $mid) = @_; # Read length includes the MID my $length = $read_length - length($mid); # Pick starting position my $start; if (defined $amplicon) { # Amplicon always start at first position of amplicon $start = 1; } else { # Shotgun reads start at a random position in genome $start = int( rand($seq_obj->length - $length + 1) ) + 1; } # End position my $end = $start + $length - 1; return $start, $end; } sub randn { # Normally distributed random value (mean 0 and standard deviation 1) using # the Box-Mueller transformation method, adapted from the Perl Cookbook my ($g1, $g2, $w); do { $g1 = 2 * rand() - 1; # uniformly distributed $g2 = 2 * rand() - 1; $w = $g1**2 + $g2**2; # variance } while ( $w >= 1 ); $w = sqrt( (-2 * log($w)) / $w ); # weight $g1 *= $w; # gaussian-distributed if ( wantarray ) { $g2 *= $w; return ($g1, $g2); } else { return $g1; } } sub randig { # Random value sampled from the inverse gaussian (a.k.a. Wald) distribution, # using the method at http://en.wikipedia.org/wiki/Inverse_Gaussian_distribution my ($mu, $lambda) = @_; my $y = randn()**2; my $x = $mu + ($mu**2 * $y)/(2 * $lambda) - $mu / (2 * $lambda) * sqrt(4 * $mu * $lambda * $y + $mu**2 * $y**2); if ( rand() <= $mu / ($mu + $x) ) { $y = $x; } else { $y = $mu**2 / $x; } return $y; } sub randomize { # Randomize an array using the Fisher-Yates shuffle described in the Perl # cookbook. my ($array) = @_; my $i; for ($i = @$array; --$i; ) { my $j = int rand($i+1); next if $i == $j; @$array[$i,$j] = @$array[$j,$i]; } return $array; } sub database_create { # Read and import sequences # Parameters: # * FASTA file containing the sequences or '-' for stdin. REQUIRED # * Sequencing unidirectionally? 0: no, 1: yes forward, -1: yes reverse # * Amplicon PCR primers (optional): Should be provided in a FASTA file and # use the IUPAC convention. If a primer sequence is given, any sequence # that does not contain the primer (or its reverse complement for the # reverse primer) is skipped, while any sequence that matches is trimmed # so that it is flush with the primer sequence # * Abundance file (optional): To avoid registering sequences in the database # unless they are needed # * Delete chars (optional): Characters to delete form the sequences. # * Minimum sequence size: Skip sequences smaller than that my ($self, $fasta_file, $unidirectional, $forward_reverse_primers, $abundance_file, $delete_chars, $min_len) = @_; $min_len = 1 if not defined $min_len; # Input filehandle if (not defined $fasta_file) { die "Error: No reference sequences provided\n"; } my $in; if ($fasta_file eq '-') { $in = Bio::SeqIO->newFh( -fh => \*STDIN, -format => 'fasta', ); } else { $in = Bio::SeqIO->newFh( -file => $fasta_file, -format => 'fasta', ); } # Get list of all IDs with a manually-specified abundance my %ids_to_keep; if ($abundance_file) { my ($ids) = community_read_abundances($abundance_file); for my $comm_num (0 .. $#$ids) { for my $gen_num ( 0 .. scalar @{$$ids[$comm_num]} - 1 ) { my $id = $$ids[$comm_num][$gen_num]; $ids_to_keep{$id} = undef; } } } # Initialize search for amplicons my $amplicon_search; if (defined $forward_reverse_primers) { $amplicon_search = Bio::Tools::AmpliconSearch->new( -primer_file => $forward_reverse_primers, ); } # Process database sequences my %seq_db; # hash of BioPerl sequence objects (all amplicons) my %seq_ids; # hash of reference sequence IDs and IDs of their amplicons my %mol_types; # hash of count of molecule types (dna, rna, protein) while ( my $ref_seq = <$in> ) { # Skip empty sequences next if not $ref_seq->seq; # Record molecule type $mol_types{$ref_seq->alphabet}++; # Skip unwanted sequences my $ref_seq_id = $ref_seq->id; next if (scalar keys %ids_to_keep > 0) && (not exists $ids_to_keep{$ref_seq_id}); # If we are sequencing from the reverse strand, reverse complement now if ($unidirectional == -1) { $ref_seq = $ref_seq->revcom; } # Extract amplicons if needed my $amp_seqs; if (defined $amplicon_search) { $amplicon_search->template($ref_seq); while (my $amp_seq = $amplicon_search->next_amplicon) { push @$amp_seqs, $amp_seq; } next if not defined $amp_seqs; } else { $amp_seqs = [ Bio::SeqFeature::SubSeq->new( -start => 1, -end => $ref_seq->length, -template => $ref_seq, ) ]; } for my $amp_seq (@$amp_seqs) { # Remove forbidden chars if ( (defined $delete_chars) && (not $delete_chars eq '') ) { ### TODO: Use Bio::Location::Split here as well? my $clean_seq = $amp_seq->seq; my $clean_seqstr = $clean_seq->seq; my $dirty_length = length $clean_seqstr; $clean_seqstr =~ s/[$delete_chars]//gi; my $num_dels = $dirty_length - length $clean_seqstr; if ($num_dels > 0) { # Update sequence with cleaned sequence string $clean_seq->seq($clean_seqstr); $amp_seq->seq($clean_seq); # Adjust (decrease) end of feature $amp_seq->end( $amp_seq->end - $num_dels ); } } # Skip the sequence if it is too small next if $amp_seq->length < $min_len; # Save amplicon sequence and create a barcode that identifies it my $amp_bc = create_amp_barcode($amp_seq, $ref_seq_id); $seq_db{$amp_bc} = $amp_seq; $seq_ids{$ref_seq_id}{$amp_bc} = undef; } } undef $in; # close the filehandle (maybe?!) # Error if no usable sequences in the database if (scalar keys %seq_ids == 0) { die "Error: No genome sequences could be used. If you specified a file of". " abundances for the genome sequences, make sure that their ID match the". " ID in the FASTA file. If you specified amplicon primers, verify that ". "they match some genome sequences.\n"; } # Determine database type: dna, rna, protein my $db_alphabet = $self->database_get_mol_type(\%mol_types); $self->{alphabet} = $db_alphabet; # Error if using amplicon on protein database if ( ($db_alphabet eq 'protein') && (defined $forward_reverse_primers) ) { die "Error: Cannot use amplicon primers with proteic reference sequences\n"; } # Error if using wrong direction on protein database if ( ($db_alphabet eq 'protein') && ($unidirectional != 1) ) { die "Error: Got = $unidirectional but can only use ". " = 1 with proteic reference sequences\n"; } my $database = { 'db' => \%seq_db, 'ids' => \%seq_ids }; return $database; } sub create_amp_barcode { # Create a barcode that is unique for each amplicon, store it and return it my ($amp_sf, $ref_seq_id) = @_; my $sep = '/'; my @elems = ($ref_seq_id, $amp_sf->start, $amp_sf->end, $amp_sf->strand || 1); #### TODO: follow the spec: id:start..end/strand my $barcode = join $sep, @elems; $amp_sf->{_barcode} = $barcode; return $barcode; } sub get_amp_barcode { # Get the amplicon barcode my ($amp_sf) = @_; return $amp_sf->{_barcode}; } sub database_get_mol_type { # Given a count of the different molecule types in the database, determine # what molecule type it is. my ($self, $mol_types) = @_; my $max_count = 0; my $max_type = ''; while (my ($type, $count) = each %$mol_types) { if ($count > $max_count) { $max_count = $count; $max_type = $type; } } my $other_count = 0; while (my ($type, $count) = each %$mol_types) { if (not $type eq $max_type) { $other_count += $count; } } if ($max_count < $other_count) { die "Error: Cannot determine what type of molecules the reference sequences". " are. Got $max_count sequences of type '$max_type' and $other_count ". "others.\n"; } if ( (not $max_type eq 'dna') && (not $max_type eq 'rna') && (not $max_type eq 'protein') ) { die "Error: Reference sequences are in an unknown alphabet '$max_type'\n"; } return $max_type; } sub database_get_all_oids { # Retrieve all object IDs from the database. These OIDs match the output of # the database_get_all_seqs method. my ($self) = @_; my @oids; while ( my ($oid, undef) = each %{$self->{database}->{db}} ) { push @oids, $oid; } return \@oids; } sub database_get_all_seqs { # Retrieve all sequence objects from the database. These sequence objects match # the output of the database_get_all_oids method. my ($self) = @_; my @seqs; while ( my (undef, $seq) = each %{$self->{database}->{db}} ) { push @seqs, $seq; } return \@seqs; } sub database_get_seq { # Retrieve a sequence object from the database based on its object ID my ($self, $oid) = @_; my $db = $self->{database}->{db}; my $seq_obj; if (not exists $$db{$oid}) { warn "Warning: Could not find sequence with object ID '$oid' in the database\n"; } $seq_obj = $$db{$oid}; return $seq_obj; } sub database_get_children_seq { # Retrieve all the sequences object made from a reference sequence based on the # ID of the reference sequence my ($self, $refseqid) = @_; my @children; while ( my ($child_oid, undef) = each %{$self->{database}->{ids}->{$refseqid}} ) { push @children, $self->database_get_seq($child_oid); } return \@children; } sub database_get_parent_id { # Based on a sequence object ID, retrieve the ID of the reference sequence it # came from my ($self, $oid) = @_; my $seq_id = $self->database_get_seq($oid)->seq->id; return $seq_id; } sub iupac_to_regexp { # Create a regular expression to match a nucleotide sequence that contain # degeneracies (in IUPAC standard) my ($seq) = @_; # Basic IUPAC code #my %iupac = ( # 'A' => ['A'], # 'C' => ['C'], # 'G' => ['G'], # 'T' => ['T'], # 'U' => ['U'], # 'R' => ['G', 'A'], # 'Y' => ['T', 'C'], # 'K' => ['G', 'T'], # 'M' => ['A', 'C'], # 'S' => ['G', 'C'], # 'W' => ['A', 'T'], # 'B' => ['G', 'T', 'C'], # 'D' => ['G', 'A', 'T'], # 'H' => ['A', 'C', 'T'], # 'V' => ['G', 'C', 'A'], # 'N' => ['A', 'G', 'C', 'T'], #); # IUPAC code # + degenerate primer residues matching ambiguous template residues # + degenerate primer residues matching uracil U my %iupac = ( 'A' => ['A'], 'C' => ['C'], 'G' => ['G'], 'T' => ['T'], 'U' => ['U'], 'R' => ['G', 'A', 'R'], 'Y' => ['T', 'U', 'C', 'Y'], 'K' => ['G', 'T', 'U', 'K'], 'M' => ['A', 'C', 'M'], 'S' => ['G', 'C', 'S'], 'W' => ['A', 'T', 'U', 'W'], 'B' => ['G', 'T', 'U', 'C', 'Y', 'K', 'S', 'B'], 'D' => ['G', 'A', 'T', 'U', 'R', 'K', 'W', 'D'], 'H' => ['A', 'C', 'T', 'U', 'Y', 'M', 'W', 'H'], 'V' => ['G', 'C', 'A', 'R', 'M', 'S', 'V'], 'N' => ['A', 'G', 'C', 'T', 'U', 'R', 'Y', 'K', 'M', 'S', 'W', 'B', 'D', 'H', 'V', 'N'], ); # Regular expression to catch this sequence my $regexp; for my $pos (0 .. length($seq)-1) { my $res = substr $seq, $pos, 1; my $iupacs = $iupac{$res}; if (not defined $iupacs) { die "Error: Primer sequence '$seq' is not a valid IUPAC sequence. ". "Offending character is '$res'.\n"; } if (scalar @$iupacs > 1) { $regexp .= '['.join('',@$iupacs).']'; } else { $regexp .= $$iupacs[0]; } } $regexp = qr/$regexp/i; return $regexp; } sub lib_coverage { # Calculate number of sequences needed to reach a given coverage. If the # number of sequences is provided, calculate the coverage my ($self, $c_struct) = @_; my $coverage = $self->{coverage_fold}; my $nof_seqs = $self->{total_reads}; my $read_length = $self->{read_length}; # 1/ Calculate library length and size my $ref_ids = $c_struct->{'ids'}; my $diversity = scalar @$ref_ids; my $lib_length = 0; for my $ref_id (@$ref_ids) { my $seqobj = $self->database_get_seq($ref_id); my $seqlen = $seqobj->length; $lib_length += $seqlen; } # 2/ Calculate number of sequences to generate based on desired coverage. If # both number of reads and coverage fold were given, coverage has precedence. if ($coverage) { $nof_seqs = ($coverage * $lib_length) / $read_length; if ( int($nof_seqs) < $nof_seqs ){ $nof_seqs = int($nof_seqs + 1); # ceiling } } # Make sure the last mate pair is always complete if ( $self->{mate_length} && ($nof_seqs % 2)) { $nof_seqs++; if (not $coverage) { warn "Warning: Added a read to make the last mate pair complete.\n" } } $coverage = ($nof_seqs * $read_length) / $lib_length; # 3/ Sanity check # TODO: Warn only if diversity was explicitely specified on the command line if ( $nof_seqs < $diversity) { warn "Warning: The number of reads to produce is lower than the required ". "diversity. Increase the coverage or number of reads to achieve this ". "diversity.\n"; $self->{diversity}->[$self->{cur_lib}-1] = $nof_seqs; } return $nof_seqs, $coverage; } sub new_subseq { # Create a new sequence object as a subsequence of another one and name it so # we can trace back where it came from my ($fragnum, $seq_feat, $unidirectional, $orientation, $start, $end, $mid, $mate_number, $lib_number, $tracking, $qual_levels) = @_; # If the length is too short for this read, no choice but to decrease it. $start = 1 if $start < 1; $end = $seq_feat->length if $end > $seq_feat->length; # Build the sequence ID my $name_sep = '_'; my $field_sep = ' '; my $mate_sep = '/'; # mate pair indicator, by convention my $newid = $fragnum; if (defined $lib_number) { $newid = $lib_number.$name_sep.$newid; } if (defined $mate_number) { $newid .= $mate_sep.$mate_number; } # Create a new simulated read object my $newseq = Bio::Seq::SimulatedRead->new( -id => $newid, -reference => $seq_feat->seq, -start => $start, -end => $end, -strand => $orientation, -mid => $mid, -track => $tracking, -coord_style => 'genbank', -qual_levels => $qual_levels, ); # Record location of amplicon on reference sequence in the sequence description if ( $seq_feat->isa('Bio::SeqFeature::Amplicon') || exists($seq_feat->{_chimera}) ) { my $amplicon_desc = gen_subseq_desc($seq_feat); my $desc = $newseq->desc; $desc =~ s/(reference=\S+)/$1 $amplicon_desc/; $newseq->desc($desc); } # Database sequences were already reverse-complemented if reverse sequencing # was requested if ($unidirectional == -1) { $orientation *= -1; $newseq = set_read_orientation($newseq, $orientation); } return $newseq; } sub gen_subseq_desc { my ($seq_feat) = @_; # Chimeras have several locations (a Bio::Location::Split object) my @locations; if (exists $seq_feat->{_chimera}) { @locations = $seq_feat->{_chimera}->sub_Location(); } else { @locations = ( $seq_feat->location ); } for (my $i = 0; $i <= scalar @locations - 1; $i++) { my $location = $locations[$i]; my $strand = $location->strand || 1; if ($strand == 1) { $location = $location->start.'..'.$location->end; } elsif ($strand == -1) { $location = 'complement('.$location->start.'..'.$location->end.')'; } else { die "Error: Strand should be -1 or 1, but got '".$location."'\n"; } $locations[$i] = $location; } my $desc = 'amplicon='.join(',', @locations); return $desc; } sub set_read_orientation { # Set read orientation and change its description accordingly my ($seq, $new_orientation) = @_; $seq->strand($new_orientation); my $desc = $seq->desc; $desc =~ s/position=(complement\()?(\d+)\.\.(\d+)(\))?/position=/; my ($start, $end) = ($2, $3); if ($new_orientation == -1) { $desc =~ s/position=/position=complement($start\.\.$end)/; } else { $desc =~ s/position=/position=$start\.\.$end/; } $seq->desc( $desc ); return $seq; } sub two_array_sort { # Sort 2 arrays by taking the numeric sort of the first one and keeping the # element of the second one match those of the first one my ($l1, $l2) = @_; my @ids = map { [ $$l1[$_], $$l2[$_] ] } (0..$#$l1); @ids = sort { $a->[0] <=> $b->[0] } @ids; my @k1; my @k2; for (my $i = 0; $i < scalar @ids; $i++) { $k1[$i] = $ids[$i][0]; $k2[$i] = $ids[$i][1]; } return \@k1, \@k2; } sub normalize { # Normalize an arrayref to 1. my ($arr, $total) = @_; if (not $total) { # total undef or 0 die "Error: Need to provide a valid total\n"; } $arr = [ map {$_ / $total} @$arr ]; return $arr; } 1; Grinder-0.5.4/CHANGES0000644000175000017500000003533712647177557014365 0ustar flofloooflofloooRevision history for Grinder 0.5.4 18-Jan-2016 Fixed bug causing the last mate pair to sometimes miss its second read (bug #13) Improved Grinder's test suite with respect to Perl's hash randomization (contributions from Francisco J. Ossandón) 0.5.3 30-May-2013 Completed fix for bug #6, multiplexed read close to length of reference (reported by Ali May). When generating multiple libraries, default is now to use 100% permuted to have dissimilar communities (consistent with 0% shared as default). 0.5.2 26-Apr-2013 Fixed bug causing reads too short when using MIDs and asking for a read length close to that of their reference (bug #6, reported by Ali May). 0.5.1 19-Apr-2013 Fixed bug preventing the insertion of very low frequency sequencing errors (bug #5). Updated average_genome_size script to use percentage in Grinder rank file instead of fractional numbers. 0.5.0 14-Jan-2013 Removed the =encoding statement which was breaking Pod::PlainText (reported by Lauren Bragg) Precompile regular expression 0.4.9 20-Nov-2012 Significant speedup by using improved version of Bioperl modules (reported by Ben Woodcroft). Fixed bug in RF and FR -oriented mates produced from the reverse- complement of the reference sequence (reported by Mike Imelfort). Mate orientation documented for IonTorrent (reported by Mike Imelfort). The relative abundances reported by Grinder in the rank file are now expressed as percentage instead of fractional for consistency. Updated dependencies to satisfy older Perl (reported by Stephen Turner). Build the documentation on author-side, not user side (reported by Stephen Turner). 0.4.8 10-Oct-2012 Fixed bug when making amplicon reads using specified relative abundances based on genomes with multiple amplicons (reported by Bertrand Bonnaud). Usage message improvements (reported by Xiao Yang). Delegated some operations to dedicated modules. 0.4.7 27-May-2012 Requiring Math::Random::MT version 1.14 should fix issues that Windows users are having (reported by David Koslicki). 0.4.6 27-May-2012 When generating kmer-based chimeras, save resources by only calculating the kmers of the reference sequences that are going to be used (improvement suggested by David Koslicki). Fixed an "undefined value" error when using kmer-based chimeras (reported by David Koslicki). Fixed an error when using kmer-based chimeras but not using all the reference sequences (reported by David Koslicki). 0.4.5 27-Jan-2012 Fixed bug when adding mutations linearly to a 1 bp read (reported by Robert Schmieder). Better handling of 0 bp reference sequences. Fixed bug when looking for amplicons on the reverse complement of a reference sequence. Properly remove the shortest of two amplicons, even if they are on different strands. 0.4.4 20-Jan-2012 Dependencies update: no need for Math::Random::MT::Perl anymore. 0.4.3 18-Jan-2012 Implemented multimeras, i.e. chimeras from more than two reference sequences (suggested by anonymous reviewer). See . Implemented chimeras where the breakpoints correspond to k-mers shared by the reference sequences (suggested by anonymous reviewer). See . 0.4.2 15-Dec-2011 Fixed incorrectly calculated relative abundances when using length bias (reported by Mike Imelfort and Mohamed Fauzi Haroon). 0.4.1 25-Nov-2011 The keyword 'strand' is not used anymore in the description of reads. Read coordinates are now reported like in the Genbank format: "position=complement(1..20)" instead of "position=1-20 strand=-1" Fixed bug reported by Dana Willner: when looking for full-length amplicon matches based on PCR primers, matches are now sought in the reference sequences but also in their reverse-complement Better handling of discrepancies between the number of libraries specified with the num_libraries option and in the abundance_file (reported by Dana Willner). 0.4.0 04-Nov-2011 Support for DNA, RNA and proteic reference sequences to produce genomic metagenomic, transcriptomic, metatranscriptomic, proteomic and metaproteomic datasets New error model suitable to simulate Illumina reads: 4th degree polynome Change in error model (mutation_distribution) parameter: - general syntax is now model_name, model_parameters... - the first parameter for the linear model is now the error rate at the 3' end of the reads, not the average error rate Speed improvement for position-specific error models Galaxy GUI fix so that the output is fastqsanger, not just fastq The reference_file parameter is now a required argument, so that running grinder without arguments displays the help (reported by Robert Schmieder) Fixed a bug that caused a crash when using an indel model and a homopolymer model simultaneously (reported by Robert Schmieder) Information displayed on screen now reports whether the library is a shotgun or amplicon library 0.3.9 18-Oct-2011 New option to select orientation of mate pairs New default for mate orientation: forward-reverse instead of forward-forward Handle empty reference sequence description more gracefully Galaxy GUI compatible with workflows and new tool shed 0.3.8 04-Oct-2011 Graphical interface for the Galaxy project Support for writing the output reads in FASTQ format (Sanger variant) Support for nested and overlapping amplicons Tests do not fail if the optional dependency Statistics::R is not installed Tested that Grinder works 100% on Windows Generating 100 reads by default instead of coverage 0.1x Fixed bug where read description was not created if unidirectional was set to -1 0.3.7 13-Sep-2011 Fixed bug in richter and margulies homopolymer error models Fixed bug so that output rank file now collapses amplicon by species The Grinder CLI script is now called 'grinder' (all lowercase) Option mutation_ratio has changed so that it is possible to specify indels without substitutions Location of amplicon relative to the reference sequence is now recorded in the read description using the 'amplicon' field Better reporting of chimeras in read descriptions using a comma-separated list for the 'amplicon' and 'reference' field Redundant sequencing errors (multiple errors at the same position) are now tracked in read descriptions New dependency: using Math::Random::MT Perl module for added speed Improved build and test mechanics Added tests for chimeras, indels, substitutions and homopolymers More comprehensive tests for seeding and random number generation 0.3.6 03-Aug-2011 Support for reference sequences that contain several amplicons Implemented a gene copy bias option for amplicon libraries Primers can now match RNA sequences or ambiguous residues of the reference sequence Automatic community structure parameter value picking when none is provided Fixed uniform insert and read length distribution Fixed quality scores, which were generated but never written to disk Write on screen when QUAL files are generated Added links to example databases that users can use as Grinder input Specified the URL where to report bugs More unit tests: community structure, read and insert length distributions amplicons with specified genome abundance 0.3.5 21-Jul-2011 Implemented a profile mechanisms to store user's preferred options Added a script to reverse the orientation of right-hand mates Fixed issues with reads with MIDs (in Bio::Seq::SimulatedRead) Library number in ID of first sequence in libraries with even number was wrong when mate pair was used Number of the pair in mate pair IDs was wrong Grinder development put under Git versioning control on SourceForge More unit tests Versioning fix 0.3.4 23-Jun-2011 New option to generate basic quality scores if desired (-qual_levels) New option to not track the read info in the read description (-desc_track) Objects returned by Grinder are now Bio::Seq::SimulatedRead Bioperl objects Double-quotes in read description are now escaped, i.e. '"' becomes '\"' Now using 'reference' instead of 'source' in read tracking description Changes in the defaults: uniform community structure instead of power law uniform read distribution instead of normally distributed 0.3.3 03-Mar-2011 New option to sequence from the reverse strand: see (suggested by Barry Cayford). Output FASTA files now named *reads* instead of *shotgun* because libraries can be amplicon too. Output file names now use numbers padded with zeroes so that, e.g. if 123 libraries were requested, their name is in 001, 002 ... 123. Output folder is now created automatically if it does not already exist. The next_read() method now returns only one read, even for mate pairs. Force the alphabet to DNA when reading the primer sequence file since degenerate primers can look like protein sequences. Fixed bug where Grinder sometimes created libraries even though there were not enough sequences to do it safely (reported by Dana Willner). When the number of reads to generate is smaller than the required diversity, the actual diversity reported reflects this now. Not reporting errors "Not enough sequences for chimera..." when there is less than 2 reads and chimera_perc is 0. Fixed bug in argument processing by Getopt::Euclid that affected repeated calls to the new() method. Fixed calculation of number of genomes shared. Clearly specified in the documentation that the percent shared is relative to the diversity of the least abundant library (reported by Dana Willner). Fixed calculation of the total library diversity. Many more Grinder test cases. 0.3.2 11-Feb-2011 New feature to specify specific characters to delete (N, -, ...) (suggested by Mike Imelfort) New method to retrieve the seed number used for the computation: $factory->get_random_seed When excluding specific characters, an amplicon read is attempted only once now More robust parsing of abundance file It is now a fatal error if sequences requested in an abundance file are not found in the genome file Small optimizations 0.3.1 08-Feb-2011 Support for making multiple libraries with different richness (diversity) values Fixed bug for communities with specified relative abundances (reported by Mike Imelfort) Better error messages for sequences that have a specified abundance 0.3.0 12-Jan-2011 Command-line arguments have changed; all have a short and long version Grinder API to allow to run Grinder inside Perl programs Support for amplicon sequencing For amplicon simulation, a forward and optional reverse primer (in IUPAC) can be specified Amplicon can be given multiplex identifiers (MIDs) Support for a generating chimeras Homopolymer error simulation More error models for point mutations (uniform and linear) Read error tracking in the sequence description New default is to produce reads with no errors New FASTA read description that specifies its source, position, strand, description and errors Option to take shotgun reads from reverse complement Support for specifying the structure of several communities manually Speed improvements 0.2.0 22-Sep-2010 New options available when generating multiple shotgun libraries. Alpha and beta diversity can be specified: * richness * percentage of genomes shared between libraries * percentage of the top genomes with a different abundance rank Revised way that mate pair reads are named. Example: >1000/1 seq3|31-60 >1000/2 seq3|41-70 Added utility to calculate average genome length from Grinder rank file 0.1.9 24-Jun-2010 Thanks to Ramsi Temanni for his suggestions and feedback regarding forbidden characters. Support for characters forbidden in the shotgun reads Little bugfix regarding default values for arguments that take a list of values 0.1.8 22-Apr-2010 Thanks to Albert Villela for his suggestions and feedback regarding paired reads. Changes in command-line options to accomodate new features Support for inputting a file specifying the abundance of the different genomes Support for mate pairs / paired end reads Support for uniform or normal distribution of read lengths and mate pair insert lengths Fixed bug causing an error when the number of reads in the input file cannot be divided by the number of independent libraries required Changed output sequence ID to a more consistent scheme 0.1.7 15-Feb-2010 Not keeping the sequences in memory anymore to preserve resources Really using the Math::Random::MT::Perl seeding facility 0.1.6 07-Dec-2009 Now using the Math::Random::MT::Perl seeding facility 0.1.5 24-Feb-2009 Grinder now has a proper installer (Perl module style) 0.1.4 Added basic report on libraries produced Fixed bug in number of sequences created when using independent libraries 0.1.3 Ability to generate several random shotgun libraries at once that do not contain any genome in common 0.1.2 Correction in the code to generate mutations Changed the defaults to use a powerlaw model and the size-dependent option The main module function now returns a hashref of rank-abundances 0.1.1 Introduction of the simulation of sequencing errors (substitutions and indels) Modified the way the random number generation is handled The main module function now returns an arrayref of Bio::Seq objects 0.1.0 Initial release Grinder-0.5.4/MANIFEST0000644000175000017500000000454712647202511014475 0ustar floflooofloflooobin/grinder bin/grinder.pod CHANGES galaxy/all_fasta.loc.sample galaxy/Galaxy_readme.txt galaxy/grinder.xml galaxy/stderr_wrapper.py galaxy/tool_data_table_conf.xml.sample inc/Module/AutoInstall.pm inc/Module/Install.pm inc/Module/Install/AuthorRequires.pm inc/Module/Install/AutoInstall.pm inc/Module/Install/AutoLicense.pm inc/Module/Install/AutoManifest.pm inc/Module/Install/Base.pm inc/Module/Install/Can.pm inc/Module/Install/Fetch.pm inc/Module/Install/Include.pm inc/Module/Install/Makefile.pm inc/Module/Install/Metadata.pm inc/Module/Install/PodFromEuclid.pm inc/Module/Install/ReadmeFromPod.pm inc/Module/Install/Scripts.pm inc/Module/Install/Win32.pm inc/Module/Install/WriteAll.pm lib/Grinder.pm lib/Grinder/Database.pm lib/Grinder/KmerCollection.pm LICENSE Makefile.PL man/average_genome_size.1 man/change_paired_read_orientation.1 man/grinder.1 MANIFEST This list of files META.yml MYMETA.json MYMETA.yml README README.htm t/00-load.t t/01-shotgun.t t/02-mates.t t/03-amplicon.t t/04-abundances.t t/05-forbidden.t t/06-seed.t t/07-diversity.t t/08-shared.t t/09-permuted.t t/10-quality.t t/11-tracking.t t/12-read-length.t t/13-insert-length.t t/14-genome-length-bias.t t/15-multiplex.t t/16-profile.t t/17-libraries.t t/18-amplicon-multiple.t t/19-gene-copy-bias.t t/20-community-structure.t t/21-errors.t t/22-homopolymers.t t/23-chimeras.t t/24-mate-orientation.t t/25-molecule-type.t t/26-combined-errors.t t/27-stdin.t t/28-revcom-amplicon.t t/29-kmer-collection.t t/30-kmer-chimeras.t t/31-shotgun-chimeras.t t/32-database.t t/data/abundance_kmers.txt t/data/abundances.txt t/data/abundances2.txt t/data/abundances_multiple.txt t/data/amplicon_database.fa t/data/database_dna.fa t/data/database_mixed.fa t/data/database_protein.fa t/data/database_rna.fa t/data/dirty_database.fa t/data/forward_primer.fa t/data/forward_reverse_primers.fa t/data/homopolymer_database.fa t/data/kmers.fa t/data/kmers2.fa t/data/mids.fa t/data/multiple_amplicon_database.fa t/data/nested_amplicon_database.fa t/data/oriented_database.fa t/data/profile.txt t/data/revcom_amplicon_database.fa t/data/reverse_forward_primers.fa t/data/reverse_primer.fa t/data/shotgun_database.fa t/data/shotgun_database_extended.fa t/data/shotgun_database_shared_kmers.fa t/data/single_amplicon_database.fa t/data/single_seq_database.fa t/pod.t t/TestUtils.pm utils/average_genome_size utils/change_paired_read_orientation Grinder-0.5.4/man/0000755000175000017500000000000012647202511014105 5ustar flofloooflofloooGrinder-0.5.4/man/change_paired_read_orientation.10000644000175000017500000001341612647202457022364 0ustar floflooofloflooo.\" Automatically generated by Pod::Man 2.28 (Pod::Simple 3.29) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is turned on, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{ . if \nF \{ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" .\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2). .\" Fear. Run. Save yourself. No user-serviceable parts. . \" fudge factors for nroff and troff .if n \{\ . ds #H 0 . ds #V .8m . ds #F .3m . ds #[ \f1 . ds #] \fP .\} .if t \{\ . ds #H ((1u-(\\\\n(.fu%2u))*.13m) . ds #V .6m . ds #F 0 . ds #[ \& . ds #] \& .\} . \" simple accents for nroff and troff .if n \{\ . ds ' \& . ds ` \& . ds ^ \& . ds , \& . ds ~ ~ . ds / .\} .if t \{\ . ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u" . ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u' . ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u' . ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u' . ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u' . ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u' .\} . \" troff and (daisy-wheel) nroff accents .ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V' .ds 8 \h'\*(#H'\(*b\h'-\*(#H' .ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#] .ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H' .ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u' .ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#] .ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#] .ds ae a\h'-(\w'a'u*4/10)'e .ds Ae A\h'-(\w'A'u*4/10)'E . \" corrections for vroff .if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u' .if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u' . \" for low resolution devices (crt and lpr) .if \n(.H>23 .if \n(.V>19 \ \{\ . ds : e . ds 8 ss . ds o a . ds d- d\h'-1'\(ga . ds D- D\h'-1'\(hy . ds th \o'bp' . ds Th \o'LP' . ds ae ae . ds Ae AE .\} .rm #[ #] #H #V #F C .\" ======================================================================== .\" .IX Title "CHANGE_PAIRED_READ_ORIENTATION 1" .TH CHANGE_PAIRED_READ_ORIENTATION 1 "2014-01-07" "perl v5.22.1" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" change_paired_read_orientation \- Change the orientation of paired\-end reads in a FASTA file .SH "DESCRIPTION" .IX Header "DESCRIPTION" Reverse the orientation, i.e. reverse-complement each right-hand paired-end read (\s-1ID\s0 ending in /2) in a \s-1FASTA\s0 file. .SH "REQUIRED ARGUMENTS" .IX Header "REQUIRED ARGUMENTS" .IP "" 4 .IX Item "" \&\s-1FASTA\s0 file containing the reads to re-orient. .IP "" 4 .IX Item "" Output \s-1FASTA\s0 file where to write the reads. .SH "COPYRIGHT" .IX Header "COPYRIGHT" Copyright 2009\-2012 Florent \s-1ANGLY\s0 .PP Grinder is free software: you can redistribute it and/or modify it under the terms of the \s-1GNU\s0 General Public License (\s-1GPL\s0) as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. Grinder is distributed in the hope that it will be useful, but \s-1WITHOUT ANY WARRANTY\s0; without even the implied warranty of \&\s-1MERCHANTABILITY\s0 or \s-1FITNESS FOR A PARTICULAR PURPOSE. \s0 See the \&\s-1GNU\s0 General Public License for more details. You should have received a copy of the \s-1GNU\s0 General Public License along with Grinder. If not, see . .SH "BUGS" .IX Header "BUGS" All complex software has bugs lurking in it, and this program is no exception. If you find a bug, please report it on the SourceForge Tracker for Grinder: .PP Bug reports, suggestions and patches are welcome. Grinder's code is developed on Sourceforge () and is under Git revision control. To get started with a patch, do: .PP .Vb 1 \& git clone git://biogrinder.git.sourceforge.net/gitroot/biogrinder/biogrinder .Ve Grinder-0.5.4/man/grinder.10000644000175000017500000011402112647202457015631 0ustar floflooofloflooo.\" Automatically generated by Pod::Man 2.28 (Pod::Simple 3.29) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is turned on, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{ . if \nF \{ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" .\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2). .\" Fear. Run. Save yourself. No user-serviceable parts. . \" fudge factors for nroff and troff .if n \{\ . ds #H 0 . ds #V .8m . ds #F .3m . ds #[ \f1 . ds #] \fP .\} .if t \{\ . ds #H ((1u-(\\\\n(.fu%2u))*.13m) . ds #V .6m . ds #F 0 . ds #[ \& . ds #] \& .\} . \" simple accents for nroff and troff .if n \{\ . ds ' \& . ds ` \& . ds ^ \& . ds , \& . ds ~ ~ . ds / .\} .if t \{\ . ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u" . ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u' . ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u' . ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u' . ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u' . ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u' .\} . \" troff and (daisy-wheel) nroff accents .ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V' .ds 8 \h'\*(#H'\(*b\h'-\*(#H' .ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#] .ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H' .ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u' .ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#] .ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#] .ds ae a\h'-(\w'a'u*4/10)'e .ds Ae A\h'-(\w'A'u*4/10)'E . \" corrections for vroff .if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u' .if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u' . \" for low resolution devices (crt and lpr) .if \n(.H>23 .if \n(.V>19 \ \{\ . ds : e . ds 8 ss . ds o a . ds d- d\h'-1'\(ga . ds D- D\h'-1'\(hy . ds th \o'bp' . ds Th \o'LP' . ds ae ae . ds Ae AE .\} .rm #[ #] #H #V #F C .\" ======================================================================== .\" .IX Title "GRINDER 1" .TH GRINDER 1 "2016-01-18" "perl v5.22.1" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" grinder \- A versatile omics shotgun and amplicon sequencing read simulator .SH "DESCRIPTION" .IX Header "DESCRIPTION" Grinder is a versatile program to create random shotgun and amplicon sequence libraries based on \s-1DNA, RNA\s0 or proteic reference sequences provided in a \s-1FASTA\s0 file. .PP Grinder can produce genomic, metagenomic, transcriptomic, metatranscriptomic, proteomic, metaproteomic shotgun and amplicon datasets from current sequencing technologies such as Sanger, 454, Illumina. These simulated datasets can be used to test the accuracy of bioinformatic tools under specific hypothesis, e.g. with or without sequencing errors, or with low or high community diversity. Grinder may also be used to help decide between alternative sequencing methods for a sequence-based project, e.g. should the library be paired-end or not, how many reads should be sequenced. .PP Grinder features include: .IP "\(bu" 4 shotgun or amplicon read libraries .IP "\(bu" 4 omics support to generate genomic, transcriptomic, proteomic, metagenomic, metatranscriptomic or metaproteomic datasets .IP "\(bu" 4 arbitrary read length distribution and number of reads .IP "\(bu" 4 simulation of \s-1PCR\s0 and sequencing errors (chimeras, point mutations, homopolymers) .IP "\(bu" 4 support for paired-end (mate pair) datasets .IP "\(bu" 4 specific rank-abundance settings or manually given abundance for each genome, gene or protein .IP "\(bu" 4 creation of datasets with a given richness (alpha diversity) .IP "\(bu" 4 independent datasets can share a variable number of genomes (beta diversity) .IP "\(bu" 4 modeling of the bias created by varying genome lengths or gene copy number .IP "\(bu" 4 profile mechanism to store preferred options .IP "\(bu" 4 available to biologists or power users through multiple interfaces: \s-1GUI, CLI\s0 and \s-1API\s0 .PP Briefly, given a \s-1FASTA\s0 file containing reference sequence (genomes, genes, transcripts or proteins), Grinder performs the following steps: .IP "1." 4 Read the reference sequences, and for amplicon datasets, extracts full-length reference \s-1PCR\s0 amplicons using the provided degenerate \s-1PCR\s0 primers. .IP "2." 4 Determine the community structure based on the provided alpha diversity (number of reference sequences in the library), beta diversity (number of reference sequences in common between several independent libraries) and specified rank\- abundance model. .IP "3." 4 Take shotgun reads from the reference sequences or amplicon reads from the full\- length reference \s-1PCR\s0 amplicons. The reads may be paired-end reads when an insert size distribution is specified. The length of the reads depends on the provided read length distribution and their abundance depends on the relative abundance in the community structure. Genome length may also biases the number of reads to take for shotgun datasets at this step. Similarly, for amplicon datasets, the number of copies of the target gene in the reference genomes may bias the number of reads to take. .IP "4." 4 Alter reads by inserting sequencing errors (indels, substitutions and homopolymer errors) following a position-specific model to simulate reads created by current sequencing technologies (Sanger, 454, Illumina). Write the reads and their quality scores in \s-1FASTA, QUAL\s0 and \s-1FASTQ\s0 files. .SH "CITATION" .IX Header "CITATION" If you use Grinder in your research, please cite: .PP .Vb 2 \& Angly FE, Willner D, Rohwer F, Hugenholtz P, Tyson GW (2012), Grinder: a \& versatile amplicon and shotgun sequence simulator, Nucleic Acids Reseach .Ve .PP Available from . .SH "VERSION" .IX Header "VERSION" This document refers to grinder version 0.5.3 .SH "AUTHOR" .IX Header "AUTHOR" Florent Angly .SH "INSTALLATION" .IX Header "INSTALLATION" .SS "Dependencies" .IX Subsection "Dependencies" You need to install these dependencies first: .IP "\(bu" 4 Perl (>= 5.6) .Sp .IP "\(bu" 4 make .Sp Many systems have make installed by default. If your system does not, you should install the implementation of make of your choice, e.g. \s-1GNU\s0 make: .PP The following \s-1CPAN\s0 Perl modules are dependencies that will be installed automatically for you: .IP "\(bu" 4 Bioperl modules (>=1.6.901). .Sp Note that some unreleased Bioperl modules have been included in Grinder. .IP "\(bu" 4 Getopt::Euclid (>= 0.3.4) .IP "\(bu" 4 List::Util .Sp First released with Perl v5.7.3 .IP "\(bu" 4 Math::Random::MT (>= 1.13) .IP "\(bu" 4 version (>= 0.77) .Sp First released with Perl v5.9.0 .SS "Procedure" .IX Subsection "Procedure" To install Grinder globally on your system, run the following commands in a terminal or command prompt: .PP On Linux, Unix, MacOS: .PP .Vb 2 \& perl Makefile.PL \& make .Ve .PP And finally, with administrator privileges: .PP .Vb 1 \& make install .Ve .PP On Windows, run the same commands but with nmake instead of make. .SS "No administrator privileges?" .IX Subsection "No administrator privileges?" If you do not have administrator privileges, Grinder needs to be installed in your home directory. .PP First, follow the instructions to install local::lib at . After local::lib is installed, every Perl module that you install manually or through the \s-1CPAN\s0 command-line application will be installed in your home directory. .PP Then, install Grinder by following the instructions detailed in the \*(L"Procedure\*(R" section. .SH "RUNNING GRINDER" .IX Header "RUNNING GRINDER" After installation, you can run Grinder using a command-line interface (\s-1CLI\s0), an application programming interface (\s-1API\s0) or a graphical user interface (\s-1GUI\s0) in Galaxy. .PP To get the usage of the \s-1CLI,\s0 type: .PP .Vb 1 \& grinder \-\-help .Ve .PP More information, including the documentation of the Grinder \s-1API,\s0 which allows you to run Grinder from within other Perl programs, is available by typing: .PP .Vb 1 \& perldoc Grinder .Ve .PP To run the \s-1GUI,\s0 refer to the Galaxy documentation at . .PP The 'utils' folder included in the Grinder package contains some utilities: .IP "average genome size:" 4 .IX Item "average genome size:" This calculates the average genome size (in bp) of a simulated random library produced by Grinder. .IP "change_paired_read_orientation:" 4 .IX Item "change_paired_read_orientation:" This reverses the orientation of each second mate-pair read (\s-1ID\s0 ending in /2) in a \s-1FASTA\s0 file. .SH "REFERENCE SEQUENCE DATABASE" .IX Header "REFERENCE SEQUENCE DATABASE" A variety of \s-1FASTA\s0 databases can be used as input for Grinder. For example, the GreenGenes database () contains over 180,000 16S rRNA clone sequences from various species which would be appropriate to produce a 16S rRNA amplicon dataset. A set of over 41,000 \s-1OTU\s0 representative sequences and their affiliation in seven different taxonomic sytems can also be used for the same purpose ( and ). The \&\s-1RDP \s0() and Silva () databases also provide many 16S rRNA sequences and Silva includes eukaryotic sequences. While 16S rRNA is a popular gene, datasets containing any type of gene could be used in the same fashion to generate simulated amplicon datasets, provided appropriate primers are used. .PP The >2,400 curated microbial genome sequences in the \s-1NCBI\s0 RefSeq collection () would also be suitable for producing 16S rRNA simulated datasets (using the adequate primers). However, the lower diversity of this database compared to the previous two makes it more appropriate for producing artificial microbial metagenomes. Individual genomes from this database are also very suitable for the simulation of single or double-barreled shotgun libraries. Similarly, the RefSeq database contains over 3,100 curated viral sequences () which can be used to produce artificial viral metagenomes. .PP Quite a few eukaryotic organisms have been sequenced and their genome or genes can be the basis for simulating genomic, transcriptomic (RNA-seq) or proteomic datasets. For example, you can use the human genome available at , the human transcripts downloadable from or the human proteome at . .SH "CLI EXAMPLES" .IX Header "CLI EXAMPLES" Here are a few examples that illustrate the use of Grinder in a terminal: .IP "1." 4 A shotgun \s-1DNA\s0 library with a coverage of 0.1X .Sp .Vb 1 \& grinder \-reference_file genomes.fna \-coverage_fold 0.1 .Ve .IP "2." 4 Same thing but save the result files in a specific folder and with a specific name .Sp .Vb 1 \& grinder \-reference_file genomes.fna \-coverage_fold 0.1 \-base_name my_name \-output_dir my_dir .Ve .IP "3." 4 A \s-1DNA\s0 shotgun library with 1000 reads .Sp .Vb 1 \& grinder \-reference_file genomes.fna \-total_reads 1000 .Ve .IP "4." 4 A \s-1DNA\s0 shotgun library where species are distributed according to a power law .Sp .Vb 1 \& grinder \-reference_file genomes.fna \-abundance_model powerlaw 0.1 .Ve .IP "5." 4 A \s-1DNA\s0 shotgun library with 123 genomes taken random from the given genomes .Sp .Vb 1 \& grinder \-reference_file genomes.fna \-diversity 123 .Ve .IP "6." 4 Two \s-1DNA\s0 shotgun libraries that have 50% of the species in common .Sp .Vb 1 \& grinder \-reference_file genomes.fna \-num_libraries 2 \-shared_perc 50 .Ve .IP "7." 4 Two \s-1DNA\s0 shotgun library with no species in common and distributed according to a exponential rank-abundance model. Note that because the parameter value for the exponential model is omitted, each library uses a different randomly chosen value: .Sp .Vb 1 \& grinder \-reference_file genomes.fna \-num_libraries 2 \-abundance_model exponential .Ve .IP "8." 4 A \s-1DNA\s0 shotgun library where species relative abundances are manually specified .Sp .Vb 1 \& grinder \-reference_file genomes.fna \-abundance_file my_abundances.txt .Ve .IP "9." 4 A \s-1DNA\s0 shotgun library with Sanger reads .Sp .Vb 1 \& grinder \-reference_file genomes.fna \-read_dist 800 \-mutation_dist linear 1 2 \-mutation_ratio 80 20 .Ve .IP "10." 4 A \s-1DNA\s0 shotgun library with first-generation 454 reads .Sp .Vb 1 \& grinder \-reference_file genomes.fna \-read_dist 100 normal 10 \-homopolymer_dist balzer .Ve .IP "11." 4 A paired-end \s-1DNA\s0 shotgun library, where the insert size is normally distributed around 2.5 kbp and has 0.2 kbp standard deviation .Sp .Vb 1 \& grinder \-reference_file genomes.fna \-insert_dist 2500 normal 200 .Ve .IP "12." 4 A transcriptomic dataset .Sp .Vb 1 \& grinder \-reference_file transcripts.fna .Ve .IP "13." 4 A unidirectional transcriptomic dataset .Sp .Vb 1 \& grinder \-reference_file transcripts.fna \-unidirectional 1 .Ve .Sp Note the use of \-unidirectional 1 to prevent reads to be taken from the reverse\- complement of the reference sequences. .IP "14." 4 A proteomic dataset .Sp .Vb 1 \& grinder \-reference_file proteins.faa \-unidirectional 1 .Ve .IP "15." 4 A 16S rRNA amplicon library .Sp .Vb 1 \& grinder \-reference_file 16Sgenes.fna \-forward_reverse 16Sprimers.fna \-length_bias 0 \-unidirectional 1 .Ve .Sp Note the use of \-length_bias 0 because reference sequence length should not affect the relative abundance of amplicons. .IP "16." 4 The same amplicon library with 20% of chimeric reads (90% bimera, 10% trimera) .Sp .Vb 1 \& grinder \-reference_file 16Sgenes.fna \-forward_reverse 16Sprimers.fna \-length_bias 0 \-unidirectional 1 \-chimera_perc 20 \-chimera_dist 90 10 .Ve .IP "17." 4 Three 16S rRNA amplicon libraries with specified MIDs and no reference sequences in common .Sp .Vb 1 \& grinder \-reference_file 16Sgenes.fna \-forward_reverse 16Sprimers.fna \-length_bias 0 \-unidirectional 1 \-num_libraries 3 \-multiplex_ids MIDs.fna .Ve .IP "18." 4 Reading reference sequences from the standard input, which allows you to decompress \s-1FASTA\s0 files on the fly: .Sp .Vb 1 \& zcat microbial_db.fna.gz | grinder \-reference_file \- \-total_reads 100 .Ve .SH "CLI REQUIRED ARGUMENTS" .IX Header "CLI REQUIRED ARGUMENTS" .IP "\-rf | \-reference_file | \-gf | \-genome_file " 4 .IX Item "-rf | -reference_file | -gf | -genome_file " \&\s-1FASTA\s0 file that contains the input reference sequences (full genomes, 16S rRNA genes, transcripts, proteins...) or '\-' to read them from the standard input. See the \&\s-1README\s0 file for examples of databases you can use and where to get them from. Default: \- .SH "CLI OPTIONAL ARGUMENTS" .IX Header "CLI OPTIONAL ARGUMENTS" .IP "\-tr | \-total_reads " 4 .IX Item "-tr | -total_reads " Number of shotgun or amplicon reads to generate for each library. Do not specify this if you specify the fold coverage. Default: 100 .IP "\-cf | \-coverage_fold " 4 .IX Item "-cf | -coverage_fold " Desired fold coverage of the input reference sequences (the output \s-1FASTA\s0 length divided by the input \s-1FASTA\s0 length). Do not specify this if you specify the number of reads directly. .IP "\-rd ... | \-read_dist ..." 4 .IX Item "-rd ... | -read_dist ..." Desired shotgun or amplicon read length distribution specified as: average length, distribution ('uniform' or 'normal') and standard deviation. .Sp Only the first element is required. Examples: .Sp .Vb 6 \& All reads exactly 101 bp long (Illumina GA 2x): 101 \& Uniform read distribution around 100+\-10 bp: 100 uniform 10 \& Reads normally distributed with an average of 800 and a standard deviation of 100 \& bp (Sanger reads): 800 normal 100 \& Reads normally distributed with an average of 450 and a standard deviation of 50 \& bp (454 GS\-FLX Ti): 450 normal 50 .Ve .Sp Reference sequences smaller than the specified read length are not used. Default: 100 .IP "\-id ... | \-insert_dist ..." 4 .IX Item "-id ... | -insert_dist ..." Create paired-end or mate-pair reads spanning the given insert length. Important: the insert is defined in the biological sense, i.e. its length includes the length of both reads and of the stretch of \s-1DNA\s0 between them: 0 : off, or: insert size distribution in bp, in the same format as the read length distribution (a typical value is 2,500 bp for mate pairs) Two distinct reads are generated whether or not the mate pair overlaps. Default: 0 .IP "\-mo | \-mate_orientation " 4 .IX Item "-mo | -mate_orientation " When generating paired-end or mate-pair reads (see ), specify the orientation of the reads (F: forward, R: reverse): .Sp .Vb 4 \& FR: \-\-\-> <\-\-\- e.g. Sanger, Illumina paired\-end, IonTorrent mate\-pair \& FF: \-\-\-> \-\-\-> e.g. 454 \& RF: <\-\-\- \-\-\-> e.g. Illumina mate\-pair \& RR: <\-\-\- <\-\-\- .Ve .Sp Default: \s-1FR\s0 .IP "\-ec | \-exclude_chars " 4 .IX Item "-ec | -exclude_chars " Do not create reads containing any of the specified characters (case insensitive). For example, use '\s-1NX\s0' to prevent reads with ambiguities (N or X). Grinder will error if it fails to find a suitable read (or pair of reads) after 10 attempts. Consider using , which may be more appropriate for your case. Default: '' .IP "\-dc | \-delete_chars " 4 .IX Item "-dc | -delete_chars " Remove the specified characters from the reference sequences (case-insensitive), e.g. '\-~*' to remove gaps (\- or ~) or terminator (*). Removing these characters is done once, when reading the reference sequences, prior to taking reads. Hence it is more efficient than . Default: .IP "\-fr | \-forward_reverse " 4 .IX Item "-fr | -forward_reverse " Use \s-1DNA\s0 amplicon sequencing using a forward and reverse \s-1PCR\s0 primer sequence provided in a \s-1FASTA\s0 file. The reference sequences and their reverse complement will be searched for \s-1PCR\s0 primer matches. The primer sequences should use the \&\s-1IUPAC\s0 convention for degenerate residues and the reference sequences that that do not match the specified primers are excluded. If your reference sequences are full genomes, it is recommended to use = 1 and = 0 to generate amplicon reads. To sequence from the forward strand, set to 1 and put the forward primer first and reverse primer second in the \s-1FASTA\s0 file. To sequence from the reverse strand, invert the primers in the \s-1FASTA\s0 file and use = \-1. The second primer sequence in the \s-1FASTA\s0 file is always optional. Example: \s-1AAACTYAAAKGAATTGRCGG\s0 and \s-1ACGGGCGGTGTGTRC\s0 for the 926F and 1392R primers that target the V6 to V9 region of the 16S rRNA gene. .IP "\-un | \-unidirectional " 4 .IX Item "-un | -unidirectional " Instead of producing reads bidirectionally, from the reference strand and its reverse complement, proceed unidirectionally, from one strand only (forward or reverse). Values: 0 (off, i.e. bidirectional), 1 (forward), \-1 (reverse). Use = 1 for amplicon and strand-specific transcriptomic or proteomic datasets. Default: 0 .IP "\-lb | \-length_bias " 4 .IX Item "-lb | -length_bias " In shotgun libraries, sample reference sequences proportionally to their length. For example, in simulated microbial datasets, this means that at the same relative abundance, larger genomes contribute more reads than smaller genomes (and all genomes have the same fold coverage). 0 = no, 1 = yes. Default: 1 .IP "\-cb | \-copy_bias " 4 .IX Item "-cb | -copy_bias " In amplicon libraries where full genomes are used as input, sample species proportionally to the number of copies of the target gene: at equal relative abundance, genomes that have multiple copies of the target gene contribute more amplicon reads than genomes that have a single copy. 0 = no, 1 = yes. Default: 1 .IP "\-md ... | \-mutation_dist ..." 4 .IX Item "-md ... | -mutation_dist ..." Introduce sequencing errors in the reads, under the form of mutations (substitutions, insertions and deletions) at positions that follow a specified distribution (with replacement): model (uniform, linear, poly4), model parameters. For example, for a uniform 0.1% error rate, use: uniform 0.1. To simulate Sanger errors, use a linear model where the errror rate is 1% at the 5' end of reads and 2% at the 3' end: linear 1 2. To model Illumina errors using the 4th degree polynome 3e\-3 + 3.3e\-8 * i^4 (Korbel et al 2009), use: poly4 3e\-3 3.3e\-8. Use the option to alter how many of these mutations are substitutions or indels. Default: uniform 0 0 .IP "\-mr ... | \-mutation_ratio ..." 4 .IX Item "-mr ... | -mutation_ratio ..." Indicate the percentage of substitutions and the number of indels (insertions and deletions). For example, use '80 20' (4 substitutions for each indel) for Sanger reads. Note that this parameter has no effect unless you specify the option. Default: 80 20 .IP "\-hd | \-homopolymer_dist " 4 .IX Item "-hd | -homopolymer_dist " Introduce sequencing errors in the reads under the form of homopolymeric stretches (e.g. \s-1AAA, CCCCC\s0) using a specified model where the homopolymer length follows a normal distribution N(mean, standard deviation) that is function of the homopolymer length n: .Sp .Vb 3 \& Margulies: N(n, 0.15 * n) , Margulies et al. 2005. \& Richter : N(n, 0.15 * sqrt(n)) , Richter et al. 2008. \& Balzer : N(n, 0.03494 + n * 0.06856) , Balzer et al. 2010. .Ve .Sp Default: 0 .IP "\-cp | \-chimera_perc " 4 .IX Item "-cp | -chimera_perc " Specify the percent of reads in amplicon libraries that should be chimeric sequences. The 'reference' field in the description of chimeric reads will contain the \s-1ID\s0 of all the reference sequences forming the chimeric template. A typical value is 10% for amplicons. This option can be used to generate chimeric shotgun reads as well. Default: 0 % .IP "\-cd ... | \-chimera_dist ..." 4 .IX Item "-cd ... | -chimera_dist ..." Specify the distribution of chimeras: bimeras, trimeras, quadrameras and multimeras of higher order. The default is the average values from Quince et al. 2011: '314 38 1', which corresponds to 89% of bimeras, 11% of trimeras and 0.3% of quadrameras. Note that this option only takes effect when you request the generation of chimeras with the option. Default: 314 38 1 .IP "\-ck | \-chimera_kmer " 4 .IX Item "-ck | -chimera_kmer " Activate a method to form chimeras by picking breakpoints at places where k\-mers are shared between sequences. represents k, the length of the k\-mers (in bp). The longer the kmer, the more similar the sequences have to be to be eligible to form chimeras. The more frequent a k\-mer is in the pool of reference sequences (taking into account their relative abundance), the more often this k\-mer will be chosen. For example, \s-1CHSIM \s0(Edgar et al. 2011) uses this method with a k\-mer length of 10 bp. If you do not want to use k\-mer information to form chimeras, use 0, which will result in the reference sequences and breakpoints to be taken randomly on the \*(L"aligned\*(R" reference sequences. Note that this option only takes effect when you request the generation of chimeras with the option. Also, this options is quite memory intensive, so you should probably limit yourself to a relatively small number of reference sequences if you want to use it. Default: 10 bp .IP "\-af | \-abundance_file " 4 .IX Item "-af | -abundance_file " Specify the relative abundance of the reference sequences manually in an input file. Each line of the file should contain a sequence name and its relative abundance (%), e.g. 'seqABC 82.1' or 'seqABC 82.1 10.2' if you are specifying two different libraries. .IP "\-am ... | \-abundance_model ..." 4 .IX Item "-am ... | -abundance_model ..." Relative abundance model for the input reference sequences: uniform, linear, powerlaw, logarithmic or exponential. The uniform and linear models do not require a parameter, but the other models take a parameter in the range [0, infinity). If this parameter is not specified, then it is randomly chosen. Examples: .Sp .Vb 3 \& uniform distribution: uniform \& powerlaw distribution with parameter 0.1: powerlaw 0.1 \& exponential distribution with automatically chosen parameter: exponential .Ve .Sp Default: uniform 1 .IP "\-nl | \-num_libraries " 4 .IX Item "-nl | -num_libraries " Number of independent libraries to create. Specify how diverse and similar they should be with , and . Assign them different \s-1MID\s0 tags with . Default: 1 .IP "\-mi | \-multiplex_ids " 4 .IX Item "-mi | -multiplex_ids " Specify an optional \s-1FASTA\s0 file that contains multiplex sequence identifiers (a.k.a MIDs or barcodes) to add to the sequences (one sequence per library, in the order given). The MIDs are included in the length specified with the \&\-read_dist option and can be altered by sequencing errors. See the MIDesigner or BarCrawl programs to generate \s-1MID\s0 sequences. .IP "\-di ... | \-diversity ..." 4 .IX Item "-di ... | -diversity ..." This option specifies alpha diversity, specifically the richness, i.e. number of reference sequences to take randomly and include in each library. Use 0 for the maximum richness possible (based on the number of reference sequences available). Provide one value to make all libraries have the same diversity, or one richness value per library otherwise. Default: 0 .IP "\-sp | \-shared_perc " 4 .IX Item "-sp | -shared_perc " This option controls an aspect of beta-diversity. When creating multiple libraries, specify the percent of reference sequences they should have in common (relative to the diversity of the least diverse library). Default: 0 % .IP "\-pp | \-permuted_perc " 4 .IX Item "-pp | -permuted_perc " This option controls another aspect of beta-diversity. For multiple libraries, choose the percent of the most-abundant reference sequences to permute (randomly shuffle) the rank-abundance of. Default: 100 % .IP "\-rs | \-random_seed " 4 .IX Item "-rs | -random_seed " Seed number to use for the pseudo-random number generator. .IP "\-dt | \-desc_track " 4 .IX Item "-dt | -desc_track " Track read information (reference sequence, position, errors, ...) by writing it in the read description. Default: 1 .IP "\-ql ... | \-qual_levels ..." 4 .IX Item "-ql ... | -qual_levels ..." Generate basic quality scores for the simulated reads. Good residues are given a specified good score (e.g. 30) and residues that are the result of an insertion or substitution are given a specified bad score (e.g. 10). Specify first the good score and then the bad score on the command-line, e.g.: 30 10. Default: .IP "\-fq | \-fastq_output " 4 .IX Item "-fq | -fastq_output " Whether to write the generated reads in \s-1FASTQ\s0 format (with Sanger-encoded quality scores) instead of \s-1FASTA\s0 and \s-1QUAL\s0 or not (1: yes, 0: no). need to be specified for this option to be effective. Default: 0 .IP "\-bn | \-base_name " 4 .IX Item "-bn | -base_name " Prefix of the output files. Default: grinder .IP "\-od | \-output_dir " 4 .IX Item "-od | -output_dir " Directory where the results should be written. This folder will be created if needed. Default: . .IP "\-pf | \-profile_file " 4 .IX Item "-pf | -profile_file " A file that contains Grinder arguments. This is useful if you use many options or often use the same options. Lines with comments (#) are ignored. Consider the profile file, 'simple_profile.txt': .Sp .Vb 3 \& # A simple Grinder profile \& \-read_dist 105 normal 12 \& \-total_reads 1000 .Ve .Sp Running: grinder \-reference_file viral_genomes.fa \-profile_file simple_profile.txt .Sp Translates into: grinder \-reference_file viral_genomes.fa \-read_dist 105 normal 12 \-total_reads 1000 .Sp Note that the arguments specified in the profile should not be specified again on the command line. .SH "CLI OUTPUT" .IX Header "CLI OUTPUT" For each shotgun or amplicon read library requested, the following files are generated: .IP "\(bu" 4 A rank-abundance file, tab-delimited, that shows the relative abundance of the different reference sequences .IP "\(bu" 4 A file containing the read sequences in \s-1FASTA\s0 format. The read headers contain information necessary to track from which reference sequence each read was taken and what errors it contains. This file is not generated if option was provided. .IP "\(bu" 4 If the option was specified, a file containing the quality scores of the reads (in \s-1QUAL\s0 format). .IP "\(bu" 4 If the option was provided, a file containing the read sequences in \s-1FASTQ\s0 format. .SH "API EXAMPLES" .IX Header "API EXAMPLES" The Grinder \s-1API\s0 allows to conveniently use Grinder within Perl scripts. Here is a synopsis: .PP .Vb 1 \& use Grinder; \& \& # Set up a new factory (see the OPTIONS section for a complete list of parameters) \& my $factory = Grinder\->new( \-reference_file => \*(Aqgenomes.fna\*(Aq ); \& \& # Process all shotgun libraries requested \& while ( my $struct = $factory\->next_lib ) { \& \& # The ID and abundance of the 3rd most abundant genome in this community \& my $id = $struct\->{ids}\->[2]; \& my $ab = $struct\->{abs}\->[2]; \& \& # Create shotgun reads \& while ( my $read = $factory\->next_read) { \& \& # The read is a Bioperl sequence object with these properties: \& my $read_id = $read\->id; # read ID given by Grinder \& my $read_seq = $read\->seq; # nucleotide sequence \& my $read_mid = $read\->mid; # MID or tag attached to the read \& my $read_errors = $read\->errors; # errors that the read contains \& \& # Where was the read taken from? The reference sequence refers to the \& # database sequence for shotgun libraries, amplicon obtained from the \& # database sequence, or could even be a chimeric sequence \& my $ref_id = $read\->reference\->id; # ID of the reference sequence \& my $ref_start = $read\->start; # start of the read on the reference \& my $ref_end = $read\->end; # end of the read on the reference \& my $ref_strand = $read\->strand; # strand of the reference \& \& } \& } \& \& # Similarly, for shotgun mate pairs \& my $factory = Grinder\->new( \-reference_file => \*(Aqgenomes.fna\*(Aq, \& \-insert_dist => 250 ); \& while ( $factory\->next_lib ) { \& while ( my $read = $factory\->next_read ) { \& # The first read is the first mate of the mate pair \& # The second read is the second mate of the mate pair \& # The third read is the first mate of the next mate pair \& # ... \& } \& } \& \& # To generate an amplicon library \& my $factory = Grinder\->new( \-reference_file => \*(Aqgenomes.fna\*(Aq, \& \-forward_reverse => \*(Aq16Sgenes.fna\*(Aq, \& \-length_bias => 0, \& \-unidirectional => 1 ); \& while ( $factory\->next_lib ) { \& while ( my $read = $factory\->next_read) { \& # ... \& } \& } .Ve .SH "API METHODS" .IX Header "API METHODS" The rest of the documentation details the available Grinder \s-1API\s0 methods. .SS "new" .IX Subsection "new" Title : new .PP Function: Create a new Grinder factory initialized with the passed arguments. Available parameters described in the \s-1OPTIONS\s0 section. .PP Usage : my \f(CW$factory\fR = Grinder\->new( \-reference_file => 'genomes.fna' ); .PP Returns : a new Grinder object .SS "next_lib" .IX Subsection "next_lib" Title : next_lib .PP Function: Go to the next shotgun library to process. .PP Usage : my \f(CW$struct\fR = \f(CW$factory\fR\->next_lib; .PP Returns : Community structure to be used for this library, where \f(CW$struct\fR\->{ids} is an array reference containing the IDs of the genome making up the community (sorted by decreasing relative abundance) and \f(CW$struct\fR\->{abs} is an array reference of the genome abundances (in the same order as the IDs). .SS "next_read" .IX Subsection "next_read" Title : next_read .PP Function: Create an amplicon or shotgun read for the current library. .PP Usage : my \f(CW$read\fR = \f(CW$factory\fR\->next_read; # for single read my \f(CW$mate1\fR = \f(CW$factory\fR\->next_read; # for mate pairs my \f(CW$mate2\fR = \f(CW$factory\fR\->next_read; .PP Returns : A sequence represented as a Bio::Seq::SimulatedRead object .SS "get_random_seed" .IX Subsection "get_random_seed" Title : get_random_seed .PP Function: Return the number used to seed the pseudo-random number generator .PP Usage : my \f(CW$seed\fR = \f(CW$factory\fR\->get_random_seed; .PP Returns : seed number .SH "COPYRIGHT" .IX Header "COPYRIGHT" Copyright 2009\-2013 Florent \s-1ANGLY\s0 .PP Grinder is free software: you can redistribute it and/or modify it under the terms of the \s-1GNU\s0 General Public License (\s-1GPL\s0) as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. Grinder is distributed in the hope that it will be useful, but \s-1WITHOUT ANY WARRANTY\s0; without even the implied warranty of \&\s-1MERCHANTABILITY\s0 or \s-1FITNESS FOR A PARTICULAR PURPOSE. \s0 See the \&\s-1GNU\s0 General Public License for more details. You should have received a copy of the \s-1GNU\s0 General Public License along with Grinder. If not, see . .SH "BUGS" .IX Header "BUGS" All complex software has bugs lurking in it, and this program is no exception. If you find a bug, please report it on the SourceForge Tracker for Grinder: .PP Bug reports, suggestions and patches are welcome. Grinder's code is developed on Sourceforge () and is under Git revision control. To get started with a patch, do: .PP .Vb 1 \& git clone git://biogrinder.git.sourceforge.net/gitroot/biogrinder/biogrinder .Ve Grinder-0.5.4/man/average_genome_size.10000644000175000017500000001345712647202457020210 0ustar floflooofloflooo.\" Automatically generated by Pod::Man 2.28 (Pod::Simple 3.29) .\" .\" Standard preamble: .\" ======================================================================== .de Sp \" Vertical space (when we can't use .PP) .if t .sp .5v .if n .sp .. .de Vb \" Begin verbatim text .ft CW .nf .ne \\$1 .. .de Ve \" End verbatim text .ft R .fi .. .\" Set up some character translations and predefined strings. \*(-- will .\" give an unbreakable dash, \*(PI will give pi, \*(L" will give a left .\" double quote, and \*(R" will give a right double quote. \*(C+ will .\" give a nicer C++. Capital omega is used to do unbreakable dashes and .\" therefore won't be available. \*(C` and \*(C' expand to `' in nroff, .\" nothing in troff, for use with C<>. .tr \(*W- .ds C+ C\v'-.1v'\h'-1p'\s-2+\h'-1p'+\s0\v'.1v'\h'-1p' .ie n \{\ . ds -- \(*W- . ds PI pi . if (\n(.H=4u)&(1m=24u) .ds -- \(*W\h'-12u'\(*W\h'-12u'-\" diablo 10 pitch . if (\n(.H=4u)&(1m=20u) .ds -- \(*W\h'-12u'\(*W\h'-8u'-\" diablo 12 pitch . ds L" "" . ds R" "" . ds C` "" . ds C' "" 'br\} .el\{\ . ds -- \|\(em\| . ds PI \(*p . ds L" `` . ds R" '' . ds C` . ds C' 'br\} .\" .\" Escape single quotes in literal strings from groff's Unicode transform. .ie \n(.g .ds Aq \(aq .el .ds Aq ' .\" .\" If the F register is turned on, we'll generate index entries on stderr for .\" titles (.TH), headers (.SH), subsections (.SS), items (.Ip), and index .\" entries marked with X<> in POD. Of course, you'll have to process the .\" output yourself in some meaningful fashion. .\" .\" Avoid warning from groff about undefined register 'F'. .de IX .. .nr rF 0 .if \n(.g .if rF .nr rF 1 .if (\n(rF:(\n(.g==0)) \{ . if \nF \{ . de IX . tm Index:\\$1\t\\n%\t"\\$2" .. . if !\nF==2 \{ . nr % 0 . nr F 2 . \} . \} .\} .rr rF .\" .\" Accent mark definitions (@(#)ms.acc 1.5 88/02/08 SMI; from UCB 4.2). .\" Fear. Run. Save yourself. No user-serviceable parts. . \" fudge factors for nroff and troff .if n \{\ . ds #H 0 . ds #V .8m . ds #F .3m . ds #[ \f1 . ds #] \fP .\} .if t \{\ . ds #H ((1u-(\\\\n(.fu%2u))*.13m) . ds #V .6m . ds #F 0 . ds #[ \& . ds #] \& .\} . \" simple accents for nroff and troff .if n \{\ . ds ' \& . ds ` \& . ds ^ \& . ds , \& . ds ~ ~ . ds / .\} .if t \{\ . ds ' \\k:\h'-(\\n(.wu*8/10-\*(#H)'\'\h"|\\n:u" . ds ` \\k:\h'-(\\n(.wu*8/10-\*(#H)'\`\h'|\\n:u' . ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'^\h'|\\n:u' . ds , \\k:\h'-(\\n(.wu*8/10)',\h'|\\n:u' . ds ~ \\k:\h'-(\\n(.wu-\*(#H-.1m)'~\h'|\\n:u' . ds / \\k:\h'-(\\n(.wu*8/10-\*(#H)'\z\(sl\h'|\\n:u' .\} . \" troff and (daisy-wheel) nroff accents .ds : \\k:\h'-(\\n(.wu*8/10-\*(#H+.1m+\*(#F)'\v'-\*(#V'\z.\h'.2m+\*(#F'.\h'|\\n:u'\v'\*(#V' .ds 8 \h'\*(#H'\(*b\h'-\*(#H' .ds o \\k:\h'-(\\n(.wu+\w'\(de'u-\*(#H)/2u'\v'-.3n'\*(#[\z\(de\v'.3n'\h'|\\n:u'\*(#] .ds d- \h'\*(#H'\(pd\h'-\w'~'u'\v'-.25m'\f2\(hy\fP\v'.25m'\h'-\*(#H' .ds D- D\\k:\h'-\w'D'u'\v'-.11m'\z\(hy\v'.11m'\h'|\\n:u' .ds th \*(#[\v'.3m'\s+1I\s-1\v'-.3m'\h'-(\w'I'u*2/3)'\s-1o\s+1\*(#] .ds Th \*(#[\s+2I\s-2\h'-\w'I'u*3/5'\v'-.3m'o\v'.3m'\*(#] .ds ae a\h'-(\w'a'u*4/10)'e .ds Ae A\h'-(\w'A'u*4/10)'E . \" corrections for vroff .if v .ds ~ \\k:\h'-(\\n(.wu*9/10-\*(#H)'\s-2\u~\d\s+2\h'|\\n:u' .if v .ds ^ \\k:\h'-(\\n(.wu*10/11-\*(#H)'\v'-.4m'^\v'.4m'\h'|\\n:u' . \" for low resolution devices (crt and lpr) .if \n(.H>23 .if \n(.V>19 \ \{\ . ds : e . ds 8 ss . ds o a . ds d- d\h'-1'\(ga . ds D- D\h'-1'\(hy . ds th \o'bp' . ds Th \o'LP' . ds ae ae . ds Ae AE .\} .rm #[ #] #H #V #F C .\" ======================================================================== .\" .IX Title "AVERAGE_GENOME_SIZE 1" .TH AVERAGE_GENOME_SIZE 1 "2014-01-07" "perl v5.22.1" "User Contributed Perl Documentation" .\" For nroff, turn off justification. Always turn off hyphenation; it makes .\" way too many mistakes in technical documents. .if n .ad l .nh .SH "NAME" average_genome_size \- Calculate the average genome size (in bp) of species in a Grinder library .SH "DESCRIPTION" .IX Header "DESCRIPTION" Calculate the average genome size (in bp) of species in a Grinder library given the library composition and the full-genomes used to produce it. .SH "REQUIRED ARGUMENTS" .IX Header "REQUIRED ARGUMENTS" .IP "" 4 .IX Item "" \&\s-1FASTA\s0 file containing the full-genomes used to produce the Grinder library. .IP "" 4 .IX Item "" Grinder rank file that describes the library composition. .SH "COPYRIGHT" .IX Header "COPYRIGHT" Copyright 2009\-2012 Florent \s-1ANGLY\s0 .PP Grinder is free software: you can redistribute it and/or modify it under the terms of the \s-1GNU\s0 General Public License (\s-1GPL\s0) as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. Grinder is distributed in the hope that it will be useful, but \s-1WITHOUT ANY WARRANTY\s0; without even the implied warranty of \&\s-1MERCHANTABILITY\s0 or \s-1FITNESS FOR A PARTICULAR PURPOSE. \s0 See the \&\s-1GNU\s0 General Public License for more details. You should have received a copy of the \s-1GNU\s0 General Public License along with Grinder. If not, see . .SH "BUGS" .IX Header "BUGS" All complex software has bugs lurking in it, and this program is no exception. If you find a bug, please report it on the SourceForge Tracker for Grinder: .PP Bug reports, suggestions and patches are welcome. Grinder's code is developed on Sourceforge () and is under Git revision control. To get started with a patch, do: .PP .Vb 1 \& git clone git://biogrinder.git.sourceforge.net/gitroot/biogrinder/biogrinder .Ve Grinder-0.5.4/MYMETA.json0000644000175000017500000000367712647202457015247 0ustar floflooofloflooo{ "abstract" : "A versatile omics shotgun and amplicon sequencing read simulator", "author" : [ "Florent Angly " ], "dynamic_config" : 0, "generated_by" : "Module::Install version 1.16, CPAN::Meta::Converter version 2.150005", "license" : [ "unknown" ], "meta-spec" : { "url" : "http://search.cpan.org/perldoc?CPAN::Meta::Spec", "version" : "2" }, "name" : "Grinder", "no_index" : { "directory" : [ "inc", "t" ] }, "prereqs" : { "build" : { "requires" : { "ExtUtils::MakeMaker" : "6.59", "Test::More" : "0", "Test::Warn" : "0" } }, "configure" : { "requires" : { "ExtUtils::MakeMaker" : "0" } }, "runtime" : { "requires" : { "Bio::DB::Fasta" : "0", "Bio::Location::Split" : "0", "Bio::PrimarySeq" : "0", "Bio::Root::Root" : "0", "Bio::Root::Version" : "1.006923", "Bio::Seq::SimulatedRead" : "0", "Bio::SeqFeature::SubSeq" : "0", "Bio::SeqIO" : "0", "Bio::Tools::AmpliconSearch" : "0", "Getopt::Euclid" : "v0.4.4", "List::Util" : "0", "Math::Random::MT" : "1.16", "perl" : "5.006", "version" : "0.77" } } }, "release_status" : "stable", "resources" : { "bugtracker" : { "web" : "http://sourceforge.net/tracker/?group_id=244196&atid=1124737" }, "homepage" : "http://sourceforge.net/projects/biogrinder/", "license" : [ "http://opensource.org/licenses/gpl-3.0.html" ], "repository" : { "type" : "git", "url" : "git://biogrinder.git.sourceforge.net/gitroot/biogrinder/biogrinder" } }, "version" : "0.005004", "x_serialization_backend" : "JSON::PP version 2.27300" } Grinder-0.5.4/galaxy/0000755000175000017500000000000012647202511014617 5ustar flofloooflofloooGrinder-0.5.4/galaxy/stderr_wrapper.py0000755000175000017500000000322212263016714020240 0ustar floflooofloflooo#!/usr/bin/env python """ Wrapper that executes a program with its arguments but reports standard error messages only if the program exit status was not 0. This is useful to prevent Galaxy to interpret that there was an error if something was printed on stderr, e.g. if this was simply a warning. Example: ./stderr_wrapper.py myprog arg1 -f arg2 Author: Florent Angly """ import sys, subprocess assert sys.version_info[:2] >= ( 2, 4 ) def stop_err( msg ): sys.stderr.write( "%s\n" % msg ) sys.exit() def __main__(): # Get command-line arguments args = sys.argv # Remove name of calling program, i.e. ./stderr_wrapper.py args.pop(0) # If there are no arguments left, we're done if len(args) == 0: return # If one needs to silence stdout #args.append( ">" ) #args.append( "/dev/null" ) #cmdline = " ".join(args) #print cmdline try: # Run program proc = subprocess.Popen( args=args, shell=False, stderr=subprocess.PIPE ) returncode = proc.wait() # Capture stderr, allowing for case where it's very large stderr = '' buffsize = 1048576 try: while True: stderr += proc.stderr.read( buffsize ) if not stderr or len( stderr ) % buffsize != 0: break except OverflowError: pass # Running Grinder failed: write error message to stderr if returncode != 0: raise Exception, stderr except Exception, e: # Running Grinder failed: write error message to stderr stop_err( 'Error: ' + str( e ) ) if __name__ == "__main__": __main__() Grinder-0.5.4/galaxy/grinder.xml0000644000175000017500000007461712263016714017014 0ustar floflooofloflooo versatile omic shotgun and amplicon read simulator grinder grinder --version stderr_wrapper.py grinder #if $reference_file.specify == "builtin": -reference_file ${ filter( lambda x: str( x[0] ) == str( $reference_file.value ), $__app__.tool_data_tables[ 'all_fasta' ].get_fields() )[0][-1] } #else if $reference_file.specify == "uploaded": -reference_file $reference_file.value #end if #if str($coverage_fold): -coverage_fold $coverage_fold #end if #if str($total_reads): -total_reads $total_reads #end if #if str($read_dist): -read_dist $read_dist #end if #if str($insert_dist): -insert_dist $insert_dist #end if #if str($mate_orientation): -mate_orientation $mate_orientation #end if #if str($exclude_chars): -exclude_chars $exclude_chars #end if #if str($delete_chars): -delete_chars $delete_chars #end if #if str($forward_reverse) != "None": -forward_reverse $forward_reverse #end if #if str($unidirectional): -unidirectional $unidirectional #end if #if str($length_bias): -length_bias $length_bias #end if #if str($copy_bias): -copy_bias $copy_bias #end if #if str($mutation_dist): -mutation_dist $mutation_dist #end if #if str($mutation_ratio): -mutation_ratio $mutation_ratio #end if #if str($homopolymer_dist): -homopolymer_dist $homopolymer_dist #end if #if str($chimera_perc): -chimera_perc $chimera_perc #end if #if str($chimera_dist): -chimera_dist $chimera_dist #end if #if str($chimera_kmer): -chimera_kmer $chimera_kmer #end if #if str($abundance_file) != "None": -abundance_file $abundance_file #end if #if str($abundance_model): -abundance_model $abundance_model #end if #if str($num_libraries): -num_libraries $num_libraries #end if #if str($multiplex_ids) != "None": -multiplex_ids $multiplex_ids #end if #if str($diversity): -diversity $diversity #end if #if str($shared_perc): -shared_perc $shared_perc #end if #if str($permuted_perc): -permuted_perc $permuted_perc #end if #if str($random_seed): -random_seed $random_seed #end if #if str($permuted_perc): -desc_track $desc_track #end if #if str($qual_levels): -qual_levels $qual_levels #end if #if str($fastq_output) == '1': -fastq_output $fastq_output #end if #if str($profile_file) != "None": -profile_file $profile_file.value #end if int(str(num_libraries)) == 1 int(str(num_libraries)) == 1 and fastq_output == 0 int(str(num_libraries)) == 1 and str(qual_levels) and fastq_output == 0 int(str(num_libraries)) == 1 and fastq_output == 1 int(str(num_libraries)) >= 2 int(str(num_libraries)) >= 2 and fastq_output == 0 int(str(num_libraries)) >= 2 and str(qual_levels) and fastq_output == 0 int(str(num_libraries)) >= 2 and fastq_output == 1 int(str(num_libraries)) >= 2 int(str(num_libraries)) >= 2 and fastq_output == 0 int(str(num_libraries)) >= 2 and str(qual_levels) and fastq_output == 0 int(str(num_libraries)) >= 2 and fastq_output == 1 int(str(num_libraries)) >= 3 int(str(num_libraries)) >= 3 and fastq_output == 0 int(str(num_libraries)) >= 3 and str(qual_levels) and fastq_output == 0 int(str(num_libraries)) >= 3 and fastq_output == 1 int(str(num_libraries)) >= 4 int(str(num_libraries)) >= 4 and fastq_output == 0 int(str(num_libraries)) >= 4 and str(qual_levels) and fastq_output == 0 int(str(num_libraries)) >= 4 and fastq_output == 1 int(str(num_libraries)) >= 5 int(str(num_libraries)) >= 5 and fastq_output == 0 int(str(num_libraries)) >= 5 and str(qual_levels) and fastq_output == 0 int(str(num_libraries)) >= 5 and fastq_output == 1 int(str(num_libraries)) >= 6 int(str(num_libraries)) >= 6 and fastq_output == 0 int(str(num_libraries)) >= 6 and str(qual_levels) and fastq_output == 0 int(str(num_libraries)) >= 6 and fastq_output == 1 int(str(num_libraries)) >= 7 int(str(num_libraries)) >= 7 and fastq_output == 0 int(str(num_libraries)) >= 7 and str(qual_levels) and fastq_output == 0 int(str(num_libraries)) >= 7 and fastq_output == 1 int(str(num_libraries)) >= 8 int(str(num_libraries)) >= 8 and fastq_output == 0 int(str(num_libraries)) >= 8 and str(qual_levels) and fastq_output == 0 int(str(num_libraries)) >= 8 and fastq_output == 1 int(str(num_libraries)) >= 9 int(str(num_libraries)) >= 9 and fastq_output == 0 int(str(num_libraries)) >= 9 and str(qual_levels) and fastq_output == 0 int(str(num_libraries)) >= 9 and fastq_output == 1 int(str(num_libraries)) >= 10 int(str(num_libraries)) >= 10 and fastq_output == 0 int(str(num_libraries)) >= 10 and str(qual_levels) and fastq_output == 0 int(str(num_libraries)) >= 10 and fastq_output == 1 **What it does** Grinder is a program to create random shotgun and amplicon sequence libraries based on reference sequences in a FASTA file. Features include: * omic support: genomic, metagenomic, transcriptomic, metatranscriptomic, proteomic and metaproteomic * shotgun library or amplicon library * arbitrary read length distribution and number of reads * simulation of PCR and sequencing errors (chimeras, point mutations, homopolymers) * support for creating paired-end (mate pair) datasets * specific rank-abundance settings or manually given abundance for each genome * creation of datasets with a given richness (alpha diversity) * independent datasets can share a variable number of genomes (beta diversity) * modeling of the bias created by varying genome lengths or gene copy number * profile mechanism to store preferred options * API to automate the creation of a large number of simulated datasets **Input** A variety of FASTA databases containing genes or genomes can be used as input for Grinder, such as the NCBI RefSeq collection (ftp://ftp.ncbi.nih.gov/refseq/release/microbial/), the GreenGenes 16S rRNA database (http://greengenes.lbl.gov/Download/Sequence_Data/Fasta_data_files/Isolated_named_strains_16S_aligned.fasta), the human genome and transcriptome (ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/RefSeqGene/, ftp://ftp.ncbi.nih.gov/refseq/H_sapiens/mRNA_Prot/human.rna.fna.gz), ... These input files can either be provided as a Galaxy dataset, or can be uploaded by Galaxy users in their history. **Output** For each library requested, a first file contains the abundance of the species in the simulated community created, e.g.:: # rank seqID rel. abundance 1 86715_Lachnospiraceae 0.367936925098555 2 6439_Neisseria_polysaccharea 0.183968462549277 3 103712_Fusobacterium_nucleatum 0.122645641699518 4 103024_Frigoribacterium 0.0919842312746386 5 129066_Streptococcus_pyogenes 0.0735873850197109 6 106485_Pseudomonas_aeruginosa 0.0613228208497591 7 13824_Veillonella_criceti 0.0525624178712221 8 28044_Lactosphaera 0.0459921156373193 The second file is a FASTA file containing shotgun or amplicon reads, e.g.:: >1 reference=13824_Veillonella_criceti position=89-1088 strand=+ ACCAACCTGCCCTTCAGAGGGGGATAACAACGGGAAACCGTTGCTAATACCGCGTACGAA TGGACTTCGGCATCGGAGTTCATTGAAAGGTGGCCTCTATTTATAAGCTATCGCTGAAGG AGGGGGTTGCGTCTGATTAGCTAGTTGGAGGGGTAATGGCCCACCAAGGCAA >2 reference=103712_Fusobacterium_nucleatum position=2-1001 strand=+ TGAACGAAGAGTTTGATCCTGGCTCAGGATGAACGCTGACAGAATGCTTAACACATGCAA GTCAACTTGAATTTGGGTTTTTAACTTAGGTTTGGG If you specify the quality score levels option, a third file representing the quality scores of the reads is created:: >1 reference=103712_Fusobacterium_nucleatum position=2-1001 strand=+ 30 30 30 10 30 30 ... Grinder-0.5.4/galaxy/all_fasta.loc.sample0000644000175000017500000000152312263016714020527 0ustar floflooofloflooo#This file lists the locations and dbkeys of all the fasta files #under the "genome" directory (a directory that contains a directory #for each build). The script extract_fasta.py will generate the file #all_fasta.loc. #IMPORTANT: EACH LINE OF THIS FILE HAS TO BE TAB-DELIMITED! # # # #So, all_fasta.loc could look something like this: # #ncbi_refseq_complete_viruses ncbi_refseq_complete_viruses RefSeq complete viruses /path/to/ncbi_refseq_complete_viruses.fna #ncbi_refseq_complete_microbes ncbi_refseq_complete_microbes RefSeq complete microbes /path/to/ncbi_refseq_complete_microbes.fna #homo_sapiens_GRCh37 homo_sapiens_GRCh37 Homo sapiens genome /path/to/Homo_sapiens_GRCh37_reference.fna #gg_named_16S gg_named_16S GreenGenes named 16S strains /path/to/Isolated_named_strains_16S.fna Grinder-0.5.4/galaxy/tool_data_table_conf.xml.sample0000644000175000017500000000036312263016714022747 0ustar floflooofloflooo value, dbkey, name, path
Grinder-0.5.4/galaxy/Galaxy_readme.txt0000644000175000017500000000044012263016714020122 0ustar flofloooflofloooThis is an XML wrapper that provides a GUI for Grinder in Galaxy (http://galaxy.psu.edu/). Place these files in your Galaxy directory. More information at http://wiki.g2.bx.psu.edu/FrontPage. Note: The Grinder wrapper uses Galaxy builtin datasets located in the 'all_fasta' data table. Grinder-0.5.4/LICENSE0000644000175000017500000010477412647202457014365 0ustar flofloooflofloooThis software is Copyright (c) 2016 by Florent Angly . This is free software, licensed under: The GNU General Public License, Version 3, June 2007 GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007 Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The GNU General Public License is a free, copyleft license for software and other kinds of works. The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things. To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others. For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License giving you legal permission to copy, distribute and/or modify it. For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems will not be attributed erroneously to authors of previous versions. Some devices are designed to deny users access to install or run modified versions of the software inside them, although the manufacturer can do so. This is fundamentally incompatible with the aim of protecting users' freedom to change the software. The systematic pattern of such abuse occurs in the area of products for individuals to use, which is precisely where it is most unacceptable. Therefore, we have designed this version of the GPL to prohibit the practice for those products. If such problems arise substantially in other domains, we stand ready to extend this provision to those domains in future versions of the GPL, as needed to protect the freedom of users. Finally, every program is threatened constantly by software patents. States should not allow patents to restrict development and use of software on general-purpose computers, but in those that do, we wish to avoid the special danger that patents applied to a free program could make it effectively proprietary. To prevent this, the GPL assures that patents cannot be used to render the program non-free. The precise terms and conditions for copying, distribution and modification follow. TERMS AND CONDITIONS 0. Definitions. "This License" refers to version 3 of the GNU General Public License. "Copyright" also means copyright-like laws that apply to other kinds of works, such as semiconductor masks. "The Program" refers to any copyrightable work licensed under this License. Each licensee is addressed as "you". "Licensees" and "recipients" may be individuals or organizations. To "modify" a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a "modified version" of the earlier work or a work "based on" the earlier work. A "covered work" means either the unmodified Program or a work based on the Program. To "propagate" a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well. To "convey" a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying. An interactive user interface displays "Appropriate Legal Notices" to the extent that it includes a convenient and prominently visible feature that (1) displays an appropriate copyright notice, and (2) tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License. If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion. 1. Source Code. The "source code" for a work means the preferred form of the work for making modifications to it. "Object code" means any non-source form of a work. A "Standard Interface" means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language. The "System Libraries" of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A "Major Component", in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it. The "Corresponding Source" for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work. The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source. The Corresponding Source for a work in source code form is that same work. 2. Basic Permissions. All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law. You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of your copyrighted material outside their relationship with you. Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary. 3. Protecting Users' Legal Rights From Anti-Circumvention Law. No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures. When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work's users, your or third parties' legal rights to forbid circumvention of technological measures. 4. Conveying Verbatim Copies. You may convey verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program. You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee. 5. Conveying Modified Source Versions. You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions: a) The work must carry prominent notices stating that you modified it, and giving a relevant date. b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to "keep intact all notices". c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it. d) If the work has interactive user interfaces, each must display Appropriate Legal Notices; however, if the Program has interactive interfaces that do not display Appropriate Legal Notices, your work need not make them do so. A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an "aggregate" if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate. 6. Conveying Non-Source Forms. You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways: a) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by the Corresponding Source fixed on a durable physical medium customarily used for software interchange. b) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by a written offer, valid for at least three years and valid for as long as you offer spare parts or customer support for that product model, to give anyone who possesses the object code either (1) a copy of the Corresponding Source for all the software in the product that is covered by this License, on a durable physical medium customarily used for software interchange, for a price no more than your reasonable cost of physically performing this conveying of source, or (2) access to copy the Corresponding Source from a network server at no charge. c) Convey individual copies of the object code with a copy of the written offer to provide the Corresponding Source. This alternative is allowed only occasionally and noncommercially, and only if you received the object code with such an offer, in accord with subsection 6b. d) Convey the object code by offering access from a designated place (gratis or for a charge), and offer equivalent access to the Corresponding Source in the same way through the same place at no further charge. You need not require recipients to copy the Corresponding Source along with the object code. If the place to copy the object code is a network server, the Corresponding Source may be on a different server (operated by you or a third party) that supports equivalent copying facilities, provided you maintain clear directions next to the object code saying where to find the Corresponding Source. Regardless of what server hosts the Corresponding Source, you remain obligated to ensure that it is available for as long as needed to satisfy these requirements. e) Convey the object code using peer-to-peer transmission, provided you inform other peers where the object code and Corresponding Source of the work are being offered to the general public at no charge under subsection 6d. A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work. A "User Product" is either (1) a "consumer product", which means any tangible personal property which is normally used for personal, family, or household purposes, or (2) anything designed or sold for incorporation into a dwelling. In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For a particular product received by a particular user, "normally used" refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product. "Installation Information" for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made. If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM). The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and protocols for communication across the network. Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying. 7. Additional Terms. "Additional permissions" are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions. When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission. Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms: a) Disclaiming warranty or limiting liability differently from the terms of sections 15 and 16 of this License; or b) Requiring preservation of specified reasonable legal notices or author attributions in that material or in the Appropriate Legal Notices displayed by works containing it; or c) Prohibiting misrepresentation of the origin of that material, or requiring that modified versions of such material be marked in reasonable ways as different from the original version; or d) Limiting the use for publicity purposes of names of licensors or authors of the material; or e) Declining to grant rights under trademark law for use of some trade names, trademarks, or service marks; or f) Requiring indemnification of licensors and authors of that material by anyone who conveys the material (or modified versions of it) with contractual assumptions of liability to the recipient, for any liability that these contractual assumptions directly impose on those licensors and authors. All other non-permissive additional terms are considered "further restrictions" within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying. If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms. Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way. 8. Termination. You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11). However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation. Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice. Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10. 9. Acceptance Not Required for Having Copies. You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so. 10. Automatic Licensing of Downstream Recipients. Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License. An "entity transaction" is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party's predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts. You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it. 11. Patents. A "contributor" is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor's "contributor version". A contributor's "essential patent claims" are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, "control" includes the right to grant patent sublicenses in a manner consistent with the requirements of this License. Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor's essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version. In the following three paragraphs, a "patent license" is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to sue for patent infringement). To "grant" such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party. If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either (1) cause the Corresponding Source to be so available, or (2) arrange to deprive yourself of the benefit of the patent license for this particular work, or (3) arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients. "Knowingly relying" means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient's use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid. If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it. A patent license is "discriminatory" if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license (a) in connection with copies of the covered work conveyed by you (or copies made from those copies), or (b) primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007. Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law. 12. No Surrender of Others' Freedom. If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program. 13. Use with the GNU Affero General Public License. Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU Affero General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the special requirements of the GNU Affero General Public License, section 13, concerning interaction through a network will apply to the combination as such. 14. Revised Versions of this License. The Free Software Foundation may publish revised and/or new versions of the GNU General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU General Public License "or any later version" applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU General Public License, you may choose any version ever published by the Free Software Foundation. If the Program specifies that a proxy can decide which future versions of the GNU General Public License can be used, that proxy's public statement of acceptance of a version permanently authorizes you to choose that version for the Program. Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version. 15. Disclaimer of Warranty. THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 16. Limitation of Liability. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 17. Interpretation of Sections 15 and 16. If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . Also add information on how to contact you by electronic and paper mail. If the program does terminal interaction, make it output a short notice like this when it starts in an interactive mode: Copyright (C) This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, your program's commands might be different; for a GUI interface, you would use an "about box". You should also get your employer (if you work as a programmer) or school, if any, to sign a "copyright disclaimer" for the program, if necessary. For more information on this, and how to apply and follow the GNU GPL, see . The GNU General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Lesser General Public License instead of this License. But first, please read . Grinder-0.5.4/t/0000755000175000017500000000000012647202511013575 5ustar flofloooflofloooGrinder-0.5.4/t/03-amplicon.t0000644000175000017500000000653712263016714016021 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $read, $nof_reads); # Forward primer only, forward sequencing ok $factory = Grinder->new( -reference_file => data('amplicon_database.fa'), -forward_reverse => data('forward_primer.fa') , -length_bias => 0 , -unidirectional => 1 , -read_dist => 48 , -total_reads => 100 , ), 'Forward primer only, forward sequencing'; ok $factory->next_lib; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; ok_read($read, 1, $nof_reads); }; is $nof_reads, 100; # Forward and reverse primers ok $factory = Grinder->new( -reference_file => data('amplicon_database.fa') , -forward_reverse => data('forward_reverse_primers.fa'), -length_bias => 0 , -unidirectional => 1 , -read_dist => 48 , -total_reads => 100 , ), 'Forward then reverse primers, forward sequencing'; ok $factory->next_lib; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; ok_read($read, 1, $nof_reads); }; is $nof_reads, 100; # Reverse primer only, reverse sequencing ok $factory = Grinder->new( -reference_file => data('amplicon_database.fa'), -forward_reverse => data('reverse_primer.fa') , -length_bias => 0 , -unidirectional => -1 , -read_dist => 48 , -total_reads => 100 , ), 'Reverse primer only, reverse sequencing'; ok $factory->next_lib; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; ok_read($read, -1, $nof_reads); }; is $nof_reads, 100; # Reverse and forward primers, reverse sequencing ok $factory = Grinder->new( -reference_file => data('amplicon_database.fa') , -forward_reverse => data('reverse_forward_primers.fa'), -length_bias => 0 , -unidirectional => -1 , -read_dist => 48 , -total_reads => 100 , ), 'Reverse then forward primers, reverse sequencing'; ok $factory->next_lib; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; ok_read($read, -1, $nof_reads); }; is $nof_reads, 100; done_testing(); sub ok_read { my ($read, $req_strand, $nof_reads) = @_; isa_ok $read, 'Bio::Seq::SimulatedRead'; my $source = $read->reference->id; my $strand = $read->strand; if (not defined $req_strand) { $req_strand = $strand; } else { is $strand, $req_strand; } my $letters; if ( $source =~ m/^seq1/ ) { $letters = 'a'; } elsif ( $source =~ m/^seq2/ ) { $letters = 'c'; } elsif ( $source =~ m/^seq3/ ) { $letters = 'g'; } elsif ( $source =~ m/^seq4/ ) { $letters = 't'; } elsif ( $source =~ m/^seq5/ ) { $letters = 'atg'; } if ( $req_strand == -1 ) { # Take the reverse complement $letters = Bio::PrimarySeq->new( -seq => $letters )->revcom->seq; }; like $read->seq, qr/[$letters]+/; is $read->id, $nof_reads; is $read->length, 48; } Grinder-0.5.4/t/data/0000755000175000017500000000000012647202511014506 5ustar flofloooflofloooGrinder-0.5.4/t/data/database_protein.fa0000644000175000017500000000055112263016714020325 0ustar floflooofloflooo>gi|194473622|ref|NP_001123975.1| adenylosuccinate lyase [Rattus norvegicus] MAASGDPACAESYRSPLAARYASHEMCFLFSDRYKFQTWRQLWLWLAEAEQTLGLPITDEQIQEMRSNLS NIDFQMAAEEEKRLRHDVMAHVHTFGHCCPKAAGIIHLGATSCYVGDNTDLIILRNAFDLLLPKLARVIS RLADFAKERADLPTLGFTHFQPAQLTTVGKRCCLWIQDLCMDLQNLKRVRDELRFRGVKGTTGTQASFLQ LFEGDHQKVEQLDKMVTEKAGFKRAYIITGQTYTRKVDIEVLSVLASLGASVHKICTDIRLLANLKEMEE Grinder-0.5.4/t/data/oriented_database.fa0000644000175000017500000000037212263016714020457 0ustar floflooofloflooo>seq1 CCCaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaTTT Grinder-0.5.4/t/data/kmers.fa0000644000175000017500000000066712263016714016152 0ustar floflooofloflooo>seq1 AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA >seq2 CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCAAAAAAAACCCCCCCCCCCCCCCCCCCCCCCGGGGGGGGCCCCCCCC >seq3 TTTTTTTTGGGGGGGGTTTTTTTTGGGGGGGGTTTTTTTTGGGGGGGGTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTTT >seq4 AAAAAAAAGGGGGGGGAAAAAAAAGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA >seq5 ACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGTACGT Grinder-0.5.4/t/data/database_rna.fa0000644000175000017500000000060612263016714017426 0ustar floflooofloflooo>gi|352962148|ref|NM_001251825.1| Homo sapiens Sp1 UranscripUion facUor (SP1), UranscripU varianU 3, mRNA GUCCGGGUUCGCUUGCCUCGUCAGCGUCCGCGUUUUUCCCGGCCCCCCCCAACCCCCCCGGACAGGACCC CCUUGAGCUUGUCCCUCAGCUGCCACCAUGAGCGACCAAGAUCACUCCAUGGAUGAAAUGACAGCUGUGG UGAAAAUUGAAAAAGGAGUUGGUGGCAAUAAUGGGGGCAAUGGUAAUGGUGGUGGUGCCUUUUCACAGGC UCGAAGUAGCAGCACAGGCAGUAGCAGCAGCACUGGAGGAGGAGGGCAGGGUGCCAAUGGCUGGCAGAUC Grinder-0.5.4/t/data/abundance_kmers.txt0000644000175000017500000000014412263016714020371 0ustar floflooofloflooo# seq1 twice as abundant as seq3, which is three times as abundant as seq2 seq1 60 seq2 10 seq3 30 Grinder-0.5.4/t/data/reverse_primer.fa0000644000175000017500000000002712263016714020050 0ustar floflooofloflooo>1392R ACGGGCGGTGTGTRC Grinder-0.5.4/t/data/abundances_multiple.txt0000644000175000017500000000006612263016714021271 0ustar flofloooflofloooseq1 4 25 0 seq2 22 24 0 seq3 24 23 100 seq5 24 4 0 Grinder-0.5.4/t/data/abundances2.txt0000644000175000017500000000003012263016714017427 0ustar flofloooflofloooseq1 60 seq2 30 seq3 10 Grinder-0.5.4/t/data/single_seq_database.fa0000644000175000017500000000054512263016714021001 0ustar floflooofloflooo>seq1 this is the first sequence aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa Grinder-0.5.4/t/data/revcom_amplicon_database.fa0000644000175000017500000000060512263016714022022 0ustar floflooofloflooo>seq1 primer match is on the reverse-complement of this sequence ACGGGCGGTGTGTACttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt ttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttCCGTCAATTCCTTTAAGTTT Grinder-0.5.4/t/data/multiple_amplicon_database.fa0000644000175000017500000000312612263016714022363 0ustar floflooofloflooo>seq1 nof_amplicons=2 has a RNA-specific residue (U) and an ambiguous base (R) AAACTUAAAGGAATTGACGGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa GTACACACCGCCCGTccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc AAACTTAAAGGAATTGRCGGtttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt GTACACACCGCCCGT >seq2 nof_amplicons=6 AAACTUAAAGGAATTGACGGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa GTACACACCGCCCGTccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc AAACTTAAAGGAATTGRCGGtttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt GTACACACCGCCCGTgggggAAACTUAAAGGAATTGACGGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaGTACACACCGCCCGTcccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccAAACTTAAAGGAATTGRCGGttttttttttttttttttttttttttttttttttttttttt tttttttttttttttttttGTACACACCGCCCGTgggggAAACTUAAAGGAATTGACGGaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaGTACACACCGCCCGTccccccccccccccccccccccccccc ccccccccccccccccccccccccccccccccccccccAAACTTAAAGGAATTGRCGGtttttttttttttttttttttt ttttttttttttttttttttttttttttttttttttttGTACACACCGCCCGT >seq3 nof_amplicons=1 AAACTUAAAGGAATTGACGGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa GTACACACCGCCCGTccccccccccccccccccccccccccccccccccccccccccccccccccc >seq4 nof_amplicons=2 one on each strand AAACTTAAAGGAATTGACGGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa GTACACACCGCCCGTccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc ACGGGCGGTGTGTACtttttttttttttttttttttttttttttcccccccctttttttttttttttttttttttCCGTC AATTCCTTTAAGTTTccccccc Grinder-0.5.4/t/data/kmers2.fa0000644000175000017500000000113012263016714016216 0ustar floflooofloflooo>seq1 GGGGGGGGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGGGGGGGGCCCCCCCCCCCCCCCCCCCCCCCCGGGGGGGGCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCGGGGGGGGCCCCCCCCCCCCCCCCCCCCCCCC >seq2 AAAAAAAAGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGAAAAAAAAAAAAAAAAGGGGGGGGAAAAAAAAGGGGGGGGAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAGGGGGGGGAAAAAAAA >seq3 TTTTTTTTGGGGGGGGTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGTTTTTTTTGGGGGGGGTTTTTTTTGGGGGGGGTTTTTTTTGGGGGGGGTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGGTTTTTTTTTTTTTTTTTTTTTTTTGGGGGGGG Grinder-0.5.4/t/data/shotgun_database_extended.fa0000644000175000017500000000177312647156525022236 0ustar floflooofloflooo>seq1 this is the first sequence aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa >seq2 cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc >seq3 gggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg gggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg >seq4 tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt >seq5 last sequence, last comment aaaaaaaaaattttttttttttttttttttttttttttttttttttttttttttttttttttttttttttgggggggggg >seq6 0 bp sequence >seq7 1 bp sequence a Grinder-0.5.4/t/data/abundances.txt0000644000175000017500000000006312263016714017353 0ustar floflooofloflooo# Abundance file seq1 25 seq2 25 seq4 25 seq5 25 Grinder-0.5.4/t/data/nested_amplicon_database.fa0000644000175000017500000000102612263016714022007 0ustar floflooofloflooo>seq1 template of type FRFFFRR cccccAAACTUAAAGGAATTGACGGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaGTACACACCGCCCGTcccccAAACTUAAAGGAATTGACGGcccccAAACTUAAAGGAATTGACGGccccAAACTTAAAGGAATTGRCGGttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttGTACACACCGCCCGTccccGTACACACCGCCCGTcc >seq2 template FRFR: a short match on reverse strand and a long match on forward AAACTTAAAGGAATTGACGGaaaaaaaaaACGGGCGGTGTGTACccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccCCGTCAATTCCTTTAAGTTTaaaaaaaaaGTACACACCGCCCGT Grinder-0.5.4/t/data/single_amplicon_database.fa0000644000175000017500000000014712263016714022011 0ustar floflooofloflooo>seq3 nof_amplicons=1 aaaaaAAACTUAAAGGAATTGACGGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaGTACACACCGCCCGTaaaaa Grinder-0.5.4/t/data/database_mixed.fa0000644000175000017500000000302412263016714017751 0ustar floflooofloflooo>gi|352962132|ref|NG_030353.1| Homo sapiens sal-like 3 (Drosophila) (SALL3), RefSeqGene on chromosome 18 TAATAATCGTTTCGGCCTCCCTATAGGCAAGGAGTCAAAGTTTTAACTTGCTAGCATTATTTATGTAATC ATACATGCTGAAATGTCCCTCCTGGTCTACATGCAGCCCCGAGCCACAGTTCAGCCATCAGGAGAGAAGT ACTTCACCATCGTTTGCATCCCTCAGTGCGAAGACGACTGTGAGCTGATGTTTCTGTGTATGCCATAAAA AGCCACGGAATGTTTGCCTCTGATGGCTACGGTGAAGCTACACAGCGTCCTGGAATAAACACACAGGAAG >gi|352962148|ref|NM_001251825.1| Homo sapiens Sp1 UranscripUion facUor (SP1), UranscripU varianU 3, mRNA GUCCGGGUUCGCUUGCCUCGUCAGCGUCCGCGUUUUUCCCGGCCCCCCCCAACCCCCCCGGACAGGACCC CCUUGAGCUUGUCCCUCAGCUGCCACCAUGAGCGACCAAGAUCACUCCAUGGAUGAAAUGACAGCUGUGG UGAAAAUUGAAAAAGGAGUUGGUGGCAAUAAUGGGGGCAAUGGUAAUGGUGGUGGUGCCUUUUCACAGGC UCGAAGUAGCAGCACAGGCAGUAGCAGCAGCACUGGAGGAGGAGGGCAGGGUGCCAAUGGCUGGCAGAUC >gi|194473622|ref|NP_001123975.1| adenylosuccinate lyase [Rattus norvegicus] MAASGDPACAESYRSPLAARYASHEMCFLFSDRYKFQTWRQLWLWLAEAEQTLGLPITDEQIQEMRSNLS NIDFQMAAEEEKRLRHDVMAHVHTFGHCCPKAAGIIHLGATSCYVGDNTDLIILRNAFDLLLPKLARVIS RLADFAKERADLPTLGFTHFQPAQLTTVGKRCCLWIQDLCMDLQNLKRVRDELRFRGVKGTTGTQASFLQ LFEGDHQKVEQLDKMVTEKAGFKRAYIITGQTYTRKVDIEVLSVLASLGASVHKICTDIRLLANLKEMEE >gi|61679760|pdb|1Y4P|B Chain B, T-To-T(High) Quaternary Transitions In Human Hemoglobin: Betaw37e Deoxy Low-Salt (10 Test Sets) MHLTPEEKSAVTALWGKVNVDEVGGEALGRLLVVYPETQRFFESFGDLSTPDAVMGNPKVKAHGKKVLGA FSDGLAHLDNLKGTFATLSELHCDKLHVDPENFRLLGNVLVCVLAHHFGKEFTPPVQAAYQKVVAGVANA LAHKYHKERADLPTLGFTHFQPAQLTTVGKRCCLWIQDLCMDLQNLKRVRDELRFRGVKGTTGTQASFLQ LFEGDHQKVEQLDKMVTEKAGFKRAYIITGQTYTRKVDIEVLSVLASLGASVHKICTDIRLLANLKEMEE Grinder-0.5.4/t/data/dirty_database.fa0000644000175000017500000000040512263016714017776 0ustar floflooofloflooo>seq1 aaaaaaaaaattttttttttttttttttttNNNNNNNNNNttttttttttttttttttttttttttttttgggggggggg >seq2 aaaaaaaaaattttttttttttttttttttttttttttttnnnnnnnnnnttttttttttttttttttttgggggggggg >seq3 aaaaaaaaaatttttttttt----------ttttttttttttttttttttttttttttttttttttttttgggggggggg Grinder-0.5.4/t/data/amplicon_database.fa0000644000175000017500000000172012263016714020446 0ustar floflooofloflooo>seq1 this is the first sequence AAACTTAAAGGAATTGACGGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaGTACACACCGCCCGT >seq2 AAACTCaAAgGAAtTGACGGccccccccccccccccccccccccccccccccccccGTACACACCGCCCGTccccccccc cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc >seq3 gggggggggggggggggggggggggAAACTTAAATGAATTGACGGggggggggggggggggggggggggggggggggggg gggggggggggggggggggggggggggggggggggggggggggggggggGCACACACCGCCCGTgggggggggggggggg >seq4 ttttttttttttttttttttttttttttttAAACTCAAATGAATTGACGGtttttttttttttttttttttttttttttt >seq5 last sequence, last comment aaaaaaaaaatttttttttttttttttttttttttttttttttttttttttttttGCACACACCGCCCGTgggggggggg Grinder-0.5.4/t/data/profile.txt0000644000175000017500000000025212263016714016710 0ustar floflooofloflooo # The profile file first -reference_file t/data/single_seq_database.fa # Now some read length specification -read_dist 50 -total_reads 100 -unidirectional 1 Grinder-0.5.4/t/data/shotgun_database.fa0000644000175000017500000000174612263016714020343 0ustar floflooofloflooo>seq1 this is the first sequence aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa >seq2 cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc >seq3 gggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg gggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg >seq4 tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt >seq5 last sequence, last comment aaaaaaaaaattttttttttttttttttttttttttttttttttttttttttttttttttttttttttttgggggggggg >seq6 0 bp sequence Grinder-0.5.4/t/data/database_dna.fa0000644000175000017500000000060512263016714017407 0ustar floflooofloflooo>gi|352962132|ref|NG_030353.1| Homo sapiens sal-like 3 (Drosophila) (SALL3), RefSeqGene on chromosome 18 TAATAATCGTTTCGGCCTCCCTATAGGCAAGGAGTCAAAGTTTTAACTTGCTAGCATTATTTATGTAATC ATACATGCTGAAATGTCCCTCCTGGTCTACATGCAGCCCCGAGCCACAGTTCAGCCATCAGGAGAGAAGT ACTTCACCATCGTTTGCATCCCTCAGTGCGAAGACGACTGTGAGCTGATGTTTCTGTGTATGCCATAAAA AGCCACGGAATGTTTGCCTCTGATGGCTACGGTGAAGCTACACAGCGTCCTGGAATAAACACACAGGAAG Grinder-0.5.4/t/data/reverse_forward_primers.fa0000644000175000017500000000006212263016714021756 0ustar floflooofloflooo>1392R ACGGGCGGTGTGTRC >926F AAACTYAAAKGAATTGRCGG Grinder-0.5.4/t/data/mids.fa0000644000175000017500000000003412263016714015751 0ustar floflooofloflooo>mid_1 ACGT >mid_2 AAAATTTT Grinder-0.5.4/t/data/forward_primer.fa0000644000175000017500000000003312263016714020036 0ustar floflooofloflooo>926F AAACTYAAAKGAATTGRCGG Grinder-0.5.4/t/data/shotgun_database_shared_kmers.fa0000644000175000017500000000172112263016714023063 0ustar floflooofloflooo>seq1 this is the first sequence aaccggttaaaaaaaaaaaaaaaaatgcatgctaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaatgcatgctaaaaaaaaaaaaaaa >seq2 cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccatgcatgctcccccccc cccccccccccccccccccccccccccccccccccccaaccggttctacccccccccccccccccccccccccccccccc cccccccccccccccatgcatgcccccccccccccccccccccccccccccccccccccccccccccccccccccccccc >seq3 ggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggatgcatgctggggggggggg ggggaaccggttggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggatgcatgc >seq4 ttttttttttttttttttttttaaccggtttttttttatgcatgcttttttttttttttttttttttttttttttttttt >seq5 last sequence, last comment aaaaaaaaaattttttttttttttttttttttttttttttttttttttttaaccggttttatgcatgcttgggggggggg Grinder-0.5.4/t/data/forward_reverse_primers.fa0000644000175000017500000000006212263016714021756 0ustar floflooofloflooo>926F AAACTYAAAKGAATTGRCGG >1392R ACGGGCGGTGTGTRC Grinder-0.5.4/t/data/homopolymer_database.fa0000644000175000017500000000040012263016714021210 0ustar floflooofloflooo>seq1 homopolymers 1 to 10 bp long acgtaaccggttaaacccgggtttaaaaccccggggttttaaaaacccccgggggtttttaaaaaaccccccggggggttttttaaaaaaacccccccgggggggtttttttaaaaaaaaccccccccggggggggttttttttaaaaaaaaacccccccccgggggggggtttttttttaaaaaaaaaaccccccccccggggggggggtttttttttt Grinder-0.5.4/t/32-database.t0000644000175000017500000000513212264034167015756 0ustar floflooofloflooo#! perl use strict; use warnings; use t::TestUtils; use Test::More; use_ok 'Grinder::Database'; my ($db, $seq); # Test minium sequence length and forbidden characters ok $db = Grinder::Database->new( -fasta_file => data('shotgun_database.fa'), ); isa_ok $db, 'Grinder::Database'; is $db->get_minimum_length, 1; is $db->get_delete_chars, ''; is_deeply [sort @{$db->get_ids}], ['seq1', 'seq2', 'seq3', 'seq4', 'seq5']; $db->get_database->DESTROY; ok $db = Grinder::Database->new( -fasta_file => data('shotgun_database.fa'), -minimum_length => 200, ); is $db->get_minimum_length, 200; is_deeply [sort @{$db->get_ids}], ['seq1', 'seq2']; $db->get_database->DESTROY; ok $db = Grinder::Database->new( -fasta_file => data('shotgun_database.fa'), -delete_chars => 'ac', ); is $db->get_delete_chars, 'ac'; is_deeply [sort @{$db->get_ids}], ['seq3', 'seq4', 'seq5']; # Test retrieving sequences and subsequences is $db->get_seq('zzz'), undef; ok $seq = $db->get_seq('seq5'); is $seq->id, 'seq5'; is $seq->seq, 'aaaaaaaaaattttttttttttttttttttttttttttttttttttttttttttttttttttttttttttgggggggggg'; ok $seq = $db->get_seq('seq5:2..11'); is $seq->id, 'seq5'; is $seq->seq, 'aaaaaaaaat'; ok $seq = $db->get_seq('seq5/-1'); is $seq->id, 'seq5'; is $seq->seq, 'ccccccccccaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaatttttttttt'; ok $seq = $db->get_seq('seq5:2..11/-1'); is $seq->id, 'seq5'; is $seq->seq, 'attttttttt'; # Test alphabet is $db->get_alphabet, 'dna'; $db->get_database->DESTROY; ok $db = Grinder::Database->new( -fasta_file => data('database_dna.fa'), ); is $db->get_alphabet, 'dna'; $db->get_database->DESTROY; ok $db = Grinder::Database->new( -fasta_file => data('database_rna.fa'), ); is $db->get_alphabet, 'rna'; $db->get_database->DESTROY; ok $db = Grinder::Database->new( -fasta_file => data('database_protein.fa'), -unidirectional => 1, ); is $db->get_alphabet, 'protein'; $db->get_database->DESTROY; ok $db = Grinder::Database->new( -fasta_file => data('database_mixed.fa'), -unidirectional => 1, ); is $db->get_alphabet, 'protein'; $db->get_database->DESTROY; ####ok $db = Grinder::Database->new( #### -fasta_file => data('shotgun_database.fa'), #### -unidirectional => -1, ####); ####is $db->get_unidirectional, -1; ####$db = Grinder::Database->new( #### -fasta_file => data('shotgun_database.fa'), #### -unidirectional => #### -forward_reverse_primers => #### -abundance_file => #### -delete_chars => #### -min_len => 1 ####); # next seq for shotgun done_testing(); Grinder-0.5.4/t/11-tracking.t0000644000175000017500000001075612263016714016016 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $nof_reads, $read); # Tracking read information in the read description ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -total_reads => 10 , -unidirectional => 0 , -desc_track => 1 , ), 'Bidirectional shotgun tracking'; ok $read = $factory->next_read; while ($factory->next_read) { like $read->desc, qr/reference=.*position=(complement\()?\d+\.\.\d+(\))?/; } ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -total_reads => 10 , -unidirectional => 1 , -desc_track => 1 , ), 'Forward shotgun tracking'; ok $read = $factory->next_read; while ($factory->next_read) { like $read->desc, qr/reference=.*position=\d+\.\.\d+/; } ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -total_reads => 10 , -unidirectional => -1 , -desc_track => 1 , ), 'Reverse shotgun tracking'; ok $read = $factory->next_read; while ($factory->next_read) { like $read->desc, qr/reference=.*position=complement\(\d+\.\.\d+\)/; } ok $factory = Grinder->new( -reference_file => data('amplicon_database.fa'), -forward_reverse => data('forward_primer.fa') , -length_bias => 0 , -unidirectional => 1 , -total_reads => 10 , -desc_track => 1 , ), 'Amplicon tracking'; ok $read = $factory->next_read; while ($factory->next_read) { like $read->desc, qr/reference=\S+.*amplicon=\d+\.\.\d+.*position=.*/; } ok $factory = Grinder->new( -reference_file => data('revcom_amplicon_database.fa'), -forward_reverse => data('forward_primer.fa') , -length_bias => 0 , -unidirectional => 1 , -total_reads => 10 , -desc_track => 1 , ), 'Reverse-complemented amplicon tracking'; ok $read = $factory->next_read; while ($factory->next_read) { like $read->desc, qr/reference=\S+.*amplicon=complement\(\d+\.\.\d+\).*position=.*/; } ok $factory = Grinder->new( -reference_file => data('amplicon_database.fa'), -forward_reverse => data('forward_primer.fa') , -length_bias => 0 , -unidirectional => 1 , -total_reads => 10 , -desc_track => 1 , -chimera_perc => 100 , -chimera_dist => (1) , -chimera_kmer => 0 , ), 'Bimeric amplicon tracking'; ok $read = $factory->next_read; while ($factory->next_read) { like $read->desc, qr/reference=\S+,\S+.*amplicon=\d+\.\.\d+,\d+\.\.\d+.*position=.*/; } ok $factory = Grinder->new( -reference_file => data('amplicon_database.fa'), -forward_reverse => data('forward_primer.fa') , -length_bias => 0 , -unidirectional => 1 , -total_reads => 10 , -desc_track => 1 , -chimera_perc => 100 , -chimera_dist => (0, 1) , -chimera_kmer => 10 , ), 'Trimeric amplicon tracking'; ok $read = $factory->next_read; while ($factory->next_read) { like $read->desc, qr/reference=\S+(,\S+){2}.*amplicon=\d+\.\.\d+(,\d+\.\.\d+){2}.*position=.*/; } ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -total_reads => 10 , -desc_track => 0 , ), 'No tracking'; ok $read = $factory->next_read; while ($factory->next_read) { is $read->desc, undef; } ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -total_reads => 10 , ), 'Tracking default'; ok $read = $factory->next_read; while ($factory->next_read) { like $read->desc, qr/reference=.*position=.*(complement\()?\d+\.\.\d+(\))?/; } done_testing(); Grinder-0.5.4/t/18-amplicon-multiple.t0000644000175000017500000001067612263016714017657 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $read, $nof_reads, %got_amplicons, %expected_amplicons); # Template with several matching amplicons and forward primer only ok $factory = Grinder->new( -reference_file => data('multiple_amplicon_database.fa'), -forward_reverse => data('forward_primer.fa') , -length_bias => 0 , -unidirectional => 1 , -read_dist => 100 , -total_reads => 100 , ), 'Forward primer only'; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; $got_amplicons{$read->seq} = undef; ok_read_forward_only($read, 1, $nof_reads); }; is $nof_reads, 100; %expected_amplicons = ( 'AAACTTAAAGGAATTGRCGGttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttGTACACACCGCCCGT' => undef, 'AAACTUAAAGGAATTGACGGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaGTACACACCGCCCGTccccc' => undef, 'AAACTTAAAGGAATTGACGGaaaaaaaaaaaaaaaaaaaaaaaggggggggaaaaaaaaaaaaaaaaaaaaaaaaaaaaaGTACACACCGCCCGTggggg' => undef, ); is_deeply( \%got_amplicons, \%expected_amplicons ); undef %got_amplicons; # Template with several matching amplicons and forward and reverse primers ok $factory = Grinder->new( -reference_file => data('multiple_amplicon_database.fa'), -forward_reverse => data('forward_reverse_primers.fa') , -length_bias => 0 , -unidirectional => 1 , -read_dist => 100 , -total_reads => 100 , ), 'Forward and reverse primers'; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; $got_amplicons{$read->seq} = undef; ok_read_forward_reverse($read, 1, $nof_reads); }; is $nof_reads, 100; %expected_amplicons = ( 'AAACTTAAAGGAATTGACGGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaGTACACACCGCCCGT' => undef, 'AAACTTAAAGGAATTGACGGaaaaaaaaaaaaaaaaaaaaaaaggggggggaaaaaaaaaaaaaaaaaaaaaaaaaaaaaGTACACACCGCCCGT' => undef, 'AAACTTAAAGGAATTGRCGGttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttGTACACACCGCCCGT' => undef, 'AAACTUAAAGGAATTGACGGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaGTACACACCGCCCGT' => undef, ); is_deeply( \%got_amplicons, \%expected_amplicons ); undef %got_amplicons; # Template with several nested amplicons and forward and reverse primers ok $factory = Grinder->new( -reference_file => data('nested_amplicon_database.fa'), -forward_reverse => data('forward_reverse_primers.fa') , -length_bias => 0 , -unidirectional => 1 , -read_dist => 100 , -total_reads => 100 , ), 'Forward and reverse primers, nested amplicons'; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; $got_amplicons{$read->seq} = undef; ok_read_forward_reverse($read, 1, $nof_reads); }; is $nof_reads, 100; %expected_amplicons = ( 'AAACTUAAAGGAATTGACGGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaGTACACACCGCCCGT' => undef, 'AAACTTAAAGGAATTGRCGGttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttGTACACACCGCCCGT' => undef, 'AAACTTAAAGGAATTGACGGggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggGTACACACCGCCCGT' => undef, ); is_deeply( \%got_amplicons, \%expected_amplicons ); undef %got_amplicons; done_testing(); sub ok_read_forward_reverse { my ($read, $req_strand, $nof_reads) = @_; isa_ok $read, 'Bio::Seq::SimulatedRead'; like $read->reference->id, qr/^seq\d+$/; my $strand = $read->strand; if (not defined $req_strand) { $req_strand = $strand; } else { is $strand, $req_strand; } my $readseq = $read->seq; is $read->id, $nof_reads; is $read->length, 95; } sub ok_read_forward_only { my ($read, $req_strand, $nof_reads) = @_; isa_ok $read, 'Bio::Seq::SimulatedRead'; like $read->reference->id, qr/^seq\d+$/; my $strand = $read->strand; if (not defined $req_strand) { $req_strand = $strand; } else { is $strand, $req_strand; } my $readseq = $read->seq; is $read->id, $nof_reads; my $readlength = $read->length; ok ( ($readlength == 95) or ($readlength == 100) ); } Grinder-0.5.4/t/10-quality.t0000644000175000017500000000153412263016714015675 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $nof_reads, $read); # Outputing basic quality scores ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -read_dist => 52 , -total_reads => 10 , ), 'No quality scores'; ok $read = $factory->next_read; is_deeply $read->qual, []; ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -read_dist => 52 , -total_reads => 10 , -qual_levels => '30 10' , ), 'With quality scores'; ok $read = $factory->next_read; is scalar @{$read->qual}, 52; is_deeply $read->qual, [(30) x 52 ]; done_testing(); Grinder-0.5.4/t/29-kmer-collection.t0000644000175000017500000002123512264035457017314 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Bio::PrimarySeq; use_ok 'Grinder::KmerCollection'; my ($col, $seq1, $seq2, $by_kmer, $by_seq, $file, $sources, $counts, $freqs, $kmers, $pos, $weights); # Test the Grinder::KmerCollection module $seq1 = Bio::PrimarySeq->new( -id => 'seq1', -seq => 'AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAA' ); $seq2 = Bio::PrimarySeq->new( -id => 'seq4', -seq => 'CCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCCAAAAAAAACCCCCCCCCCCCCCCCCCCCCCCGGGGGGGGCCCCCCCC' ); ok $col = Grinder::KmerCollection->new( -k => 8 ); isa_ok $col, 'Grinder::KmerCollection'; is $col->k, 8; ok $col->add_seqs([$seq1]); ok $col->add_seqs([$seq2]); ok $by_kmer = $col->collection_by_kmer; ok exists $by_kmer->{'AAAAAAAA'}->{'seq1'}; ok exists $by_kmer->{'AAAAAAAA'}->{'seq4'}; ok exists $by_kmer->{'CCCCCCCC'}->{'seq4'}; ok exists $by_kmer->{'CCCCGGGG'}->{'seq4'}; ok exists $by_kmer->{'ACCCCCCC'}->{'seq4'}; ok $by_kmer = $col->collection_by_seq; ok exists $by_kmer->{'seq1'}->{'AAAAAAAA'}; ok exists $by_kmer->{'seq4'}->{'AAAAAAAA'}; ok exists $by_kmer->{'seq4'}->{'CCCCCCCC'}; ok exists $by_kmer->{'seq4'}->{'CCCCGGGG'}; ok exists $by_kmer->{'seq4'}->{'ACCCCCCC'}; ok $col = $col->filter_rare(2); isa_ok $col, 'Grinder::KmerCollection'; ok $by_kmer = $col->collection_by_kmer; ok exists $by_kmer->{'AAAAAAAA'}->{'seq1'}; ok exists $by_kmer->{'AAAAAAAA'}->{'seq4'}; ok exists $by_kmer->{'CCCCCCCC'}->{'seq4'}; ok not exists $by_kmer->{'CCCCGGGG'}; ok not exists $by_kmer->{'ACCCCCCC'}; ok $by_kmer = $col->collection_by_seq; ok exists $by_kmer->{'seq1'}->{'AAAAAAAA'}; ok exists $by_kmer->{'seq4'}->{'AAAAAAAA'}; ok exists $by_kmer->{'seq4'}->{'CCCCCCCC'}; ok not exists $by_kmer->{'seq4'}->{'CCCCGGGG'}; ok not exists $by_kmer->{'seq4'}->{'ACCCCCCC'}; ok $col = Grinder::KmerCollection->new( -k => 8, -seqs => [$seq1, $seq2] ); # Count of all kmers ($kmers, $counts) = $col->counts(); $kmers = [sort {$a cmp $b} @$kmers]; $counts = [sort {$a <=> $b} @$counts]; is_deeply $kmers , [ 'AAAAAAAA', 'AAAAAAAC', 'AAAAAACC', 'AAAAACCC', 'AAAACCCC', 'AAACCCCC', 'AACCCCCC', 'ACCCCCCC', 'CAAAAAAA', 'CCAAAAAA', 'CCCAAAAA', 'CCCCAAAA', 'CCCCCAAA', 'CCCCCCAA', 'CCCCCCCA', 'CCCCCCCC', 'CCCCCCCG', 'CCCCCCGG', 'CCCCCGGG', 'CCCCGGGG', 'CCCGGGGG', 'CCGGGGGG', 'CGGGGGGG', 'GCCCCCCC', 'GGCCCCCC', 'GGGCCCCC', 'GGGGCCCC', 'GGGGGCCC', 'GGGGGGCC', 'GGGGGGGC', 'GGGGGGGG' ]; is_deeply $counts, [ 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 43, 74, ]; # Frequency of kmers from position >= 40 ($kmers, $freqs) = $col->counts(undef, 40, 1); $kmers = [sort {$a cmp $b} @$kmers]; $freqs = [sort {$a <=> $b} @$freqs]; is_deeply $kmers , [ 'AAAAAAAA', 'AACCCCCC', 'ACCCCCCC', 'CCCCCCCC', 'CCCCCCCG', 'CCCCCCGG', 'CCCCCGGG', 'CCCCGGGG', 'CCCGGGGG', 'CCGGGGGG', 'CGGGGGGG', 'GCCCCCCC', 'GGCCCCCC', 'GGGCCCCC', 'GGGGCCCC', 'GGGGGCCC', 'GGGGGGCC', 'GGGGGGGC', 'GGGGGGGG' ]; is_deeply $freqs, [ '0.0147058823529412', '0.0147058823529412', '0.0147058823529412', '0.0147058823529412', '0.0147058823529412', '0.0147058823529412', '0.0147058823529412', '0.0147058823529412', '0.0147058823529412', '0.0147058823529412', '0.0147058823529412', '0.0147058823529412', '0.0147058823529412', '0.0147058823529412', '0.0147058823529412', '0.0147058823529412', '0.0147058823529412', '0.25', '0.5', ]; ($kmers, $freqs) = $col->counts('seq1', 40, 1); is_deeply $kmers, [ 'AAAAAAAA' ]; is_deeply $freqs, [ 1 ]; ($kmers, $freqs) = $col->counts('seq4', 40, 1); $kmers = [sort {$a cmp $b} @$kmers]; $freqs = [sort {$a <=> $b} @$freqs]; is_deeply $kmers , [ 'AACCCCCC', 'ACCCCCCC', 'CCCCCCCC', 'CCCCCCCG', 'CCCCCCGG', 'CCCCCGGG', 'CCCCGGGG', 'CCCGGGGG', 'CCGGGGGG', 'CGGGGGGG', 'GCCCCCCC', 'GGCCCCCC', 'GGGCCCCC', 'GGGGCCCC', 'GGGGGCCC', 'GGGGGGCC', 'GGGGGGGC', 'GGGGGGGG' ]; is_deeply $freqs, [ '0.0294117647058824', '0.0294117647058824', '0.0294117647058824', '0.0294117647058824', '0.0294117647058824', '0.0294117647058824', '0.0294117647058824', '0.0294117647058824', '0.0294117647058824', '0.0294117647058824', '0.0294117647058824', '0.0294117647058824', '0.0294117647058824', '0.0294117647058824', '0.0294117647058824', '0.0294117647058824', '0.0294117647058824', '0.5', ]; ok $col = $col->filter_shared(2); isa_ok $col, 'Grinder::KmerCollection'; ($kmers, $counts) = $col->counts(); is_deeply $kmers , ['AAAAAAAA']; is_deeply $counts, [ 74 ]; ok $by_kmer = $col->collection_by_kmer; ok exists $by_kmer->{'AAAAAAAA'}->{'seq1'}; ok exists $by_kmer->{'AAAAAAAA'}->{'seq4'}; ok not exists $by_kmer->{'CCCCCCCC'}; ok not exists $by_kmer->{'CCCCGGGG'}; ok not exists $by_kmer->{'ACCCCCCC'}; ok $by_kmer = $col->collection_by_seq; ok exists $by_kmer->{'seq1'}->{'AAAAAAAA'}; ok exists $by_kmer->{'seq4'}->{'AAAAAAAA'}; ok not exists $by_kmer->{'seq4'}->{'CCCCCCCC'}; ok not exists $by_kmer->{'seq4'}->{'CCCCGGGG'}; ok not exists $by_kmer->{'seq4'}->{'ACCCCCCC'}; ($sources, $counts) = $col->sources('AAAAAAAA'); my %values = ('seq1' => 73, 'seq4' => 1); is $values{$sources->[0]}, $counts->[0]; is $values{$sources->[1]}, $counts->[1]; ($sources, $counts) = $col->sources('AAAAAAAA', 'seq1'); is $values{$sources->[0]}, $counts->[0]; ($sources, $counts) = $col->sources('ZZZZZZZZ'); is_deeply $sources, []; is_deeply $counts , []; ($kmers, $counts) = $col->kmers('seq1'); is_deeply $kmers , ['AAAAAAAA']; is_deeply $counts, [ 73 ]; ($kmers, $counts) = $col->kmers('seq4'); is_deeply $kmers , ['AAAAAAAA']; is_deeply $counts, [ 1 ]; ($kmers, $counts) = $col->kmers('asdf'); is_deeply $kmers , []; is_deeply $counts, []; $pos = $col->positions('AAAAAAAA', 'seq1'); is_deeply $pos, [1..73]; $pos = $col->positions('AAAAAAAA', 'seq4'); is_deeply $pos, [34]; $pos = $col->positions('CCCCGGGG', 'seq4'); is_deeply $pos, []; $pos = $col->positions('AAAAAAAA', 'seq3'); is_deeply $pos, []; ok $col = Grinder::KmerCollection->new( -k => 8, -seqs => [$seq1, $seq2], -ids => ['abc', '123'], )->filter_rare(2); isa_ok $col, 'Grinder::KmerCollection'; ok $by_kmer = $col->collection_by_kmer; ok exists $by_kmer->{'AAAAAAAA'}->{'abc'}; ok exists $by_kmer->{'AAAAAAAA'}->{'123'}; ok $by_kmer = $col->collection_by_seq; ok exists $by_kmer->{'abc'}->{'AAAAAAAA'}; ok exists $by_kmer->{'123'}->{'AAAAAAAA'}; ($sources, $counts) = $col->sources('AAAAAAAA'); %values = ('123' => 1, 'abc' => 73); is $values{$sources->[0]}, $counts->[0]; is $values{$sources->[1]}, $counts->[1]; ($sources, $counts) = $col->sources('AAAAAAAA', 'abc'); is $values{$sources->[0]}, $counts->[0]; # Using weights ok $col = Grinder::KmerCollection->new( -k => 8, -seqs => [$seq1, $seq2], )->filter_shared(2); $weights = { 'seq1' => 10, 'seq4' => 0.1 }; ok $col->weights($weights); ($sources, $counts) = $col->sources('AAAAAAAA'); %values = ('seq1' => 730, 'seq4' => 0.1); is $values{$sources->[0]}, $counts->[0]; is $values{$sources->[1]}, $counts->[1]; ($kmers, $counts) = $col->counts(); is_deeply $kmers , ['AAAAAAAA']; is_deeply $counts, [ 730.1 ]; ($kmers, $counts) = $col->kmers('seq1'); is_deeply $kmers , ['AAAAAAAA']; is_deeply $counts, [ 730 ]; ($kmers, $counts) = $col->kmers('seq4'); is_deeply $kmers , ['AAAAAAAA']; is_deeply $counts, [ 0.1 ]; ok $col->weights({}); is_deeply $col->weights, {}; # Read from file $file = data('kmers.fa'); ok $col = Grinder::KmerCollection->new( -k => 8, -file => $file, ); done_testing; Grinder-0.5.4/t/pod.t0000644000175000017500000000034612263016714014551 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; # Ensure a recent version of Test::Pod my $min_tp = 1.22; eval "use Test::Pod $min_tp"; plan skip_all => "Test::Pod $min_tp required for testing POD" if $@; all_pod_files_ok(); Grinder-0.5.4/t/04-abundances.t0000644000175000017500000000671612263016714016322 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $nof_reads, $read, %sources); # Specified genome abundance for a single shotgun library ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -abundance_file => data('abundances.txt') , -length_bias => 0 , -random_seed => 1910567890 , -total_reads => 1000 , ), 'Genome abundance for a single shotgun libraries'; while ( $read = $factory->next_read ) { my $source = $read->reference->id; if (not exists $sources{$source}) { $sources{$source} = 1; } else { $sources{$source}++; } }; ok exists $sources{'seq1'}; ok exists $sources{'seq2'}; ok not exists $sources{'seq3'}; ok exists $sources{'seq4'}; ok exists $sources{'seq5'}; # These tests are quite sensitive to the seed used. Ideal average answer should # be 250 here between_ok( $sources{'seq1'}, 230, 280 ); between_ok( $sources{'seq2'}, 230, 280 ); between_ok( $sources{'seq4'}, 230, 280 ); between_ok( $sources{'seq5'}, 230, 280 ); is $factory->next_lib, undef; %sources = (); # Specified genome abundance for a single amplicon library ok $factory = Grinder->new( -abundance_file => data('abundances2.txt') , -reference_file => data('amplicon_database.fa') , -forward_reverse => data('forward_reverse_primers.fa'), -copy_bias => 0 , -unidirectional => 1 , -read_dist => 48 , -random_seed => 1910567890 , -total_reads => 1000 , ), 'Genome abundance for a single amplicon libraries'; while ( $read = $factory->next_read ) { my $source = $read->reference->id; # Strip amplicon sources of the 'amplicon' part $source =~ s/_amplicon.*$//; if (not exists $sources{$source}) { $sources{$source} = 1; } else { $sources{$source}++; } }; ok exists $sources{'seq1'}; ok exists $sources{'seq2'}; ok exists $sources{'seq3'}; # These tests are quite sensitive to the seed used. Ideal average answer should # be 600, 300 and 100 here between_ok( $sources{'seq1'}, 570, 630 ); between_ok( $sources{'seq2'}, 270, 330 ); between_ok( $sources{'seq3'}, 70 , 130 ); is $factory->next_lib, undef; %sources = (); # Specified genome abundance for multiple shotgun libraries ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -abundance_file => data('abundances_multiple.txt') , -length_bias => 0 , -random_seed => 1232567890 , -total_reads => 1000 , ), 'Genome abundance for multiple shotgun libraries'; ok $factory->next_lib; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++ }; is $nof_reads, 1000; ok $factory->next_lib; ok $factory->next_lib; while ( $read = $factory->next_read ) { my $source = $read->reference->id; if (not exists $sources{$source}) { $sources{$source} = 1; } else { $sources{$source}++; } }; ok not exists $sources{'seq1'}; ok not exists $sources{'seq2'}; ok exists $sources{'seq3'}; ok not exists $sources{'seq4'}; ok not exists $sources{'seq5'}; is $sources{'seq3'}, 1000; is $factory->next_lib, undef; done_testing(); Grinder-0.5.4/t/30-kmer-chimeras.t0000644000175000017500000001157012263016714016737 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $read, $nof_reads); my %chim_sizes; my %refs; my %expected; my $delta = 0.2; # Bimeras ok $factory = Grinder->new( -reference_file => data('kmers.fa'), -length_bias => 0 , -unidirectional => 1 , -chimera_perc => 100 , -chimera_dist => (1) , -chimera_kmer => 8 , -total_reads => 300 , ), 'Bimeras'; %refs = (); while ( $read = $factory->next_read ) { my @refs = get_references($read); is scalar @refs, 2; for my $ref (@refs) { $refs{$ref}++; } } ok exists $refs{'seq1'}; ok exists $refs{'seq2'}; ok exists $refs{'seq3'}; ok exists $refs{'seq4'}; ok not exists $refs{'seq5'}; # Trimeras ok $factory = Grinder->new( -reference_file => data('kmers.fa'), -length_bias => 0 , -unidirectional => 1 , -chimera_perc => 100 , -chimera_dist => (0, 1) , -chimera_kmer => 8 , -total_reads => 300 , ), 'Trimeras'; %refs = (); while ( $read = $factory->next_read ) { my @refs = get_references($read); is scalar @refs, 3; for my $ref (@refs) { $refs{$ref}++; } } ok exists $refs{'seq1'}; ok exists $refs{'seq2'}; ok exists $refs{'seq3'}; ok exists $refs{'seq4'}; ok not exists $refs{'seq5'}; # Quadrameras ok $factory = Grinder->new( -reference_file => data('kmers.fa'), -length_bias => 0 , -unidirectional => 1 , -chimera_perc => 100 , -chimera_dist => (0, 0, 1) , -chimera_kmer => 8 , -total_reads => 300 , ), 'Quadrameras'; %refs = (); while ( $read = $factory->next_read ) { my @refs = get_references($read); is scalar @refs, 4; for my $ref (@refs) { $refs{$ref}++; } } ok exists $refs{'seq1'}; ok exists $refs{'seq2'}; ok exists $refs{'seq3'}; ok exists $refs{'seq4'}; ok not exists $refs{'seq5'}; # 100% chimeras (bimeras, trimeras, quadrameras) ok $factory = Grinder->new( -reference_file => data('kmers.fa'), -length_bias => 0 , -unidirectional => 1 , -chimera_perc => 100 , -chimera_dist => (1, 1, 1) , -chimera_kmer => 8 , -total_reads => 1000 , ), '100% chimeras (bimeras, trimeras, quadrameras)'; %refs = (); while ( $read = $factory->next_read ) { my @refs = get_references($read); my $nof_refs = scalar @refs; $chim_sizes{$nof_refs}++; between_ok( $nof_refs, 2, 4 ); for my $ref (@refs) { $refs{$ref}++; } } between_ok( $chim_sizes{2}, 333.3 * (1-$delta), 333.3 * (1+$delta) ); between_ok( $chim_sizes{3}, 333.3 * (1-$delta), 333.3 * (1+$delta) ); between_ok( $chim_sizes{4}, 333.3 * (1-$delta), 333.3 * (1+$delta) ); ok exists $refs{'seq1'}; ok exists $refs{'seq2'}; ok exists $refs{'seq3'}; ok exists $refs{'seq4'}; ok not exists $refs{'seq5'}; # From equal abundance sequences ok $factory = Grinder->new( -reference_file => data('kmers2.fa'), -length_bias => 0 , -unidirectional => 1 , -chimera_perc => 100 , -chimera_dist => (0, 0, 0, 0, 1) , -chimera_kmer => 8 , -total_reads => 1000 , ), 'From equal abundance sequences'; %refs = (); while ( $read = $factory->next_read ) { my @refs = get_references($read); is scalar @refs, 6; for my $ref (@refs) { $refs{$ref}++; } } %expected = ( 'seq1' => 6000 * 4/18, 'seq2' => 6000 * 6/18, 'seq3' => 6000 * 8/18, ); between_ok $refs{'seq1'}, $expected{'seq1'}*(1-$delta), $expected{'seq1'}*(1+$delta); between_ok $refs{'seq2'}, $expected{'seq2'}*(1-$delta), $expected{'seq2'}*(1+$delta); between_ok $refs{'seq3'}, $expected{'seq3'}*(1-$delta), $expected{'seq3'}*(1+$delta); # From differentially abundant sequences ok $factory = Grinder->new( -reference_file => data('kmers2.fa') , -abundance_file => data('abundance_kmers.txt'), -length_bias => 0 , -unidirectional => 1 , -chimera_perc => 100 , -chimera_dist => (0, 0, 0, 0, 1) , -chimera_kmer => 8 , -total_reads => 1000 , ), 'From differentially abundant sequences'; %refs = (); while ( $read = $factory->next_read ) { my @refs = get_references($read); is scalar @refs, 6; for my $ref (@refs) { $refs{$ref}++; } } $delta = 0.2; cmp_ok $refs{'seq2'}, '<', 1100; # seq1 and seq3 should occur as frequently cmp_ok $refs{'seq1'}, '>', 2300; cmp_ok $refs{'seq3'}, '>', 2300; between_ok $refs{'seq1'}, $expected{'seq3'}*(1-$delta), $expected{'seq3'}*(1+$delta); done_testing(); Grinder-0.5.4/t/19-gene-copy-bias.t0000644000175000017500000000517312263016714017023 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $nof_reads, $read, %sources); # Specified genome abundance for a single library, no copy bias ok $factory = Grinder->new( -abundance_file => data('abundances2.txt') , -reference_file => data('multiple_amplicon_database.fa'), -forward_reverse => data('forward_reverse_primers.fa') , -copy_bias => 0 , -unidirectional => 1 , -read_dist => 48 , -random_seed => 1910567890 , -total_reads => 1000 , ), 'Genome abundance for a single libraries'; while ( $read = $factory->next_read ) { my $source = $read->reference->id; # Strip amplicon sources of the 'amplicon' part $source =~ s/_amplicon.*$//; if (not exists $sources{$source}) { $sources{$source} = 1; } else { $sources{$source}++; } }; ok exists $sources{'seq1'}; ok exists $sources{'seq2'}; ok exists $sources{'seq3'}; # These tests are quite sensitive to the seed used. Ideal average answer should # be 600, 300 and 100 between_ok( $sources{'seq1'}, 580, 620 ); between_ok( $sources{'seq2'}, 280, 320 ); between_ok( $sources{'seq3'}, 80, 120 ); is $factory->next_lib, undef; %sources = (); # Specified genome abundance for a single library ok $factory = Grinder->new( -abundance_file => data('abundances2.txt') , -reference_file => data('multiple_amplicon_database.fa'), -forward_reverse => data('forward_reverse_primers.fa') , -copy_bias => 1 , -unidirectional => 1 , -read_dist => 48 , -random_seed => 1910567890 , -total_reads => 1000 , ), 'Genome abundance for a single libraries'; while ( $read = $factory->next_read ) { my $source = $read->reference->id; # Strip amplicon sources of the 'amplicon' part $source =~ s/_amplicon.*$//; if (not exists $sources{$source}) { $sources{$source} = 1; } else { $sources{$source}++; } }; ok exists $sources{'seq1'}; ok exists $sources{'seq2'}; ok exists $sources{'seq3'}; # These tests are quite sensitive to the seed used. Ideal average answer should # be 387.1, 580.6 and 32.3 between_ok( $sources{'seq1'}, 367, 407 ); between_ok( $sources{'seq2'}, 560, 600 ); between_ok( $sources{'seq3'}, 12, 52 ); is $factory->next_lib, undef; %sources = (); done_testing(); Grinder-0.5.4/t/16-profile.t0000644000175000017500000000244212263016714015652 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $nof_reads, $read); # No profile ok $factory = Grinder->new( -reference_file => data('single_seq_database.fa'), -read_dist => 50 , -total_reads => 100 , -unidirectional => 1 , ), 'No profile'; while ( $read = $factory->next_read ) { is $read->seq, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'; }; # Grinder profile file that contains the same parameters as the previous test ok $factory = Grinder->new( -profile_file => data('profile.txt'), ), 'Grinder profile'; while ( $read = $factory->next_read ) { is $read->seq, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'; }; # A mix of profile and other command-line arguments ok $factory = Grinder->new( -desc_track => 0 , -num_libraries => 2 , -profile_file => data('profile.txt'), -multiplex_ids => data('mids.fa') , -shared_perc => 100 , ), 'Mix of profile and manually-specified options'; while ( $read = $factory->next_read ) { is $read->seq, 'ACGTaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'; is $read->desc, undef; }; done_testing(); Grinder-0.5.4/t/13-insert-length.t0000644000175000017500000000535312263016714016776 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $nof_reads, $mate1, $mate2, @ilengths, $min, $max, $mean, $stddev, $hist, $ehist, $coeff); # All inserts the same length ok $factory = Grinder->new( -reference_file => data('single_seq_database.fa'), -total_reads => 1000 , -read_dist => 50 , -insert_dist => 150 , ), 'Same size inserts'; while ( $mate1 = $factory->next_read ) { $mate2 = $factory->next_read; # insert size includes mate1 + spacer + mate2 push @ilengths, insert_length($mate1, $mate2); }; ($min, $max, $mean, $stddev) = stats(\@ilengths); is $min, 150; is $max, 150; is $mean, 150; is $stddev, 0; @ilengths = (); # Uniformly distributed inserts ok $factory = Grinder->new( -reference_file => data('single_seq_database.fa'), -total_reads => 1000 , -read_dist => 50 , -insert_dist => (150, 'uniform', 15) , ), 'Uniform distribution'; while ( $mate1 = $factory->next_read ) { $mate2 = $factory->next_read; # insert size includes mate1 + spacer + mate2 push @ilengths, insert_length($mate1, $mate2); }; ($min, $max, $mean, $stddev) = stats(\@ilengths); cmp_ok $min, '>=', 135; cmp_ok $max, '<=', 165; between_ok( $mean, 148, 152 ); between_ok( $stddev, 7, 10 ); $hist = hist(\@ilengths, 50, 250); $ehist = uniform(50, 250, 135, 165, 1000); $coeff = corr_coeff($hist, $ehist, $mean); cmp_ok $coeff, '>', 0.99; SKIP: { skip rfit_msg() if not can_rfit(); test_uniform_dist(\@ilengths, 135, 165); } @ilengths = (); # Normally distributed inserts ok $factory = Grinder->new( -reference_file => data('single_seq_database.fa'), -total_reads => 1000 , -read_dist => 50 , -insert_dist => (150, 'normal', 10) , ), 'Normal distribution'; while ( $mate1 = $factory->next_read ) { $mate2 = $factory->next_read; # insert size includes mate1 + spacer + mate2 push @ilengths, insert_length($mate1, $mate2); }; ($min, $max, $mean, $stddev) = stats(\@ilengths); between_ok( $mean, 149, 151 ); # should be 150 between_ok( $stddev, 9, 11 ); $hist = hist(\@ilengths, 50, 250); $ehist = normal(50, 250, $mean, $stddev**2, 1000); $coeff = corr_coeff($hist, $ehist, $mean); cmp_ok $coeff, '>', 0.99; SKIP: { skip rfit_msg() if not can_rfit(); test_normal_dist(\@ilengths, 150, 10); } @ilengths = (); done_testing(); sub insert_length { my ($mate1, $mate2) = @_; if ($mate1->end > $mate2->end) { ($mate1, $mate2) = ($mate2, $mate1); } my $length = $mate2->end - $mate1->start + 1; return $length; } Grinder-0.5.4/t/02-mates.t0000644000175000017500000000402012647201570015312 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use Test::Warn; use t::TestUtils; use Grinder; my ($factory, $read, $nof_reads); ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -total_reads => 101 , -read_dist => 48 , -insert_dist => 250 , ), 'Mate pairs'; warning_like { $factory->next_lib } qr{.*added a read.*}i; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; ok_mate($read, undef, $nof_reads); }; is $nof_reads, 102; # Coverage fold ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -read_dist => 48 , -coverage_fold => 6.04 , -insert_dist => 250 , ), 'Coverage fold'; ok $factory->next_lib; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; }; is $nof_reads, 112; done_testing(); sub ok_mate { my ($read, $req_strand, $nof_reads) = @_; isa_ok $read, 'Bio::Seq::SimulatedRead'; my $source = $read->reference->id; my $strand = $read->strand; if (not defined $req_strand) { $req_strand = $strand; } else { is $strand, $req_strand; } my $letters; if ( $source eq 'seq1' ) { $letters = 'a'; } elsif ( $source eq 'seq2' ) { $letters = 'c'; } elsif ( $source eq 'seq3' ) { $letters = 'g'; } elsif ( $source eq 'seq4' ) { $letters = 't'; } elsif ( $source eq 'seq5' ) { $letters = 'atg'; } elsif ( $source eq 'seq7' ) { $letters = 'a'; } if ( $req_strand == -1 ) { # Take the reverse complement $letters = Bio::PrimarySeq->new( -seq => $letters )->revcom->seq; }; like $read->seq, qr/[$letters]+/; my $id = round($nof_reads/2).'/'.($nof_reads%2?1:2); is $read->id, $id; if ($source eq 'seq7') { is $read->length, 1; } else { is $read->length, 48; } } Grinder-0.5.4/t/12-read-length.t0000644000175000017500000000401612263016714016377 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $nof_reads, $read, @rlengths, $min, $max, $mean, $stddev, $hist, $ehist, $coeff); # All sequences the same length ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -read_dist => 50 , -total_reads => 1000 , ), 'Same length reads'; while ( $read = $factory->next_read ) { push @rlengths, $read->length; }; ($min, $max, $mean, $stddev) = stats(\@rlengths); is $min, 50; is $max, 50; is $mean, 50; is $stddev, 0; @rlengths = (); # Uniform distribution ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -read_dist => (50, 'uniform', 10) , -total_reads => 1000 , ), 'Uniform distribution'; while ( $read = $factory->next_read ) { push @rlengths, $read->length; }; ($min, $max, $mean, $stddev) = stats(\@rlengths); is $min, 40; is $max, 60; is round($mean), 50; between_ok( $stddev, 5.3, 6.3 ); # should be 5.79 $hist = hist(\@rlengths, 1, 100); $ehist = uniform(1, 100, 40, 60, 1000); $coeff = corr_coeff($hist, $ehist, $mean); cmp_ok $coeff, '>', 0.99; SKIP: { skip rfit_msg() if not can_rfit(); test_uniform_dist(\@rlengths, 40, 60); } @rlengths = (); # Normal distribution ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -read_dist => (50, 'normal', 5) , -total_reads => 1000 , ), 'Normal distribution'; while ( $read = $factory->next_read ) { push @rlengths, $read->length; } ($min, $max, $mean, $stddev) = stats(\@rlengths); between_ok( $mean, 49, 51 ); # should be 50.0 between_ok( $stddev, 4.5, 5.5 ); # should be 5.0 $hist = hist(\@rlengths, 1, 100); $ehist = normal(1, 100, $mean, $stddev**2, 1000); $coeff = corr_coeff($hist, $ehist, $mean); cmp_ok $coeff, '>', 0.99; SKIP: { skip rfit_msg() if not can_rfit(); test_normal_dist(\@rlengths, 50, 5); } @rlengths = (); done_testing(); Grinder-0.5.4/t/22-homopolymers.t0000644000175000017500000002053712263016714016751 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $nof_reads, $read, $hpols, $min, $max, $mean, $stddev, $expected_mean, $expected_stddev, $hist, $ehist, $coeff); my $delta = 0.20; # 20% my $min_coeff = 0.75; # Balzer homopolymer distribution ok $factory = Grinder->new( -reference_file => data('homopolymer_database.fa'), -unidirectional => 1 , -read_dist => 220 , -total_reads => 1000 , -homopolymer_dist => 'balzer' , ), 'Balzer'; while ( $read = $factory->next_read ) { my ($error_str) = ($read->desc =~ /errors=(\S+)/); $hpols = add_homopolymers($error_str, $read->reference->seq, $hpols); if ($error_str) { unlike $error_str, qr/%/; like $error_str, qr/[+-]/; } else { ok 1; ok 1; } } for my $homo_len ( sort {$b <=> $a} (keys %$hpols) ) { last if $homo_len <= 3; ### TODO: go up to 2 my $values = $$hpols{$homo_len}; ($min, $max, $mean, $stddev) = stats($values); ($expected_mean, $expected_stddev) = balzer($homo_len); #print "Balzer homopolymer length: $homo_len\n"; #print " expected mean = $expected_mean, expected stddev = $expected_stddev\n"; #print " min = $min, max = $max, mean = $mean, stddev = $stddev\n"; between_ok( $mean, (1-$delta)*$expected_mean, (1+$delta)*$expected_mean ); between_ok( $stddev, (1-$delta)*$expected_stddev, (1+$delta)*$expected_stddev ); $hist = hist($$hpols{$homo_len}, 1, 20); $ehist = normal(1, 20, $mean, $stddev**2, 4000); # 4 homopolymers of each size in the 1000 reads $coeff = corr_coeff($hist, $ehist, $mean); cmp_ok $coeff, '>', $min_coeff; #### TODO: Better test of normality #SKIP: { # skip rfit_msg() if not can_rfit(); # test_normal_dist($values, $mean, $stddev); #} } $hpols = {}; # Richter homopolymer distribution ok $factory = Grinder->new( -reference_file => data('homopolymer_database.fa'), -unidirectional => 1 , -read_dist => 220 , -total_reads => 1000 , -homopolymer_dist => 'richter' , ), 'Richter'; while ( $read = $factory->next_read ) { my ($error_str) = ($read->desc =~ /errors=(\S+)/); $hpols = add_homopolymers($error_str, $read->reference->seq, $hpols); if ($error_str) { unlike $error_str, qr/%/; like $error_str, qr/[+-]/; } else { ok 1; ok 1; } } for my $homo_len ( sort {$b <=> $a} (keys %$hpols) ) { last if $homo_len <= 3; ### TODO: go up to 2 my $values = $$hpols{$homo_len}; ($min, $max, $mean, $stddev) = stats($values); ($expected_mean, $expected_stddev) = richter($homo_len); #print "Richter homopolymer length: $homo_len\n"; #print " expected mean = $expected_mean, expected stddev = $expected_stddev\n"; #print " min = $min, max = $max, mean = $mean, stddev = $stddev\n"; between_ok( $mean, (1-$delta)*$expected_mean, (1+$delta)*$expected_mean ); between_ok( $stddev, (1-$delta)*$expected_stddev, (1+$delta)*$expected_stddev ); $hist = hist($$hpols{$homo_len}, 1, 20); $ehist = normal(1, 20, $mean, $stddev**2, 4000); # 4 homopolymers of each size in the 1000 reads $coeff = corr_coeff($hist, $ehist, $mean); cmp_ok $coeff, '>', $min_coeff; #### TODO: Better test of normality #SKIP: { # skip rfit_msg() if not can_rfit(); # test_normal_dist($values, $mean, $stddev); #} } $hpols = {}; # Margulies homopolymer distribution ok $factory = Grinder->new( -reference_file => data('homopolymer_database.fa'), -unidirectional => 1 , -read_dist => 220 , -total_reads => 1000 , -homopolymer_dist => 'margulies' , ), 'Margulies'; while ( $read = $factory->next_read ) { my ($error_str) = ($read->desc =~ /errors=(\S+)/); $hpols = add_homopolymers($error_str, $read->reference->seq, $hpols); if ($error_str) { unlike $error_str, qr/%/; like $error_str, qr/[+-]/; } else { ok 1; ok 1; } } for my $homo_len ( sort {$b <=> $a} (keys %$hpols) ) { last if $homo_len <= 3; ### TODO: go up to 2 ($expected_mean, $expected_stddev) = margulies($homo_len); my $values = $$hpols{$homo_len}; ($min, $max, $mean, $stddev) = stats($values); #print "Margulies homopolymer length: $homo_len\n"; #print " expected mean = $expected_mean, expected stddev = $expected_stddev\n"; #print " min = $min, max = $max, mean = $mean, stddev = $stddev\n"; between_ok( $mean, (1-$delta)*$expected_mean, (1+$delta)*$expected_mean ); between_ok( $stddev, (1-$delta)*$expected_stddev, (1+$delta)*$expected_stddev ); $hist = hist($$hpols{$homo_len}, 1, 20); $ehist = normal(1, 20, $mean, $stddev**2, 4000); # 4 homopolymers of each size in the 1000 reads $coeff = corr_coeff($hist, $ehist, $mean); cmp_ok $coeff, '>', $min_coeff; #### TODO: Better test of normality #SKIP: { # skip rfit_msg() if not can_rfit(); # test_normal_dist($values, $mean, $stddev); #} } $hpols = {}; done_testing(); sub add_homopolymers { my ($err_str, $ref_seq, $err_h) = @_; # Record position and length of homopolymer errors my %errors; if (defined $err_str) { $err_str = combine_dels($err_str, $ref_seq); my @errors = split ',', $err_str; for my $error (@errors) { # Record homopolymer error my ($pos, $type, $repl) = ($error =~ m/(\d+)([%+-])(.*)/i); my $elen; # error length if ($type eq '-') { $repl .= '-'; $elen = - length($repl); } elsif ($type eq '+') { $elen = + length($repl); } $errors{$pos} = $elen; # Test that proper residue was added to homopolymer my $hres = substr $ref_seq, $pos-1, 1; # first residue of homopolymer my $new_res = substr $repl, 0, 1; # residue to add to homopolymer if ( $type eq '+' ) { # in case of insertion (not deletion) die if not $hres eq $new_res; } } } # Record all homopolymer lengths while ( $ref_seq =~ m/(.)(\1+)/g ) { # Found a homopolymer my $hlen = length($2) + 1; # length of the error-free homopolymer my $pos = pos($ref_seq) - $hlen + 1; # start of the homopolymer (residue no.) my $elen = $hlen + ($errors{$pos} || 0); # length of the error-containing homopolymer push @{$$err_h{$hlen}}, $elen; } return $err_h; } sub combine_dels { # Put homopolymer deletions at adjacent position into a single error entry # (but only if they affect the same homopolymer) # Ex: 45-,46-,47- becomes 45--- my ($err_str, $ref_seq) = @_; my %errors; for my $error (split ',', $err_str) { my ($pos, $type, $repl) = ($error =~ m/(\d+)([%+-])([a-z]*)/i); if ($type eq '-') { # Keep track of what was deleted $repl = substr $ref_seq, $pos-1, 1; $errors{$pos}{'-'} = [ $repl ]; } else { push @{$errors{$pos}{$type}}, $repl; } } for my $pos (sort {$b <=> $a} (keys %errors)) { if ( exists $errors{$pos}{'-'} && exists $errors{$pos-1} && exists $errors{$pos-1}{'-'} && ($errors{$pos}{'-'}[0] eq $errors{$pos-1}{'-'}[0]) ) { push @{$errors{$pos-1}{'-'}}, @{$errors{$pos}{'-'}}; delete $errors{$pos}{'-'}; delete $errors{$pos} if scalar keys %{$errors{$pos}} == 0; } } $err_str = ''; for my $pos (sort {$a <=> $b} (keys %errors)) { while ( my ($type, $repls) = each %{$errors{$pos}} ) { my $repl; if ($type eq '-') { $repl = '-' x (scalar @$repls - 1); } else { $repl = join '', @$repls; } $err_str .= $pos.$type.$repl.','; } } $err_str =~ s/,$//; return $err_str; } sub margulies { my ($homo_len) = @_; my $mean = $homo_len; my $stddev = $homo_len * 0.15; return $mean, $stddev; } sub richter { my ($homo_len) = @_; my $mean = $homo_len; my $stddev = sqrt($homo_len) * 0.15; return $mean, $stddev; } sub balzer { my ($homo_len) = @_; my $mean = $homo_len; my $variance = 0.03494 + $homo_len * 0.06856; return $mean, $stddev; } Grinder-0.5.4/t/24-mate-orientation.t0000644000175000017500000001651112263016714017472 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $read1, $read2, $nof_reads); my $total_reads = 100; # FR-oriented mates ok $factory = Grinder->new( -reference_file => data('oriented_database.fa'), -total_reads => $total_reads , -read_dist => 80 , -insert_dist => 240 , -unidirectional => +1 , -mate_orientation => 'FR' , ), 'FR-oriented mates'; $nof_reads = 0; while ( $read1 = $factory->next_read ) { $read2 = $factory->next_read; $nof_reads += 2; is type($read1, $read2), 'FR'; }; is $nof_reads, $total_reads; ok $factory = Grinder->new( -reference_file => data('oriented_database.fa'), -total_reads => $total_reads , -read_dist => 80 , -insert_dist => 240 , -unidirectional => -1 , -mate_orientation => 'FR' , ); $nof_reads = 0; while ( $read1 = $factory->next_read ) { $read2 = $factory->next_read; $nof_reads += 2; is type($read1, $read2), 'FR'; }; is $nof_reads, $total_reads; ok $factory = Grinder->new( -reference_file => data('oriented_database.fa'), -total_reads => $total_reads , -read_dist => 80 , -insert_dist => 240 , -unidirectional => 0 , -mate_orientation => 'FR' , ); $nof_reads = 0; while ( $read1 = $factory->next_read ) { $read2 = $factory->next_read; $nof_reads += 2; is type($read2, $read1), 'FR'; }; is $nof_reads, $total_reads; # FF-oriented mates ok $factory = Grinder->new( -reference_file => data('oriented_database.fa'), -total_reads => $total_reads , -read_dist => 80 , -insert_dist => 240 , -unidirectional => +1 , -mate_orientation => 'FF' , ), 'FF-oriented mates'; $nof_reads = 0; while ( $read1 = $factory->next_read ) { $read2 = $factory->next_read; $nof_reads += 2; is type($read1, $read2), 'FF'; }; is $nof_reads, $total_reads; ok $factory = Grinder->new( -reference_file => data('oriented_database.fa'), -total_reads => $total_reads , -read_dist => 80 , -insert_dist => 240 , -unidirectional => -1 , -mate_orientation => 'FF' , ); $nof_reads = 0; while ( $read1 = $factory->next_read ) { $read2 = $factory->next_read; $nof_reads += 2; is type($read1, $read2), 'RR'; }; is $nof_reads, $total_reads; ok $factory = Grinder->new( -reference_file => data('oriented_database.fa'), -total_reads => $total_reads , -read_dist => 80 , -insert_dist => 240 , -unidirectional => 0 , -mate_orientation => 'FF' , ); $nof_reads = 0; while ( $read1 = $factory->next_read ) { $read2 = $factory->next_read; $nof_reads += 2; like type($read1, $read2), qr/(FF|RR)/; }; is $nof_reads, $total_reads; # RF-oriented mates ok $factory = Grinder->new( -reference_file => data('oriented_database.fa'), -total_reads => $total_reads , -read_dist => 80 , -insert_dist => 240 , -unidirectional => +1 , -mate_orientation => 'RF' , ), 'RF-oriented mates'; $nof_reads = 0; while ( $read1 = $factory->next_read ) { $read2 = $factory->next_read; $nof_reads += 2; is type($read1, $read2), 'RF'; }; is $nof_reads, $total_reads; ok $factory = Grinder->new( -reference_file => data('oriented_database.fa'), -total_reads => $total_reads , -read_dist => 80 , -insert_dist => 240 , -unidirectional => -1 , -mate_orientation => 'RF' , ); $nof_reads = 0; while ( $read1 = $factory->next_read ) { $read2 = $factory->next_read; $nof_reads += 2; is type($read1, $read2), 'RF'; }; is $nof_reads, $total_reads; ok $factory = Grinder->new( -reference_file => data('oriented_database.fa'), -total_reads => $total_reads , -read_dist => 80 , -insert_dist => 240 , -unidirectional => 0 , -mate_orientation => 'RF' , ); $nof_reads = 0; while ( $read1 = $factory->next_read ) { $read2 = $factory->next_read; $nof_reads += 2; is type($read1, $read2), 'RF'; }; is $nof_reads, $total_reads; # RR-oriented mates ok $factory = Grinder->new( -reference_file => data('oriented_database.fa'), -total_reads => $total_reads , -read_dist => 80 , -insert_dist => 240 , -unidirectional => +1 , -mate_orientation => 'RR' , ), 'RR-oriented mates'; $nof_reads = 0; while ( $read1 = $factory->next_read ) { $read2 = $factory->next_read; $nof_reads += 2; is type($read1, $read2), 'RR'; }; is $nof_reads, $total_reads; ok $factory = Grinder->new( -reference_file => data('oriented_database.fa'), -total_reads => $total_reads , -read_dist => 80 , -insert_dist => 240 , -unidirectional => -1 , -mate_orientation => 'RR' , ); $nof_reads = 0; while ( $read1 = $factory->next_read ) { $read2 = $factory->next_read; $nof_reads += 2; is type($read1, $read2), 'FF'; }; is $nof_reads, $total_reads; ok $factory = Grinder->new( -reference_file => data('oriented_database.fa'), -total_reads => $total_reads , -read_dist => 80 , -insert_dist => 240 , -unidirectional => 0 , -mate_orientation => 'RR' , ); $nof_reads = 0; while ( $read1 = $factory->next_read ) { $read2 = $factory->next_read; $nof_reads += 2; like type($read1, $read2), qr/(FF|RR)/; }; is $nof_reads, $total_reads; done_testing(); sub type { my ($read1, $read2) = @_; my $read1_t = { 'CCCaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa' => 'F', 'tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttGGG' => 'R', }; my $read2_t = { 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaTTT' => 'F', 'AAAttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt' => 'R', }; my $type = ''; if ( (exists $read1_t->{$read1->seq}) && (exists $read2_t->{$read2->seq}) ) { $type = $read1_t->{$read1->seq}.$read2_t->{$read2->seq}; } else { $type = $read1_t->{$read2->seq}.$read2_t->{$read1->seq}; } return $type; } Grinder-0.5.4/t/14-genome-length-bias.t0000644000175000017500000000215612263016714017657 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $nof_reads, $read, %sources); # Specified genome abundance for a single library ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -abundance_file => data('abundances.txt') , -length_bias => 1 , -random_seed => 1910567890 , -total_reads => 1000 , ), 'Genome abundance for a single libraries'; while ( $read = $factory->next_read ) { my $source = $read->reference->id; if (not exists $sources{$source}) { $sources{$source} = 1; } else { $sources{$source}++; } }; ok exists $sources{'seq1'}; ok exists $sources{'seq2'}; ok not exists $sources{'seq3'}; ok exists $sources{'seq4'}; ok exists $sources{'seq5'}; # These tests are quite sensitive to the seed used between_ok( $sources{'seq1'}, 414, 477 ); # avg = 444 between_ok( $sources{'seq2'}, 303, 363 ); # avg = 333 between_ok( $sources{'seq4'}, 81, 141 ); # avg = 111 between_ok( $sources{'seq5'}, 81, 141 ); # avg = 111 done_testing(); Grinder-0.5.4/t/21-errors.t0000644000175000017500000001635612264032430015525 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $nof_reads, $read, @epositions, $min, $max, $mean, $stddev, $prof, $eprof, $coeff, $nof_indels, $nof_substs); # No errors by default ok $factory = Grinder->new( -reference_file => data('single_seq_database.fa'), -unidirectional => 1 , -read_dist => 50 , -total_reads => 1000 , ), 'No errors'; while ( $read = $factory->next_read ) { is $read->seq, 'aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa'; unlike $read->desc, qr/errors/; } # Substitutions ok $factory = Grinder->new( -reference_file => data('single_seq_database.fa'), -unidirectional => 1 , -read_dist => 50 , -total_reads => 1000 , -mutation_ratio => (100, 0) , -mutation_dist => ('uniform', 10) , ), 'Substitutions only'; while ( $read = $factory->next_read ) { my ($error_str) = ($read->desc =~ /errors=(\S+)/); if ($error_str) { like $error_str, qr/%/; unlike $error_str, qr/[-+]/; } else { ok 1; ok 1; } } # Indels ok $factory = Grinder->new( -reference_file => data('single_seq_database.fa'), -unidirectional => 1 , -read_dist => 50 , -total_reads => 1000 , -mutation_ratio => (0, 100) , -mutation_dist => ('uniform', 10) , ), 'Indels only'; while ( $read = $factory->next_read ) { my ($error_str) = ($read->desc =~ /errors=(\S+)/); if ($error_str) { unlike $error_str, qr/%/; like $error_str, qr/[-+]/; } else { ok 1; ok 1; } } # Indels and substitutions ok $factory = Grinder->new( -reference_file => data('single_seq_database.fa'), -unidirectional => 1 , -read_dist => 50 , -total_reads => 1000 , -mutation_ratio => (50, 50) , -mutation_dist => ('uniform', 10) , ), 'Indels and substitutions'; while ( $read = $factory->next_read ) { my ($error_str) = ($read->desc =~ /errors=(\S+)/); if ($error_str) { like $error_str, qr/[-+%]/; $nof_indels += ($error_str =~ tr/-+//); $nof_substs += ($error_str =~ tr/%//); } else { ok 1; } } between_ok( $nof_substs / $nof_indels, 0.92, 1.08 ); # should be 1 # Uniform distribution (frequent errors) ok $factory = Grinder->new( -reference_file => data('single_seq_database.fa'), -unidirectional => 1 , -read_dist => 50 , -total_reads => 1000 , -mutation_ratio => (50, 50) , -mutation_dist => ('uniform', 10) , -random_seed => 1233567880 , ), 'Uniform (frequent errors)'; while ( $read = $factory->next_read ) { my @positions = error_positions($read); push @epositions, @positions if scalar @positions > 0; } $prof = hist(\@epositions, 1, 50); ($min, $max, $mean, $stddev) = stats($prof); between_ok( $$prof[0] , 70, 130 ); # exp. number of errors at 1st pos is 100 (10%) between_ok( $$prof[24], 70, 130 ); # exp. number of errors at 25th pos is 100 (10%) between_ok( $$prof[-1], 70, 130 ); # exp. number of errors at last pos is 100 (10%) between_ok( $mean , 97, 103 ); # exp. mean number is 100 (10%) SKIP: { skip rfit_msg() if not can_rfit(); test_uniform_dist(\@epositions, 1, 50); } @epositions = (); # Uniform distribution (rare errors) ok $factory = Grinder->new( -reference_file => data('single_seq_database.fa'), -unidirectional => 1 , -read_dist => 50 , -total_reads => 10000 , -mutation_ratio => (50, 50) , -mutation_dist => ('uniform', 0.1) , -random_seed => 1233567880 , ), 'Uniform (rare errors)'; while ( $read = $factory->next_read ) { my @positions = error_positions($read); push @epositions, @positions if scalar @positions > 0; } $prof = hist(\@epositions, 1, 50); ($min, $max, $mean, $stddev) = stats($prof); between_ok( $$prof[0] , 4, 16 ); # exp. number of errors at 1st pos is 100 (10%) between_ok( $$prof[24], 4, 16 ); # exp. number of errors at 25th pos is 100 (10%) between_ok( $$prof[-1], 4, 16 ); # exp. number of errors at last pos is 100 (10%) between_ok( $mean , 8, 12 ); # exp. mean number is 100 (10%) SKIP: { skip rfit_msg() if not can_rfit(); test_uniform_dist(\@epositions, 1, 50); } @epositions = (); # Linear distribution ok $factory = Grinder->new( -reference_file => data('single_seq_database.fa'), -unidirectional => 1 , -read_dist => 50 , -total_reads => 1000 , -mutation_ratio => (50, 50) , -mutation_dist => ('linear', 5, 15) , -random_seed => 1233567880 , ), 'Linear'; while ( $read = $factory->next_read ) { my @positions = error_positions($read); push @epositions, @positions if scalar @positions > 0; } $prof = hist(\@epositions, 1, 50); ($min, $max, $mean, $stddev) = stats($prof); between_ok( $$prof[0] , 30, 70 ); # exp. number of errors at 1st pos is 50 (5%) between_ok( $$prof[24], 65, 135 ); # exp. number of errors at 25th pos is 100 (10%) between_ok( $$prof[-1], 115, 185 ); # exp. number of errors at last pos is 150 (15%) between_ok( $mean , 97, 103 ); # exp. mean number of errors is 100 #SKIP: { #skip rfit_msg() if not can_rfit(); #### TODO #TODO: { # $TODO = "Need to implement a linear density distribution in R"; # test_linear_dist(\@epositions, 1, 50, 0.0000000001); #} #} @epositions = (); # Fourth degree polynomial distribution ok $factory = Grinder->new( -reference_file => data('single_seq_database.fa'), -unidirectional => 1 , -read_dist => 100 , -total_reads => 1000 , -mutation_ratio => (50, 50) , -mutation_dist => ('poly4', 1, 4.4e-7) , -random_seed => 1233567880 , ), 'Polynomial'; while ( $read = $factory->next_read ) { my @positions = error_positions($read); push @epositions, @positions if scalar @positions > 0; } $prof = hist(\@epositions, 1, 100); ($min, $max, $mean, $stddev) = stats($prof); between_ok( $$prof[0] , 1, 27 ); # exp. number of errors at 1st is 10 (1%) between_ok( $$prof[49], 7, 67 ); # exp. number of errors at 50th is 37.4 (3.74%) between_ok( $$prof[-1], 405, 492 ); # exp. number of errors at last is 449 (44.9%) between_ok( $mean , 97, 103 ); # exp. mean number of errors is 100 (10.02%) #SKIP: { #skip rfit_msg() if not can_rfit(); #### TODO #TODO: { # $TODO = "Need to implement a polynomial distribution in R"; # test_polynomial_dist(\@epositions, 1, 50, 0.0000000001); #} #} @epositions = (); done_testing(); Grinder-0.5.4/t/07-diversity.t0000644000175000017500000000375512263016714016244 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $nof_reads, $read, %sources); # Single library, single diversity ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -random_seed => 1233567880 , -total_reads => 100 , -diversity => 2 , ), 'Single library, single diversity'; while ( $read = $factory->next_read ) { my $source = $read->reference->id; $sources{$source} = undef; }; is scalar keys %sources, 2; %sources = (); # Two libraries, single diversity ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -random_seed => 1233567880 , -total_reads => 100 , -num_libraries => 2 , -diversity => 2 , ), 'Two libraries, single diversity'; $factory->next_lib; while ( $read = $factory->next_read ) { my $source = $read->reference->id; $sources{$source} = undef; }; is scalar keys %sources, 2; %sources = (); $factory->next_lib; while ( $read = $factory->next_read ) { my $source = $read->reference->id; $sources{$source} = undef; }; is scalar keys %sources, 2; %sources = (); # Two libraries, two diversities ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -random_seed => 1233567880 , -total_reads => 100 , -num_libraries => 2 , -diversity => (2, 3) , ), 'Two libraries, two diversities'; $factory->next_lib; while ( $read = $factory->next_read ) { my $source = $read->reference->id; $sources{$source} = undef; }; is scalar keys %sources, 2; %sources = (); $factory->next_lib; while ( $read = $factory->next_read ) { my $source = $read->reference->id; $sources{$source} = undef; }; is scalar keys %sources, 3; %sources = (); done_testing(); Grinder-0.5.4/t/27-stdin.t0000644000175000017500000000434212263016714015336 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $read, $nof_reads); # Feed __DATA__ content to stdin *STDIN = *DATA; ok $factory = Grinder->new( -reference_file => '-' , -total_reads => 100 , -read_dist => 48 , ), 'Input from stdin'; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; ok_read($read, undef, $nof_reads); }; is $nof_reads, 100; sub ok_read { my ($read, $req_strand, $nof_reads) = @_; isa_ok $read, 'Bio::Seq::SimulatedRead'; my $source = $read->reference->id; my $strand = $read->strand; if (not defined $req_strand) { $req_strand = $strand; } else { is $strand, $req_strand; } my $letters; if ( $source eq 'seq1' ) { $letters = 'a'; } elsif ( $source eq 'seq2' ) { $letters = 'c'; } elsif ( $source eq 'seq3' ) { $letters = 'g'; } elsif ( $source eq 'seq4' ) { $letters = 't'; } elsif ( $source eq 'seq5' ) { $letters = 'atg'; } if ( $req_strand == -1 ) { # Take the reverse complement $letters = Bio::PrimarySeq->new( -seq => $letters )->revcom->seq; }; like $read->seq, qr/[$letters]+/; is $read->id, $nof_reads; is $read->length, 48; } done_testing(); __DATA__ >seq1 this is the first sequence aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa aaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa >seq2 cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc cccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccccc >seq3 gggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg gggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggggg >seq4 tttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttttt >seq5 last sequence, last comment aaaaaaaaaattttttttttttttttttttttttttttttttttttttttttttttttttttttttttttgggggggggg Grinder-0.5.4/t/05-forbidden.t0000644000175000017500000000312312265727045016151 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $read); # Exclude forbidden characters ok $factory = Grinder->new( -reference_file => data('dirty_database.fa'), -read_dist => 80 , -random_seed => 1233567890 , -total_reads => 10 , ), 'With dubious chars'; ok $read = $factory->next_read; like $read->seq, qr/[N-]/i; ok $factory = Grinder->new( -reference_file => data('dirty_database.fa'), -exclude_chars => 'n-' , # case independent -read_dist => 30 , -random_seed => 1233756782 , -total_reads => 10 , ), 'Exclude chars'; while ( $read = $factory->next_read ) { unlike $read->seq, qr/[N-]/i; } ok $factory = Grinder->new( -reference_file => data('dirty_database.fa'), -exclude_chars => 'N-' , -read_dist => 71 , -random_seed => 1233567890 , -total_reads => 10 , ), 'Cannot generate read'; eval { $read = $factory->next_read }; like $@, qr/error/i; # Delete forbidden characters ok $factory = Grinder->new( -reference_file => data('dirty_database.fa'), -delete_chars => 'N-' , -read_dist => 70 , -random_seed => 1233567890 , -total_reads => 10 , ), 'Delete chars'; while ( $read = $factory->next_read ) { unlike $read->seq, qr/[N-]/i; } done_testing(); Grinder-0.5.4/t/26-combined-errors.t0000644000175000017500000000363112263016714017306 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $nof_reads, $read); # Combined errors: indels, substitutions, homopolymers, chimeras ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -unidirectional => 1 , -read_dist => 48 , -total_reads => 1000 , -homopolymer_dist => 'balzer' , -mutation_ratio => (100, 0) , -mutation_dist => ('uniform', 10) , -chimera_perc => 10 , -chimera_dist => (100) , -chimera_kmer => 0 , ), 'Combined errors (uniform)'; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; isa_ok $read, 'Bio::Seq::SimulatedRead'; is $read->id, $nof_reads; }; is $nof_reads, 1000; # Combined errors with linear model ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -unidirectional => 1 , -read_dist => (20, 'normal', 10) , -total_reads => 1000 , -homopolymer_dist => 'balzer' , -mutation_ratio => (85, 15) , -mutation_dist => ('linear', 2, 2) , -chimera_perc => 10 , -chimera_dist => (100) , -chimera_kmer => 0 , ), 'Combined errors (linear)'; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; isa_ok $read, 'Bio::Seq::SimulatedRead'; is $read->id, $nof_reads; }; is $nof_reads, 1000; done_testing(); Grinder-0.5.4/t/20-community-structure.t0000644000175000017500000001713112264034610020264 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $nof_reads, $read, @reads, $ra, $era, $coeff, $min, $max, $mean, $stddev, $struct, $param1, $param2); my $nof_refs = 6; my $max_refs = 10; # Uniform community structure ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -read_dist => 48 , -length_bias => 0 , -abundance_model => ('uniform', 0) , -total_reads => 1000 , -random_seed => 1234567890 , ), 'Uniform community structure'; while ( $read = $factory->next_read ) { push @reads, $read->reference->id; } $ra = rank_abundance(\@reads, $max_refs); ($min, $max, $mean, $stddev) = stats($ra); $era = uniform_cstruct($max_refs, $nof_refs, 1000); $coeff = corr_coeff($ra, $era, $mean); cmp_ok $coeff, '>', 0.97; @reads = (); # Linear community structure ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -read_dist => 48 , -length_bias => 0 , -abundance_model => ('linear', 0) , -total_reads => 1000 , -random_seed => 1234567890 , ), 'Linear community structure'; while ( $read = $factory->next_read ) { push @reads, $read->reference->id; } $ra = rank_abundance(\@reads, $max_refs); ($min, $max, $mean, $stddev) = stats($ra); $era = linear_cstruct($max_refs, $nof_refs, 1000); $coeff = corr_coeff($ra, $era, $mean); cmp_ok $coeff, '>', 0.97; @reads = (); # Power law community structure ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -read_dist => 48 , -length_bias => 0 , -abundance_model => ('powerlaw', 0.5) , -total_reads => 1000 , -random_seed => 1234567890 , ), 'Power law community structure'; while ( $read = $factory->next_read ) { push @reads, $read->reference->id; } $ra = rank_abundance(\@reads, $max_refs); ($min, $max, $mean, $stddev) = stats($ra); $era = powerlaw_cstruct($max_refs, $nof_refs, 0.5, 1000); $coeff = corr_coeff($ra, $era, $mean); cmp_ok $coeff, '>', 0.97; @reads = (); # Logarithmic community structure ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -read_dist => 48 , -length_bias => 0 , -abundance_model => ('logarithmic', 0.5) , -total_reads => 1000 , -random_seed => 1234567890 , ), 'Logarithmic community structure'; while ( $read = $factory->next_read ) { push @reads, $read->reference->id; } $ra = rank_abundance(\@reads, $max_refs); ($min, $max, $mean, $stddev) = stats($ra); $era = logarithmic_cstruct($max_refs, $nof_refs, 0.5, 1000); $coeff = corr_coeff($ra, $era, $mean); cmp_ok $coeff, '>', 0.97; @reads = (); # Exponential community structure ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -read_dist => 48 , -length_bias => 0 , -abundance_model => ('exponential', 0.5) , -total_reads => 1000 , -random_seed => 1234567890 , ), 'Exponential community structure'; $struct = $factory->next_lib; while ( $read = $factory->next_read ) { push @reads, $read->reference->id; } $ra = rank_abundance(\@reads, $max_refs); ($min, $max, $mean, $stddev) = stats($ra); $era = exponential_cstruct($max_refs, $nof_refs, 0.5, 1000); $coeff = corr_coeff($ra, $era, $mean); cmp_ok $coeff, '>', 0.97; is $struct->{param}, 0.5; @reads = (); # Communities with random structure parameter value ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -read_dist => 48 , -length_bias => 0 , -num_libraries => 2 , -shared_perc => 100 , -abundance_model => ('exponential') , -total_reads => 1000 , -random_seed => 1234567890 , ), 'Communities with random structure parameter value'; $struct = $factory->next_lib; while ( $read = $factory->next_read ) { push @reads, $read->reference->id; } $ra = rank_abundance(\@reads, $max_refs); ($min, $max, $mean, $stddev) = stats($ra); $param1 = $struct->{param}; between_ok( $param1, 0, 1000 ); $era = exponential_cstruct($max_refs, $nof_refs, $param1, 1000); $coeff = corr_coeff($ra, $era, $mean); cmp_ok $coeff, '>', 0.97; @reads = (); $struct = $factory->next_lib; while ( $read = $factory->next_read ) { push @reads, $read->reference->id; } $ra = rank_abundance(\@reads, $max_refs); ($min, $max, $mean, $stddev) = stats($ra); $param2 = $struct->{param}; between_ok( $param2, 0, 1000 ); $era = exponential_cstruct($max_refs, $nof_refs, $param2, 1000); $coeff = corr_coeff($ra, $era, $mean); cmp_ok $coeff, '>', 0.97; isnt $param1, $param2; @reads = (); done_testing(); sub uniform_cstruct { # Evaluate the uniform function in the given integer range my ($x_max, $max, $num) = @_; my @ys; my $width = $max; for my $x (1 .. $x_max) { my $y; if ( $x <= $max ) { $y = $num / $width; } else { $y = 0; } push @ys, $y; } return \@ys; } sub linear_cstruct { # Evaluate the linear function in the given integer range my ($x_max, $max, $num) = @_; my @ys; my $sum = 0; for (my $x = $max; $x >= 1; $x--) { my $y = $x; $sum += $y; push @ys, $y; } for (my $x = 0; $x < $max; $x++) { $ys[$x] *= $num / $sum; } push @ys, (0) x ($x_max - scalar @ys); return \@ys; } sub powerlaw_cstruct { # Evaluate the power function in the given integer range my ($x_max, $max, $param, $num) = @_; my @ys; my $sum = 0; for my $x (1 .. $max) { my $y = $x**(-$param); $sum += $y; push @ys, $y; } for (my $x = 0; $x < $max; $x++) { $ys[$x] *= $num / $sum; } push @ys, (0) x ($x_max - scalar @ys); return \@ys; } sub logarithmic_cstruct { # Evaluate the logarithmic function in the given integer range my ($x_max, $max, $param, $num) = @_; my @ys; my $sum = 0; for my $x (1 .. $max) { my $y = (log($x+1))**(-$param); $sum += $y; push @ys, $y; } for (my $x = 0; $x < $max; $x++) { $ys[$x] *= $num / $sum; } push @ys, (0) x ($x_max - scalar @ys); return \@ys; } sub exponential_cstruct { # Evaluate the exponential function in the given integer range my ($x_max, $max, $param, $num) = @_; my @ys; my $sum = 0; for my $x (1 .. $max) { my $y = exp(-$x*$param); $sum += $y; push @ys, $y; } for (my $x = 0; $x < $max; $x++) { $ys[$x] *= $num / $sum; } push @ys, (0) x ($x_max - scalar @ys); return \@ys; } sub rank_abundance { my ($data, $max) = @_; # Put a data series into bins my %hash; for my $val (@$data) { $hash{$val}++; } my @y_data = sort { $b <=> $a } (values %hash); push @y_data, (0) x ($max - scalar @y_data); return \@y_data; } Grinder-0.5.4/t/01-shotgun.t0000644000175000017500000000405112647157127015703 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $nof_reads, $read); # Initialization with short argument ok $factory = Grinder->new( -rf => data('shotgun_database_extended.fa'), -tr => 10 , ), 'Shotgun & short arguments'; ok $factory->next_read; # Total reads (long argument) ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -read_dist => 48 , -total_reads => 100 , ), 'Long arguments'; ok $factory->next_lib; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; ok_read($read, undef, $nof_reads); }; is $nof_reads, 100; # Coverage fold ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -read_dist => 48 , -coverage_fold => 6.04 , ), 'Coverage fold'; ok $factory->next_lib; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; }; is $nof_reads, 111; done_testing(); sub ok_read { my ($read, $req_strand, $nof_reads) = @_; isa_ok $read, 'Bio::Seq::SimulatedRead'; my $source = $read->reference->id; my $strand = $read->strand; if (not defined $req_strand) { $req_strand = $strand; } else { is $strand, $req_strand; } my $letters; if ( $source eq 'seq1' ) { $letters = 'a'; } elsif ( $source eq 'seq2' ) { $letters = 'c'; } elsif ( $source eq 'seq3' ) { $letters = 'g'; } elsif ( $source eq 'seq4' ) { $letters = 't'; } elsif ( $source eq 'seq5' ) { $letters = 'atg'; } elsif ( $source eq 'seq7' ) { $letters = 'a'; } if ( $req_strand == -1 ) { # Take the reverse complement $letters = Bio::PrimarySeq->new( -seq => $letters )->revcom->seq; }; like $read->seq, qr/[$letters]+/; is $read->id, $nof_reads; if ($source eq 'seq7') { is $read->length, 1; } else { is $read->length, 48; } } Grinder-0.5.4/t/TestUtils.pm0000644000175000017500000003146412265202755016111 0ustar floflooofloflooopackage t::TestUtils; use strict; use warnings; use POSIX qw( floor ceil ); use Test::More; use File::Spec::Functions; use List::Util qw( min max ); use vars qw{@ISA @EXPORT}; BEGIN { @ISA = 'Exporter'; @EXPORT = qw{ PI round between_ok data get_references get_chars stats hist uniform normal corr_coeff write_data can_rfit rfit_msg error_positions test_normal_dist test_linear_dist test_uniform_dist }; } our $can_rfit; #------------------------------------------------------------------------------# # The Pi mathematical constant use constant PI => 4 * atan2(1, 1); sub round { # Round the number given as argument return int(shift() + 0.5); } sub between_ok { # Test that a value is in the given range (inclusive) my ($value, $min, $max) = @_; cmp_ok( $value, '>=', $min ) and cmp_ok( $value, '<=', $max ) or diag("Got $value but the allowed range was [$min, $max]"); } sub data { # Get the complete filename of a test data file return catfile('t', 'data', @_); } sub get_references { # Get the number of references that a read comes from my ($read) = @_; my $desc = $read->desc; $desc =~ m/reference=(\S+)/; my $refs = $1; my @refs = split(',', $refs); return @refs; } sub get_chars { # Return a hashref where the keys are the characters seen in the specified # string my ($string) = @_; my %chars; for my $pos (0 .. length($string)-1) { my $char = substr $string, $pos, 1; $chars{$char} = undef; } my @chars = keys %chars; return \%chars; } sub stats { # Calculates min, max, mean, stddev my ($vals) = @_; my ($min, $max, $mean, $sum, $sqsum, $stddev) = (1E99, 0, 0, 0, 0, 0); my $num = scalar @$vals; for my $val (@$vals) { $min = $val if $val < $min; $max = $val if $val > $max; $sum += $val; $sqsum += $val**2 } $mean = $sum / $num; $stddev = sqrt( $sqsum / $num - $mean**2 ); return $min, $max, $mean, $stddev; } sub hist { # Count the number of occurence of each integer: # E.g. given the arrayref: # [ 1, 1, 1, 3, 3, 4 ] # Return the arrayref: # [ 3, 0, 2, 4 ] # The min and the max of the range to consider can be given as an option my ($data, $min, $max) = @_; if (not defined $data) { die "Error: no data provided to hist()\n"; } my %hash; for my $val (@$data) { $hash{$val}++; } $min = min(@$data) if not defined $min; $max = max(@$data) if not defined $max; my @y_data; for my $x ($min .. $max) { push @y_data, $hash{$x} || 0; } return \@y_data; } sub normal { # Evaluate the normal function in the given integer range my ($x_min, $x_max, $mean, $variance, $num) = @_; my @ys; for my $x ($x_min .. $x_max) { my $proba = 1 / sqrt(2 * PI * $variance) * exp( - ($x - $mean)**2 / (2 * $variance)); my $y = $proba * $num; push @ys, $y; } return \@ys; } sub uniform { # Evaluate the uniform function in the given integer range my ($x_min, $x_max, $min, $max, $num) = @_; my @ys; my $width = $max - $min + 1; for my $x ($x_min .. $x_max) { my $y; if ( ($x >= $min) and ($x <= $max) ) { $y = $num / $width; } else { $y = 0; } push @ys, $y; } return \@ys; } sub corr_coeff { # The correlation coefficient R2 is # R2 = 1 - ( SSerr / SStot ) # where # SSerr = sum( (y - f)**2 ) # and # SStot = sum( (y - mean)**2 ) my ($y, $f, $mean) = @_; my $SSerr = 0; my $SStot = 0; for my $i ( 0 .. scalar @$y - 1 ) { #print " ".($i+1)." ".$$y[$i]." ".$$f[$i]."\n"; $SSerr += ($$y[$i] - $$f[$i])**2; $SStot += ($$y[$i] - $mean)**2; } my $R2 = 1 - ($SSerr / $SStot); return $R2; } sub write_data { # Write a data series (array reference) to a file with the specified name, or # 'data.txt' by default my ($data, $filename) = @_; $filename = 'data.txt' if not defined $filename; open my $out, '>', $filename or die "Error: Could not write file $filename\n$!\n"; for my $datum (@$data) { print $out "$datum\n"; } close $out; return $filename; } sub can_rfit { # Determine if a system can run the fitdistrplus R module through the # Statistics::R Perl interface. Load Statistics::R if it can and return 1. # Return 0 otherwise. if (not defined $can_rfit) { eval { require Statistics::R; my $R = Statistics::R->new(); my $ret = $R->run(q`library(fitdistrplus)`); $R->stop(); }; if ($@) { $can_rfit = 0; warn "Skip: ".rfit_msg()."\n"; } else { $can_rfit = 1; } } return $can_rfit; } sub rfit_msg { return "Statistics::R, R or fitdistrplus not found"; } sub error_positions { my ($read) = @_; my ($err_str) = ($read->desc =~ /errors=(\S+)/); my @error_positions; if (defined $err_str) { for my $error (split ',', $err_str) { my ($pos, $type, $res) = ($error =~ m/(\d+)([%+-])([a-z]*)/i); push @error_positions, $pos; } } return @error_positions; } sub test_linear_dist { # Test that the datapoints provided follow a linear distribution my ($values, $want_min, $want_max, $want_slope) = @_; my ($min, $max, $ratio_lo, $ratio_hi, $slope, $chisqpvalue, $chisqtest) = fit_linear($values); is $want_min, $min, 'fitdist() linear'; is $want_max, $max; between_ok( 2, $ratio_lo, $ratio_hi ); between_ok( $slope, (1 - 0.05) * $want_slope, (1 + 0.05) * $want_slope ); # Allow a 5% standard deviation is( $chisqtest, 'not rejected', 'Chi square test') or diag("p-value was: $chisqpvalue"); return 1; } sub test_uniform_dist { # Test that the integer series provided follow a uniform distribution with the # specified minimum and maximum. Note that you probably need over 30-100 # values for the statistical test to work! my ($values, $want_min, $want_max) = @_; my ($min_lo, $min_hi, $max_lo, $max_hi, $chisqpvalue, $chisqtest) = fit_uniform($values, $want_min, $want_max); # Need to be more lenient since fitdistrplus is not too good with integers #between_ok( $want_min, $min_lo, $min_hi ); #between_ok( $want_max, $max_lo, $max_hi ); between_ok( round($want_min), floor($min_lo), ceil($min_hi) ); between_ok( round($want_max), floor($max_lo), ceil($max_hi) ); is( $chisqtest, 'not rejected', 'Chi square test') or diag("p-value was: $chisqpvalue"); return 1; } sub test_normal_dist { # Test that the integer series provided follow a normal distribution with the # specified mean and standard deviation. Note that you probably need over # 30-100 values for the statistical test to work! my ($values, $want_mean, $want_sd, $filename) = @_; my ($mean_lo, $mean_hi, $sd_lo, $sd_hi, $chisqpvalue, $chisqtest) = fit_normal($values, $want_mean, $want_sd); # Need to be more lenient since fitdistrplus is not too good with integers #between_ok( $want_mean, $mean_lo, $mean_hi ); #between_ok( $want_sd , $sd_lo , $sd_hi ); between_ok( round($want_mean), floor($mean_lo), ceil($mean_hi) ); between_ok( round($want_sd ), floor($sd_lo ), ceil($sd_hi ) ); is( $chisqtest, 'not rejected', 'Chi square test') or diag("p-value was: $chisqpvalue"); return 1; } #------------------------------------------------------------------------------# my $niter = 30; # number of iterations to fit the distributions sub fit_linear { my ($values) = @_; # Fit a linear distribution. Since R does not have a linear distribution, use # the beta distribution: # when beta shape1=1 & shape2=2, distribution is linearly decreasing (slope=-2) # when beta shape1=2 & shape2=1, distribution is linearly increasing (slope=2) # Find min and max of series my $min = min(@$values); my $max = max(@$values); # Rescale values between in 0 and 1 instead of min and max my $rescaled_values; if ( ($min == 0) and ($max == 1) ) { $rescaled_values = $values; } else { for my $value (@$values) { push @$rescaled_values, ($value - $min) / ($max - $min); } } # Now we can run fit_beta() my ($shape1_lo, $shape1_hi, $shape2_lo, $shape2_hi, $chisqpvalue, $chisqtest) = fit_beta($values, 2, 1, $min, $max); my $ratio_hi = $shape1_hi / $shape2_lo; my $ratio_lo = $shape2_hi / $shape1_lo; # Calculate the slope my $slope = 2 / ($max - $min); return $min, $max, $ratio_lo, $ratio_hi, $slope, $chisqpvalue, $chisqtest; } sub fit_beta { # Try to fit a beta distribution to a series of data points using a maximum # goodness of fit method. Return the 95% confidence interval for the shape1 # parameter, the shape2 parameter and the results of Chi square statistics. my ($values, $want_shape1, $want_shape2, $want_min, $want_max) = @_; my $break_num = $want_max - $want_min; my $break_size = 1 / $break_num; my $break_start = 0 - $break_size / 2; my $break_end = 1 + $break_size / 2; my $start_p = "start=list(shape1=$want_shape1, shape2=$want_shape2)"; my $breaks_p = "chisqbreaks=seq($break_start, $break_end, $break_size)"; #my $fit_cmd = "f <- fitdist(x, distr='beta', method='mle', $start_p)"; my $fit_cmd = "f <- fitdist(x, distr='beta', method='mge', gof='CvM', $start_p)"; my $boot_cmd = "fb <- bootdist(f, niter=$niter)"; my $gof_cmd = "g <- gofstat(f, $breaks_p)"; my $R = Statistics::R->new(); $R->set('x', $values); $R->run('library(fitdistrplus)'); $R->run($fit_cmd); $R->run($boot_cmd); $R->run($gof_cmd); my $shape1_lo = $R->get('fb$CI[1,2]'); my $shape1_hi = $R->get('fb$CI[1,3]'); my $shape2_lo = $R->get('fb$CI[2,2]'); my $shape2_hi = $R->get('fb$CI[2,3]'); my $chisqpvalue = $R->get('g$chisqpvalue'); my $chisqtest = test_result($chisqpvalue); $R->stop(); return $shape1_lo, $shape1_hi, $shape2_lo, $shape2_hi, $chisqpvalue, $chisqtest; } sub fit_uniform { # Try to fit a uniform distribution to a series of integers using a maximum # goodness of fit method. Return the 95% confidence interval for the mean, # the standard deviation and the results of the Chi square statistics. my ($values, $want_min, $want_max) = @_; my $range_min = min(@$values) - 0.5; my $range_max = max(@$values) + 0.5; my $breaks_p = "chisqbreaks=seq($range_min, $range_max)"; my $start_p = "start=list(min=$want_min, max=$want_max)"; my $fit_cmd = "f <- fitdist(x, distr='unif', method='mge', gof='CvM', $start_p)"; my $boot_cmd = "fb <- bootdist(f, niter=$niter)"; my $gof_cmd = "g <- gofstat(f, $breaks_p)"; my $R = Statistics::R->new(); $R->set('x', $values); $R->run('library(fitdistrplus)'); $R->run($fit_cmd); $R->run($boot_cmd); $R->run($gof_cmd); my $min_lo = $R->get('fb$CI[1,2]'); my $min_hi = $R->get('fb$CI[1,3]'); my $max_lo = $R->get('fb$CI[2,2]'); my $max_hi = $R->get('fb$CI[2,3]'); my $chisqpvalue = $R->get('g$chisqpvalue'); my $chisqtest = test_result($chisqpvalue); $R->stop(); return $min_lo, $min_hi, $max_lo, $max_hi, $chisqpvalue, $chisqtest; } sub fit_normal { # Try to fit a normal distribution to a series of integers using a maximum # likelihood method. Return the 95% confidence interval for the mean, the # standard deviation and the results of the Chi square statistics. my ($values, $want_mean, $want_sd) = @_; my $range_min = min(@$values) - 0.5; my $range_max = max(@$values) + 0.5; my $breaks_p = "chisqbreaks=seq($range_min, $range_max)"; my $start_p = "start=list(mean=$want_mean, sd=$want_sd)"; my $fit_cmd = "f <- fitdist(x, distr='norm', method='mle', $start_p)"; my $boot_cmd = "fb <- bootdist(f, niter=$niter)"; my $gof_cmd = "g <- gofstat(f, $breaks_p)"; my $R = Statistics::R->new(); $R->set('x', $values); $R->run('library(fitdistrplus)'); $R->run($fit_cmd); $R->run($boot_cmd); $R->run($gof_cmd); my $mean_lo = $R->get('fb$CI[1,2]'); my $mean_hi = $R->get('fb$CI[1,3]'); my $sd_lo = $R->get('fb$CI[2,2]'); my $sd_hi = $R->get('fb$CI[2,3]'); my $chisqpvalue = $R->get('g$chisqpvalue'); my $chisqtest = test_result($chisqpvalue); $R->stop(); return $mean_lo, $mean_hi, $sd_lo, $sd_hi, $chisqpvalue, $chisqtest; } sub test_result { # Reject a statistical test if the p value is less than 0.05 my ($p_value) = @_; my $test_result; if ( lc $p_value eq 'nan' ) { $p_value = 1; # probably a very large p value } my $thresh = 0.05; if ($p_value <= $thresh) { $test_result = 'rejected'; } elsif ($p_value > $thresh) { $test_result = 'not rejected'; } else { die "Error: '$p_value' is not a supported p-value\n"; } return $test_result; } 1; Grinder-0.5.4/t/00-load.t0000644000175000017500000000040512263016714015117 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; BEGIN { use_ok('Grinder'); use_ok('Grinder::KmerCollection'); use_ok('Grinder::Database'); use_ok('t::TestUtils'); } diag( "Testing Grinder $Grinder::VERSION, Perl $], $^X" ); done_testing(); Grinder-0.5.4/t/15-multiplex.t0000644000175000017500000000675112647202351016243 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use Test::Warn; use t::TestUtils; use Grinder; my ($factory, $nof_reads, $read); # Prepend a single multiplex identifier (MID), ACGT, to shotgun reads warning_like { $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -multiplex_ids => data('mids.fa') , -num_libraries => 1 , -read_dist => 52 , -total_reads => 9 , ) } qr{.*Ignoring extraneous MIDs.*}i, 'Single MID - shotgun'; while ( $read = $factory->next_read ) { is $read->length, 52; is substr($read->seq, 0, 4), 'ACGT'; }; # Prepend two multiplex identifiers, ACGT and AAAATTTT, to shotgun reads ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -multiplex_ids => data('mids.fa') , -num_libraries => 2 , -read_dist => 52 , -total_reads => 10 , ), 'Two MIDs - shotgun'; while ( $read = $factory->next_read ) { is $read->length, 52; like $read->id, qr/^1_/; is substr($read->seq, 0, 4), 'ACGT'; }; $factory->next_lib; while ( $read = $factory->next_read ) { like $read->id, qr/^2_/; is $read->length, 52; is substr($read->seq, 0, 8), 'AAAATTTT'; }; # Prepend a single multiplex identifier to amplicon reads warning_like { $factory = Grinder->new( -reference_file => data('single_amplicon_database.fa'), -multiplex_ids => data('mids.fa') , -num_libraries => 1 , -read_dist => 70 , -total_reads => 10 , -forward_reverse => data('forward_reverse_primers.fa') , -unidirectional => 1 , ) } qr{.*Ignoring extraneous MIDs.*}i, 'Single MID - amplicon'; while ( $read = $factory->next_read ) { is $read->seq, 'ACGTAAACTUAAAGGAATTGACGGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaGTACACACCGC'; }; # Request too long of a read warning_like { $factory = Grinder->new( -reference_file => data('single_amplicon_database.fa'), -multiplex_ids => data('mids.fa') , -num_libraries => 1 , -read_dist => 80 , -total_reads => 10 , -forward_reverse => data('forward_reverse_primers.fa') , -unidirectional => 1 , ) } qr{.*Ignoring extraneous MIDs.*}i, 'Single MID - amplicon too long'; while ( $read = $factory->next_read ) { is $read->seq, 'ACGTAAACTUAAAGGAATTGACGGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaGTACACACCGCCCGT'; }; # Prepend two multiplex identifiers to amplicon reads ok $factory = Grinder->new( -reference_file => data('single_amplicon_database.fa'), -multiplex_ids => data('mids.fa') , -num_libraries => 2 , -shared_perc => 100 , -read_dist => 74 , -total_reads => 10 , -forward_reverse => data('forward_reverse_primers.fa') , -unidirectional => 1 , ), 'Two MIDs - amplicon'; while ( $read = $factory->next_read ) { like $read->id, qr/^1_/; is $read->seq, 'ACGTAAACTUAAAGGAATTGACGGaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaGTACACACCGCCCGT'; }; done_testing(); Grinder-0.5.4/t/25-molecule-type.t0000644000175000017500000000446312263016714017003 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $read); my $dna_want_chars = { 'A' => undef, 'C' => undef, 'G' => undef, 'T' => undef, }; my $rna_want_chars = { 'A' => undef, 'C' => undef, 'G' => undef, 'U' => undef, }; my $protein_want_chars = { 'A' => undef, 'R' => undef, 'N' => undef, 'D' => undef, 'C' => undef, 'Q' => undef, 'E' => undef, 'G' => undef, 'H' => undef, 'I' => undef, 'L' => undef, 'K' => undef, 'M' => undef, 'F' => undef, 'P' => undef, 'S' => undef, 'T' => undef, 'W' => undef, 'Y' => undef, 'V' => undef, }; # DNA database ok $factory = Grinder->new( -reference_file => data('database_dna.fa'), -read_dist => 240 , -total_reads => 100 , -mutation_ratio => (100, 0) , -mutation_dist => ('uniform', 20) , ), 'DNA'; is $factory->{alphabet}, 'dna'; while ($read = $factory->next_read) { my $got_chars = get_chars($read->seq); is_deeply $got_chars, $dna_want_chars; } # RNA ok $factory = Grinder->new( -reference_file => data('database_rna.fa'), -read_dist => 240 , -total_reads => 100 , -mutation_ratio => (100, 0) , -mutation_dist => ('uniform', 20) , ), 'RNA'; is $factory->{alphabet}, 'rna'; while ($read = $factory->next_read) { my $got_chars = get_chars($read->seq); is_deeply $got_chars, $rna_want_chars; } # Protein ok $factory = Grinder->new( -reference_file => data('database_protein.fa'), -total_reads => 100 , -read_dist => 240 , -mutation_ratio => (100, 0) , -mutation_dist => ('uniform', 20) , -unidirectional => +1 , ), 'Protein'; is $factory->{alphabet}, 'protein'; while ($read = $factory->next_read) { my $got_chars = get_chars($read->seq); is_deeply $got_chars, $protein_want_chars; } # Mixed ok $factory = Grinder->new( -reference_file => data('database_mixed.fa'), -total_reads => 100 , -unidirectional => +1 , ), 'Mixed'; is $factory->{alphabet}, 'protein'; done_testing(); Grinder-0.5.4/t/06-seed.t0000644000175000017500000000231112636326214015127 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $seed1, $seed2, $seed3, @dataset1, @dataset2); # Seed the pseudo-random number generator ok $factory = Grinder->new( -reference_file => data('shotgun_database_extended.fa'), -random_seed => 1233567890 , -total_reads => 10 , ), 'Set the seed'; ok $seed1 = $factory->get_random_seed(); is $seed1, 1233567890; # Get a seed automatically ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -total_reads => 10 , ), 'Get a seed automatically'; ok $seed2 = $factory->get_random_seed(); cmp_ok $seed2, '>', 0; while (my $read = $factory->next_read) { push @dataset1, $read; } # Specify the same seed ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -total_reads => 10 , -random_seed => $seed2 , ), 'Specify the same seed'; ok $seed3 = $factory->get_random_seed(); is $seed3, $seed2; while (my $read = $factory->next_read) { push @dataset2, $read; } is_deeply \@dataset1, \@dataset2; done_testing(); Grinder-0.5.4/t/17-libraries.t0000644000175000017500000000574512647200675016207 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $lib, $nof_libs, $nof_reads, $read); # Multiple shotgun libraries ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -read_dist => 48 , -num_libraries => 4 , -total_reads => 100 , ), 'Multiple shotgun libraries'; $nof_libs = 0; while ( $lib = $factory->next_lib ) { $nof_libs++; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; ok_read($read, undef, $nof_reads, $nof_libs); }; is $nof_reads, 100; } is $nof_libs, 4; # Multiple mate pair libraries ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -total_reads => 100 , -read_dist => 48 , -num_libraries => 4 , -insert_dist => 250 , ), 'Multiple mate pair libraries'; $nof_libs = 0; while ( $lib = $factory->next_lib ) { $nof_libs++; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; ok_mate($read, undef, $nof_reads, $nof_libs); }; is $nof_reads, 100; } is $nof_libs, 4; done_testing(); sub ok_read { my ($read, $req_strand, $nof_reads, $nof_libs) = @_; isa_ok $read, 'Bio::Seq::SimulatedRead'; my $source = $read->reference->id; my $strand = $read->strand; if (not defined $req_strand) { $req_strand = $strand; } else { is $strand, $req_strand; } my $letters; if ( $source eq 'seq1' ) { $letters = 'a'; } elsif ( $source eq 'seq2' ) { $letters = 'c'; } elsif ( $source eq 'seq3' ) { $letters = 'g'; } elsif ( $source eq 'seq4' ) { $letters = 't'; } elsif ( $source eq 'seq5' ) { $letters = 'atg'; } if ( $req_strand == -1 ) { # Take the reverse complement $letters = Bio::PrimarySeq->new( -seq => $letters )->revcom->seq; } like $read->seq, qr/[$letters]+/; is $read->id, $nof_libs.'_'.$nof_reads; is $read->length, 48; } sub ok_mate { my ($read, $req_strand, $nof_reads, $nof_libs) = @_; isa_ok $read, 'Bio::Seq::SimulatedRead'; my $source = $read->reference->id; my $strand = $read->strand; if (not defined $req_strand) { $req_strand = $strand; } else { is $strand, $req_strand; } my $letters; if ( $source eq 'seq1' ) { $letters = 'a'; } elsif ( $source eq 'seq2' ) { $letters = 'c'; } elsif ( $source eq 'seq3' ) { $letters = 'g'; } elsif ( $source eq 'seq4' ) { $letters = 't'; } elsif ( $source eq 'seq5' ) { $letters = 'atg'; } if ( $req_strand == -1 ) { # Take the reverse complement $letters = Bio::PrimarySeq->new( -seq => $letters )->revcom->seq; }; like $read->seq, qr/[$letters]+/; my $id = $nof_libs.'_'.round($nof_reads/2).'/'.($nof_reads%2?1:2); is $read->id, $id; is $read->length, 48; } Grinder-0.5.4/t/28-revcom-amplicon.t0000644000175000017500000000634512263016714017316 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $read, $nof_reads); # Forward primer only, forward sequencing ok $factory = Grinder->new( -reference_file => data('revcom_amplicon_database.fa'), -forward_reverse => data('forward_primer.fa') , -length_bias => 0 , -unidirectional => 1 , -read_dist => 48 , -total_reads => 100 , ), 'Forward primer only, forward sequencing'; ok $factory->next_lib; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; ok_read($read, 1, $nof_reads); }; is $nof_reads, 100; # Forward and reverse primers ok $factory = Grinder->new( -reference_file => data('revcom_amplicon_database.fa'), -forward_reverse => data('forward_reverse_primers.fa') , -length_bias => 0 , -unidirectional => 1 , -read_dist => 48 , -total_reads => 100 , ), 'Forward then reverse primers, forward sequencing'; ok $factory->next_lib; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; ok_read($read, 1, $nof_reads); }; is $nof_reads, 100; # Reverse primer only, reverse sequencing ok $factory = Grinder->new( -reference_file => data('revcom_amplicon_database.fa'), -forward_reverse => data('reverse_primer.fa') , -length_bias => 0 , -unidirectional => -1 , -read_dist => 48 , -total_reads => 100 , ), 'Reverse primer only, reverse sequencing'; ok $factory->next_lib; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; ok_read($read, -1, $nof_reads); }; is $nof_reads, 100; # Reverse and forward primers, reverse sequencing ok $factory = Grinder->new( -reference_file => data('revcom_amplicon_database.fa') , -forward_reverse => data('reverse_forward_primers.fa'), -length_bias => 0 , -unidirectional => -1 , -read_dist => 48 , -total_reads => 100 , ), 'Reverse then forward primers, reverse sequencing'; ok $factory->next_lib; $nof_reads = 0; while ( $read = $factory->next_read ) { $nof_reads++; ok_read($read, -1, $nof_reads); }; is $nof_reads, 100; done_testing(); sub ok_read { my ($read, $req_strand, $nof_reads) = @_; isa_ok $read, 'Bio::Seq::SimulatedRead'; my $source = $read->reference->id; my $strand = $read->strand; if (not defined $req_strand) { $req_strand = $strand; } else { is $strand, $req_strand; } my $letters; if ( $source =~ m/^seq1/ ) { $letters = 't'; } # Take the if ( $req_strand == 1 ) { # Take the reverse complement (of the reverse-complemented sequence) $letters = Bio::PrimarySeq->new( -seq => $letters )->revcom->seq; }; like $read->seq, qr/[$letters]+/; is $read->id, $nof_reads; is $read->length, 48; } Grinder-0.5.4/t/31-shotgun-chimeras.t0000644000175000017500000000323312263016714017466 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; # These tests are to exercise some edge cases of kmer chimeras. # But let's take the opportunity to do shotgun chimeras my ($factory, $read); my %refs; # Shotgun chimeras ok $factory = Grinder->new( -reference_file => data('shotgun_database_shared_kmers.fa'), -chimera_perc => 100 , -chimera_dist => (1, 1, 1) , -chimera_kmer => 10 , -total_reads => 300 , -diversity => 5 , ), 'Chimera from shotgun library'; %refs = (); while ( $read = $factory->next_read ) { my @refs = get_references($read); between_ok scalar @refs, 2, 4; for my $ref (@refs) { $refs{$ref}++; } } is $factory->next_lib, undef; # Use only some of the sequences ok $factory = Grinder->new( -reference_file => data('shotgun_database_shared_kmers.fa'), -chimera_perc => 100 , -chimera_dist => (1) , -chimera_kmer => 2 , -total_reads => 300 , -diversity => 3 , # 3 out of 5 reference sequences ), 'Use only some of the reference sequences'; %refs = (); while ( $read = $factory->next_read ) { my @refs = get_references($read); is scalar @refs, 2; for my $ref (@refs) { $refs{$ref}++; } } is scalar keys %refs, 3; is $factory->next_lib, undef; done_testing(); Grinder-0.5.4/t/08-shared.t0000644000175000017500000001460412263016714015464 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $nof_reads, $read, $lib_num, %sources, %shared); # No species shared ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -random_seed => 1233567890 , -abundance_model => 'uniform' , -length_bias => 0 , -total_reads => 100 , -num_libraries => 3 , -shared_perc => 0 , ), 'No species shared'; while ($factory->next_lib) { $lib_num = $factory->{cur_lib}; while ( $read = $factory->next_read ) { my $source = $read->reference->id; $sources{$lib_num}{$source} = undef; # Is this source genome shared? $shared{$source} = undef if ( $lib_num == 3 && exists $sources{1}{$source} && exists $sources{2}{$source} ); } }; is scalar keys %sources, 3; is scalar keys %{$sources{1}}, 1; is scalar keys %{$sources{2}}, 1; is scalar keys %{$sources{3}}, 1; is scalar keys %shared, 0; %sources = (); %shared = (); # 50% species shared ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -random_seed => 1233567890 , -abundance_model => 'uniform' , -length_bias => 0 , -total_reads => 100 , -num_libraries => 3 , -shared_perc => 50 , ), '50% species shared'; while ($factory->next_lib) { $lib_num = $factory->{cur_lib}; while ( $read = $factory->next_read ) { my $source = $read->reference->id; $sources{$lib_num}{$source} = undef; # Is this source genome shared? $shared{$source} = undef if ( $lib_num == 3 && exists $sources{1}{$source} && exists $sources{2}{$source} ); } }; is scalar keys %sources, 3; is scalar keys %{$sources{1}}, 2; is scalar keys %{$sources{2}}, 2; is scalar keys %{$sources{3}}, 2; is scalar keys %shared, 1; %sources = (); %shared = (); # 66% species shared ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -random_seed => 1233567890 , -abundance_model => 'uniform' , -length_bias => 0 , -total_reads => 100 , -num_libraries => 3 , -shared_perc => 66 , ), '66% species shared'; while ($factory->next_lib) { $lib_num = $factory->{cur_lib}; while ( $read = $factory->next_read ) { my $source = $read->reference->id; $sources{$lib_num}{$source} = undef; # Is this source genome shared? $shared{$source} = undef if ( $lib_num == 3 && exists $sources{1}{$source} && exists $sources{2}{$source} ); } }; is scalar keys %sources, 3; is scalar keys %{$sources{1}}, 2; is scalar keys %{$sources{2}}, 2; is scalar keys %{$sources{3}}, 2; is scalar keys %shared, 1; %sources = (); %shared = (); # 67% species shared ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -random_seed => 1233567890 , -abundance_model => 'uniform' , -length_bias => 0 , -total_reads => 100 , -num_libraries => 3 , -shared_perc => 67 , ), '67% species shared'; while ($factory->next_lib) { $lib_num = $factory->{cur_lib}; while ( $read = $factory->next_read ) { my $source = $read->reference->id; $sources{$lib_num}{$source} = undef; # Is this source genome shared? $shared{$source} = undef if ( $lib_num == 3 && exists $sources{1}{$source} && exists $sources{2}{$source} ); } }; is scalar keys %sources, 3; is scalar keys %{$sources{1}}, 3; is scalar keys %{$sources{2}}, 3; is scalar keys %{$sources{3}}, 3; is scalar keys %shared, 2; %sources = (); %shared = (); # All species shared ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -random_seed => 1233567890 , -abundance_model => 'uniform' , -length_bias => 0 , -total_reads => 100 , -num_libraries => 3 , -shared_perc => 100 , ), 'All species shared'; while ($factory->next_lib) { $lib_num = $factory->{cur_lib}; while ( $read = $factory->next_read ) { my $source = $read->reference->id; $sources{$lib_num}{$source} = undef; # Is this source genome shared? $shared{$source} = undef if ( $lib_num == 3 && exists $sources{1}{$source} && exists $sources{2}{$source} ); } }; is scalar keys %sources, 3; is scalar keys %{$sources{1}}, 5; is scalar keys %{$sources{2}}, 5; is scalar keys %{$sources{3}}, 5; is scalar keys %shared, 5; %sources = (); %shared = (); # Inequal richness ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -random_seed => 1233567890 , -abundance_model => 'uniform' , -total_reads => 100 , -length_bias => 0 , -num_libraries => 2 , -diversity => (3,5) , -shared_perc => 100 , ), 'Inequal richness'; while ($factory->next_lib) { $lib_num = $factory->{cur_lib}; while ( $read = $factory->next_read ) { my $source = $read->reference->id; $sources{$lib_num}{$source} = undef; # Is this source genome shared? $shared{$source} = undef if ( $lib_num == 2 && exists $sources{1}{$source} ); } }; is scalar keys %sources, 2; is scalar keys %{$sources{1}}, 3; is scalar keys %{$sources{2}}, 5; is scalar keys %shared, 3; %sources = (); %shared = (); done_testing(); Grinder-0.5.4/t/23-chimeras.t0000644000175000017500000001220712263016714016003 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $read, $nof_reads, $nof_chimeras, $nof_regulars); my %chim_sizes; # No Chimeras ok $factory = Grinder->new( -reference_file => data('amplicon_database.fa') , -forward_reverse => data('forward_reverse_primers.fa'), -length_bias => 0 , -unidirectional => 1 , -chimera_perc => 0 , -chimera_dist => (1) , -chimera_kmer => 0 , -total_reads => 100 , ), 'No chimeras'; while ( $read = $factory->next_read ) { is scalar get_references($read), 1; # Remove forward and reverse primer my $seq = $read->seq; $seq = remove_primers($seq, 'AAACT.AAA.GAATTG.CGG', 'G.ACACACCGCCCGT'); # Now the amplicon is simply long homopolymeric sequences like $seq, qr/^(a+|c+|g+|t+)+$/; } # 50% chimeras (bimeras) ok $factory = Grinder->new( -reference_file => data('amplicon_database.fa') , -forward_reverse => data('forward_reverse_primers.fa'), -length_bias => 0 , -unidirectional => 1 , -chimera_perc => 50 , -chimera_dist => (1) , -chimera_kmer => 0 , -total_reads => 100 , ), '50% chimeras (bimeras)'; while ( $read = $factory->next_read ) { # Remove forward and reverse primer $nof_chimeras += scalar get_references($read); $nof_regulars += scalar get_references($read); } between_ok( $nof_chimeras / $nof_regulars, 0.9, 1.1 ); # 100% chimeras (bimeras) ok $factory = Grinder->new( -reference_file => data('amplicon_database.fa') , -forward_reverse => data('forward_reverse_primers.fa'), -length_bias => 0 , -unidirectional => 1 , -chimera_perc => 100 , -chimera_dist => (1) , -chimera_kmer => 0 , -total_reads => 100 , ), '100% chimeras (bimeras)'; while ( $read = $factory->next_read ) { # Remove forward and reverse primer is scalar get_references($read), 2; } # 100% chimeras (trimeras) ok $factory = Grinder->new( -reference_file => data('amplicon_database.fa') , -forward_reverse => data('forward_reverse_primers.fa'), -length_bias => 0 , -unidirectional => 1 , -chimera_perc => 100 , -chimera_dist => (0, 1) , -chimera_kmer => 0 , -total_reads => 100 , ), '100% chimeras (trimeras)'; while ( $read = $factory->next_read ) { # Remove forward and reverse primer is scalar get_references($read), 3; } # 100% chimeras (quadrameras) ok $factory = Grinder->new( -reference_file => data('amplicon_database.fa') , -forward_reverse => data('forward_reverse_primers.fa'), -length_bias => 0 , -unidirectional => 1 , -chimera_perc => 100 , -chimera_dist => (0, 0, 1) , -chimera_kmer => 0 , -total_reads => 100 , ), '100% chimeras (quadrameras)'; while ( $read = $factory->next_read ) { # Remove forward and reverse primer is scalar get_references($read), 4; } # 100% chimeras (bimeras, trimeras, quadrameras) ok $factory = Grinder->new( -reference_file => data('amplicon_database.fa') , -forward_reverse => data('forward_reverse_primers.fa'), -length_bias => 0 , -unidirectional => 1 , -chimera_perc => 100 , -chimera_dist => (1, 1, 1) , -chimera_kmer => 0 , -total_reads => 1000 , ), '100% chimeras (bimeras, trimeras, quadrameras)'; while ( $read = $factory->next_read ) { # Remove forward and reverse primer my $nof_refs = scalar get_references($read); $chim_sizes{$nof_refs}++; between_ok( $nof_refs, 2, 4 ); } between_ok( $chim_sizes{2}, 333.3 * 0.9, 333.3 * 1.1 ); between_ok( $chim_sizes{3}, 333.3 * 0.9, 333.3 * 1.1 ); between_ok( $chim_sizes{4}, 333.3 * 0.9, 333.3 * 1.1 ); done_testing(); sub remove_primers { my ($seq, $forward_re, $reverse_re) = @_; $seq =~ s/$forward_re//i; $seq =~ s/$reverse_re//i; return $seq; } sub matches_ref { my ($read) = @_; my $read_seq = $read->seq; my $ref_seq = $read->reference->seq; my $matches = 0; if ($ref_seq =~ m/$read_seq/) { $matches = 1; print "$read_seq\nmatches\n$ref_seq\n\n"; } return $matches; } Grinder-0.5.4/t/09-permuted.t0000644000175000017500000001523712263016714016047 0ustar floflooofloflooo#! perl use strict; use warnings; use Test::More; use t::TestUtils; use Grinder; my ($factory, $nof_reads, $read, $lib_num, $ranks1, $ranks2, $ranks3, $rank1_perm); # No species permuted ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -random_seed => 1233567890 , -abundance_model => ('powerlaw', 1.8) , -total_reads => 1000 , -num_libraries => 2 , -length_bias => 0 , -shared_perc => 100 , -permuted_perc => 0 , ), 'No species permuted'; ok $factory->next_lib; $ranks1 = get_ranks($factory); is scalar @$ranks1, 5; ok $factory->next_lib; $ranks2 = get_ranks($factory); is scalar @$ranks2, 5; $rank1_perm = 0; compare_ranks( $ranks1, $ranks2, $rank1_perm ); # Cannot have 20% permuted (1 species) because in this permutation method, # top species are permuted amongst the top species # 40% species permuted (2 species permuted) # This test is very sensitive to the seed because when permuting the top 2 # species, 50% of the time, the answer will be (1,2) and the rest of the time, # it will be (2,1) ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -random_seed => 2183567890 , -abundance_model => ('powerlaw', 1.8) , -total_reads => 1000 , -num_libraries => 2 , -length_bias => 0 , -shared_perc => 100 , -permuted_perc => 40 , ), '40% species permuted'; ok $factory->next_lib; $ranks1 = get_ranks($factory); is scalar @$ranks1, 5; ok $factory->next_lib; $ranks2 = get_ranks($factory); is scalar @$ranks2, 5; $rank1_perm = 2; compare_ranks( $ranks1, $ranks2, $rank1_perm ); # 60% species permuted ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -random_seed => 1095230708 , -abundance_model => ('powerlaw', 1.8) , -total_reads => 1000 , -num_libraries => 2 , -length_bias => 0 , -shared_perc => 100 , -permuted_perc => 60 , ), '60% species permuted'; ok $factory->next_lib; $ranks1 = get_ranks($factory); is scalar @$ranks1, 5; ok $factory->next_lib; $ranks2 = get_ranks($factory); is scalar @$ranks2, 5; $rank1_perm = 3; compare_ranks( $ranks1, $ranks2, $rank1_perm ); # 80% species permuted ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -random_seed => 1095230708 , -abundance_model => ('powerlaw', 1.8) , -total_reads => 1000 , -num_libraries => 2 , -length_bias => 0 , -shared_perc => 100 , -permuted_perc => 80 , ), '80% species permuted'; ok $factory->next_lib; $ranks1 = get_ranks($factory); is scalar @$ranks1, 5; ok $factory->next_lib; $ranks2 = get_ranks($factory); is scalar @$ranks2, 5; $rank1_perm = 4; compare_ranks( $ranks1, $ranks2, $rank1_perm ); # All species permuted ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -random_seed => 1933067890 , -abundance_model => ('powerlaw', 1.8) , -total_reads => 1000 , -num_libraries => 2 , -length_bias => 0 , -shared_perc => 100 , -permuted_perc => 100 , ), 'All species permuted'; ok $factory->next_lib; $ranks1 = get_ranks($factory); is scalar @$ranks1, 5; ok $factory->next_lib; $ranks2 = get_ranks($factory); is scalar @$ranks2, 5; $rank1_perm = 5; compare_ranks( $ranks1, $ranks2, $rank1_perm ); # Inequal richness ok $factory = Grinder->new( -reference_file => data('shotgun_database.fa'), -random_seed => 1243567820 , -abundance_model => ('powerlaw', 1.8) , -total_reads => 1000 , -num_libraries => 2 , -length_bias => 0 , -diversity => (3,5) , -shared_perc => 100 , -permuted_perc => 100 , ), 'Inequal richness'; ok $factory->next_lib; $ranks1 = get_ranks($factory); is scalar @$ranks1, 3; ok $factory->next_lib; $ranks2 = get_ranks($factory); is scalar @$ranks2, 5; $rank1_perm = 3; compare_ranks( $ranks1, $ranks2, $rank1_perm ); sub compare_ranks { # Compare genome ranks 2 to genome ranks 1 my ($ranks1, $ranks2, $rank1_perm) = @_; # Top genomes that should be permuted my @perm_ids; if ($rank1_perm == 0) { # nothing to do } elsif ($rank1_perm > 0) { @perm_ids = @$ranks1[0..$rank1_perm-1]; } # Copy arrays my %refs1; for my $rank (1 .. scalar @$ranks1) { $refs1{$$ranks1[$rank-1]} = $rank; } my %refs2; for my $rank (1 .. scalar @$ranks2) { $refs2{$$ranks2[$rank-1]} = $rank; } # Test that permuted genomes have a different rank # Note: This is not foolproof because at high percentage permuted, on samples # with a small richness, the high number of permutations can cause a # permuted genome to end up with the same rank as initially. Need to # play with the seed number to get it right. for my $perm_id ( @perm_ids ) { my $rank1 = $refs1{$perm_id}; my $rank2 = $refs2{$perm_id}; isnt $rank1, $rank2; delete $refs1{$perm_id}; delete $refs2{$perm_id}; } # Now, remaining genomes should have identical ranks (because it is 100% shared) my @refs1 = sort { $refs1{$a} <=> $refs1{$b} } (keys %refs1); my @refs2 = sort { $refs2{$a} <=> $refs2{$b} } (keys %refs2); # Length of the smallest array (number of species shared is relative to ) my $min_arr_len = scalar @refs1 < scalar @refs2 ? scalar @refs1 : scalar @refs2; for my $i (0 .. $min_arr_len - 1) { my $id1 = $refs1[$i]; my $id2 = $refs2[$i]; is $id1, $id2; } return 1; } done_testing(); sub get_ranks { my ($factory) = @_; my %sources; while ( $read = $factory->next_read ) { my $source = $read->reference->id; if (not exists $sources{$source}) { $sources{$source} = 1; } else { $sources{$source}++; } } my @ranks = sort { $sources{$b} <=> $sources{$a} } (keys %sources); return \@ranks } Grinder-0.5.4/META.yml0000644000175000017500000000216412647202457014617 0ustar floflooofloflooo--- abstract: 'A versatile omics shotgun and amplicon sequencing read simulator' author: - 'Florent Angly ' build_requires: ExtUtils::MakeMaker: 6.59 Test::More: 0 Test::Warn: 0 configure_requires: ExtUtils::MakeMaker: 6.59 distribution_type: module dynamic_config: 1 generated_by: 'Module::Install version 1.16' license: gpl3 meta-spec: url: http://module-build.sourceforge.net/META-spec-v1.4.html version: 1.4 name: Grinder no_index: directory: - inc - t requires: Bio::DB::Fasta: 0 Bio::Location::Split: 0 Bio::PrimarySeq: 0 Bio::Root::Root: 0 Bio::Root::Version: '1.006923' Bio::Seq::SimulatedRead: 0 Bio::SeqFeature::SubSeq: 0 Bio::SeqIO: 0 Bio::Tools::AmpliconSearch: 0 Getopt::Euclid: 0.4.4 List::Util: 0 Math::Random::MT: '1.16' perl: 5.6.0 version: '0.77' resources: bugtracker: http://sourceforge.net/tracker/?group_id=244196&atid=1124737 homepage: http://sourceforge.net/projects/biogrinder/ license: http://opensource.org/licenses/gpl-3.0.html repository: git://biogrinder.git.sourceforge.net/gitroot/biogrinder/biogrinder version: '0.005004' Grinder-0.5.4/bin/0000755000175000017500000000000012647202511014102 5ustar flofloooflofloooGrinder-0.5.4/bin/grinder0000755000175000017500000000177112263016714015472 0ustar floflooofloflooo#! /usr/bin/env perl # This file is part of the Grinder package, copyright 2009-2013 # Florent Angly , under the GPLv3 license # Grinder is a program to create artificial random shotgun and amplicon sequence # libraries based on reference genomes in a FASTA file. # Grinder is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # Grinder is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # You should have received a copy of the GNU General Public License # along with Grinder. If not, see . use strict; use warnings; use FindBin qw($Bin); use lib "$Bin"; use Grinder; Grinder::Grinder(@ARGV); exit; Grinder-0.5.4/bin/grinder.pod0000644000175000017500000007403412647202457016261 0ustar floflooofloflooo# This file was generated dynamically by Getopt::Euclid. Do not edit it. =head1 NAME grinder - A versatile omics shotgun and amplicon sequencing read simulator =head1 DESCRIPTION Grinder is a versatile program to create random shotgun and amplicon sequence libraries based on DNA, RNA or proteic reference sequences provided in a FASTA file. Grinder can produce genomic, metagenomic, transcriptomic, metatranscriptomic, proteomic, metaproteomic shotgun and amplicon datasets from current sequencing technologies such as Sanger, 454, Illumina. These simulated datasets can be used to test the accuracy of bioinformatic tools under specific hypothesis, e.g. with or without sequencing errors, or with low or high community diversity. Grinder may also be used to help decide between alternative sequencing methods for a sequence-based project, e.g. should the library be paired-end or not, how many reads should be sequenced. Grinder features include: =over =item * shotgun or amplicon read libraries =item * omics support to generate genomic, transcriptomic, proteomic, metagenomic, metatranscriptomic or metaproteomic datasets =item * arbitrary read length distribution and number of reads =item * simulation of PCR and sequencing errors (chimeras, point mutations, homopolymers) =item * support for paired-end (mate pair) datasets =item * specific rank-abundance settings or manually given abundance for each genome, gene or protein =item * creation of datasets with a given richness (alpha diversity) =item * independent datasets can share a variable number of genomes (beta diversity) =item * modeling of the bias created by varying genome lengths or gene copy number =item * profile mechanism to store preferred options =item * available to biologists or power users through multiple interfaces: GUI, CLI and API =back Briefly, given a FASTA file containing reference sequence (genomes, genes, transcripts or proteins), Grinder performs the following steps: =over =item 1. Read the reference sequences, and for amplicon datasets, extracts full-length reference PCR amplicons using the provided degenerate PCR primers. =item 2. Determine the community structure based on the provided alpha diversity (number of reference sequences in the library), beta diversity (number of reference sequences in common between several independent libraries) and specified rank- abundance model. =item 3. Take shotgun reads from the reference sequences or amplicon reads from the full- length reference PCR amplicons. The reads may be paired-end reads when an insert size distribution is specified. The length of the reads depends on the provided read length distribution and their abundance depends on the relative abundance in the community structure. Genome length may also biases the number of reads to take for shotgun datasets at this step. Similarly, for amplicon datasets, the number of copies of the target gene in the reference genomes may bias the number of reads to take. =item 4. Alter reads by inserting sequencing errors (indels, substitutions and homopolymer errors) following a position-specific model to simulate reads created by current sequencing technologies (Sanger, 454, Illumina). Write the reads and their quality scores in FASTA, QUAL and FASTQ files. =back =head1 CITATION If you use Grinder in your research, please cite: Angly FE, Willner D, Rohwer F, Hugenholtz P, Tyson GW (2012), Grinder: a versatile amplicon and shotgun sequence simulator, Nucleic Acids Reseach Available from L. =head1 VERSION This document refers to grinder version 0.5.3 =head1 AUTHOR Florent Angly =head1 INSTALLATION =head2 Dependencies You need to install these dependencies first: =over =item * Perl (>= 5.6) L =item * make Many systems have make installed by default. If your system does not, you should install the implementation of make of your choice, e.g. GNU make: L =back The following CPAN Perl modules are dependencies that will be installed automatically for you: =over =item * Bioperl modules (>=1.6.901). Note that some unreleased Bioperl modules have been included in Grinder. =item * Getopt::Euclid (>= 0.3.4) =item * List::Util First released with Perl v5.7.3 =item * Math::Random::MT (>= 1.13) =item * version (>= 0.77) First released with Perl v5.9.0 =back =head2 Procedure To install Grinder globally on your system, run the following commands in a terminal or command prompt: On Linux, Unix, MacOS: perl Makefile.PL make And finally, with administrator privileges: make install On Windows, run the same commands but with nmake instead of make. =head2 No administrator privileges? If you do not have administrator privileges, Grinder needs to be installed in your home directory. First, follow the instructions to install local::lib at L. After local::lib is installed, every Perl module that you install manually or through the CPAN command-line application will be installed in your home directory. Then, install Grinder by following the instructions detailed in the "Procedure" section. =head1 RUNNING GRINDER After installation, you can run Grinder using a command-line interface (CLI), an application programming interface (API) or a graphical user interface (GUI) in Galaxy. To get the usage of the CLI, type: grinder --help More information, including the documentation of the Grinder API, which allows you to run Grinder from within other Perl programs, is available by typing: perldoc Grinder To run the GUI, refer to the Galaxy documentation at L. The 'utils' folder included in the Grinder package contains some utilities: =over =item average genome size: This calculates the average genome size (in bp) of a simulated random library produced by Grinder. =item change_paired_read_orientation: This reverses the orientation of each second mate-pair read (ID ending in /2) in a FASTA file. =back =head1 REFERENCE SEQUENCE DATABASE A variety of FASTA databases can be used as input for Grinder. For example, the GreenGenes database (L) contains over 180,000 16S rRNA clone sequences from various species which would be appropriate to produce a 16S rRNA amplicon dataset. A set of over 41,000 OTU representative sequences and their affiliation in seven different taxonomic sytems can also be used for the same purpose (L and L). The RDP (L) and Silva (L) databases also provide many 16S rRNA sequences and Silva includes eukaryotic sequences. While 16S rRNA is a popular gene, datasets containing any type of gene could be used in the same fashion to generate simulated amplicon datasets, provided appropriate primers are used. The >2,400 curated microbial genome sequences in the NCBI RefSeq collection (L) would also be suitable for producing 16S rRNA simulated datasets (using the adequate primers). However, the lower diversity of this database compared to the previous two makes it more appropriate for producing artificial microbial metagenomes. Individual genomes from this database are also very suitable for the simulation of single or double-barreled shotgun libraries. Similarly, the RefSeq database contains over 3,100 curated viral sequences (L) which can be used to produce artificial viral metagenomes. Quite a few eukaryotic organisms have been sequenced and their genome or genes can be the basis for simulating genomic, transcriptomic (RNA-seq) or proteomic datasets. For example, you can use the human genome available at L, the human transcripts downloadable from L or the human proteome at L. =head1 CLI EXAMPLES Here are a few examples that illustrate the use of Grinder in a terminal: =over =item 1. A shotgun DNA library with a coverage of 0.1X grinder -reference_file genomes.fna -coverage_fold 0.1 =item 2. Same thing but save the result files in a specific folder and with a specific name grinder -reference_file genomes.fna -coverage_fold 0.1 -base_name my_name -output_dir my_dir =item 3. A DNA shotgun library with 1000 reads grinder -reference_file genomes.fna -total_reads 1000 =item 4. A DNA shotgun library where species are distributed according to a power law grinder -reference_file genomes.fna -abundance_model powerlaw 0.1 =item 5. A DNA shotgun library with 123 genomes taken random from the given genomes grinder -reference_file genomes.fna -diversity 123 =item 6. Two DNA shotgun libraries that have 50% of the species in common grinder -reference_file genomes.fna -num_libraries 2 -shared_perc 50 =item 7. Two DNA shotgun library with no species in common and distributed according to a exponential rank-abundance model. Note that because the parameter value for the exponential model is omitted, each library uses a different randomly chosen value: grinder -reference_file genomes.fna -num_libraries 2 -abundance_model exponential =item 8. A DNA shotgun library where species relative abundances are manually specified grinder -reference_file genomes.fna -abundance_file my_abundances.txt =item 9. A DNA shotgun library with Sanger reads grinder -reference_file genomes.fna -read_dist 800 -mutation_dist linear 1 2 -mutation_ratio 80 20 =item 10. A DNA shotgun library with first-generation 454 reads grinder -reference_file genomes.fna -read_dist 100 normal 10 -homopolymer_dist balzer =item 11. A paired-end DNA shotgun library, where the insert size is normally distributed around 2.5 kbp and has 0.2 kbp standard deviation grinder -reference_file genomes.fna -insert_dist 2500 normal 200 =item 12. A transcriptomic dataset grinder -reference_file transcripts.fna =item 13. A unidirectional transcriptomic dataset grinder -reference_file transcripts.fna -unidirectional 1 Note the use of -unidirectional 1 to prevent reads to be taken from the reverse- complement of the reference sequences. =item 14. A proteomic dataset grinder -reference_file proteins.faa -unidirectional 1 =item 15. A 16S rRNA amplicon library grinder -reference_file 16Sgenes.fna -forward_reverse 16Sprimers.fna -length_bias 0 -unidirectional 1 Note the use of -length_bias 0 because reference sequence length should not affect the relative abundance of amplicons. =item 16. The same amplicon library with 20% of chimeric reads (90% bimera, 10% trimera) grinder -reference_file 16Sgenes.fna -forward_reverse 16Sprimers.fna -length_bias 0 -unidirectional 1 -chimera_perc 20 -chimera_dist 90 10 =item 17. Three 16S rRNA amplicon libraries with specified MIDs and no reference sequences in common grinder -reference_file 16Sgenes.fna -forward_reverse 16Sprimers.fna -length_bias 0 -unidirectional 1 -num_libraries 3 -multiplex_ids MIDs.fna =item 18. Reading reference sequences from the standard input, which allows you to decompress FASTA files on the fly: zcat microbial_db.fna.gz | grinder -reference_file - -total_reads 100 =back =head1 CLI REQUIRED ARGUMENTS =over =item -rf | -reference_file | -gf | -genome_file FASTA file that contains the input reference sequences (full genomes, 16S rRNA genes, transcripts, proteins...) or '-' to read them from the standard input. See the README file for examples of databases you can use and where to get them from. Default: - =back =head1 CLI OPTIONAL ARGUMENTS =over =item -tr | -total_reads Number of shotgun or amplicon reads to generate for each library. Do not specify this if you specify the fold coverage. Default: 100 =item -cf | -coverage_fold Desired fold coverage of the input reference sequences (the output FASTA length divided by the input FASTA length). Do not specify this if you specify the number of reads directly. =item -rd ... | -read_dist ... Desired shotgun or amplicon read length distribution specified as: average length, distribution ('uniform' or 'normal') and standard deviation. Only the first element is required. Examples: All reads exactly 101 bp long (Illumina GA 2x): 101 Uniform read distribution around 100+-10 bp: 100 uniform 10 Reads normally distributed with an average of 800 and a standard deviation of 100 bp (Sanger reads): 800 normal 100 Reads normally distributed with an average of 450 and a standard deviation of 50 bp (454 GS-FLX Ti): 450 normal 50 Reference sequences smaller than the specified read length are not used. Default: 100 =item -id ... | -insert_dist ... Create paired-end or mate-pair reads spanning the given insert length. Important: the insert is defined in the biological sense, i.e. its length includes the length of both reads and of the stretch of DNA between them: 0 : off, or: insert size distribution in bp, in the same format as the read length distribution (a typical value is 2,500 bp for mate pairs) Two distinct reads are generated whether or not the mate pair overlaps. Default: 0 =item -mo | -mate_orientation When generating paired-end or mate-pair reads (see ), specify the orientation of the reads (F: forward, R: reverse): FR: ---> <--- e.g. Sanger, Illumina paired-end, IonTorrent mate-pair FF: ---> ---> e.g. 454 RF: <--- ---> e.g. Illumina mate-pair RR: <--- <--- Default: FR =item -ec | -exclude_chars Do not create reads containing any of the specified characters (case insensitive). For example, use 'NX' to prevent reads with ambiguities (N or X). Grinder will error if it fails to find a suitable read (or pair of reads) after 10 attempts. Consider using , which may be more appropriate for your case. Default: '' =item -dc | -delete_chars Remove the specified characters from the reference sequences (case-insensitive), e.g. '-~*' to remove gaps (- or ~) or terminator (*). Removing these characters is done once, when reading the reference sequences, prior to taking reads. Hence it is more efficient than . Default: =item -fr | -forward_reverse Use DNA amplicon sequencing using a forward and reverse PCR primer sequence provided in a FASTA file. The reference sequences and their reverse complement will be searched for PCR primer matches. The primer sequences should use the IUPAC convention for degenerate residues and the reference sequences that that do not match the specified primers are excluded. If your reference sequences are full genomes, it is recommended to use = 1 and = 0 to generate amplicon reads. To sequence from the forward strand, set to 1 and put the forward primer first and reverse primer second in the FASTA file. To sequence from the reverse strand, invert the primers in the FASTA file and use = -1. The second primer sequence in the FASTA file is always optional. Example: AAACTYAAAKGAATTGRCGG and ACGGGCGGTGTGTRC for the 926F and 1392R primers that target the V6 to V9 region of the 16S rRNA gene. =item -un | -unidirectional Instead of producing reads bidirectionally, from the reference strand and its reverse complement, proceed unidirectionally, from one strand only (forward or reverse). Values: 0 (off, i.e. bidirectional), 1 (forward), -1 (reverse). Use = 1 for amplicon and strand-specific transcriptomic or proteomic datasets. Default: 0 =item -lb | -length_bias In shotgun libraries, sample reference sequences proportionally to their length. For example, in simulated microbial datasets, this means that at the same relative abundance, larger genomes contribute more reads than smaller genomes (and all genomes have the same fold coverage). 0 = no, 1 = yes. Default: 1 =item -cb | -copy_bias In amplicon libraries where full genomes are used as input, sample species proportionally to the number of copies of the target gene: at equal relative abundance, genomes that have multiple copies of the target gene contribute more amplicon reads than genomes that have a single copy. 0 = no, 1 = yes. Default: 1 =item -md ... | -mutation_dist ... Introduce sequencing errors in the reads, under the form of mutations (substitutions, insertions and deletions) at positions that follow a specified distribution (with replacement): model (uniform, linear, poly4), model parameters. For example, for a uniform 0.1% error rate, use: uniform 0.1. To simulate Sanger errors, use a linear model where the errror rate is 1% at the 5' end of reads and 2% at the 3' end: linear 1 2. To model Illumina errors using the 4th degree polynome 3e-3 + 3.3e-8 * i^4 (Korbel et al 2009), use: poly4 3e-3 3.3e-8. Use the option to alter how many of these mutations are substitutions or indels. Default: uniform 0 0 =item -mr ... | -mutation_ratio ... Indicate the percentage of substitutions and the number of indels (insertions and deletions). For example, use '80 20' (4 substitutions for each indel) for Sanger reads. Note that this parameter has no effect unless you specify the option. Default: 80 20 =item -hd | -homopolymer_dist Introduce sequencing errors in the reads under the form of homopolymeric stretches (e.g. AAA, CCCCC) using a specified model where the homopolymer length follows a normal distribution N(mean, standard deviation) that is function of the homopolymer length n: Margulies: N(n, 0.15 * n) , Margulies et al. 2005. Richter : N(n, 0.15 * sqrt(n)) , Richter et al. 2008. Balzer : N(n, 0.03494 + n * 0.06856) , Balzer et al. 2010. Default: 0 =item -cp | -chimera_perc Specify the percent of reads in amplicon libraries that should be chimeric sequences. The 'reference' field in the description of chimeric reads will contain the ID of all the reference sequences forming the chimeric template. A typical value is 10% for amplicons. This option can be used to generate chimeric shotgun reads as well. Default: 0 % =item -cd ... | -chimera_dist ... Specify the distribution of chimeras: bimeras, trimeras, quadrameras and multimeras of higher order. The default is the average values from Quince et al. 2011: '314 38 1', which corresponds to 89% of bimeras, 11% of trimeras and 0.3% of quadrameras. Note that this option only takes effect when you request the generation of chimeras with the option. Default: 314 38 1 =item -ck | -chimera_kmer Activate a method to form chimeras by picking breakpoints at places where k-mers are shared between sequences. represents k, the length of the k-mers (in bp). The longer the kmer, the more similar the sequences have to be to be eligible to form chimeras. The more frequent a k-mer is in the pool of reference sequences (taking into account their relative abundance), the more often this k-mer will be chosen. For example, CHSIM (Edgar et al. 2011) uses this method with a k-mer length of 10 bp. If you do not want to use k-mer information to form chimeras, use 0, which will result in the reference sequences and breakpoints to be taken randomly on the "aligned" reference sequences. Note that this option only takes effect when you request the generation of chimeras with the option. Also, this options is quite memory intensive, so you should probably limit yourself to a relatively small number of reference sequences if you want to use it. Default: 10 bp =item -af | -abundance_file Specify the relative abundance of the reference sequences manually in an input file. Each line of the file should contain a sequence name and its relative abundance (%), e.g. 'seqABC 82.1' or 'seqABC 82.1 10.2' if you are specifying two different libraries. =item -am ... | -abundance_model ... Relative abundance model for the input reference sequences: uniform, linear, powerlaw, logarithmic or exponential. The uniform and linear models do not require a parameter, but the other models take a parameter in the range [0, infinity). If this parameter is not specified, then it is randomly chosen. Examples: uniform distribution: uniform powerlaw distribution with parameter 0.1: powerlaw 0.1 exponential distribution with automatically chosen parameter: exponential Default: uniform 1 =item -nl | -num_libraries Number of independent libraries to create. Specify how diverse and similar they should be with , and . Assign them different MID tags with . Default: 1 =item -mi | -multiplex_ids Specify an optional FASTA file that contains multiplex sequence identifiers (a.k.a MIDs or barcodes) to add to the sequences (one sequence per library, in the order given). The MIDs are included in the length specified with the -read_dist option and can be altered by sequencing errors. See the MIDesigner or BarCrawl programs to generate MID sequences. =item -di ... | -diversity ... This option specifies alpha diversity, specifically the richness, i.e. number of reference sequences to take randomly and include in each library. Use 0 for the maximum richness possible (based on the number of reference sequences available). Provide one value to make all libraries have the same diversity, or one richness value per library otherwise. Default: 0 =item -sp | -shared_perc This option controls an aspect of beta-diversity. When creating multiple libraries, specify the percent of reference sequences they should have in common (relative to the diversity of the least diverse library). Default: 0 % =item -pp | -permuted_perc This option controls another aspect of beta-diversity. For multiple libraries, choose the percent of the most-abundant reference sequences to permute (randomly shuffle) the rank-abundance of. Default: 100 % =item -rs | -random_seed Seed number to use for the pseudo-random number generator. =item -dt | -desc_track Track read information (reference sequence, position, errors, ...) by writing it in the read description. Default: 1 =item -ql ... | -qual_levels ... Generate basic quality scores for the simulated reads. Good residues are given a specified good score (e.g. 30) and residues that are the result of an insertion or substitution are given a specified bad score (e.g. 10). Specify first the good score and then the bad score on the command-line, e.g.: 30 10. Default: =item -fq | -fastq_output Whether to write the generated reads in FASTQ format (with Sanger-encoded quality scores) instead of FASTA and QUAL or not (1: yes, 0: no). need to be specified for this option to be effective. Default: 0 =item -bn | -base_name Prefix of the output files. Default: grinder =item -od | -output_dir Directory where the results should be written. This folder will be created if needed. Default: . =item -pf | -profile_file A file that contains Grinder arguments. This is useful if you use many options or often use the same options. Lines with comments (#) are ignored. Consider the profile file, 'simple_profile.txt': # A simple Grinder profile -read_dist 105 normal 12 -total_reads 1000 Running: grinder -reference_file viral_genomes.fa -profile_file simple_profile.txt Translates into: grinder -reference_file viral_genomes.fa -read_dist 105 normal 12 -total_reads 1000 Note that the arguments specified in the profile should not be specified again on the command line. =back =head1 CLI OUTPUT For each shotgun or amplicon read library requested, the following files are generated: =over =item * A rank-abundance file, tab-delimited, that shows the relative abundance of the different reference sequences =item * A file containing the read sequences in FASTA format. The read headers contain information necessary to track from which reference sequence each read was taken and what errors it contains. This file is not generated if option was provided. =item * If the option was specified, a file containing the quality scores of the reads (in QUAL format). =item * If the option was provided, a file containing the read sequences in FASTQ format. =back =head1 API EXAMPLES The Grinder API allows to conveniently use Grinder within Perl scripts. Here is a synopsis: use Grinder; # Set up a new factory (see the OPTIONS section for a complete list of parameters) my $factory = Grinder->new( -reference_file => 'genomes.fna' ); # Process all shotgun libraries requested while ( my $struct = $factory->next_lib ) { # The ID and abundance of the 3rd most abundant genome in this community my $id = $struct->{ids}->[2]; my $ab = $struct->{abs}->[2]; # Create shotgun reads while ( my $read = $factory->next_read) { # The read is a Bioperl sequence object with these properties: my $read_id = $read->id; # read ID given by Grinder my $read_seq = $read->seq; # nucleotide sequence my $read_mid = $read->mid; # MID or tag attached to the read my $read_errors = $read->errors; # errors that the read contains # Where was the read taken from? The reference sequence refers to the # database sequence for shotgun libraries, amplicon obtained from the # database sequence, or could even be a chimeric sequence my $ref_id = $read->reference->id; # ID of the reference sequence my $ref_start = $read->start; # start of the read on the reference my $ref_end = $read->end; # end of the read on the reference my $ref_strand = $read->strand; # strand of the reference } } # Similarly, for shotgun mate pairs my $factory = Grinder->new( -reference_file => 'genomes.fna', -insert_dist => 250 ); while ( $factory->next_lib ) { while ( my $read = $factory->next_read ) { # The first read is the first mate of the mate pair # The second read is the second mate of the mate pair # The third read is the first mate of the next mate pair # ... } } # To generate an amplicon library my $factory = Grinder->new( -reference_file => 'genomes.fna', -forward_reverse => '16Sgenes.fna', -length_bias => 0, -unidirectional => 1 ); while ( $factory->next_lib ) { while ( my $read = $factory->next_read) { # ... } } =head1 API METHODS The rest of the documentation details the available Grinder API methods. =head2 new Title : new Function: Create a new Grinder factory initialized with the passed arguments. Available parameters described in the OPTIONS section. Usage : my $factory = Grinder->new( -reference_file => 'genomes.fna' ); Returns : a new Grinder object =head2 next_lib Title : next_lib Function: Go to the next shotgun library to process. Usage : my $struct = $factory->next_lib; Returns : Community structure to be used for this library, where $struct->{ids} is an array reference containing the IDs of the genome making up the community (sorted by decreasing relative abundance) and $struct->{abs} is an array reference of the genome abundances (in the same order as the IDs). =head2 next_read Title : next_read Function: Create an amplicon or shotgun read for the current library. Usage : my $read = $factory->next_read; # for single read my $mate1 = $factory->next_read; # for mate pairs my $mate2 = $factory->next_read; Returns : A sequence represented as a Bio::Seq::SimulatedRead object =head2 get_random_seed Title : get_random_seed Function: Return the number used to seed the pseudo-random number generator Usage : my $seed = $factory->get_random_seed; Returns : seed number =head1 COPYRIGHT Copyright 2009-2013 Florent ANGLY Grinder is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License (GPL) as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. Grinder is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with Grinder. If not, see . =head1 BUGS All complex software has bugs lurking in it, and this program is no exception. If you find a bug, please report it on the SourceForge Tracker for Grinder: L Bug reports, suggestions and patches are welcome. Grinder's code is developed on Sourceforge (L) and is under Git revision control. To get started with a patch, do: git clone git://biogrinder.git.sourceforge.net/gitroot/biogrinder/biogrinder Grinder-0.5.4/README0000644000175000017500000010027412647202457014227 0ustar flofloooflofloooNAME grinder - A versatile omics shotgun and amplicon sequencing read simulator DESCRIPTION Grinder is a versatile program to create random shotgun and amplicon sequence libraries based on DNA, RNA or proteic reference sequences provided in a FASTA file. Grinder can produce genomic, metagenomic, transcriptomic, metatranscriptomic, proteomic, metaproteomic shotgun and amplicon datasets from current sequencing technologies such as Sanger, 454, Illumina. These simulated datasets can be used to test the accuracy of bioinformatic tools under specific hypothesis, e.g. with or without sequencing errors, or with low or high community diversity. Grinder may also be used to help decide between alternative sequencing methods for a sequence-based project, e.g. should the library be paired-end or not, how many reads should be sequenced. Grinder features include: * shotgun or amplicon read libraries * omics support to generate genomic, transcriptomic, proteomic, metagenomic, metatranscriptomic or metaproteomic datasets * arbitrary read length distribution and number of reads * simulation of PCR and sequencing errors (chimeras, point mutations, homopolymers) * support for paired-end (mate pair) datasets * specific rank-abundance settings or manually given abundance for each genome, gene or protein * creation of datasets with a given richness (alpha diversity) * independent datasets can share a variable number of genomes (beta diversity) * modeling of the bias created by varying genome lengths or gene copy number * profile mechanism to store preferred options * available to biologists or power users through multiple interfaces: GUI, CLI and API Briefly, given a FASTA file containing reference sequence (genomes, genes, transcripts or proteins), Grinder performs the following steps: 1. Read the reference sequences, and for amplicon datasets, extracts full-length reference PCR amplicons using the provided degenerate PCR primers. 2. Determine the community structure based on the provided alpha diversity (number of reference sequences in the library), beta diversity (number of reference sequences in common between several independent libraries) and specified rank- abundance model. 3. Take shotgun reads from the reference sequences or amplicon reads from the full- length reference PCR amplicons. The reads may be paired-end reads when an insert size distribution is specified. The length of the reads depends on the provided read length distribution and their abundance depends on the relative abundance in the community structure. Genome length may also biases the number of reads to take for shotgun datasets at this step. Similarly, for amplicon datasets, the number of copies of the target gene in the reference genomes may bias the number of reads to take. 4. Alter reads by inserting sequencing errors (indels, substitutions and homopolymer errors) following a position-specific model to simulate reads created by current sequencing technologies (Sanger, 454, Illumina). Write the reads and their quality scores in FASTA, QUAL and FASTQ files. CITATION If you use Grinder in your research, please cite: Angly FE, Willner D, Rohwer F, Hugenholtz P, Tyson GW (2012), Grinder: a versatile amplicon and shotgun sequence simulator, Nucleic Acids Reseach Available from . VERSION This document refers to grinder version 0.5.3 AUTHOR Florent Angly INSTALLATION Dependencies You need to install these dependencies first: * Perl (>= 5.6) * make Many systems have make installed by default. If your system does not, you should install the implementation of make of your choice, e.g. GNU make: The following CPAN Perl modules are dependencies that will be installed automatically for you: * Bioperl modules (>=1.6.901). Note that some unreleased Bioperl modules have been included in Grinder. * Getopt::Euclid (>= 0.3.4) * List::Util First released with Perl v5.7.3 * Math::Random::MT (>= 1.13) * version (>= 0.77) First released with Perl v5.9.0 Procedure To install Grinder globally on your system, run the following commands in a terminal or command prompt: On Linux, Unix, MacOS: perl Makefile.PL make And finally, with administrator privileges: make install On Windows, run the same commands but with nmake instead of make. No administrator privileges? If you do not have administrator privileges, Grinder needs to be installed in your home directory. First, follow the instructions to install local::lib at . After local::lib is installed, every Perl module that you install manually or through the CPAN command-line application will be installed in your home directory. Then, install Grinder by following the instructions detailed in the "Procedure" section. RUNNING GRINDER After installation, you can run Grinder using a command-line interface (CLI), an application programming interface (API) or a graphical user interface (GUI) in Galaxy. To get the usage of the CLI, type: grinder --help More information, including the documentation of the Grinder API, which allows you to run Grinder from within other Perl programs, is available by typing: perldoc Grinder To run the GUI, refer to the Galaxy documentation at . The 'utils' folder included in the Grinder package contains some utilities: average genome size: This calculates the average genome size (in bp) of a simulated random library produced by Grinder. change_paired_read_orientation: This reverses the orientation of each second mate-pair read (ID ending in /2) in a FASTA file. REFERENCE SEQUENCE DATABASE A variety of FASTA databases can be used as input for Grinder. For example, the GreenGenes database () contains over 180,000 16S rRNA clone sequences from various species which would be appropriate to produce a 16S rRNA amplicon dataset. A set of over 41,000 OTU representative sequences and their affiliation in seven different taxonomic sytems can also be used for the same purpose ( and ). The RDP () and Silva () databases also provide many 16S rRNA sequences and Silva includes eukaryotic sequences. While 16S rRNA is a popular gene, datasets containing any type of gene could be used in the same fashion to generate simulated amplicon datasets, provided appropriate primers are used. The >2,400 curated microbial genome sequences in the NCBI RefSeq collection () would also be suitable for producing 16S rRNA simulated datasets (using the adequate primers). However, the lower diversity of this database compared to the previous two makes it more appropriate for producing artificial microbial metagenomes. Individual genomes from this database are also very suitable for the simulation of single or double-barreled shotgun libraries. Similarly, the RefSeq database contains over 3,100 curated viral sequences () which can be used to produce artificial viral metagenomes. Quite a few eukaryotic organisms have been sequenced and their genome or genes can be the basis for simulating genomic, transcriptomic (RNA-seq) or proteomic datasets. For example, you can use the human genome available at , the human transcripts downloadable from or the human proteome at . CLI EXAMPLES Here are a few examples that illustrate the use of Grinder in a terminal: 1. A shotgun DNA library with a coverage of 0.1X grinder -reference_file genomes.fna -coverage_fold 0.1 2. Same thing but save the result files in a specific folder and with a specific name grinder -reference_file genomes.fna -coverage_fold 0.1 -base_name my_name -output_dir my_dir 3. A DNA shotgun library with 1000 reads grinder -reference_file genomes.fna -total_reads 1000 4. A DNA shotgun library where species are distributed according to a power law grinder -reference_file genomes.fna -abundance_model powerlaw 0.1 5. A DNA shotgun library with 123 genomes taken random from the given genomes grinder -reference_file genomes.fna -diversity 123 6. Two DNA shotgun libraries that have 50% of the species in common grinder -reference_file genomes.fna -num_libraries 2 -shared_perc 50 7. Two DNA shotgun library with no species in common and distributed according to a exponential rank-abundance model. Note that because the parameter value for the exponential model is omitted, each library uses a different randomly chosen value: grinder -reference_file genomes.fna -num_libraries 2 -abundance_model exponential 8. A DNA shotgun library where species relative abundances are manually specified grinder -reference_file genomes.fna -abundance_file my_abundances.txt 9. A DNA shotgun library with Sanger reads grinder -reference_file genomes.fna -read_dist 800 -mutation_dist linear 1 2 -mutation_ratio 80 20 10. A DNA shotgun library with first-generation 454 reads grinder -reference_file genomes.fna -read_dist 100 normal 10 -homopolymer_dist balzer 11. A paired-end DNA shotgun library, where the insert size is normally distributed around 2.5 kbp and has 0.2 kbp standard deviation grinder -reference_file genomes.fna -insert_dist 2500 normal 200 12. A transcriptomic dataset grinder -reference_file transcripts.fna 13. A unidirectional transcriptomic dataset grinder -reference_file transcripts.fna -unidirectional 1 Note the use of -unidirectional 1 to prevent reads to be taken from the reverse- complement of the reference sequences. 14. A proteomic dataset grinder -reference_file proteins.faa -unidirectional 1 15. A 16S rRNA amplicon library grinder -reference_file 16Sgenes.fna -forward_reverse 16Sprimers.fna -length_bias 0 -unidirectional 1 Note the use of -length_bias 0 because reference sequence length should not affect the relative abundance of amplicons. 16. The same amplicon library with 20% of chimeric reads (90% bimera, 10% trimera) grinder -reference_file 16Sgenes.fna -forward_reverse 16Sprimers.fna -length_bias 0 -unidirectional 1 -chimera_perc 20 -chimera_dist 90 10 17. Three 16S rRNA amplicon libraries with specified MIDs and no reference sequences in common grinder -reference_file 16Sgenes.fna -forward_reverse 16Sprimers.fna -length_bias 0 -unidirectional 1 -num_libraries 3 -multiplex_ids MIDs.fna 18. Reading reference sequences from the standard input, which allows you to decompress FASTA files on the fly: zcat microbial_db.fna.gz | grinder -reference_file - -total_reads 100 CLI REQUIRED ARGUMENTS -rf | -reference_file | -gf | -genome_file FASTA file that contains the input reference sequences (full genomes, 16S rRNA genes, transcripts, proteins...) or '-' to read them from the standard input. See the README file for examples of databases you can use and where to get them from. Default: - CLI OPTIONAL ARGUMENTS -tr | -total_reads Number of shotgun or amplicon reads to generate for each library. Do not specify this if you specify the fold coverage. Default: 100 -cf | -coverage_fold Desired fold coverage of the input reference sequences (the output FASTA length divided by the input FASTA length). Do not specify this if you specify the number of reads directly. -rd ... | -read_dist ... Desired shotgun or amplicon read length distribution specified as: average length, distribution ('uniform' or 'normal') and standard deviation. Only the first element is required. Examples: All reads exactly 101 bp long (Illumina GA 2x): 101 Uniform read distribution around 100+-10 bp: 100 uniform 10 Reads normally distributed with an average of 800 and a standard deviation of 100 bp (Sanger reads): 800 normal 100 Reads normally distributed with an average of 450 and a standard deviation of 50 bp (454 GS-FLX Ti): 450 normal 50 Reference sequences smaller than the specified read length are not used. Default: 100 -id ... | -insert_dist ... Create paired-end or mate-pair reads spanning the given insert length. Important: the insert is defined in the biological sense, i.e. its length includes the length of both reads and of the stretch of DNA between them: 0 : off, or: insert size distribution in bp, in the same format as the read length distribution (a typical value is 2,500 bp for mate pairs) Two distinct reads are generated whether or not the mate pair overlaps. Default: 0 -mo | -mate_orientation When generating paired-end or mate-pair reads (see ), specify the orientation of the reads (F: forward, R: reverse): FR: ---> <--- e.g. Sanger, Illumina paired-end, IonTorrent mate-pair FF: ---> ---> e.g. 454 RF: <--- ---> e.g. Illumina mate-pair RR: <--- <--- Default: FR -ec | -exclude_chars Do not create reads containing any of the specified characters (case insensitive). For example, use 'NX' to prevent reads with ambiguities (N or X). Grinder will error if it fails to find a suitable read (or pair of reads) after 10 attempts. Consider using , which may be more appropriate for your case. Default: '' -dc | -delete_chars Remove the specified characters from the reference sequences (case-insensitive), e.g. '-~*' to remove gaps (- or ~) or terminator (*). Removing these characters is done once, when reading the reference sequences, prior to taking reads. Hence it is more efficient than . Default: -fr | -forward_reverse Use DNA amplicon sequencing using a forward and reverse PCR primer sequence provided in a FASTA file. The reference sequences and their reverse complement will be searched for PCR primer matches. The primer sequences should use the IUPAC convention for degenerate residues and the reference sequences that that do not match the specified primers are excluded. If your reference sequences are full genomes, it is recommended to use = 1 and = 0 to generate amplicon reads. To sequence from the forward strand, set to 1 and put the forward primer first and reverse primer second in the FASTA file. To sequence from the reverse strand, invert the primers in the FASTA file and use = -1. The second primer sequence in the FASTA file is always optional. Example: AAACTYAAAKGAATTGRCGG and ACGGGCGGTGTGTRC for the 926F and 1392R primers that target the V6 to V9 region of the 16S rRNA gene. -un | -unidirectional Instead of producing reads bidirectionally, from the reference strand and its reverse complement, proceed unidirectionally, from one strand only (forward or reverse). Values: 0 (off, i.e. bidirectional), 1 (forward), -1 (reverse). Use = 1 for amplicon and strand-specific transcriptomic or proteomic datasets. Default: 0 -lb | -length_bias In shotgun libraries, sample reference sequences proportionally to their length. For example, in simulated microbial datasets, this means that at the same relative abundance, larger genomes contribute more reads than smaller genomes (and all genomes have the same fold coverage). 0 = no, 1 = yes. Default: 1 -cb | -copy_bias In amplicon libraries where full genomes are used as input, sample species proportionally to the number of copies of the target gene: at equal relative abundance, genomes that have multiple copies of the target gene contribute more amplicon reads than genomes that have a single copy. 0 = no, 1 = yes. Default: 1 -md ... | -mutation_dist ... Introduce sequencing errors in the reads, under the form of mutations (substitutions, insertions and deletions) at positions that follow a specified distribution (with replacement): model (uniform, linear, poly4), model parameters. For example, for a uniform 0.1% error rate, use: uniform 0.1. To simulate Sanger errors, use a linear model where the errror rate is 1% at the 5' end of reads and 2% at the 3' end: linear 1 2. To model Illumina errors using the 4th degree polynome 3e-3 + 3.3e-8 * i^4 (Korbel et al 2009), use: poly4 3e-3 3.3e-8. Use the option to alter how many of these mutations are substitutions or indels. Default: uniform 0 0 -mr ... | -mutation_ratio ... Indicate the percentage of substitutions and the number of indels (insertions and deletions). For example, use '80 20' (4 substitutions for each indel) for Sanger reads. Note that this parameter has no effect unless you specify the option. Default: 80 20 -hd | -homopolymer_dist Introduce sequencing errors in the reads under the form of homopolymeric stretches (e.g. AAA, CCCCC) using a specified model where the homopolymer length follows a normal distribution N(mean, standard deviation) that is function of the homopolymer length n: Margulies: N(n, 0.15 * n) , Margulies et al. 2005. Richter : N(n, 0.15 * sqrt(n)) , Richter et al. 2008. Balzer : N(n, 0.03494 + n * 0.06856) , Balzer et al. 2010. Default: 0 -cp | -chimera_perc Specify the percent of reads in amplicon libraries that should be chimeric sequences. The 'reference' field in the description of chimeric reads will contain the ID of all the reference sequences forming the chimeric template. A typical value is 10% for amplicons. This option can be used to generate chimeric shotgun reads as well. Default: 0 % -cd ... | -chimera_dist ... Specify the distribution of chimeras: bimeras, trimeras, quadrameras and multimeras of higher order. The default is the average values from Quince et al. 2011: '314 38 1', which corresponds to 89% of bimeras, 11% of trimeras and 0.3% of quadrameras. Note that this option only takes effect when you request the generation of chimeras with the option. Default: 314 38 1 -ck | -chimera_kmer Activate a method to form chimeras by picking breakpoints at places where k-mers are shared between sequences. represents k, the length of the k-mers (in bp). The longer the kmer, the more similar the sequences have to be to be eligible to form chimeras. The more frequent a k-mer is in the pool of reference sequences (taking into account their relative abundance), the more often this k-mer will be chosen. For example, CHSIM (Edgar et al. 2011) uses this method with a k-mer length of 10 bp. If you do not want to use k-mer information to form chimeras, use 0, which will result in the reference sequences and breakpoints to be taken randomly on the "aligned" reference sequences. Note that this option only takes effect when you request the generation of chimeras with the option. Also, this options is quite memory intensive, so you should probably limit yourself to a relatively small number of reference sequences if you want to use it. Default: 10 bp -af | -abundance_file Specify the relative abundance of the reference sequences manually in an input file. Each line of the file should contain a sequence name and its relative abundance (%), e.g. 'seqABC 82.1' or 'seqABC 82.1 10.2' if you are specifying two different libraries. -am ... | -abundance_model ... Relative abundance model for the input reference sequences: uniform, linear, powerlaw, logarithmic or exponential. The uniform and linear models do not require a parameter, but the other models take a parameter in the range [0, infinity). If this parameter is not specified, then it is randomly chosen. Examples: uniform distribution: uniform powerlaw distribution with parameter 0.1: powerlaw 0.1 exponential distribution with automatically chosen parameter: exponential Default: uniform 1 -nl | -num_libraries Number of independent libraries to create. Specify how diverse and similar they should be with , and . Assign them different MID tags with . Default: 1 -mi | -multiplex_ids Specify an optional FASTA file that contains multiplex sequence identifiers (a.k.a MIDs or barcodes) to add to the sequences (one sequence per library, in the order given). The MIDs are included in the length specified with the -read_dist option and can be altered by sequencing errors. See the MIDesigner or BarCrawl programs to generate MID sequences. -di ... | -diversity ... This option specifies alpha diversity, specifically the richness, i.e. number of reference sequences to take randomly and include in each library. Use 0 for the maximum richness possible (based on the number of reference sequences available). Provide one value to make all libraries have the same diversity, or one richness value per library otherwise. Default: 0 -sp | -shared_perc This option controls an aspect of beta-diversity. When creating multiple libraries, specify the percent of reference sequences they should have in common (relative to the diversity of the least diverse library). Default: 0 % -pp | -permuted_perc This option controls another aspect of beta-diversity. For multiple libraries, choose the percent of the most-abundant reference sequences to permute (randomly shuffle) the rank-abundance of. Default: 100 % -rs | -random_seed Seed number to use for the pseudo-random number generator. -dt | -desc_track Track read information (reference sequence, position, errors, ...) by writing it in the read description. Default: 1 -ql ... | -qual_levels ... Generate basic quality scores for the simulated reads. Good residues are given a specified good score (e.g. 30) and residues that are the result of an insertion or substitution are given a specified bad score (e.g. 10). Specify first the good score and then the bad score on the command-line, e.g.: 30 10. Default: -fq | -fastq_output Whether to write the generated reads in FASTQ format (with Sanger-encoded quality scores) instead of FASTA and QUAL or not (1: yes, 0: no). need to be specified for this option to be effective. Default: 0 -bn | -base_name Prefix of the output files. Default: grinder -od | -output_dir Directory where the results should be written. This folder will be created if needed. Default: . -pf | -profile_file A file that contains Grinder arguments. This is useful if you use many options or often use the same options. Lines with comments (#) are ignored. Consider the profile file, 'simple_profile.txt': # A simple Grinder profile -read_dist 105 normal 12 -total_reads 1000 Running: grinder -reference_file viral_genomes.fa -profile_file simple_profile.txt Translates into: grinder -reference_file viral_genomes.fa -read_dist 105 normal 12 -total_reads 1000 Note that the arguments specified in the profile should not be specified again on the command line. CLI OUTPUT For each shotgun or amplicon read library requested, the following files are generated: * A rank-abundance file, tab-delimited, that shows the relative abundance of the different reference sequences * A file containing the read sequences in FASTA format. The read headers contain information necessary to track from which reference sequence each read was taken and what errors it contains. This file is not generated if option was provided. * If the option was specified, a file containing the quality scores of the reads (in QUAL format). * If the option was provided, a file containing the read sequences in FASTQ format. API EXAMPLES The Grinder API allows to conveniently use Grinder within Perl scripts. Here is a synopsis: use Grinder; # Set up a new factory (see the OPTIONS section for a complete list of parameters) my $factory = Grinder->new( -reference_file => 'genomes.fna' ); # Process all shotgun libraries requested while ( my $struct = $factory->next_lib ) { # The ID and abundance of the 3rd most abundant genome in this community my $id = $struct->{ids}->[2]; my $ab = $struct->{abs}->[2]; # Create shotgun reads while ( my $read = $factory->next_read) { # The read is a Bioperl sequence object with these properties: my $read_id = $read->id; # read ID given by Grinder my $read_seq = $read->seq; # nucleotide sequence my $read_mid = $read->mid; # MID or tag attached to the read my $read_errors = $read->errors; # errors that the read contains # Where was the read taken from? The reference sequence refers to the # database sequence for shotgun libraries, amplicon obtained from the # database sequence, or could even be a chimeric sequence my $ref_id = $read->reference->id; # ID of the reference sequence my $ref_start = $read->start; # start of the read on the reference my $ref_end = $read->end; # end of the read on the reference my $ref_strand = $read->strand; # strand of the reference } } # Similarly, for shotgun mate pairs my $factory = Grinder->new( -reference_file => 'genomes.fna', -insert_dist => 250 ); while ( $factory->next_lib ) { while ( my $read = $factory->next_read ) { # The first read is the first mate of the mate pair # The second read is the second mate of the mate pair # The third read is the first mate of the next mate pair # ... } } # To generate an amplicon library my $factory = Grinder->new( -reference_file => 'genomes.fna', -forward_reverse => '16Sgenes.fna', -length_bias => 0, -unidirectional => 1 ); while ( $factory->next_lib ) { while ( my $read = $factory->next_read) { # ... } } API METHODS The rest of the documentation details the available Grinder API methods. new Title : new Function: Create a new Grinder factory initialized with the passed arguments. Available parameters described in the OPTIONS section. Usage : my $factory = Grinder->new( -reference_file => 'genomes.fna' ); Returns : a new Grinder object next_lib Title : next_lib Function: Go to the next shotgun library to process. Usage : my $struct = $factory->next_lib; Returns : Community structure to be used for this library, where $struct->{ids} is an array reference containing the IDs of the genome making up the community (sorted by decreasing relative abundance) and $struct->{abs} is an array reference of the genome abundances (in the same order as the IDs). next_read Title : next_read Function: Create an amplicon or shotgun read for the current library. Usage : my $read = $factory->next_read; # for single read my $mate1 = $factory->next_read; # for mate pairs my $mate2 = $factory->next_read; Returns : A sequence represented as a Bio::Seq::SimulatedRead object get_random_seed Title : get_random_seed Function: Return the number used to seed the pseudo-random number generator Usage : my $seed = $factory->get_random_seed; Returns : seed number COPYRIGHT Copyright 2009-2013 Florent ANGLY Grinder is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License (GPL) as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. Grinder is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with Grinder. If not, see . BUGS All complex software has bugs lurking in it, and this program is no exception. If you find a bug, please report it on the SourceForge Tracker for Grinder: Bug reports, suggestions and patches are welcome. Grinder's code is developed on Sourceforge () and is under Git revision control. To get started with a patch, do: git clone git://biogrinder.git.sourceforge.net/gitroot/biogrinder/biogrinder