pax_global_header00006660000000000000000000000064131440212420014503gustar00rootroot0000000000000052 comment=02a72433755e3ca8290557b92b1466ea3955a20c paleomix-1.2.12/000077500000000000000000000000001314402124200134045ustar00rootroot00000000000000paleomix-1.2.12/.gitignore000066400000000000000000000002571314402124200154000ustar00rootroot00000000000000*.swp .\#* \#* *~ *.py[cod] .coverage MANIFEST # Packages dist build sdist tests/runs tests/links/ tests/all_modules.py *.egg/ *.egg-info/ .eggs .tox .vscode docs/_build paleomix-1.2.12/CHANGES.md000066400000000000000000000710451314402124200150050ustar00rootroot00000000000000# Changelog ## [1.2.12] - 2017-08-13 ### Fixed - Fixed input / output files not being listed in 'pipe.errors' files. - Use the same max open files limit for picard (ulimit -n minus headroom) when determining if the default should be changed and as the final value. ### Added - The 'vcf_to_fasta' command now supports VCFs containing haploid genotype calls, courtesy of Graham Gower. ### Changed - Require Pysam version 0.10.0 or later. ## [1.2.11] - 2017-06-09 ### Fixed - Fixed unhandled exception if a FASTA file for a prefix is missing in a BAM pipeline makefile. - Fixed the 'RescaleQualities' option not being respected for non-global options in BAM pipeline makefiles. ## [1.2.10] - 2017-05-29 ### Added - Preliminary support for CSI indexed BAM files, required for genomes with chromosomes > 2^29 - 1 bp in size. Support is still missing in HTSJDK, so GATK cannot currently be used with such genomes. CSI indexing is enabled automatically when required. ### Fixed - Reference sequences placed in the current directory no longer cause the BAM pipeline to complain about non-writable directories. - The maximum number of temporary files used by picard will no longer be increased above the default value used by the picard tools. ### Changed - The 'Status' of processes terminated by the pipeline will now be reported as 'Automatically terminated by PALEOMIX'. This is to help differentiate between processes that failed or were killed by an external source, and processes that were cleaned up by the pipeline itself. - Pretty-printing of commands shown when commands fail have been revised to make it more readable, including explicit descriptions when output is piped from one process to another and vice versa. - Commands are now shown in a format more suitable for running on the command-line, instead of as a Python list, when a node fails. Pipes are still specified separately. - Improved error messages for missing programs during version checks, and for exceptions raised when calling Popen during version checks. - Strip MC tags from reads with unmapped mates during cleanup; this is required since Picard (v2.9.0) ValidateSamFile considers such tags invalid. ## [1.2.9] - 2017-05-01 ### Fixed - Improved handling of BAM tags to prevent unintended type changes. - Fixed 'rmdup_collapsed' underreporting the number of duplicate reads (in the 'XP' tag), when duplicates with different CIGAR strings were processed. ### Changed - PCR duplicates detected for collapsed reads using 'rmdup\_collapsed' are now identified based on alignments that include clipped bases. This matches the behavior of the Picard 'MarkDuplicates' command. - Depending on work-load, 'rmdup\_collapsed' may now run up to twice as fast. ## [1.2.8] - 2017-04-28 ### Added - Added FILTER entry for 'F' filter used in vcf\_filter. This corresponds to heterozygous sites where the allele frequency was not determined. - Added 'dupcheck' command. This command roughly corresponds to the DetectInputDuplication step that is part of the BAM pipeline, and attempts to identify duplicate data (not PCR duplicates), by locating reads mapped to the same position, with the same name, sequence, and quality scores. - Added link to sample data used in publication to the Zonkey documentation. ### Changed - Only letters, numbers, and '-', '_', and '.' are allowed in sample-names used in Zonkey, in order to prevent invalid filenames and certain programs breaking on whitespace. Trailing whitespace is stripped. - Show more verbose output when building Zonkey pipelines. - Picard tools version 1.137 or later is now required by the BAM pipeline. This is nessesary as newer BAM files (header version 1.5) would fail to validate when using earlier versions of Picard tools. ### Fixed - Fixed validation nodes failing on output paths without a directory. - Fixed possible uncaught exceptions when terminating cat commands used by FASTQ validation nodes resulting in loss of error messages. - Fixed makefile validation failing with an unhandled TypeError if unhashable types were found in unexpected locations. For example, a dict found where a subset of strings were allowed. These now result in a proper MakeFileError. - Fixed user options in the 'BWA' section of the BAM Pipeline makefiles not being correctly applied when using the 'mem' or the 'bwasw' algorithms. - Fixed some unit tests failing when the environment caused getlogin to fail. ## [1.2.7] - 2017-01-03 ### Added - PALEOMIX now includes the 'Zonkey' pipeline, a pipeline for detecting equine F1 hybrids from archeological remains. Usage is described in the documentation. ### Changed - The wrongly named per-sample option 'Gender' in the phylogenetic pipeline makefile has been replaced with a 'Sex' option. This does not break backwards compatibility, and makefiles using the old name will still work correctly. - The 'RescaleQualities' option has been merged with the 'mapDamage' Feature in the BAM pipeline makefile. The 'mapDamage' feature now takes the options 'plot', 'model', and 'rescale', allowing more fine-grained control. ### Fixed - Fixed the phylogenetic pipeline complaining about missing sample genders (now sex) if no regions of interest had been specified. The pipeline will now complain about there being no regions of interest, instead. - The 'random sampling' genotyper would misinterpret mapping qualities 10 (encoded as '+') and 12 (encoded as '-') as indels, resulting in the genotyping failing. These mapping qualities are now correctly ignored. ## [1.2.6] - 2016-10-12 ### Changed - PALEOMIX now uses the 'setproctitle' for better compatibility; installing / upgraing PALEOMIX using pip (or equivalent tools) should automatically install this dependency. ### Fixed - mapDamage plots should not require indexed BAMs; this fixed missing file errors for some makefile configurations. - Version check for java did now works correctly for OpenJDK JVMs. - Pressing 'l' or 'L' to list the currently running tasks now correctly reports the total runtime of the pipeline, rather than 0s. - Fixed broken version-check in setup.py breaking on versions of python older than than 2.7, preventing meaningful message (patch by beeso018). - The total runtime is now correctly reported when pressing the 'l' key during execution of a pipeline. - The logger will automatically create the output directory if this does not already exist; previously logged messages could cause the pipeline to fail, even if these were not in themselves fatal. - Executables required executables for version checks are now included in the prior checks for missing executables, to avoid version-checks failing due to missing executables. ### Added - PALEOMIX will attempt to automatically limit the per-process maximum number of file-handles used by when invoking Picard tools, in order to prevent failures due to exceeding the system limits (ulimit -n). ## [1.2.5] - 2015-06-06 ### Changed - Improved information capture when a node raises an unexpected exception, mainly for nodes implementing their own 'run' function (not CommandNodes). - Improved printing of the state of output files when using the command-line option --list-output-files. Outdated files are now always listed as outdated, where previously these could be listed as 'Missing' if the task in question was queued to be run next. - Don't attempt to validate prefixes when running 'trim_pipeline'; note that the structure of the Prefix section the makefile still has to be valid. - Reverted commit normalizing the strand of unmapped reads. - The commands 'paleomix coverage' and 'paleomix depths' now accept records lacking read-group information by default; these are record as in the sample and library columns. It is further possible to ignore all read-group information using the --ignore-readgroups command-line option. - The 'bam_pipeline mkfile' command now does limited validation of input 'SampleSheet.csv', prints generated targets sorted alphabetically, and automatically generates unique names for identically named lanes. Finally, the target template is not included automatically generating a makefile. - The 'coverage' and 'depth' commands are now capable of processing files containing reads with and without read-groups, without requiring the use of the --ignore-readgroups command-line option. Furthermore, reads for which the read-group is missing in the BAM header are treated as if no readgroup was specified for that read. - The 'coverage' and 'depth' command now checks that input BAM files are sorted during startup and while processing a file. - Normalized information printed by different progress UIs (--progress-ui), and included the maximum number of threads allowed. - Restructured CHANGELOG based on http://keepachangelog.com/ ### Fixed - Fixed mislabeling of BWA nodes; all were labeled as 'SE'. - Terminate read duplication checks when reaching the trailing, unmapped reads; this fixes uncontrolled memory growth when an alignment produces a large number of unmapped reads. - Fixed the pipeline demanding the existence of files from lanes that had been entirely excluded due to ExcludeReads settings. - Fixed some tasks needlessly depending on BAM files being indexed (e.g. depth histograms of a single BAM), resulting in missing file errors for certain makefile configurations. - Fixed per-prefix scan for duplicate input data not being run if no BAMs were set to be generated in the makefile, i.e. if both 'RawBAM' and 'RealignedBAM' was set to 'off'. ### Deprecated - Removed the BAM file from the bam_pipeline example, and added deprecation warning; support for including preexisting BAMs will be removed in a future version of PALEOMIX. ## [1.2.4] - 2015-03-14 ### Added - Included PATH in 'pipe.errors' file, to assist debugging of failed nodes. ### Fixed - Fix regression causing 'fixmate' not to be run on paired-end reads. This would occasionally cause paired-end mapping to fail during validation. ## [1.2.3] - 2015-03-11 ### Added - Added the ability to the pipelines to output the list of input files required for a given makefile, excluding any file built by the pipeline itself. Use the --list-input-files command-line option to view these. ### Changed - Updated 'bam_pipeline' makefile template; prefixes and targets are described more explicitly, and values for the prefix are commented out by default. The 'Label' option is no included in the template, as it is considered deprecated. - Allow the 'trim_pipeline' to be run on a makefile without any prefixes; this eases use of this pipeline in the case where a mapping is not wanted. - Improved handling of unmapped reads in 'paleomix cleanup'; additional flags (in particular 0x2; proper alignment) are now cleared if the mate is unmapped, and unmapped reads are always represented on the positive strand (clearing 0x4 and / or 0x20). ## [1.2.2] - 2015-03-10 ### Added - Documented work-arounds for problem caused when upgrading an old version of PALEOMIX (< 1.2.0) by using 'pip' to install a newer version, in which all command-line aliases invoke the same tool. - Added expanded description of PALEOMIX to README file. - The tool 'paleomix vcf_filter' can now clear any existing value in the FILTER column, and only record the result of running the filters implemented by this tool. This behavior may be enabled by running vcf_filter with the command-line option '--reset-filter yes'. ### Changed - Improved parsing of 'depths' histograms when running the phylogenetic pipeline genotyping step with 'MaxDepth: auto'; mismatches between the sample name in the table and in the makefile now only cause a warning, allowing for the common case where files depths were manually recalculated (and --target was not set), or where files were renamed. - The tool 'paleomix rmdup_collapsed' now assumes that ALL single-end reads (flag 0x1 not set) are collapsed. This ensures that pre-collapsed reads used in the pipeline are correctly filtered. Furthermore, reads without quality scores will be filtered, but only selected as the unique representative for a set of potential duplicates if no reads have quality scores. In that case, a random read is selected among the candidates. ### Fixed - Fixed failure during mapping when using SAMTools v1.x. ## [1.2.1] - 2015-03-08 ### Changed - Remove dependency on BEDTools from the Phylogenetic pipeline. - Change paleomix.__version__ to follow PEP 0396. ### Fixed - Stop 'phylo_pipeline makefile' from always printing help text. - Fixed bug causing the phylo_pipeline to throw exception if no additional command-line arguments were given. - Allow simulation of reads for phylogenetic pipeline example to be executed when PALEOMIX is run from a virtual environment. ## [1.2.0] - 2015-02-24 This is a major revision of PALEOMIX, mainly focused on reworking the internals of the PALEOMIX framework, as well as cleaning up several warts in the BAM pipeline. As a result, the default makefile has changed in a number of ways, but backwards compatibility is still retained with older makefiles, with one exception. Where previously the 'FilterUnmappedReads' would only be in effect when 'MinQuality' was set to 0, this option is now independent of the 'MinQuality' option. In addition, it is now possible to install PALEOMIX via Pypi, as described in the (partially) updated documentation now hosted on ReadTheDocs. ### Changed - Initial version of updated documentation hosted on ReadTheDocs, to replace documentation currently hosted on the repository wiki. - mapDamage files and models are now only kept in the {Target}.{Prefix}.mapDamage folder to simplify the file-structure; consequently, re-scaling can be re-done with different parameters by re-running the model step in these folders. - Rework BWA backtrack mapping to be carried out in two steps; this requires saving the .sai files (and hence more disk-space used by intermediate files, which can be removed afterwards), but allows better control over thread and memory usage. - Validate paths in BAM makefiles, to ensure that these can be parsed, and that these do not contain keys other than '{Pair}'. - The mapping-quality filter in the BAM pipeline / 'cleanup' command now only applies to mapped reads; consequently, setting a non-zero mapq value, and setting 'FilterUnmappedReads' to 'no' will not result in unmapped reads being filtered. - Improved the cleanup of BAM records following mapping, to better ensure that the resulting records follow the recommendations in the SAM spec. with regards to what fields / flags are set. - Configuration files are now expected to be located in ~/.paleomix or /etc/paleomix rather than ~/.pypeline and /etc/pypeline. To ensure backwards compatibility, ~/.pypeline will be migrated when a pipeline is first run, and replaced with a symbolic link to the new location. Furthermore, files in /etc/pypeline are still read, but settings in /etc/paleomix take precedence. - When parsing GTF files with 'gtf_to_bed', use either the attribute 'gene_type' or 'gene_biotype', defaulting to the value 'unknown_genetype' if neither attribute can be found; also support reading of gz / bz2 files. - The "ExcludeReads" section of the BAM Pipeline makefile is now a dictionary rather a list of strings. Furthermore, 'Singleton' reads are now considered seperately from 'Single'-end reads, and may be excluded independently of those. This does not break backwards compatibility, but as a consequence 'Single' includes both single-end and singleton reads when using old makefiles. - Added command-line option --nth-sample to the 'vcf_to_fasta' command, allowing FASTA construction from multi-sample VCFs; furthermore, if no BED file is specified, the entire genotype is constructed assuming that the VCF header is present. - Modify the FASTA indexing node so that SAMTools v0.1.x and v1.x can be used (added workaround for missing feature in v1.x). - The "Features" section of the BAM Pipeline makefile is now a dictionary rather than a list of strings, and spaces have been removed from feature names. This does not break backwards compatibility. - EXaML v3.0+ is now required; the name of the examl parser executable is required to be 'parse-examl' (previously expected to be 'examlParser'), following the name used by EXaML v3.0+. - Pysam v0.8.3+ is now required. - AdapterRemoval v2.1.5+ is now required; it is now possible to provide a list of adapter sequences using --adapter-list, and to specify the number of threads uses by AdapterRemoval via the --adapterremoval-max-threads command-line option. - Renamed module from 'pypeline' to 'paleomix' to aviod conflicts. - Improved handling FASTQ paths containing wildcards in the BAM pipeline, including additional checks to catch unequal numbers of files for paired- end reads. - Switch to setuptools in preperation for PyPI registration. - Avoid seperate indexing of intermediate BAMs when possible, reducing the total number of steps required for typical runs. - Restructure tests, removing (mostly unused) node tests. - Reworked sub-command handling to enable migration to setup-tools, and improved the safety of invoking these from the pipeline itself. - The output of "trim_pipeline mkfile" now includes the section for AdapterRemoval, which was previously mistakenly omitted. - Increased the speed of the checks for duplicate input data (i.e. the same FASTQ record(s) included multiple times in one or more files) by ~4x. ### Added - Paleomix v1.2.0 is now available via Pypi ('pip install paleomix'). - Added command 'paleomix ena', which is designed to ease the preparation of FASTQ reads previously recorded in a BAM pipeline makefile for submission to the European Nucleotide Archive; this command is current unstable, and not available by default (see comments in 'main.py'). - Exposed 'bam_pipeline remap' command, which eases re-mapping the hits identified against one prefix against other prefixes. - Added validation of BED files supplied to the BAM pipeline, and expand validation of BED files supplied to the Phylogenetic pipeline, to catch some cases that may cause unexpected behavior or failure during runtime. - Support SAMTools v1.x in the BAM pipeline; note, however, that the phylogenetic pipeline still requires SAMTools v0.1.19, due to major changes to BCFTools 1.x, which is not yet supported. - Modified 'bam_cleanup' to support SAMTools 1.x; SAMTools v0.1.19 or v1.x+ is henceforth required by this tool. - The gender 'NA' may now be used for samples for which no filtering of sex chromosomes is to be carried out, and defaults to an empty set of chromsomes unless explicitly overridden. - Pipeline examples are now available following installation via the commands "bam_pipeline example" and "phylo_pipeline example", which copy the example files to a folder specified by the user. - Added ability to specify the maximum number of threads used by GATK; currently only applicable for training of indel realigner. ### Fixed - Ensured that only a single header is generated when using multiple threads during genotyping, in order to avoid issues with programs unable to handle multiple headers. - Information / error messages are now more consistently logged to stderr, to better ensure that results printed to stdout are not mixed with those. - Fixed bug which could cause the data duplication detection to fail when unmapped reads were included. - Fixed default values not being shown for 'vcf_filter --help'. - Fix 'vcf_filter' when using pysam v0.8.4; would raise exception due to changes to the VCF record class. ### Removed - Removed the 'paleomix zip' command, as this is no longer needed thanks to built-in gzip / bzip2 support in AdapterRemoval v2. - Removed commandline options --allow-missing-input-files, --list-orphan-files, --target, and --list-targets. ## [1.1.1] - 2015-10-10 ### Changed - Detect the presence of carriage-returns ('\r') in FASTA files used as prefixes; these cause issues with some tools, and files should be converted using e.g. 'dos2unix' first. ### Fixed - Minor fix to help-text displayed as part of running information. ### Deprecated - AdapterRemoval v1.x is now considered deprecated, and support will be dropped shortly. Please upgrade to v2.1 or later, which can be found at https://github.com/MikkelSchubert/adapterremoval ### Removed - Dropped support for Picard tools versions prior to 1.124; this was nessesitated Picard tools merging into a single jar for all commands. This jar (picard.jar) is expected to be located in the --jar-root folder. ## [1.1.0] - 2015-09-08 ### Added - Check that regions of interest specified in PhylogeneticInference section corresponds to those specified earlier in the makefile. - Added the ability to automatically read MaxReadDepth values from depth-histograms generated by the BAM pipeline to the genotyping step. - Add support for BWA algorithms "bwasw" and "mem", which are recommended for longer sequencing reads. The default remains the "backtrack" algorithm. - Include list of filters in 'vcf_filter' output and renamed these to be compatible with GATK (using ':' instead of '='). - Support for genotyping entire BAM (once, and only once), even if only a set of regions are to be called; this is useful in the context of larger projects, and when multiple overlapping regions are to be genotyped. - Added validation of FASTA files for the BAM pipeline, in order to catch serveral types of errors that may lead to failure during mapping. - Added options to BAM / Phylo pipelines for writing Dot-file of the full dependency tree of a pipeline. - Added the ability to change the number of threads, and more, while the pipeline is running. Currently, already running tasks are not terminated if the maximum number of threads is decreased. Press 'h' during runtime to list commands. - Support for AdapterRemoval v2. - Allow the -Xmx option for Java to be overridden by the user. ### Changed - Prohibit whitespace and parentheses in prefix paths; these cause problems with Bowtie2, due to the wrapper script used by this program. - Allow "*" as the name for prefixes, when selecting prefixes by wildcards. - Rework genotyping step to improve performance when genotyping sparse regions (e.g. genes), and to allow transparent parallelization. - Require BWA 0.5.9, 0.5.10, 0.6.2, or 0.7.9+ for BWA backtrack; other versions have never been tested, or are known to contain bugs that result in invalid BAM files. - The memory limit it no longer increased for 32-bit JREs by default, as the value used by the pipeline exceeded the maxmimum for this architecture. - Improved verification of singleton-filtering settings in makefiles. - Reworked the 'sample_pileup' command, to reduce the memory usage for larger regions (e.g. entire chromosomes) by an order of magnitude. Also fixed some inconsistency in the calculation of distance to indels, resulting in some changes in results. - Changed 'gtf_to_bed' to group by the gene biotype, instead of the source. ### Fixed - Fixed a bug preventing new tasks from being started immediately after a task had failed; new tasks would only be started once a task had finished, or no running tasks were left. - Fixed MaxDepth calculation being limited to depths in the range 0 .. 200. - Added workaround for bug in Pysam, which caused parsing of some GTF files to fail if these contained unquoted values (e.g. "exon_number 2;"). - Fixed bug causing some tasks to not be re-run if the input file changed. - Fixed off-by-one error for coverages near the end of regions / contigs. - Ensure that the correct 'paleomix' wrapper script is called when invoking the various other tools, even if this is not located in the current PATH. - Parse newer SAMTools / BCFTools version strings, so that a meaningful version check failure can be reported, as these versions are not supported yet due to missing functionality. - Fix potential deadlock in the genotyping tool, which could occur if either of the invoked commands failed to start or crashed / were killed during execution. - Fixed error in which summary files could not be generated if two (or more) prefixes using the same label contained contigs with overlapping names but different sizes. - Fixed problems calculating coverage, depths, and others, when when using a user-provided BED without a name column. - Improved termination of child-processes, when the pipeline is interrupted. ### Deprecated - The 'mkfile' command has been renamed to 'makefile' for both pipelines; the old command is still supported, but considered deprecated. ### Removed - Dropped support for the "verbose" terminal output due to excessive verbosity (yes, really). The new default is "running" (previously called "quiet"), which shows a list of currently running nodes at every update. ## [1.0.1] - 2014-04-30 ### Added - Add 'paleomix' command, which provides interface for the various tools included in the PALEOMIX pipeline; this reduces the number of executables exposed by the pipeline, and allows for prerequisite checks to be done in one place. - Added warning if HomozygousContigs contains contigs not included in any of the prefixes specified in the makefile. ### Changed - Reworking version checking; add checks for JRE version (1.6+), for GATK (to check that the JRE can run it), and improved error messages for unidentified and / or outdated versions, and reporting of version numbers and requirements. - Dispose of hsperfdata_* folders created by certain JREs when using a custom temporary directory, when running Picard tools. - Cleanup of error-message displayed if Pysam version is outdated. - Ensure that file-handles are closed in the main process before subprocess execution, to ensure that these recieve SIGPIPE upon broken pipes. - Improvements to handling of implicit empty lists in makefiles; it is now no longer required to explicitly specify an empty list. Thus, the following is equivalent assuming that the pipeline expects a list: ExplicitEmptyList: [] ImplicitEmptyList: - Tweak makefile templates; the phylo makefile now specifies Male/Female genders with chrM and chrX; for the BAM pipeline the ROIs sub-tree and Label is commented out by default, as these are optional. - Reduced start-up time for bigger pipelines. ### Fixed - Fix manifest, ensuring that all files are included in source distribution. - Fix regression in coverage / depths, which would fail if invoked for specific regions of interest. - Fix bug preventing Padding from being set to zero when genotyping. ## [1.0.0] - 2014-04-16 ### Changed - Switching to more traditional version-number tracking. [Unreleased]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.12...HEAD [1.2.12]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.11...v1.2.12 [1.2.11]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.10...v1.2.11 [1.2.10]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.9...v1.2.10 [1.2.9]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.8...v1.2.9 [1.2.8]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.7...v1.2.8 [1.2.7]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.6...v1.2.7 [1.2.6]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.5...v1.2.6 [1.2.5]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.4...v1.2.5 [1.2.4]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.3...v1.2.4 [1.2.3]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.2...v1.2.3 [1.2.2]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.1...v1.2.2 [1.2.1]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.0...v1.2.1 [1.2.0]: https://github.com/MikkelSchubert/paleomix/compare/v1.1.1...v1.2.0 [1.1.1]: https://github.com/MikkelSchubert/paleomix/compare/v1.1.0...v1.1.1 [1.1.0]: https://github.com/MikkelSchubert/paleomix/compare/v1.0.1...v1.1.0 [1.0.1]: https://github.com/MikkelSchubert/paleomix/compare/v1.0.0...v1.0.1 [1.0.0]: https://github.com/MikkelSchubert/paleomix/compare/v1.0.0-RC...v1.0.0 paleomix-1.2.12/INSTALL.rst000066400000000000000000000001321314402124200152400ustar00rootroot00000000000000For detailed installation instructions, please refer to http://paleomix.readthedocs.io/ paleomix-1.2.12/MANIFEST.in000066400000000000000000000005641314402124200151470ustar00rootroot00000000000000include README.rst include CHANGES.md include MANIFEST.in include pylint.conf licenses/gpl.txt licenses/mit.txt include paleomix/yaml/CHANGES include paleomix/yaml/LICENSE include paleomix/yaml/PKG-INFO include paleomix/yaml/README # Examples recursive-include paleomix/resources * # Misc tools recursive-include misc *.py *.sh *.yaml # Tests recursive-include tests * paleomix-1.2.12/README.rst000066400000000000000000000051131314402124200150730ustar00rootroot00000000000000********************** The PALEOMIX pipelines ********************** The PALEOMIX pipelines are a set of pipelines and tools designed to aid the rapid processing of High-Throughput Sequencing (HTS) data: The BAM pipeline processes de-multiplexed reads from one or more samples, through sequence processing and alignment, to generate BAM alignment files useful in downstream analyses; the Phylogenetic pipeline carries out genotyping and phylogenetic inference on BAM alignment files, either produced using the BAM pipeline or generated elsewhere; and the Zonkey pipeline carries out a suite of analyses on low coverage equine alignments, in order to detect the presence of F1-hybrids in archaeological assemblages. In addition, PALEOMIX aids in metagenomic analysis of the extracts. The pipelines have been designed with ancient DNA (aDNA) in mind, and includes several features especially useful for the analyses of ancient samples, but can all be for the processing of modern samples, in order to ensure consistent data processing. For a detailed description of the pipeline, please refer to `PALEOMIX `_ website and the `documentation `_; for questions, bug reports, and/or suggestions, use the `GitHub tracker `_, or contact Mikkel Schubert at `MSchubert@snm.ku.dk `_. The PALEOMIX pipelines have been published in Nature Protocols; if you make use of PALEOMIX in your work, then please cite Schubert M, Ermini L, Sarkissian CD, Jónsson H, Ginolhac A, Schaefer R, Martin MD, Fernández R, Kircher M, McCue M, Willerslev E, and Orlando L. "**Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIX**". Nat Protoc. 2014 May;9(5):1056-82. doi: `10.1038/nprot.2014.063 `_. Epub 2014 Apr 10. PubMed PMID: `24722405 `_. The Zonkey pipeline has been published in Journal of Archaeological Science; if you make use of this pipeline in your work, then please cite Schubert M, Mashkour M, Gaunitz C, Fages A, Seguin-Orlando A, Sheikhi S, Alfarhan AH, Alquraishi SA, Al-Rasheid KAS, Chuang R, Ermini L, Gamba C, Weinstock J, Vedat O, and Orlando L. "**Zonkey: A simple, accurate and sensitive pipeline to genetically identify equine F1-hybrids in archaeological assemblages**". Journal of Archaeological Science. 2007 Feb; 78:147-157. doi: `10.1016/j.jas.2016.12.005 `_. paleomix-1.2.12/bin/000077500000000000000000000000001314402124200141545ustar00rootroot00000000000000paleomix-1.2.12/bin/bam_pipeline000077500000000000000000000035171314402124200165340ustar00rootroot00000000000000#!/usr/bin/env python # -*- coding: utf-8 -*- # Copyright (c) 2014 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. """ Legacy script for invoking the PALEOMIX command 'bam_pipeline'; main scripts are otherwise created by setuptools during the installation. """ import sys try: import paleomix except ImportError: error = sys.exc_info()[1] # Python 2/3 compatible exception syntax sys.stderr.write("""Error importing required PALEOMIX module 'paleomix': - %s Please make sure that PYTHONPATH points to the location of the 'paleomix' module. This may be done permanently by appendign the following to your ~/.bashrc file (if using Bash): export PYTHONPATH=${PYTHONPATH}:/path/to/paleomix/checkout/... """ % (error,)) sys.exit(1) if __name__ == '__main__': sys.exit(paleomix.run_bam_pipeline()) paleomix-1.2.12/bin/bam_rmdup_collapsed000077500000000000000000000035251314402124200201030ustar00rootroot00000000000000#!/usr/bin/env python # -*- coding: utf-8 -*- # Copyright (c) 2014 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. """ Legacy script for invoking the PALEOMIX command "rmdup_collapsed"; main scripts are otherwise created by setuptools during the installation. """ import sys try: import paleomix except ImportError: error = sys.exc_info()[1] # Python 2/3 compatible exception syntax sys.stderr.write("""Error importing required PALEOMIX module 'paleomix': - %s Please make sure that PYTHONPATH points to the location of the 'paleomix' module. This may be done permanently by appendign the following to your ~/.bashrc file (if using Bash): export PYTHONPATH=${PYTHONPATH}:/path/to/paleomix/checkout/... """ % (error,)) sys.exit(1) if __name__ == '__main__': sys.exit(paleomix.run_rmdup_collapsed()) paleomix-1.2.12/bin/conv_gtf_to_bed000077500000000000000000000035131314402124200172250ustar00rootroot00000000000000#!/usr/bin/env python # -*- coding: utf-8 -*- # Copyright (c) 2014 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. """ Legacy script for invoking the PALEOMIX command "gtf_to_bed"; main scripts are otherwise created by setuptools during the installation. """ import sys try: import paleomix except ImportError: error = sys.exc_info()[1] # Python 2/3 compatible exception syntax sys.stderr.write("""Error importing required PALEOMIX module 'paleomix': - %s Please make sure that PYTHONPATH points to the location of the 'paleomix' module. This may be done permanently by appendign the following to your ~/.bashrc file (if using Bash): export PYTHONPATH=${PYTHONPATH}:/path/to/paleomix/checkout/... """ % (error,)) sys.exit(1) if __name__ == '__main__': sys.exit(paleomix.run_gtf_to_bed()) paleomix-1.2.12/bin/paleomix000077500000000000000000000034471314402124200157300ustar00rootroot00000000000000#!/usr/bin/env python # -*- coding: utf-8 -*- # Copyright (c) 2014 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. """ Legacy script for invoking PALEOMIX; main scripts are otherwise created by setuptools during the installation. """ import sys try: import paleomix except ImportError: error = sys.exc_info()[1] # Python 2/3 compatible exception syntax sys.stderr.write("""Error importing required PALEOMIX module 'paleomix': - %s Please make sure that PYTHONPATH points to the location of the 'paleomix' module. This may be done permanently by appendign the following to your ~/.bashrc file (if using Bash): export PYTHONPATH=${PYTHONPATH}:/path/to/paleomix/checkout/... """ % (error,)) sys.exit(1) if __name__ == '__main__': sys.exit(paleomix.run()) paleomix-1.2.12/bin/phylo_pipeline000077500000000000000000000035231314402124200171250ustar00rootroot00000000000000#!/usr/bin/env python # -*- coding: utf-8 -*- # Copyright (c) 2014 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. """ Legacy script for invoking the PALEOMIX command "phylo_pipeline"; main scripts are otherwise created by setuptools during the installation. """ import sys try: import paleomix except ImportError: error = sys.exc_info()[1] # Python 2/3 compatible exception syntax sys.stderr.write("""Error importing required PALEOMIX module 'paleomix': - %s Please make sure that PYTHONPATH points to the location of the 'paleomix' module. This may be done permanently by appendign the following to your ~/.bashrc file (if using Bash): export PYTHONPATH=${PYTHONPATH}:/path/to/paleomix/checkout/... """ % (error,)) sys.exit(1) if __name__ == '__main__': sys.exit(paleomix.run_phylo_pipeline()) paleomix-1.2.12/bin/trim_pipeline000077500000000000000000000035211314402124200167430ustar00rootroot00000000000000#!/usr/bin/env python # -*- coding: utf-8 -*- # Copyright (c) 2014 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. """ Legacy script for invoking the PALEOMIX command "trim_pipeline"; main scripts are otherwise created by setuptools during the installation. """ import sys try: import paleomix except ImportError: error = sys.exc_info()[1] # Python 2/3 compatible exception syntax sys.stderr.write("""Error importing required PALEOMIX module 'paleomix': - %s Please make sure that PYTHONPATH points to the location of the 'paleomix' module. This may be done permanently by appendign the following to your ~/.bashrc file (if using Bash): export PYTHONPATH=${PYTHONPATH}:/path/to/paleomix/checkout/... """ % (error,)) sys.exit(1) if __name__ == '__main__': sys.exit(paleomix.run_trim_pipeline()) paleomix-1.2.12/docs/000077500000000000000000000000001314402124200143345ustar00rootroot00000000000000paleomix-1.2.12/docs/Makefile000066400000000000000000000163711314402124200160040ustar00rootroot00000000000000# Makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD = sphinx-build PAPER = BUILDDIR = _build # User-friendly check for sphinx-build ifeq ($(shell which $(SPHINXBUILD) >/dev/null 2>&1; echo $$?), 1) $(error The '$(SPHINXBUILD)' command was not found. Make sure you have Sphinx installed, then set the SPHINXBUILD environment variable to point to the full path of the '$(SPHINXBUILD)' executable. Alternatively you can add the directory with the executable to your PATH. If you don't have Sphinx installed, grab it from http://sphinx-doc.org/) endif # Internal variables. PAPEROPT_a4 = -D latex_paper_size=a4 PAPEROPT_letter = -D latex_paper_size=letter ALLSPHINXOPTS = -d $(BUILDDIR)/doctrees $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . # the i18n builder cannot share the environment and doctrees with the others I18NSPHINXOPTS = $(PAPEROPT_$(PAPER)) $(SPHINXOPTS) . .PHONY: help clean html dirhtml singlehtml pickle json htmlhelp qthelp devhelp epub latex latexpdf text man changes linkcheck doctest coverage gettext help: @echo "Please use \`make ' where is one of" @echo " html to make standalone HTML files" @echo " dirhtml to make HTML files named index.html in directories" @echo " singlehtml to make a single large HTML file" @echo " pickle to make pickle files" @echo " json to make JSON files" @echo " htmlhelp to make HTML files and a HTML help project" @echo " qthelp to make HTML files and a qthelp project" @echo " applehelp to make an Apple Help Book" @echo " devhelp to make HTML files and a Devhelp project" @echo " epub to make an epub" @echo " latex to make LaTeX files, you can set PAPER=a4 or PAPER=letter" @echo " latexpdf to make LaTeX files and run them through pdflatex" @echo " latexpdfja to make LaTeX files and run them through platex/dvipdfmx" @echo " text to make text files" @echo " man to make manual pages" @echo " texinfo to make Texinfo files" @echo " info to make Texinfo files and run them through makeinfo" @echo " gettext to make PO message catalogs" @echo " changes to make an overview of all changed/added/deprecated items" @echo " xml to make Docutils-native XML files" @echo " pseudoxml to make pseudoxml-XML files for display purposes" @echo " linkcheck to check all external links for integrity" @echo " doctest to run all doctests embedded in the documentation (if enabled)" @echo " coverage to run coverage check of the documentation (if enabled)" clean: rm -rf $(BUILDDIR)/* html: $(SPHINXBUILD) -b html $(ALLSPHINXOPTS) $(BUILDDIR)/html @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/html." dirhtml: $(SPHINXBUILD) -b dirhtml $(ALLSPHINXOPTS) $(BUILDDIR)/dirhtml @echo @echo "Build finished. The HTML pages are in $(BUILDDIR)/dirhtml." singlehtml: $(SPHINXBUILD) -b singlehtml $(ALLSPHINXOPTS) $(BUILDDIR)/singlehtml @echo @echo "Build finished. The HTML page is in $(BUILDDIR)/singlehtml." pickle: $(SPHINXBUILD) -b pickle $(ALLSPHINXOPTS) $(BUILDDIR)/pickle @echo @echo "Build finished; now you can process the pickle files." json: $(SPHINXBUILD) -b json $(ALLSPHINXOPTS) $(BUILDDIR)/json @echo @echo "Build finished; now you can process the JSON files." htmlhelp: $(SPHINXBUILD) -b htmlhelp $(ALLSPHINXOPTS) $(BUILDDIR)/htmlhelp @echo @echo "Build finished; now you can run HTML Help Workshop with the" \ ".hhp project file in $(BUILDDIR)/htmlhelp." qthelp: $(SPHINXBUILD) -b qthelp $(ALLSPHINXOPTS) $(BUILDDIR)/qthelp @echo @echo "Build finished; now you can run "qcollectiongenerator" with the" \ ".qhcp project file in $(BUILDDIR)/qthelp, like this:" @echo "# qcollectiongenerator $(BUILDDIR)/qthelp/PALEOMIX.qhcp" @echo "To view the help file:" @echo "# assistant -collectionFile $(BUILDDIR)/qthelp/PALEOMIX.qhc" applehelp: $(SPHINXBUILD) -b applehelp $(ALLSPHINXOPTS) $(BUILDDIR)/applehelp @echo @echo "Build finished. The help book is in $(BUILDDIR)/applehelp." @echo "N.B. You won't be able to view it unless you put it in" \ "~/Library/Documentation/Help or install it in your application" \ "bundle." devhelp: $(SPHINXBUILD) -b devhelp $(ALLSPHINXOPTS) $(BUILDDIR)/devhelp @echo @echo "Build finished." @echo "To view the help file:" @echo "# mkdir -p $$HOME/.local/share/devhelp/PALEOMIX" @echo "# ln -s $(BUILDDIR)/devhelp $$HOME/.local/share/devhelp/PALEOMIX" @echo "# devhelp" epub: $(SPHINXBUILD) -b epub $(ALLSPHINXOPTS) $(BUILDDIR)/epub @echo @echo "Build finished. The epub file is in $(BUILDDIR)/epub." latex: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo @echo "Build finished; the LaTeX files are in $(BUILDDIR)/latex." @echo "Run \`make' in that directory to run these through (pdf)latex" \ "(use \`make latexpdf' here to do that automatically)." latexpdf: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through pdflatex..." $(MAKE) -C $(BUILDDIR)/latex all-pdf @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." latexpdfja: $(SPHINXBUILD) -b latex $(ALLSPHINXOPTS) $(BUILDDIR)/latex @echo "Running LaTeX files through platex and dvipdfmx..." $(MAKE) -C $(BUILDDIR)/latex all-pdf-ja @echo "pdflatex finished; the PDF files are in $(BUILDDIR)/latex." text: $(SPHINXBUILD) -b text $(ALLSPHINXOPTS) $(BUILDDIR)/text @echo @echo "Build finished. The text files are in $(BUILDDIR)/text." man: $(SPHINXBUILD) -b man $(ALLSPHINXOPTS) $(BUILDDIR)/man @echo @echo "Build finished. The manual pages are in $(BUILDDIR)/man." texinfo: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo @echo "Build finished. The Texinfo files are in $(BUILDDIR)/texinfo." @echo "Run \`make' in that directory to run these through makeinfo" \ "(use \`make info' here to do that automatically)." info: $(SPHINXBUILD) -b texinfo $(ALLSPHINXOPTS) $(BUILDDIR)/texinfo @echo "Running Texinfo files through makeinfo..." make -C $(BUILDDIR)/texinfo info @echo "makeinfo finished; the Info files are in $(BUILDDIR)/texinfo." gettext: $(SPHINXBUILD) -b gettext $(I18NSPHINXOPTS) $(BUILDDIR)/locale @echo @echo "Build finished. The message catalogs are in $(BUILDDIR)/locale." changes: $(SPHINXBUILD) -b changes $(ALLSPHINXOPTS) $(BUILDDIR)/changes @echo @echo "The overview file is in $(BUILDDIR)/changes." linkcheck: $(SPHINXBUILD) -b linkcheck $(ALLSPHINXOPTS) $(BUILDDIR)/linkcheck @echo @echo "Link check complete; look for any errors in the above output " \ "or in $(BUILDDIR)/linkcheck/output.txt." doctest: $(SPHINXBUILD) -b doctest $(ALLSPHINXOPTS) $(BUILDDIR)/doctest @echo "Testing of doctests in the sources finished, look at the " \ "results in $(BUILDDIR)/doctest/output.txt." coverage: $(SPHINXBUILD) -b coverage $(ALLSPHINXOPTS) $(BUILDDIR)/coverage @echo "Testing of coverage in the sources finished, look at the " \ "results in $(BUILDDIR)/coverage/python.txt." xml: $(SPHINXBUILD) -b xml $(ALLSPHINXOPTS) $(BUILDDIR)/xml @echo @echo "Build finished. The XML files are in $(BUILDDIR)/xml." pseudoxml: $(SPHINXBUILD) -b pseudoxml $(ALLSPHINXOPTS) $(BUILDDIR)/pseudoxml @echo @echo "Build finished. The pseudo-XML files are in $(BUILDDIR)/pseudoxml." paleomix-1.2.12/docs/_static/000077500000000000000000000000001314402124200157625ustar00rootroot00000000000000paleomix-1.2.12/docs/_static/zonkey/000077500000000000000000000000001314402124200173015ustar00rootroot00000000000000paleomix-1.2.12/docs/_static/zonkey/incl_ts_0_tree_rooted.png000066400000000000000000000446071314402124200242670ustar00rootroot00000000000000PNG  IHDR@@Na pHYs+tEXtSoftwareGPL Ghostscript 9.19 IDATxl}%Q~PHj Ⱆڻ8 $]:@mH_j(zl|@tE/Z|iivvhv;&#_ݡRrm;(S($gwfvrvr|yY' 0@&`L"D2d$ I 0@&`L"D2d$ I 0@&`L"D2d$ I 0@&`L"D2d$ I 0@&`L"D2d$ I 0@&`L"D2d$ I 0@&`L"D2d$ I 0@&`L"D2d$ I 0@&`L"D2i_ KH7^{/ayUtan=0na{O~o-?o-\@~5t䋜 0s ѣz]Ӵۂlc@O9366Fza0=UVgffnCzBh4;hG @Tթ) 8/Bn`zu]!D˿[V{^ٳz=醠 NLL 1"BRa:0]'5MEQn ދn`@wanXL!7r]WQ!vVs`?L^t ~WUΝ;wᥥ[>"ظkB!/ׯ_O]O+raVю!D .m;m':6???dp2fYA(RTMthj <7.lcq]W B333FC[e1wY4MEFybUUUvL8 (B$ÚVtm-]kF{ q]Ns<\.PMӂ (J2i e,l۶lvѣGZu]YORpF${P4M#BAEuݲ,<Ѵ֮m!i(䫯*0 [_g9]Zݸ[9P.n4M4+\-O%NOOG=u}zzZ^.JKL΁=" ɭb(Dzͦ>h4Oc0$}nB_4{zK GE €#u]uEavC]2== !eɗG A`Y%a0HV r`0Ġ}b xlB4]יʁ s{J]q:Aa.+ CUUʁ Щu{0ciI7YEa3{pr qS9&/'  ]"&.MӦ-J!$ ok߻[*۶+ S93]";[BaxZFfS]0MS8N A`%H"S9X.w LUՙ rKH7^{//_X=}l!}6ꋗWBؓtE뚦(JP`ffhoYV^u~a޼\-/\MT?#յO%Ĭ}qػiffT}UW, !z_w死Ƞ<.}ʭozס!s>ѾwI^LPejjRt>#zG%=!WsM} !/9 YVW8kX].39Ӧin`|߸uxpƯ-8BWn5xhp@Kջw@`ӾwITb<2mՏHt_guhoi]޿O~յWnE)]8<ז`}ˌK~o}W|k&Zpu߹}?Oz}E=VWz'k߻[zcO~ley4'=}pLwySj%]ȟupkw \]|G\+KWWoOm42{\]vxp x}np}߻rK^Xw<Xػv!Ĵ]-=^c ]#K{ɓ[ػbGFvN-lW.-6Dky'.D`% !n};@`k߻DV!wd=0!.-Nn@7ػ2-w%r[gr'>77?3?я~[$݀^˿ٿI\(m70 5Mm;v+C\5MSUU۶ٚ +=VaA '͛a-DLpD^8Imy 0I4KeYtې, @UB`Y#q/癦)m~yEt]0 4[ m(ie%=11ue9r1H!z`3 }0+L!)Z.ܑtoc#X%W= =) P\ RQjQ Nض].5Ms1 vBRe||EHC[\4uݤ[X| ߠ<޽GVꩮ@JTjzzZ4tmXy[wroxm9,m70 )@ Un^r4Zz`ݥ(8Q*eYbrI}'˵{?pbz~{L[JIBIz팜y(;eO*8̌m۱ogHճ^X"<3MS8g q2*|9z` r1 P뺤W+,rXTb6iW+纮ir3#i|mE,yrQu Ph'0 k,ڰ* ҋv !N\ddHţI%聥eYV(X.L/d6A,bmoW~%Ky5TXzaiI@/^"ζmYJ4Jb!^H-1W'eaŗ8W',i-@%+w$"|Jc1@eY^B,Lӌ P\ ȴO|_I"2L(j5XdܟƝ;wϕeV|n 'd*EQ(@.04M mEQ?a6?Eq]ViI;*/+V ;#y3%ߴ;%""4et5 yk[?ۨwۙ-X0,]Bͣ+n4ؽ`}KUUq]W^NE@DAV$Z9Ysb^>Z4Mв,9GC.tYh1tɿz.֙t.9GCօ}C銵TDa?0 岬[*(}_%D,_d*MUb$?k۶<Ս"'tpF B,<ϓ '1 #Y%m !ڐKmrvGiR;xv0sͲ,EgDPkèT-m-oz`B'MxW.^" v(h)@U,)@ۼaP"o 0ͶFA*Ue U՞,!3ӂY]yBܸ^=x\;R{rBnR lAZo,CVYi&[0=n\r FQ)s~eﰺ|щNiCㅛ\yfk岷؞! dF;aL@EyKZ\;xN+kǵ#}³³_S;a NG+Oi ᷍BFAKH-˚;sLTʒ4.BcZkOkV``8ro (O<՗BDlrbHVb0)*@yErb0Oꓤ[\ jd!:G[d*EQ(@n)wN9 Pr 'KVs0 ~̠U[.@.L] #GKR\zI;'\˪N,fG'0deYFL]#ضmY%l}YLFDlHk4uݤ[mo?hy 6n2 в,yE_ʿY|RIؒE4 4M Pe7orY SUyD^^_T q% ˏe2bt۶F9peAG6?(r=vxRgez-l-KO,q]W.KE7mZefأ]^LnYiXBa(,ȑ#GeN4X,/=jZ'@iN*E IDATy(juff\srOh Θ#ڶEHy遵k'4ZuqEzeEm\.3 U̝Nm۞ X+Y}irT)ԺCiuY&0Mh_ 9^QC.w=y>)*N(";e}` P&; {`ȗzW䅻+·!)s`H7Rhv zo_ 艿v{'wDˢKh2LYOA1EGj8Vڢ;[!۶ iFö=[VϞ=;11133cFmn\Ve/j~erM&ə7Үխ#?Vޭ_[wo.{{؝UG;RbY#}qEQ È"T,bɗr2eYʳeYKr͖Sb!g:c岷-}3(mڠ?/lN"jHRI-}lͧl9sfddZ%lV*I!dRi6|'GZ掽*kkkʯc]yf?^Yyde*nX^w6<ޣT,KϺD=}4UU[.4 yo!ӕ˵'߱%}'˵u7;|ڳ_~,5r,OMMs){ckkkх64$S ).*trU/t\5MѣJezzl.3 U!Dt!AxW}+I[ a(FwUE+ Uuv^#y].jܲmhwZ7CmW*˲2,,2>>.AҎVbI? Ek*t,eLV1w4h}:rv(۝71;t0釁tt]/8]] CUTє34M+ a:tR*v&C膨'3絋:U̝sR2~F5to⥿|__h ,@itsK<0dWctAT*u0vEuv7RRxc\f'2q0=ϋfg^ەw}{cHJ:CVWsr%mbً9Fr܍3rZ|шkt|}C1ءCI? d,@ei疠ޯb8rZ*FrF!;[^'Gݹ,D&q`;l}_QMj߸ {1wIN?zh\fvF`{bd#Ge*t,B;!τ;E3dVjHͻخ_{8"vB(Q,)@%^ 94b33z)B|L܉AJI!r1Аΰ3fO_`1e"b.ms'eX, !\ץ\a`kqT +9*U*,˒CUIGҶCr׾}sl6|}ɲ\J*Nhrts Q*Xh~Y}CɢSawb3qoN|X,8 Pi֗ҿCt o,d$zAr/ PeesZˬ|CZo3CQWL)w偻T@RU Pens%"68CJHEm PZ7>X#nߌ$4ͨU?gtstaG29HT빖*mi- <rLb3MszzJ>lz`T@ZTghD?V1ojatbc!}T@]UoV1oKT4MbGb}i,mV1oeY BD0Ly[ż-rE]#b)1"K9Ubޖ֮X,NOV>liȅTi[;urG>J =[Bb̷T}U"bBMRkʐ|C>JI!3d)#YJ.KE9Zܹu_0uo\r+Oiݕu'"5B! y[U}tn/7.>z%Kuਖ਼;K·,JxgEq'*@b\8Ζ]'w¸q*^ a cϐ2x\SvW/] Wkuy;+uR']y}L@ELLj@>W1o\ɷyWlBeh84Z\V1̛<vOͬ. ~m brM^B5JSq|Z=.ۻ'$MHUW1w.jkWwoNNB<ݳ_y}tPq{)8|t`)G4{u!;[" Q^.*m1 c_͍Kս굺unٯ\~+d2b߱=C7[V_0>YF]yOā~Ѻ}po P97/NL!çK7/B'o^rMX\? !_q岷-}3({G2 q?8?rFћTiǟu7.woO[CШ~g|< ӮV-/:O6˵'̓Zԇ˨|]6r#b%Y +³OiKuKoxE Oܼ.`.<ˉ+SWxCkkkI5@aX.3gض|w9$R bJV +=y'mY>z`c Q}k]*ljm߰*G!ehT]Ȣ|ݛLA$GKeYb3!o̱LSFb3!'ʫT:v_U|޽1 !fE2 cXŌnGŲu`%۶[ PXŌ.G  1YZ+mYŌ.G݊kaV1`{w# CȻTn^蒜ؠ'v!D@Tm Pݐ[[vzH4 uX,n\U|8)%CTM P]:v*G~@i&XŌ.GݹCݾKR%X*q0ؔ(+@*ftO>l_ FtaO?t-BG[1XN* Uԧ;tsПgJ@bt[П`ȼ!D`ߐhe?lo]>ޘ⨅(BvQD+g{"+""^9  0x$1n%x/EGc~uXOkkkI.Ų;9E1q;EQǑÉ,[x%Rw~w{<:#3F_>z`r!.݃mۍFT*{rD"9''.,-\z,6jr_,`iA={[J>΁ ^CضmYV2 ,"FXW5eh-sı!•<[|Q>qE!ģ|IyԟmS!=?.Vw}'ͪm+0Lt]7  W/Ϟ~d])u !)!KB=翬?xo,-|/ǂ'GO.޺1peuƇe{PR*_xR45-r":S.04 hlx3]yENܸu1۳or嗳s`]033S.-"ÀAX(J\6 #q6LK}S_o҂>zR;ޯ7\Y;vϽ￟|q{&GO?翼]|L0 1MSp>T_Txi63_z^wPN}Pp._>vn?15§|̩~XiGH뺦if}8junn̙3Ԉ[7&'l @b9 rnḭ0(ﺮ(R0 &$'I͞>>>AIz%LE PN&4 CUF!,v!DJ(bJqEQ (JR$s<4}߷m;Y:.v0Iu}qqqt]/BZ&,Fac(% EVX7-^4*$,QD9PUUuKRm^@HaEŇaMqEq]eZ|`כQDY-^Ur<33Ixd$/ ãG.M@H"zW,77 #ˣߞ}!K ıDŽ|)e{c;90igu+>SܕWCFt-,dτ'g|D{l)AHhqkGFFzܞ |/.Zy{'GxMaoK߮|q{&GO?翜 !ȀBaAfx{-޺s?|&g֍h1";gƟ _ZB=l֘  o4 fTK K #l~Z˿ޓ`@&:110z`;= {'GO- +G;ܲ} a;wn]bv30 6E ðX,&$$ΝK1H 36Ed#"dƺQDdXn`u)9GȒuɶ "dI( y@Șhl&$1ԑʳ\8l[?r(;!?rJu%b!GEO-HL.z`Ep=rW^I!HR.o޼lys1gO~l?gdWLӴBise`mm-6teYB۶;\<3@~<ܹs###2ƊŢi,soWz]j  X\\u}||\u:g?t]qAy^ѨVRI199)u]s0IUU4/=}^;???669pAFL\qZ_I2|ߗU Ιi-ˊ\lBvL:gB0g?^ sqqQm(/r\nl6ы{9JeffPf|$2 ^֛K 箼dX될m^yQE0䇠 o3y8;[C =)/nũ6pe/.Լ{Õ`iA:8wSv(&?4M++biAhy^\v0 y/JP&IDAT I Ei݇sWr,2tp[<:A!_xY=)mS'Hf!<7AH71ٺ5Բs&o,rCp_޺<ȧWlǿ샏|ʅc~P\GTJav >-ÿ[?LsW^:+/ŕs/ }nF!@ިɚWN&募|_(|t{B?[[;?6s/ŅS_woSi] =0Ȼǿ|?l<\̩>̗WnK Bs  L`•s S`!`L 0@&`L"D2d$ I 0@6?fȅ={oEsZ-VKVLK/ >={oE$ I 0@&`LX[[K ]BUդ.﫪(J I0 4-醤 /vT67\0$ I 0@&`L"D2i'݆x8SV6\CᵛK"xy|WT='u!DR1Ms>'/7A^Ėn-aOZ$6~jjJi=y='R_*Z-:l6=\_*I\Kz`E}R]ϝ;嵛KȈmۖe9ƖzuZRmω"bff˿㥲9]~C59m7ᵛKb, #dz^*⥢|A>;;{ƍTڟEQR}!t]>h~Q۶OKE,u]0d#Bs"RXP}_^n=ɼɵKJʖrR4- C;5/KL40BxpK9~tirH山3gΌuyPpԆnxlii6cccj5wM_//yζs՗Jlh u [ +OoE;xTv,JK +p C 0@&`L"D2d$ I 0@&`L"D2d$ I 0`=iA}˲eqt]m}I7HriViQe||\^,>n8Fcff&JVyX,ޘk=@BP,]BơhQFEV.~ae!D\FUUm6{nvXiZBh&Eu-˒wژk[UkIQY[[h4k[/DjH\[[h4F\n099zF#Ξ9sF^jjubbBژk[] ![ibbbɰ-i(BQ ˲0J 6AS\r%Y4'1#DWދE5fݵ@ .e p "UUE ID2d$ I 0@&`L"D2d$ I 0@&.u15IENDB`paleomix-1.2.12/docs/_static/zonkey/incl_ts_0_tree_unrooted.png000066400000000000000000000423101314402124200246170ustar00rootroot00000000000000PNG  IHDR@@Na pHYs+tEXtSoftwareGPL Ghostscript 9.19 IDATxl}J")Ҥ:fD:Rڃhm_{@E5D0_T-;(¡} Kw+Z&{E|9.6`gvwv[hPSJ)-)'^r1;d5拟=Immm P@=0@,`X"Db% K 0@,`X"Db% K 0@,`X"Db% K 0@,`X"Db% K 0@,`X"Db% K 0@,`X"Db% K 0@,`X"Db% K 0@,`X"Db% K 0@,`X"DbґVut/>B_Z:0S:8h<,ԁ+<^z__x<+o?/|6{|`X\)=}ç틗_ V{to]vX\{>|kˏ@,{r-[~A~˯ /kO8=xE;^&<]xs83<4M0l. kO}bTu !sBg>G]tߒfҙgG6``Yօ &''K;c'bf"oKO5u`(X];y<8>|ze[WZyŇMG> 2X}40 w=UxnڏʗC#Cw=td|OʿNR Vkkm뺮5x//]Jե%}6Yx=,ɯ?ُWƆ>=k/S@[a֦04M<۶M o~g`i尴6[g;dW~td`hnG_}>=>|ZYX]J3 ~OVǀJi8F7Wo _lm?@0 s܎r,!RFP4FGm;N>{bzǎa3s]WiA@z@SSS)@( Pf`-`YV:^ԇXSɍ 76 3qG4UU 5CԔeYt(E}?Ɔd,!6eYLFk^!f`RBcU.D 35ǡ543(}_>uTFXr\XTU5 V:3(E۶0Wr#EEllluyM,RUղ,&d9,bd/UUTL^*ĉi@ "Vq̭|>/`]"DEź"Dk ,J<|]1hƺ"}]^eYXAL}SZWmQXdr]P(X{z.c]vCEi_{`U(bvmϺ"lG5[+vXd)uE؎늆a (6UJ늊d2UUǡ1!TɷJ늆ar9_uEɑj:D*󺮷]ם=w\6e7.<_q'~[/:#u`O}Lkȸ^:W0ZՁ+YY~Kq{nW=[X]zf3ȤRBթ`8af\ۼxէ _93_yv#{OWҟx.^9aU>|z|tClGE&j|]qbb0_{ƇO?_z{?lv\r6?Jo>%hgDi[񘰸^,_zsW_{&\bnG| _|g^_YGҟ_/GZ=ў&0۶>y;y|LDrL:046t:0$ k҃ko:@`I!wT|ܵ/>6 ^ kJO'e,.-M΄çύ|Rޞ|1l4,!FfllC؟+yo>yGƇ?tîV+cCN=&c0 Ei(/z?+?G_{맟+{sRƇOӑե?ܱU,NL.-. vV򯽹~kl~_v Eb,2l "=05,%`G Ȝ8qC!"} 9hMFEceeCd!D+h,1@$*80@,`i YhЊ@>o q?sGBE`yyݻ$ 7nz,XܹS,[= H,o EE`nn0 hJ1=4vPr~Z=HJ+@3`x:f ;B Pennc!"B Z=H 젂 PE188H@`U4:h0d2BEQ[=H JMCE@^yttJzh@䦗 0UUggg[="H @ʧ\U4ΑV dy[/ 򇯭|_zݻOLR:xOWJX'[1}cv G30F_~np<~Cݟ{s]{^ۢ13`(B0K=~Dκ<}W*=㵻ttmHSjv Ai͍ͷ>ܕ B</#dVU(a @ Au @Ϸ4{`Z󼑑DoW*RQ[ʹXG"WцC7\@$dL`cbppa1v[*4:-y&R;g`,//70P(2Xvl!"}?!`ێm8Q*,30a !Mx'nm8{=L4Y>o(nȍ1 4Yə~ n4Ƙh22LG<XvlQ(J5F%30JB4f`I^2 =|K:U{ NL[Bկ "MyeY_xqkki#@,Ui!)Hg27n/RTQ $'gLD4JRaAd BZuر|>a63zTFFFP0 =+ 3-۶UjXVZJUUISpaظ3 t NjLKUUCgZ iay2TUa\NaYiD#qdo54M"LkGXT\.'?O"KӴ L> 90 mۖko쫂C(J$GEhk.U۳sKVL:WUUu˲d}x?K=n4???^(GhW3q.\P^nnPs)Md_L&c<{bbKؾ7FFFjGh5eMOOdSU4͉R7xM4MSQ*4rAWs)UU<Uxe6U7Y^04 0 I%ٶNDž)@.L ݒ$6%,;*l*CK6mJQ}(JRi2TUd_"%V'kl[!ѓFsJTbYVVpG`SOtym4Vr]… |3$UU'&&={IRZETJBшQm"d3=]nTjar yW?jQUga{he2yP\UY׎CC}Bق%-Ahŗmۮz }oEQZd yfΪy7555331j*30ˎ5118NHs'''^! Ghu]4wlۆCRUUBl߰0MSCd}}B iZPh(I$E+yRh !dhɆPL&8DH^UU#-Gh麾 S[[[C RWEax >V12 z&0[$m8$$_VCkppPufZH#jr6%ccc\TѶV&jЀIӎz D- IQ\-Bo`{-MӘE" C -`7U_R~xw>3Xqv;>sp}bZI?9na,#-M,Q߽\#oګ_~K VovscG?͍Wh'Luݨ- v q-t:]c͍o`KWw !Gߔvscw7.ykYûMI$m8$EQ:Z@$t]_XXxpO%=|_O)痖BւGv{~Nƽ`{C>Uo֍"(Ȉy嫈ϿSC]BKwv|~>9;{m?X;{Rr/s;K@߽}NvB- uREh躞f5M#Hض]1̇Jr7{ozںzȽ}<߼uں R\w:;]ME{ҫIXnv -]׹ROQq.ݼuH o]̇޸'枇yx?Xڛ.DAVLZ@Sw-f?rTquGe7S\#.㵻 hPQ$yB ӎw_Zkw:z𵕛[?^+Y~ɮ?|zW*~ɟ4ok&qhar0Zr C+NZ@\7nnlw+onlw޾o\[7dЈ (+277Ӷʎ%Z@VIסOvۥ[[p|33=4`{9c"t]m[H.u#Рy(] =4om5#Fmy2|'VM#pH6ɫ(y+++2&&&-C!6R3v -q-FUӸJw5/%%5ImP8S~<}4h/RZ㚦Z:UR}\U0 s܉' O|ʉ',r>`aeyaP:`*(R(jA=\MZ$G ,?|.Z^^@^{G<4mccիa5MK!4m(J:rl[[[2 _f`ar2mMlghi(kDx WL *$h(JB*0U+JsBWRf`۩ZjiL&)h։Ɯ uB("+5g}opp '3ӆC "vմIߴ lg+++y!>d{']כZI$+fQkհql{nJٶmfy^u;+oyZK(G} }lddddG`aG" É 44x}=}^qLI;C='2 !*6Wp 4E)o$~>ZLLpfS`E]Y\Y 7>Kޗ?J7׋b>0M^/4MuUUu]`GvmK3T!5s==e!b޺;_<5KB#\t@d7?I/*L4+N@@Q/is[+uu* 1u+ż .s{jȀzwcA$#nDVyW6,KUUa(a*D˲4M};XGFoxE_~Pw~P"vhvOy dVi&o^:Xk%ŋ+Ms못}e'գxXօ = ,PdHI"PQyלΚWb&dGO|Gx٩Kjpq%D`GRH:8nH8Gdv? 09 ;2 %#zz"F*K\Ncpg;ԣXHFY`PUrBWVFHƉ̇s P(4ɛ#qdhdRFҹd|ǔh6YJzyV@$#܎ ~֮G0]}̲Im`]ޣdnhvSSSdIFmFJaU'NhS7,m#{bhgɨBg`57eYy-b;w∝RxMtdA)F0]7#4M3 h(:q4v168<ϻxbw(:q4r&E'\x~j7\26vzFЉcX$uu]'O>zdx? (p*mm_yBDhdؑTek c;ћfffdCEQ<<Z({`t⸧?MaAy^yaEQ䊢fZ"Npc}N40%ĻQ쁭۸*n52dJVc"u]0͗KmF15qL&c6dM,!vuڊeY.\FC4u3Ms;=.!%04MBD>==fY`qigb#Za}dna&I+/!}YnmAJj)tdl>z:>nF9==mF$cv8uEfUmOMM^'4u]8XwO5x&~ =b۶a38ɘmsE d# ͓ia4N2P9K{DEEfbKgM+vAGW/,􈊊8q0,F ;Fd%#f`~eYeYӶm7E^l/?~F3+ՠc+H{6t`z9ӒPawV[W.6?]~Q:>?ԭ5!$KFmEq9FAK0*e؎^}=})ydBYPrG9u+rY .;7 Vqѓ,;vB0i:[2h3cJ)y^#$3uS=Ùڕ\뤖 _6Jsǽ#f1\^\ !bʸX7/7|g;BkjE6ހ;|8QRQum\7SF)Cq+'X.dݾVGci# 'pX;mv i}x@XLjc%<Z#]{,/?rFBVyd]'5QWqͿuus|P=\',ORq@GJFbo#ً0 ,',l[W3 }gCeHv)mmûApdUWl#zf fk>CszDE;ΣbyL=o\6{u%מG֌@Hmmmz OA䣿(G1]>11QyaXlCY5ϴdGAIؑX\Y1??_:f'''e(rU$$$6oG?wX7˲iChd,8# ꨤ4MYe$J2Ll傖'V'{Dm4lPЎbR~@GІ ### Z"&Od>^Y1::Z#ۭG4_2z"HjHg(hdz{`Q(hdX$Z]F/K60Db+7HK"b_LwI-i$#Rw#H;כ3؊s'ɘ1 XB9AGFO$#ވ d/=dq@="I |4QP]2D֔/B(S2f`b=Z` {Y^^ߠdK PJ߄w}fMx-d"s۟&x?ϛY{[zHdfT! !lBP7%#R VQu=ϳ,9q;%DIUUu/\y^^b%&nG^MFa[#DRGpN{mtv ?py6a]2,=EQq]W&JщC|@MıiLLL$DRf`ՌN;2 Ct%zLKlf`~[M-u]yDB+;7!;tKz ͸/!Jٶf}oXŒQAl]`$fa+@ˤZ=ƻiu'9~/D@4EQ `]8YSyӡ@%#\m`bzvLpfB+Kio@%#?b׽ iɼu]!M^rq\{O !Dtj;neci;A٩ciX>;>@B$%m.K oQii^%`Ug۶ad6%DAG d)籖=%bz!;{GhCZB Ǿ+8NDm(Y&hñO !0m(YFRK4 GLӬXuuz`(Xb^TpXE5Ǒ>FZy,`m8UԄ\B?;u+=5q[QQ\wBCݕ"Eo;{Ng?9gvBpsܺngagX;cV.;]'!o?rvj).N}g̮Zi%<0 5 #GO|S~5m[P{d@îdߙz(X^ϰޤ >TF뺦imz ug5gmCyM#8I LdZ=:RsQJ! n6;;!"q@IJɳ 0 0@HJ ![=@d`<@GIJ 6$%h&)Ft tm8$"w+}߲n=L"}{a(DJX"L}CD)|͛7=޹[n5yHJzx`yիW>.@1?{ԧ>8NEKr\OR>O%b$)&PU}43 '~@%($۶뺚qW"8dPiV:iMMMan}KDu]4u]^hs[B,gaRB袲b'K|7 CQqw)3Yimz8=0$+;4Ms]h[*!z8`;(UvLLLXaGb5Tv@J uR(8NǗ*;4V`mr9X=kff&r"ie EQl>/=}=aNOO??~ALxUUIL6xՁ:Wǟ|`woBEov_^~5l}I:3k'?-^i3$z A.勇ΟyR. !~.Q^{HtGksK?R6`P%Db% K 0@,`X"Di+~W~3tkDt?>99e'Q7ryHIP=]קv,-fYVk" 0@,`X"DbR" v Bt0 44Q9UUU(-њtDb% K 0@,1DqT*vn\\}a>'KO5q}Y~ԡee2挧g`y!a^f-˲ў߫BA1==mf OU>0 5M[^^BX1K-?<-˺pB«A7BPgffjs;wvM]\UT|!POuzzz||to6T|>?>>Xy]jss]wd秚f>ثn۶ݵ5 C4i樎4Ś`yy)o~/v?R?  Ø䓯EOUoF7{|4mnbNbYiiF!lVgX:.V [#j&g Wy'PŶm6jW/^LRTJJh`w0 ŞvJ,PSuGQ-fTT}VAX9IDATߟ.l@VJB&]Ӯi"MFFFΟ??88X(DYu{Q?7CA>UT''';?.//LOOvRGkI3W;Aȋk{>FSm꟪\3C}vf:^% K 0@,`X"Db% K 0@,`X"DbT˘oY8#ql.4M5M}_U(2::*o[y"Jmmm5;j@;4MNr9q0 @Q Ð J !<+a(*ϧi0JY^^˟4͊{GGGMtg~~>͖Ud2oL/״OK2BMӔySZBQK7ʿ}0r"˕VUU]^^.,V_,!{illb3lO)"Pe||<YbYV',=SzNB! 1oʕ+u`|?w]J[hmk611!Kϼ{u}bbBޞ{WWW#y#@kTyeYlg}_n5?;;iZ>7M4MyQ&mw``f`2122Ri^eٶ]~PQMTU}zzZQl6[*B?77Wz{e~rMRJrx/8SLrt\.纮<'UWȩ^i˭b0H؞S(u` h UUK KDb% K 0@,`X"Db% K 0@,4arNIENDB`paleomix-1.2.12/docs/_static/zonkey/mito_phylo.png000066400000000000000000000307171314402124200222020ustar00rootroot00000000000000PNG  IHDR@@Na pHYs+tEXtSoftwareGPL Ghostscript 9.19 IDATx}L\K3t0/gZٱ:ZMJ l*nU+eob+ݫ&wqI;3DWR1Dc3 @^f8'%uBlimޛ3 i6c[VUK1cDl4n6ڛ}%k cBpl74>3_;3pl{ 7~scZ~43xgι7g`Vno fm|f}fji'jNv5am?0;su)_f)QND%wwӷoD܃۱plJ6g`pn-+{~y(HDR"`)0@J %HDR"`)0@J %HDR"`)0@J %R U f}200l6ClJ@8흚'cccB@ g`]GGpdrrRUǏ !,K___aaCcV D+((㏍^d1::mTWW !Z[[]㊢h6s!6$H BD"zL& ClbItijj4zTSS#())1z!HjIJi*$5200mĒ"(FQ!XAArClbIw_l6{<: eeeSSS#KY02_̱p׮m:?| 7^ߞTciޱwehplJxl˿<0 `NMȕYT|9sz܂B׭f{o!7SGj! XC9SƦ~-] ؕQk٩Ow[!`$v5G:t=~MroϽuK;>3^72ˉ;MBplSG 3;gXo Į_#?裄S{Z|o?=! [vwÉ)9\:@b###> |bX?UWDHhP]oX~4hvr c-Wu9Fe \HDrϝ?쳦@K:hƴ> {&~n6 R"`)0@J %HDR"`)0@J %F~$˳l<(b6g`6h4a!Dwww pwjjaq`UUUBŽaG$gzGUղ2eaao~!Diizmb @Vylkk'߿"uYO!v]p_08809ydMMv(.K?3s:---.+%%rhv{KKigrN3nle) Fប!q:F*++?ݻwݻ~Ν7n</Ξ=PSS_GZN<)8}ZRRRrss[ZZErjkk[[[Ϟ=;99jjjO;;$~(JSSSSSz]&v^ov{$lh\.>w:Z4Vss>)..߂ >޾Vi7ꚚN< Xvtn_!irް3gǟNnmrrR=s}}s8ڶ덿p8O!Ngqq>}ѣ Go>---===tnnv@MM "hg={( Ngee B=9(nw8 'X^E?9D xEQgFp8EnbB뭩YXX>Y;:A <p8eZ!`\6z HDR"`)0@J %HDR"`)0@J %HDR"`)0@J %HDR"`)0@J 4` MNux~x: ^5es?*=Y˄ddW, o :KPGB5:}n=O7bHj) Fប!q:FFP#JFS!orS^?=j28`:jtFk]/XkPОc=v G^Bc?B<őjdh~ɋUO(S( y}Q/7;;uԩzj~cz:m#30=RՓ=E!'8}GS<~Yo!򗿼p3z-XkŕK/:8/=Ǫo;k}[GCW8s[u0Ȥ޿=jJwܹq?W.Yő]jdZ ?e(m{L !ˉb AJ^%=sU{WծaqU6r]/~ˋ?d0,6['Ow~| 귾t]֙p 6?\9; }z#+f7q8/uhLH UO\N;m^5e-%^f5:a?  w?}TcgcJW٢FgS֯]ڻ; o ^V2{ÿ> 6=~2z555B-CHqi8J$/N^dHDR"`)0.`P%$t~KH VYYiH % A Pl6듁p8gVg`^8흚'cccB@ 8:::'?~\aX bk30ȭ6~2>>(mTU]n@j!Ç}W"*ɴ8C"~8C򑑑g}$ill\200mF"i*(FQ466VPP6\^^xl6[ 8trCR#` D,+++((Xr@^ lb$S)0@J %HDR"`)0@J %a|S,(lևJ8vn(ٕ;o :KZz&??_KE"` cϱTRL/WZ ~VqDT,? ^&`09V}[L~m_q4Ū'lmo_Fdh >\=~ިsC/^o컗^yI!?\K^xپ#JF֩wмB!5kgΜ1v%x`‵ʥO$\ŕKrI!o"pq9V!D΂s/&JF+]sO_ A&<ߚLrׂP^}B}|wδ.\/6[]QX+.;B:O(6[|۔d(ʧ~j*ƞ=jʇ9vlߑg_U#ӓaB\Sb0,6[Ow&E߹r_ZB\~x=:蛌pm_͇0̹˯w<8Ǫ]|Ff#Q2?kzmUwk:Q_ȎsNAM0/BJW\ JbycB˟?V7/+=_Vb#6 0/U=/Xr-=xhOpXQi-ϱ; O6%@VJFʱ jEQfr`` l6%T*pwjjJ{>66&+ `p/1Qh~. ^*hmmmmm5z-F'm#rE ŕKriyx9V]?*=qk!#<222b* RVV?߽Eo};:OE~IDİ/8r=0'^뀀!Q{{s8F/03^XU7;GsG ."~uKsn{A{vr}ֈ{sn+]?*9p+:6үϋsP#3XW BW^/G*9PBٍC\7dd?*JF `m0X3kŌ÷DR"`)0@J %HDR"`)0@J %HDR"`)0@J %HDR"`)0@J 4 f@8˳l1K`S LpwjjJ{>66&+ ` LIUU?.X,}}}K [1 XCCb1zFB|'ZڪWE۶l `K,,,|F/$)-+2L+ %GTݵͤ(vM_kjLJjݼdZݐE_*v鶶G}o޼yMkvmBDFNm8Q9%u;[펦:C ,-`*n޼e``@+ %EFêy(77WUP{22*r&2şͤ??,riҭ]"Wvi~qlֿc!6@gggKqΝs!\'62B[CyGӶ~w%5XF8n{USl7t&9`0` hؕ?x`dZs)72≌[B;c3$Dl—pL g,Lͱgr{Ct:^/fvȝ]RI|#2.K5NͰȈw.س$73䎎ٞ"#ީ?YXq[Elw7nyVG|=k-j+s Q/5QGU .fBC:˷<;-A!C`wrLVNHͱ 'Dt7Ɵ끀[QI˱m;#؄o> !ҭ#^>mTȶ %$X' ,~c5/3>'{`)0@J %HDR"`)0@J %XG@@Q٬Op^^f[n`58K8흚'cccBLIDAT@ *q䤪ǏBX,z(aRXXhr 0 9sG1z!XVOOO¤ZڪOi:;;O6d&Aݟٟ?U` D"=W&i X0z X+--B=ztI|?JŢX=lEQTUFBȄ{`1fC-F@& XG/ ***6O6lb%'V{`)0@J %HDR"`)0@J %lE5x^rt7N_kL_k ^p&|B}H|=mBȈ7#ߩ c/iZw: X"e8Ȉ's4kP.&c^JB-g> !&/8Ss;[+ou=m0` \~]hkk3v% j,r !riȈGXd;3Ȩw,re֧صլځ0` 444\td2vuu:u*~8{j##^!D1;w4mPsMΉwW5-UM L^sz#`xw^ªL_kN1FF<6Yޚrk +;[nf(/c35Ǟ90`krgşZ}#2IU}vGo3̐;:f{fRFFS8a"&|wÃYEìٛ#j;[j>m>N_k^@-df5\\ߖnÙ]iʉ Ή9v!ĝ_;Kisi9] H(*reՂ볍ٖ$oOv0 R"`)0@J %HDR"`)0.D aml6ٳG#H^^f3t8O HDQD/_6b]8Ç'@@QkƬ 6L4MOOD"Cͦ()d`#,,,|# {"#'Ҭ3CQ|gVKy̝nu6j0B:{\fqmVQ.! &lw2#F㭎ss[uU/8\^rG;ݍaZ6'+G5z!/SnVk.N \ڮgpF_BhWjNU"| !moGuKvi펺7E-3C;3\2 7%+{!{gBۙ3g3CE.!DZ=ZhfRfZ2wȯ.B&|ѶH˱ !ҭOixlw4iY䊌zs[cfƂ 0k-ѫX7n͹\.}2{j7ҭ!I;Iږlj^4GG)\h`{USdԛj.%>PBm{[x` 9s|bZs)72≌xB;n.䟾ּIh}*. _i NnjLWJą 덇8@KǼB!̅JYmmKٳa+.6*f<7Q3bݥ7l._9muT˙bʽݨݘb,r ?`jM5'|!yz[{?}c6 e(M-oٝ374i9]mt1---Kk9.eaa5lu>t*(z㾎{Yr9lxZnUZ[[n`?ip `~^{yAϷcˉN3c\g`^%;n^&BJJJkkZ1Xy@  !c`EYc`I3-xk8UUU1pf0~~OZjn=zY{yGj~myc`JK[SSSΨ_R歭B-+ʴ&9X1E0HDR"`)0@J %HDR"`)0@J %HDR"`)0@J %HDR"`)0@J %HDR"`)0@J ,hTIENDB`paleomix-1.2.12/docs/acknowledgements.rst000066400000000000000000000011021314402124200204120ustar00rootroot00000000000000================ Acknowledgements ================ The PALEOMIX pipeline has been developed by researchers at the `Orlando Group`_ at the `Centre for GeoGenetics`_, University of Copenhagen, Denmark. Its development was supported by the Danish Council for Independent Research, Natural Sciences (FNU); the Danish National Research Foundation (DNRF94); a Marie-Curie Career Integration grant (FP7 CIG-293845); the Lundbeck foundation (R52-A5062). .. _Orlando Group: http://geogenetics.ku.dk/research_groups/palaeomix_group/ .. _Centre for GeoGenetics: http://geogenetics.ku.dk/paleomix-1.2.12/docs/bam_pipeline/000077500000000000000000000000001314402124200167605ustar00rootroot00000000000000paleomix-1.2.12/docs/bam_pipeline/configuration.rst000066400000000000000000000050551314402124200223660ustar00rootroot00000000000000.. highlight:: ini .. _bam_configuration: Configuring the BAM pipeline ============================ The BAM pipeline exposes a number options, including the maximum number of threads used, and the maximum number of threads used for individual programs, the location of JAR files, and more. These may be set using the corresponding command-line options (e.g. --max-threads). However, it is also possible to set default values for such options, including on a per-host bases. This is accomplished by executing the following command, in order to generate a configuration file at ~/.paleomix/bam_pipeline.ini: .. code-block:: bash $ paleomix bam_pipeline --write-config The resulting file contains a list of options which can be overwritten:: [Defaults] max_threads = 16 log_level = warning jar_root = /home/username/install/jar_root bwa_max_threads = 1 progress_ui = quiet temp_root = /tmp/username/bam_pipeline jre_options = bowtie2_max_threads = 1 ui_colors = on .. note:: Options in the configuration file correspond directly to command-line options for the BAM pipeline, with two significant differences: The leading dashes (--) are removed and any remaining dashes are changed to underscores (_); as an example, the command-line option --max-threads becomes max\_threads in the configuration file, as shown above. These values will be used by the pipeline, unless the corresponding option is also supplied on the command-line. I.e. if "max_threads" is set to 4 in the "bam_pipeline.ini" file, but the pipeline is run using "paleomix bam_pipeline --max-threads 10", then the max threads value is set to 10. .. note:: If no value is given for --max-threads in ini-file or on the command-line, then the maximum number of threads is set to the number of CPUs available for the current host. It is furthermore possible to set specific options depending on the current host-name. Assuming that the pipeline was run on multiple servers sharing a single home folder, one might set the maximum number of threads on a per-server basis as follows:: [Defaults] max_threads = 32 [BigServer] max_threads = 64 [SmallServer] max_threads = 16 The names used (here "BigServer" and "SmallServer") should correspond to the hostname, i.e. the value returned by the "hostname" command: .. code-block:: bash $ hostname BigServer Any value set in the section matching the name of the current host will take precedence over the 'Defaults' section, but can still be overridden by specifying the same option on the command-line.paleomix-1.2.12/docs/bam_pipeline/filestructure.rst000066400000000000000000000225721314402124200224220ustar00rootroot00000000000000.. highlight:: Yaml .. _bam_filestructure: File structure ============== The following section explains the file structure of the BAM pipeline example project (see :ref:`examples`), which results if that project is executed:: ExampleProject: # Target name Synthetic_Sample_1: # Sample name ACGATA: # Library 1 Lane_1: 000_data/ACGATA_L1_R{Pair}_*.fastq.gz Lane_2: Single: 000_data/ACGATA_L2/reads.singleton.truncated.gz Collapsed: 000_data/ACGATAr_L2/reads.collapsed.gz CollapsedTruncated: 000_data/ACGATA_L2/reads.collapsed.truncated.gz GCTCTG: # Library 2 Lane_1: 000_data/GCTCTG_L1_R1_*.fastq.gz Lane_2: rCRS: 000_data/GCTCTG_L2.bam TGCTCA: # Library 3 Options: SplitLanesByFilenames: no Lane_1: 000_data/TGCTCA_L1_R1_*.fastq.gz Lane_2: 000_data/TGCTCA_L2_R{Pair}_*.fastq.gz Once executed, this example is expected to generate the following result files, depending on which options are enabled: * ExampleProject.rCRS.bam * ExampleProject.rCRS.bai * ExampleProject.rCRS.realigned.bam * ExampleProject.rCRS.realigned.bai * ExampleProject.rCRS.coverage * ExampleProject.rCRS.depths * ExampleProject.rCRS.duphist * ExampleProject.rCRS.mapDamage * ExampleProject.summary As well as a single folder containing intermediate results: * ExampleProject/ .. warning:: Please be aware that the internal file structure of PALEOMIX may change between major revisions (e.g. v1.1 to 1.2), but is not expected to change between minor revisions (v1.1.1 to v1.1.2). Consequently, if you wish to re-run an old project with the PALEOMIX pipeline, it is recommended to either use the same version of PALEOMIX, or remove the folder containing intermediate files before starting (see below), in order to ensure that analyses are re-run from scratch. Primary results --------------- These files are the main results generated by the PALEOMIX pipeline: **ExampleProject.rCRS.bam** and **ExampleProject.rCRS.bai** Final BAM file, which has not realigned using the GATK Indel Realigner, and it's index file (.bai), created using the "samtools index". If rescaling has been enabled, this BAM will contain reads processed by mapDamage. **ExampleProject.rCRS.realigned.bam** and **ExampleProject.rCRS.realigned.bai** BAM file realigned using the GATK Indel Realigner, and it's index file (.bai), created using the "samtools index". If rescaling has been enabled, this BAM will contain reads processed by mapDamage. **ExampleProject.rCRS.mapDamage/** Per-library analyses generated using mapDamage2.0. If rescaling is enabled, these folders also includes the model files generated for each library. See the `mapDamage2.0 documentation`_ for a description of these files. **ExampleProject.rCRS.coverage** Coverage statistics generated using the 'paleomix coverage' command. These include per sample, per library and per contig / chromosome breakdowns. **ExampleProject.rCRS.depths** Depth-histogram generated using 'paleomix depths' commands. As with the coverage, this information is broken down by sample, library, and contig / chromosome. **ExampleProject.rCRS.duphist** Per-library histograms of PCR duplicates; for use with *`preseq`_* [Daley2013]_ to estimate the remaining molecular complexity of these libraries. Please refer to the original PALEOMIX publication [Schubert2014]_ for more information. **ExampleProject.summary** A summary table, which is created for each target if enabled in the makefile. This table contains contains a summary of the project, including the number / types of reads processed, average coverage, and other statistics broken down by prefix, sample, and library. .. warning:: Some statistics will missing if pre-trimmed reads are included in the makefile, since PALEOMIX relies on the output from the adapter trimming software to collect these values. Intermediate results -------------------- Internally, the BAM pipeline uses a simple file structure which corresponds to the visual structure of the makefile. For each target (in this case "ExampleProject") a folder of the same name is created in the directory in which the makefile is executed. This folder contains a folder containing the trimmed / collapsed reads, and a folder for each prefix (in this case, only "rCRS"), as well as some additional files used in certain analytical steps (see below): .. code-block:: bash $ ls ExampleProject/ reads/ rCRS/ [...] Trimmed reads ^^^^^^^^^^^^^ Each of these folders in turn contains a directory structure that corresponds to the names of the samples, libraries, and lanes, shown here for Lane_1 in library ACGATA. If the option "SplitLanesByFilenames" is enabled (as shown here), several numbered folders may be created for each lane, using a 3-digit postfix: .. code-block:: bash ExampleProject/ reads/ Synthetic_Sample_1/ ACGATA/ Lane_1_001/ Lane_1_002/ Lane_1_003/ [...] The contents of the lane folders contains the output of AdapterRemoval, with most filenames corresponding to the read-types listed in the makefile under the option "ExcludeReads": .. code-block:: bash $ ls ExampleProject/reads/Synthetic_Sample_1/ACGATA/Lane_1_001/ reads.settings # Settings / statistics file generated by AdapterRemoval reads.discarded.bz2 # Low-quality or short reads reads.truncated.bz2 # Single-ended reads following adapter-removal reads.collapsed.bz2 # Paired-ended reads collapsed into single reads reads.collapsed.truncated.bz2 # Collapsed reads trimmed at either termini reads.pair1.truncated.bz2 # The first mate read of paired reads reads.pair2.truncated.bz2 # The second mate read of paired reads reads.singleton.truncated.bz2 # Paired-ended reads for which one mate was discarded If the reads were pre-trimmed (as is the case for Lane_2 of the library ACGATA), then a single file is generated to signal that the reads have been validated (attempting to detect invalid quality scores and/or file formats): .. code-block:: bash $ ls ExampleProject/reads/Synthetic_Sample_1/ACGATA/Lane_2/ reads.pretrimmed.validated The .validated file is an empty file marking the successful validation of pre-trimmed reads. If the validation fails with a false positive, creating this file for lane in question allows one to bypass the validation step. Mapped reads (BAM format) ^^^^^^^^^^^^^^^^^^^^^^^^^ The file-structure used for mapped reads is similar to that described for the trimmed reads, but includes a larger number of files. Using lane "Lane_1" of library "ACGATA" as an example, the following files are created in each folder for that lane, with each type of reads represented (collapsed, collapsedtruncated, paired, and single) depending on the lane type (SE or PE): .. code-block:: bash $ ls ExampleProject/rCRS/Synthetic_Sample_1/ACGATA/Lane_1_001/ collapsed.bai # Index file used for accessing the .bam file collapsed.bam # The mapped reads in BAM format collapsed.coverage # Coverage statistics collapsed.validated # Log-file from Picard ValidateSamFile indicating marking that the .bam file has been validated [...] For each library, two sets of files are created in the folder corresponding to the sample; these corresponds to the way in which duplicates are filtered, with one method for "normal" reads (paired and single-ended reads), and one method for "collapsed" reads (taking advantage of the fact that both external coordinates of the mapping is informative). Note however, that "collapsedtruncated" reads are included among normal reads, as at least one of the external coordinates are unreliable for these. Thus, the following files are observed: .. code-block:: bash ExampleProject/ rCRS/ Synthetic_Sample_1/ ACGATA.duplications_checked ACGATA.rmdup.*.bai ACGATA.rmdup.*.bam ACGATA.rmdup.*.coverage ACGATA.rmdup.*.validated With the exception of the "duplicates_checked" file, these corresponds to the files created in the lane folder. The "duplicates_checked" file marks the successful completion of a validation step in which attempts to detect data duplication due to the inclusion of the same reads / files multiple times (not PCR duplicates!). If rescaling is enabled, a set of files is created for each library, containing the BAM file generated using the mapDamage2.0 quality rescaling functionality, but otherwise corresponding to the files described above: .. code-block:: bash ExampleProject/ rCRS/ Synthetic_Sample_1/ ACGATA.rescaled.bai ACGATA.rescaled.bam ACGATA.rescaled.coverage ACGATA.rescaled.validated Finally, the resulting BAMs for each library (rescaled or not) are merged (optionally using GATK) and validated. This results in the creation of the following files in the target folder: .. code-block:: bash ExampleProject/ rCRS.validated # Signifies that the "raw" BAM has been validated rCRS.realigned.validated # Signifies that the realigned BAM has been validated rCRS.intervals # Intervals selected by the GATK IndelRealigner during training rCRS.duplications_checked # Similar to above, but catches duplicates across samples / libraries .. _mapDamage2.0 documentation: http://ginolhac.github.io/mapDamage/\#a7 .. _preseq: http://smithlabresearch.org/software/preseq/ paleomix-1.2.12/docs/bam_pipeline/index.rst000066400000000000000000000023661314402124200206300ustar00rootroot00000000000000.. _bam_pipeline: BAM Pipeline ============ **Table of Contents:** .. toctree:: overview.rst requirements.rst configuration.rst usage.rst makefile.rst filestructure.rst The BAM Pipeline is a pipeline designed for the processing of demultiplexed high-throughput sequencing (HTS) data, primarily that generated from Illumina high-throughput sequencing platforms. The pipeline carries out trimming of adapter sequences, filtering of low quality reads, merging of overlapping mate-pairs to reduce the error rate, mapping of reads using against one or more reference genomes / sequences, filtering of PCR duplicates, analyses of / correction for post-mortem DNA damage, estimation of average coverages and depth-of-coverage histograms, and more. To ensure the correctness of the results, the pipeline invokes frequent validation of intermediate results (*e.g.* using Picard Tools ValidateSamFile.jar), and attempts to detect common errors in input files. To allow tailoring of the process to the needs of individual projects, many features may be disabled, and the behavior of most programs can be tweaked to suit the specific of a given project, down to and including only carrying out trimming of FASTQ reads, to facilitate use in other contexts. paleomix-1.2.12/docs/bam_pipeline/makefile.rst000066400000000000000000001263621314402124200213010ustar00rootroot00000000000000.. highlight:: YAML .. _bam_makefile: Makefile description ==================== .. contents:: The following sections reviews the options available in the BAM pipeline makefiles. As described in the :ref:`bam_usage` section, a default makefile may be generated using the 'paleomix bam\_pipeline makefile' command. For clarity, the location of options in subsections are specified by concatenating the names using '\:\:' as a separator. Thus, in the following (simplified example), the 'UseSeed' option (line 13) would be referred to as 'Options \:\: Aligners \:\: BWA \:\: UseSeed': .. code-block:: yaml :emphasize-lines: 13 :linenos: Options: # Settings for aligners supported by the pipeline Aligners: # Choice of aligner software to use, either "BWA" or "Bowtie2" Program: BWA # Settings for mappings performed using BWA BWA: # May be disabled ("no") for aDNA alignments, as post-mortem damage # localizes to the seed region, which BWA expects to have few # errors (sets "-l"). See http://pmid.us/22574660 UseSeed: yes Specifying command-line options ------------------------------- For several programs it is possible to directly specify command-line options; this is accomplished in one of 3 ways; firstly, for command-line options that take a single value, this is accomplished simply by specifying the option and value as any other option. For example, if we wish to supply the option --mm 5 to AdapterRemoval, then we would list it as "--mm: 5" (all other options omitted for brevity):: AdapterRemoval: --mm: 5 For options that do not take any values, such as the AdapterRemoval '--trimns' (enabling the trimming of Ns in the reads), these are specified either as "--trimmns: ", with the value left blank, or as "--trimns: yes". The following are therefore equivalent:: AdapterRemoval: --trimns: # Method 1 --trimns: yes # Method 2 In some cases the BAM pipeline will enable features by default, but still allow these to be overridden. In those cases, the feature can be disabled by setting the value to 'no' (without quotes), as shown here:: AdapterRemoval: --trimns: no If you need to provide the text "yes" or "no" as the value for an option, it is necessary to put these in quotes:: --my-option: "yes" --my-option: "no" In some cases it is possible or even necessary to specify an option multiple times. Due to the way YAML works, this is not possible to do so directly. Instead, the pipeline allows multiple instances of the same option by providing these as a list:: --my-option: - "yes" - "no" - "maybe" The above will be translated into calling the program in question with the options "--my-option yes --my-option no --my-option maybe". Options section --------------- By default, the 'Options' section of the makefile contains the following: .. literalinclude:: makefile.yaml :language: yaml :linenos: Options: General ^^^^^^^^^^^^^^^^ **Options \:\: Platform** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 7 :lines: 7-8 The sequencing platform used to generate the sequencing data; this information is recorded in the resulting BAM file, and may be used by downstream tools. The `SAM/BAM specification`_ the valid platforms, which currently include 'CAPILLARY', 'HELICOS', 'ILLUMINA', 'IONTORRENT', 'LS454', 'ONT', 'PACBIO', and 'SOLID'. **Options \:\: QualityOffset** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 9 :lines: 9-13 The QualityOffset option refers to the starting ASCII value used to encode `Phred quality-scores`_ in user-provided FASTQ files, with the possible values of 33, 64, and 'Solexa'. For most modern data, this will be 33, corresponding to ASCII characters in the range '!' to 'J'. Older data is often encoded using the offset 64, corresponding to ASCII characters in the range '@' to 'h', and more rarely using Solexa quality-scores, which represent a different scheme than Phred scores, and which occupy the range of ASCII values from ';' to 'h'. For a visual representation of this, refer to the Wikipedia article linked above. **Options \:\: SplitLanesByFilenames** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 14 :lines: 14-17 This option influences how the BAM pipeline handles lanes that include multiple files. By default (corresponding to a value of 'yes'), the pipeline will process individual files in parallel, potentially allowing for greater throughput. If set to 'no', all files in a lane are merged during processing, resulting in a single set of trimmed reads per lane. The only effect of this option on the final result is a greater number of read-groups specified in the final BAM files. See the :ref:`bam_filestructure` section for more details on how this is handled. .. warning:: This option is deprecated, and will be removed in future versions of PALEOMIX. **Options \:\: CompressionFormat** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 18 :lines: 18-19 This option determines which type of compression is carried out on trimmed FASTQ reads; if set to 'gz', reads are gzip compressed, and if set to 'bz2', reads are compressed using bzip2. This option has no effect on the final results, but may be used to trade off space (gz) for some additional runtime (bz2). .. warning:: This option is deprecated, and may be removed in future versions of PALEOMIX. **Options \:\: PCRDuplicates** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 72 :lines: 72-79 This option determines how the BAM pipeline handles PCR duplicates following the mapping of trimmed reads. At present, 3 possible options are available. The first option is 'filter', which corresponds to running Picard MarkDuplicates and 'paleomix rmdup_collapsed' on the input files, and removing any read determined to be a PCR duplicate; the second option 'mark' functions like the 'filter' option, except that reads are not removed from the output, but instead the read flag is marked using the 0x400 bit (see the `SAM/BAM specification`_ for more information), in order to allow down-stream tools to identify these as duplicates. The final option is 'no' (without quotes), in which case no PCR duplicate detection / filtering is carried out on the aligned reads, useful for data generated using amplification free sequencing. Options: Adapter Trimming ^^^^^^^^^^^^^^^^^^^^^^^^^ The "AdapterRemoval" subsection allows for options that are applied when AdapterRemoval is applied to the FASTQ reads supplied by the user. For a more detailed description of command-line options, please refer to the `AdapterRemoval documentation`_. A few important particularly options are described here: **Options \:\: AdapterRemoval \:\: --adapter1** and **Options \:\: AdapterRemoval \:\: --adapter2** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 23 :lines: 23-25 These two options are used to specify the adapter sequences used to identify and trim reads that contain adapter contamination. Thus, the sequence provided for --adapter1 is expected to be found in the mate 1 reads, and the sequence specified for --adapter2 is expected to be found in the mate 2 reads. In both cases, these should be specified as in the orientation that appear in these files (i.e. it should be possible to grep the files for these, assuming that the reads were long enough, and treating Ns as wildcards). It is very important that these be specified correctly. Please refer to the `AdapterRemoval documentation`_ for more information. .. note:: As of version AdapterRemoval 2.1, it is possible to use multiple threads to speed up trimming of adapter sequences. This is accomplished not by setting the --threads command-line option in the makefile, but by supplying the --adapterremoval-max-threads option to the BAM pipeline itself: .. code-block:: bash $ paleomix bam_pipeline run makefile.yaml --adapterremoval-max-threads 2 .. warning:: Older versions of PALEOMIX may use the --pcr1 and --pcr2 options instead of --adapter1 and --adapter2; for new projects, using --adapter1 and --adapter2 is strongly recommended, due to the simpler schematics (described above). If your project uses the --pcr1 and --pcr2 options, then refer to the `AdapterRemoval documentation`_ information for how to proceed! **Options \:\: AdapterRemoval \:\: --mm** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 28 :lines: 28 Sets the fraction of mismatches allowed when aligning reads / adapter sequences. If the specified value (MM) is greater than 1, this is calculated as 1 / MM, otherwise the value is used directly. To set, replace the default value as desired:: --mm: 3 # Maximum mismatch rate of 1 / 3 --mm: 5 # Maximum mismatch rate of 1 / 5 --mm: 0.2 # Maximum mismatch rate of 1 / 5 **Options \:\: AdapterRemoval \:\: --minlength** The minimum length required after read merging, adapter trimming, and base-quality quality trimming; resulting reads shorter than this length are discarded, and thereby excluded from further analyses by the pipeline. A value of at least 25 bp is recommended to cut down on the rate of spurious alignments; if possible, a value of 30 bp may be used to greatly reduce the fraction of spurious alignments, with smaller gains for greater minimums [Schubert2012]_. .. warning:: The default value used by PALEOMIX for '--minlength' (25 bp) differs from the default value for AdapterRemoval (15 bp). Thus, if a minimum length of 15 bp is desired, it is nessesarily to explicitly state so in the makefile, simply commenting out this command-line argument is not sufficient. **Options \:\: AdapterRemoval \:\: --collapse** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 31 :lines: 31 If enabled, AdapterRemoval will attempt to combine overlapping paired-end reads into a single (potentially longer) sequence. This has at least two advantages, namely that longer reads allow for less ambiguous alignments against the target reference genome, and that the fidelity of the overlapping region (potentially the entire read) is improved by selecting the highest quality base when discrepancies are observed. The names of reads thus merged are prefixed with either 'M\_' or 'MT\_', with the latter marking reads that have been trimmed from the 5' or 3' termini following collapse, and which therefore do not represent the full insert. To disable this behavior, set the option to 'no' (without quotes):: --collapse: yes # Option enabled --collapse: no # Option disabled .. note:: This option may be combined with the 'ExcludeReads' option (see below), to either eliminate or select for short inserts, depending on the expectations from the experiment. I.e. for ancient samples, where the most inserts should be short enough to allow collapsing (< 2x read read - 11, by default), excluding paired (uncollapsed) and singleton reads may help reduce the fraction of exogenous DNA mapped. **Options \:\: AdapterRemoval \:\: --trimns** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 32 :lines: 32 If set to 'yes' (without quotes), AdapterRemoval will trim uncalled bases ('N') from the 5' and 3' end of the reads. Trimming will stop at the first called base ('A', 'C', 'G', or 'T'). If both --trimns and --trimqualities are enabled, then consecutive stretches of Ns and / or low-quality bases are trimmed from the 5' and 3' end of the reads. To disable, set the option to 'no' (without quotes):: --trimns: yes # Option enabled --trimns: no # Option disabled **Options \:\: AdapterRemoval \:\: --trimqualities** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 33 :lines: 33 If set to 'yes' (without quotes), AdapterRemoval will trim low-quality bases from the 5' and 3' end of the reads. Trimming will stop at the first base which is greater than the (Phred encoded) minimum quality score specified using the command-line option --minquality. This value defaults to 2. If both --trimns and --trimqualities are enabled, then consecutive stretches of Ns and / or low-quality bases are trimmed from the 5' and 3' end of the reads. To disable, set the option to 'no' (without quotes):: --trimqualities: yes # Option enabled --trimqualities: no # Option disabled Options: Short read aligners ^^^^^^^^^^^^^^^^^^^^^^^^^^^^ This section allow selection between supported short read aligners (currently BWA [Li2009a]_ and Bowtie2 [Langmead2012]_), as well as setting options for these, individually: .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 35 :lines: 35-39 To select a mapping program, set the 'Program' option appropriately:: Program: BWA # Using BWA to map reads Program: Bowtie2 # Using Bowtie2 to map reads Options: Short read aligners - BWA """""""""""""""""""""""""""""""""" The following options are applied only when running the BWA short read aligner; see the section "Options: Short read aligners" above for how to select this aligner. .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 40 :lines: 40-54 **Options \:\: Aligners \:\: BWA \:\: Algorithm** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 42 :lines: 42-44 The mapping algorithm to use; options are 'backtrack' (corresponding to 'bwa aln'), 'bwasw', and 'mem'. Additional command-line options may be specified for these. Algorithms are selected as follows:: Algorithm: backtrack # 'Backtrack' algorithm, using the command 'bwa aln' Algorithm: bwasw # 'SW' algorithm for long queries, using the command 'bwa bwasw' Algorithm: mem # 'mem' algorithm, using the command 'bwa mem' .. warning:: Alignment algorithms 'bwasw' and 'mem' currently cannot be used with input data that is encoded using QualityOffset 64 or 'Solexa'. This is a limitation of PALEOMIX, and will be resolved in future versions. In the mean time, this can be circumvented by converting FASTQ reads to the standard quality-offset 33, using for example `seqtk`_. **Options \:\: Aligners \:\: BWA \:\: MinQuality** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 45 :lines: 45-46 Specifies the minimum mapping quality of alignments produced by BWA. Any aligned read with a quality score below this value are removed during the mapping process. Note that while unmapped read have a quality of zero, these are not excluded by a non-zero 'MinQuality' value. To filter unmapped reads, use the option 'FilterUnmappedReads' (see below). To set this option, replace the default value with a desired minimum:: MinQuality: 0 # Keep all hits MinQuality: 25 # Keep only hits where mapping-quality >= 25 **Options \:\: Aligners \:\: BWA \:\: FilterUnmappedReads** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 47 :lines: 47-48 Specifies wether or not unmapped reads (reads not aligned to a target sequence) are to be retained in the resulting BAM files. If set to 'yes' (without quotes), all unmapped reads are discarded during the mapping process, while setting the option to 'no' (without quotes) retains these reads in the BAM. By convention, paired reads in which one mate is unmapped are assigned the same chromosome and position, while no chromosome / position are assigned to unmapped single-end reads. To change this setting, replace the value with either 'yes' or 'no' (without quotes):: FilterUnmappedReads: yes # Remove unmapped reads during alignment FilterUnmappedReads: no # Keep unmapped reads **Options \:\: Aligners \:\: BWA \:\: \*** Additional command-line options may be specified for the selected alignment algorithm, as described in the "Specifying command-line options" section above. See also the examples listed for Bowtie2 below. Note that for the 'backtrack' algorithm, it is only possible to specify options for the 'bwa aln' call. Options: Short read aligners - Bowtie2 """""""""""""""""""""""""""""""""""""" The following options are applied only when running the Bowtie2 short read aligner; see the section "Options: Short read aligners" above for how to select this aligner. .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 56 :lines: 56-70 **Options \:\: Aligners \:\: Bowtie2 \:\: MinQuality** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 58 :lines: 58-59 See 'Options \:\: Aligners \:\: BWA \:\: MinQuality' above. **Options \:\: Aligners \:\: Bowtie2 \:\: FilterUnmappedReads** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 60 :lines: 60-61 See 'Options \:\: Aligners \:\: BWA \:\: FilterUnmappedReads' above. **Options \:\: Aligners \:\: BWA \:\: \*** Additional command-line options may be specified for Bowtie2, as described in the "Specifying command-line options" section above. Please refer to the `Bowtie2 documentation`_ for more information about available command-line options. Options: mapDamage plots and rescaling ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 80 :lines: 80-86 This subsection is used to specify options for mapDamage2.0, when plotting *post-mortem* DNA damage, when building models of the *post-mortem* damage, and when rescaling quality scores to account for this damage. In order to enable plotting, modeling, or rescaling of quality scores, please see the 'mapDamage' option in the 'Features' section below. .. note:: It may be worthwhile to tweak mapDamage parameters before building a model of *post-mortem* DNA damage; this may be accomplished by running the pipeline without rescaling, running with the 'mapDamage' feature set to 'plot' (with or without quotes), inspecting the plots generated per-library, and then tweaking parameters as appropriate, before setting 'RescaleQualities' to 'model' (with or without quotes). Disabling the construction of the final BAMs may be accomplished by setting the features 'RawBam' and 'RealignedBAM' to 'no' (without quotes) in the 'Features' section (see below), and then setting the desired option to yes again after enabling rescaling and adding the desired options to the mapDamage section. Should you wish to change the modeling and rescaling parameters, after having already run the pipeline with RescaleQualities enabled, simply remove the mapDamage files generated for the relevant libraries (see the :ref:`bam_filestructure` section). .. warning:: Rescaling requires a certain minimum number of C>T and G>A substitutions, before it is possible to construct a model of *post-mortem* DNA damage. If mapDamage fails with an error indicating that "DNA damage levels are too low", then it is necessary to disable rescaling for that library to continue. **Options \:\: mapDamage :: --downsample** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 84 :lines: 84-86 By default the BAM pipeline only samples 100k reads for use in constructing mapDamage plots; in our experience, this is sufficient for accurate plots and models. If no downsampling is to be done, this value can set to 0 to disable this features:: --downsample: 100000 # Sample 100 thousand reads --downsample: 1000000 # Sample 1 million reads --downsample: 0 # No downsampling **Options \:\: mapDamage :: \*** Additional command-line options may be supplied to mapDamage, just like the '--downsample' parameter, as described in the "Specifying command-line options" section above. These are used during plotting and rescaling (if enabled). Options: Excluding read-types ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 88 :lines: 88-102 During the adapter-trimming and read-merging step, AdapterRemoval will generate a selection of different read types. This option allows certain read-types to be excluded from further analyses. In particular, it may be useful to exclude non-collapsed (paired and singleton) reads when processing (ancient) DNA for which only short inserts is expected, since this may help exclude exogenous DNA. The following read types are currently recognized: *Single* Single-end reads; these are the (trimmed) reads generated from supplying single-end FASTQ files to the pipeline. *Paired* Paired-end reads; these are the (trimmed) reads generated from supplying paired-end FASTQ files to the pipeline, but covering only the subset of paired reads for which *both* mates were retained, and which were not merged into a single read (if --collapse is set for AdapterRemoval). *Singleton* Paired-end reads; these are (trimmed) reads generated from supplying paired-end FASTQ files to the pipeline, but covering only those reads in which one of the two mates were discarded due to either the '--maxns', the '--minlength', or the '--maxlength' options supplied to AdapterRemoval. Consequently, these reads are mapped and PCR-duplicate filtered in single-end mode. *Collapsed* Paired-end reads, for which the sequences overlapped, and which were consequently merged by AdapterRemoval into a single sequence (enabled by the --collapse command-line option). These sequences are expected to represent the complete insert, and while they are mapped in single-end mode, PCR duplicate filtering is carried out in a manner that treats these as paired reads. Note that all collapsed reads are tagged by prefixing the read name with 'M\_'. *CollapsedTruncated* Paired-end reads (like *Collapsed*), which were trimmed due to the '--trimqualities' or the '--trimns' command-line options supplied to AdapterRemoval. Consequently, and as these sequences represent the entire insert, these reads are mapped and PCR-duplicate filtered in single-end mode. Note that all collapsed, truncated reads are tagged by prefixing the read name with 'MT\_'. To enable / disable exclusion of a read type, set the value for the appropriate type to 'yes' or 'no' (without quotes):: Singleton: no # Singleton reads are NOT excluded Singleton: yes # Singleton reads are excluded Options: Optional features ^^^^^^^^^^^^^^^^^^^^^^^^^^ .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 104 :lines: 104-127 This section lists several optional features, in particular determining which BAM files and which summary statistics are generated when running the pipeline. Currently, the following options are available: *RawBAM* If enabled, the pipeline will generate a final BAM, which is NOT processed using the GATK Indel Realigner (see below), following all other processing steps. *RealignedBAM* If enabled, the pipeline will generate a final BAM, which is processed using the GATK Indel Realigner [McKenna2010]_, in order to improve the alignment near indels, by performing a multiple sequence alignment in regions containing putative indels. *mapDamage* The 'mapDamage' option accepts four possible values: 'no', 'plot', 'model', and 'rescale'. By default value ('plot'), will cause mapDamage to be run in order to generate simple plots of the *post-mortem* DNA damage rates, as well as base composition plots, and more. If set to 'model', mapDamage will firstly generate the plots described for 'plot', but also construct models of DNA damage parameters, as described in [Jonsson2013]_. Note that a minimum amount of DNA damage is required to be present in order to build these models. If the option is set to 'rescale', both plots and models will be constructed using mapDamage, and in addition, the quality scores of bases will be down-graded based on how likely they are to represent *post-mortem* DNA damage (see above). *Coverage* If enabled, a table summarizing the number of hits, the number of aligned bases, bases inserted, and bases deleted, as well as the mean coverage, is generated for each reference sequence, stratified by sample, library, and contig. *Depths* If enabled, a table containing a histogram of the depth of coverage, ranging from 0 to 200, is generated for each reference sequence, stratified by sample, library, and contig. These files may further be used by the Phylogenetic pipeline, in order to automatically select a maximum read depth during SNP calling (see the :ref:`phylo_usage` section for more information). *Summary* If enabled, a single summary table will be generated per target, containing information about the number of reads processed, hits and fraction of PCR duplicates (per prefix and per library), and much more. *DuplicateHist* If enabled, a histogram of the estimated number of PCR duplicates observed per DNA fragment is generated per library. This may be used with the 'preseq' program in order to estimate the (remaining) complexity of a given library, and thereby direct future sequencing efforts [Daley2013]_. For a description of where files are placed, refer to the :ref:`bam_filestructure` section. It is possible to run the BAM pipeline without any of these options enabled, and this may be useful in certain cases (if only the statistics or per-library BAMs are needed). To enable / disable a features, set the value for that feature to 'yes' or 'no' (without quotes):: Summary: no # Do NOT generate a per-target summary table Summary: yes # Generate a per-target summary table Prefixes section ---------------- .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 129 :lines: 129-149 Reference genomes used for mapping are specified by listing these (one or more) in the 'Prefixes' section. Each reference genome is associated with a name (used in summary statistics and as part of the resulting filenames), and the path to a FASTA file which contains the reference genome. Several other options are also available, but only the name and the 'Path' value are required, as shown here for several examples:: # Map of prefixes by name, each having a Path key, which specifies the # location of the BWA/Bowtie2 index, and optional label, and an option # set of regions for which additional statistics are produced. Prefixes: # Name of the prefix; is used as part of the output filenames MyPrefix1: # Path to FASTA file containing reference genome; must end with '.fasta' Path: /path/to/genomes/file_1.fasta MyPrefix2: Path: /path/to/genomes/file_2.fasta MyPrefix3: Path: /path/to/genomes/AE008922_1.fasta Each sample in the makefile is mapped against each prefix, and BAM files are generated according to the enabled 'Features' (see above). In addition to the path, two other options are available per prefix, namely the 'Label' and 'RegionsOfInterest', which are described below. .. warning:: FASTA files used in the BAM pipeline *must* be named with a .fasta file extension. Furthermore, if alignments are to be carried out against the human nuclear genome, chromosomes MUST be ordered by their number for GATK to work! See the `GATK FAQ`_ for more information. Regions of interest ^^^^^^^^^^^^^^^^^^^ It is possible to specify one or more "regions of interest" for a particular reference genome. Doing so results in the production of coverage and depth tables being generated for those regions (if these features are enabled, see above), as well as additional information in the summary table (if enabled, see above). Such regions are specified using a BED file containing one or more regions; in particular, the first three columns (name, 0-based start coordinate, and 1-based end coordinate) are required, with the 4th column (the name) being optional. Strand information (the 6th column) is not used, but must still be valid according to the BED format. If these regions are named, statistics are merged by these names (essentially treating them as pseudo contigs), while regions are merged by contig. Thus, it is important to insure that names are unique if statistics are desired for very single region, individually. Specifying regions of interest is accomplished by providing a name and a path for each set of regions of interest under the 'RegionOfInterest' section for a given prefix:: # Produce additional coverage / depth statistics for a set of # regions defined in a BED file; if no names are specified for the # BED records, results are named after the chromosome / contig. RegionsOfInterest: MyRegions: /path/to/my_regions.bed MyOtherRegions: /path/to/my_other_regions.bed The following is a simple example of such a BED file, for an alignment against the rCRS (`NC_012920.1`_):: NC_012920_1 3306 4262 region_a NC_012920_1 4469 5510 region_b NC_012920_1 5903 7442 region_a In this case, the resulting tables will contain information about two different regions, namely region\_a (2495 bp, resulting from merging the two individual regions specified), and region\_b (1041 bp). The order of lines in this file does not matter. Adding multiple prefixes ^^^^^^^^^^^^^^^^^^^^^^^^ In cases where it is necessary to map samples against a large number of reference genomes, it may become impractical to add these to the makefile by hand. To allow such use-cases, it is possible to specify the location of the reference genomes via a path containing wild-cards, and letting the BAM pipeline collect these automatically. For the following example, we assume that we have a folder '/path/to/genomes', which contains our reference genomes: .. code-block:: bash $ ls /path/to/genomes AE000516_2.fasta AE004091_2.fasta AE008922_1.fasta AE008923_1.fasta To automatically add these (4) reference genomes to the makefile, we would add a prefix as follows:: # Map of prefixes by name, each having a Path key, which specifies the # location of the BWA/Bowtie2 index, and optional label, and an option # set of regions for which additional statistics are produced. Prefixes: # Name of the prefix; is used as part of the output filenames MyGenomes*: # Path to .fasta file containing a set of reference sequences. Path: /path/to/genomes/*.fasta There are two components to this, namely the name of the pseudo-prefix which *must* end with a star (\*), and the path which may contain one or more wild-cards. If the prefix name does not end with a star, the BAM pipeline will simply treat the path as a regular path. In this particular case, the BAM pipeline will perform the equivalent of 'ls /path/to/genomes/\*.fasta', and then add each file it has located using the filename without extensions as the name of the prefix. In other words, the above is equivalent to the following:: # Map of prefixes by name, each having a Path key, which specifies the # location of the BWA/Bowtie2 index, and optional label, and an option # set of regions for which additional statistics are produced. Prefixes: # Name of the prefix; is used as part of the output filenames AE000516_2: Path: /path/to/genomes/AE000516_2.fasta AE004091_2: Path: /path/to/genomes/AE004091_2.fasta AE008922_1: Path: /path/to/genomes/AE008922_1.fasta AE008923_1: Path: /path/to/genomes/AE008923_1.fasta A makefile including such prefixes is executed as any other makefile. .. note:: The name provided for the pseudo-prefix (here 'MyGenomes') is not used by the pipeline, and can instead be used to document the nature of the files being included. .. warning:: Just like regular prefixes, it is required that the filename of the reference genome ends with '.fasta'. However, the pipeline will attempt to add *any* file found using the provided path with wildcards, and care should therefore be taken to avoid including non-FASTA files. For example, if the path '/path/to/genomes/\*' was used instead of '/path/to/genomes/\*.fasta', this would cause the pipeline to abort due to the inclusion of (for example) non-FASTA index files generated at this location by the pipeline itself. Prefix labels ^^^^^^^^^^^^^ .. code-block:: yaml Prefixes: # Uncomment and replace 'NAME_OF_PREFIX' with name of the prefix; this name # is used in summary statistics and as part of output filenames. # NAME_OF_PREFIX: # ... # (Optional) Uncomment and replace 'LABEL' with one of 'nuclear', # 'mitochondrial', 'chloroplast', 'plasmid', 'bacterial', or 'viral'. # Label: LABEL The label option for prefixes allow a prefix to be classified according to one of several categories, currently including 'nuclear', 'mitochondrial', 'chloroplast', 'plasmid', 'bacterial', and 'viral'. This is only used when generating the .summary files (if the 'Summary' feature is enabled), in which the label is used instead of the prefix name, and the results for prefixes with the same label are combined. .. warning:: Labels are deprecated, and will either be removed in future versions of PALEOMIX, or significantly changed. Targets section --------------- .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 152 :lines: 152- In the BAM pipeline, the term 'Target' is used to refer not to a particular sample (though in typical usage a target includes just one sample), but rather one or more samples to processed together to generate a BAM file per prefix (see above). A sample included in a target may likewise contain one or more libraries, for each of which one or more sets of FASTQ reads are specified. The following simplified example, derived from the makefile constructed as part of :ref:`bam_usage` section exemplifies this: .. code-block:: yaml :linenos: # Target name; all output files uses this name as a prefix MyFilename: # Sample name; used to tag data for segregation in downstream analyses MySample: # Library name; used to tag data for segregation in downstream analyses TGCTCA: # Lane / run names and paths to FASTQ files Lane_1: 000_data/TGCTCA_L1_*.fastq.gz Lane_2: 000_data/TGCTCA_L2_R{Pair}_*.fastq.gz *Target name* The first top section of this target (line 1, 'MyFilename') constitute the target name. This name is used as part of summary statistics and, more importantly, determined the first part of name of files generated as part of the processing of data specified for this target. Thus, in this example all files and folders generated during the processing of this target will start with 'MyFilename'; for example, the summary table normally generated from running the pipeline will be placed in a file named 'MyFilename.summary'. *Sample name* The subsections listed in the 'Target' section (line 2, 'MySample') constitute the (biological) samples included in this target; in the vast majority of analyses, you will have only a single sample per target, and in that case it is considered good practice to use the same name for both the target and the sample. A single target can, however, contain any number of samples, the data for which are tagged according to the names given in the makefile, using the SAM/BAM readgroup ('RG') tags. *Library name* The subsections listed in the 'Sample' section (line 3, 'TGCTCA') constitute the sequencing libraries constructed during the extraction and library building for the current sample. For modern samples, there is typically only a single library per sample, but more complex sequencing projects (modern and ancient) may involve any number of libraries constructed from one or more extracts. It is very important that libraries be listed correctly (see below). .. warning:: Note that the BAM pipeline imposes the restriction that each library name specified for a target must be unique, even if these are located in to different samples. This restriction may be removed in future versions of PALEOMIX. *Lane name* The subsections of each library are used to specify the names of individual In addition to these target (sub)sections, it is possible to specify 'Options' for individual targets, samples, and libraries, similarly to how this is done globally at the top of the makefile. This is described below. .. warning:: It is very important that lanes are assigned to their corresponding libraries in the makefile; while it is possible to simply record every sequencing run / lane under a single library and run the pipeline like that, this will result in several unintended side effects: Firstly, the BAM pipeline uses the library information to ensure that PCR duplicates are filtered correctly. Wrongly grouping together lanes will result either in the loss of sequences which are not, in fact, PCR duplicates, while wrongly splitting a library into multiple entries will result in PCR duplicates not being correctly identified across these. Furthermore, GATK and mapDamage analyses make use of this information to carry out various analyses on a per-library basis, which may similarly be negatively impacted by incorrect specification of libraries. Including already trimmed reads ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In some cases it is useful to include FASTQ reads that has already been trimmed for adapter sequences. While this is not recommended in general, as it may introduce systematic bias if some data has been processed differently than the remaining FASTQ reads, the BAM pipeline makes it simple to incorporate both 'raw' and trimmed FASTQ reads, and to ensure that these integrate in the pipeline. To include already trimmed reads, these are specified as values belonging to a lane, using the same names for read-types as in the 'ExcludeReads' option (see above). The following minimal example demonstrates this: .. code-block:: yaml :linenos: MyFilename: MySample: ACGATA: # Regular lane, containing reads that are not already trimmed Lane_1: 000_data/ACGATA_L1_R{Pair}_*.fastq.gz # Lane containing pre-trimmed reads of each type Lane_2: # Single-end reads Single: /path/to/single_end_reads.fastq.gz # Paired-end reads where one mate has been discarded Singleton: /path/to/singleton_reads.fastq.gz # Paired end reads; note that the {Pair} key is required, # just like with raw, paired-end reads Paired: /path/to/paired_end_{Pair}.fastq.gz # Paired-end reads merged into a single sequence Collapsed: /path/to/collapsed.fastq.gz # Paired-end reads merged into a single sequence, and then truncated CollapsedTruncated: /path/to/collapsed_truncated.fastq.gz The above examples show how each type of reads are to be listed, but it is not necessary to specify more than a single type of pre-trimmed reads in the makefile. .. note:: Including already trimmed reads currently result in the absence of some summary statistics in the .summary file, namely the number of raw reads, as well as trimming statistics, since the BAM pipeline currently relies on AdapterRemoval to collect these statistics. Overriding global settings ^^^^^^^^^^^^^^^^^^^^^^^^^^ In addition to the 'Options' section included, by default, at the beginning of every makefile, it is possible to specify / override options at a Target, Sample, and Library level. This allows, for example, that different adapter sequences be specified for each library generated for a sample, or options that should only be applied to a particular sample among several included in a makefile. The following demonstration uses the makefile constructed as part of :ref:`bam_usage` section as the base: .. code-block:: yaml :linenos: :emphasize-lines: 2-7, 10-14, 20-23 MyFilename: # These options apply to all samples with this filename Options: # In this example, we override the default adapter sequences AdapterRemoval: --adapter1: AGATCGGAAGAGC --adapter2: AGATCGGAAGAGC MySample: # These options apply to libraries 'ACGATA', 'GCTCTG', and 'TGCTCA' Options: # In this example, we assume that FASTQ files for our libraries # include Phred quality scores encoded with offset 64. QualityOffset: 64 ACGATA: Lane_1: 000_data/ACGATA_L1_R{Pair}_*.fastq.gz GCTCTG: # These options apply to 'Lane_1' in the 'GCTCTG' library Options: # It is possible to override options we have previously overridden QualityOffset: 33 Lane_1: 000_data/GCTCTG_L1_*.fastq.gz TGCTCA: Lane_1: 000_data/TGCTCA_L1_*.fastq.gz Lane_2: 000_data/TGCTCA_L2_R{Pair}_*.fastq.gz In this example, we have overwritten options at 3 places: * The first place (lines 2 - 7) will be applied to *all* samples, libraries, and lanes in this target, unless subsequently overridden. In this example, we have set a new pair of adapter sequences, which we wish to use for these data. * The second place (lines 10 - 14) are applied to the sample 'MySample' that we have included in this target, and consequently applies to all libraries specified for this sample ('ACGATA', 'GCTCTG', and 'TGCTCA'). In most cases you will only have a single sample, and so it will not make a difference whether or not you override options for the entire target (e.g. lines 3 - 8), or just for that sample (e.g. lines 11-15). * Finally, the third place (lines 20 - 23) demonstrate how options can be overridden for a particular library. In this example, we have chosen to override an option (for this library only!) we previously overrode for that sample (the 'QualityOffset' option). .. note:: It currently not possible to override options for a single lane, it is only possible to override options for all lanes in a library. .. warning:: It is currently not possible to set the 'Features' except in the global 'Options' section at the top of the Makefile; this limitation will be removed in future versions of PALEOMIX. .. _AdapterRemoval documentation: https://github.com/MikkelSchubert/adapterremoval .. _Bowtie2 documentation: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml .. _GATK FAQ: http://www.broadinstitute.org/gatk/guide/article?id=1204 .. _NC_012920.1: http://www.ncbi.nlm.nih.gov/nuccore/251831106 .. _Phred quality-scores: https://en.wikipedia.org/wiki/FASTQ_format#Quality .. _SAM/BAM specification: http://samtools.sourceforge.net/SAM1.pdf .. _seqtk: https://github.com/lh3/seqtk paleomix-1.2.12/docs/bam_pipeline/makefile.yaml000066400000000000000000000166131314402124200214300ustar00rootroot00000000000000# Default options. # Can also be specific for a set of samples, libraries, and lanes, # by including the "Options" hierarchy at the same level as those # samples, libraries, or lanes below. This does not include # "Features", which may only be specific globally. Options: # Sequencing platform, see SAM/BAM reference for valid values Platform: Illumina # Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+) # or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to # specify 'Solexa', to handle reads on the Solexa scale. This is # used during adapter-trimming and sequence alignment QualityOffset: 33 # Split a lane into multiple entries, one for each (pair of) file(s) # found using the search-string specified for a given lane. Each # lane is named by adding a number to the end of the given barcode. SplitLanesByFilenames: yes # Compression format for FASTQ reads; 'gz' for GZip, 'bz2' for BZip2 CompressionFormat: bz2 # Settings for trimming of reads, see AdapterRemoval man-page AdapterRemoval: # Adapter sequences, set and uncomment to override defaults # --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG # --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT # Some BAM pipeline defaults differ from AR defaults; # To override, change these value(s): --mm: 3 --minlength: 25 # Extra features enabled by default; change 'yes' to 'no' to disable --collapse: yes --trimns: yes --trimqualities: yes # Settings for aligners supported by the pipeline Aligners: # Choice of aligner software to use, either "BWA" or "Bowtie2" Program: BWA # Settings for mappings performed using BWA BWA: # One of "backtrack", "bwasw", or "mem"; see the BWA documentation # for a description of each algorithm (defaults to 'backtrack') Algorithm: backtrack # Filter aligned reads with a mapping quality (Phred) below this value MinQuality: 0 # Filter reads that did not map to the reference sequence FilterUnmappedReads: yes # May be disabled ("no") for aDNA alignments, as post-mortem damage # localizes to the seed region, which BWA expects to have few # errors (sets "-l"). See http://pmid.us/22574660 UseSeed: yes # Additional command-line options may be specified for the "aln" # call(s), as described below for Bowtie2 below. # Settings for mappings performed using Bowtie2 Bowtie2: # Filter aligned reads with a mapping quality (Phred) below this value MinQuality: 0 # Filter reads that did not map to the reference sequence FilterUnmappedReads: yes # Examples of how to add additional command-line options # --trim5: 5 # --trim3: 5 # Note that the colon is required, even if no value is specified --very-sensitive: # Example of how to specify multiple values for an option # --rg: # - CN:SequencingCenterNameHere # - DS:DescriptionOfReadGroup # Mark / filter PCR duplicates. If set to 'filter', PCR duplicates are # removed from the output files; if set to 'mark', PCR duplicates are # flagged with bit 0x400, and not removed from the output files; if set to # 'no', the reads are assumed to not have been amplified. Collapsed reads # are filtered using the command 'paleomix rmdup_duplicates', while "normal" # reads are filtered using Picard MarkDuplicates. PCRDuplicates: filter # Command-line options for mapDamage; note that the long-form # options are expected; --length, not -l, etc. Uncomment the # "mapDamage" line adding command-line options below. mapDamage: # By default, the pipeline will downsample the input to 100k hits # when running mapDamage; remove to use all hits --downsample: 100000 # Set to 'yes' exclude a type of trimmed reads from alignment / analysis; # possible read-types reflect the output of AdapterRemoval ExcludeReads: # Exclude single-end reads (yes / no)? Single: no # Exclude non-collapsed paired-end reads (yes / no)? Paired: no # Exclude paired-end reads for which the mate was discarded (yes / no)? Singleton: no # Exclude overlapping paired-ended reads collapsed into a single sequence # by AdapterRemoval (yes / no)? Collapsed: no # Like 'Collapsed', but only for collapsed reads truncated due to the # presence of ambiguous or low quality bases at read termini (yes / no). CollapsedTruncated: no # Optional steps to perform during processing. Features: # Generate BAM from the alignments without indel realignment (yes / no) RawBAM: no # Generate indel-realigned BAM using the GATK Indel realigner (yes / no) RealignedBAM: yes # To disable mapDamage, write 'no'; to generate basic mapDamage plots, # write 'plot'; to build post-mortem damage models, write 'model', # and to produce rescaled BAMs, write 'rescale'. The 'model' option # includes the 'plot' output, and the 'rescale' option includes both # 'plot' and 'model' results. All analyses are carried out per library. mapDamage: plot # Generate coverage information for the raw BAM (wo/ indel realignment). # If one or more 'RegionsOfInterest' have been specified for a prefix, # additional coverage files are generated for each alignment (yes / no) Coverage: yes # Generate histogram of number of sites with a given read-depth, from 0 # to 200. If one or more 'RegionsOfInterest' have been specified for a # prefix, additional histograms are generated for each alignment (yes / no) Depths: yes # Generate summary table for each target (yes / no) Summary: yes # Generate histogram of PCR duplicates, for use with PreSeq (yes / no) DuplicateHist: no # Map of prefixes by name, each having a Path key, which specifies the # location of the BWA/Bowtie2 index, and optional label, and an option # set of regions for which additional statistics are produced. Prefixes: # Replace 'NAME_OF_PREFIX' with name of the prefix; this name # is used in summary statistics and as part of output filenames. NAME_OF_PREFIX: # Replace 'PATH_TO_PREFIX' with the path to .fasta file containing the # references against which reads are to be mapped. Using the same name # as filename is strongly recommended (e.g. /path/to/Human_g1k_v37.fasta # should be named 'Human_g1k_v37'). Path: PATH_TO_PREFIX # (Optional) Uncomment and replace 'PATH_TO_BEDFILE' with the path to a # .bed file listing extra regions for which coverage / depth statistics # should be calculated; if no names are specified for the BED records, # results are named after the chromosome / contig. Change 'NAME' to the # name to be used in summary statistics and output filenames. # RegionsOfInterest: # NAME: PATH_TO_BEDFILE # Mapping targets are specified using the following structure. Uncomment and # replace 'NAME_OF_TARGET' with the desired prefix for filenames. #NAME_OF_TARGET: # Uncomment and replace 'NAME_OF_SAMPLE' with the name of this sample. # NAME_OF_SAMPLE: # Uncomment and replace 'NAME_OF_LIBRARY' with the name of this sample. # NAME_OF_LIBRARY: # Uncomment and replace 'NAME_OF_LANE' with the name of this lane, # and replace 'PATH_WITH_WILDCARDS' with the path to the FASTQ files # to be trimmed and mapped for this lane (may include wildcards). # NAME_OF_LANE: PATH_WITH_WILDCARDS paleomix-1.2.12/docs/bam_pipeline/overview.rst000066400000000000000000000104431314402124200213620ustar00rootroot00000000000000Overview of analytical steps ============================ During a typical analyses, the BAM pipeline will proceed through the following steps. Note that the exact order in which each step is carried out during execution is not necessarily as shown below, since the exact steps depend on the user settings, and since the pipeline will automatically run steps as soon as possible: 1. Initial steps 1. Each prefix (reference sequences in FASTA format) is indexed using either "bwa index" or "bowtie-build", depending on the configuration used. 2. Each prefix is indexed using "samtools faidx". 3. A sequence dictionary is built for each prefix using Picard BuildSequenceDictionary.jar 2. Preprocessing of reads 2. Adapter sequences, low quality bases and ambiguous bases are trimmed; overlapping paired-end reads are merged, and short reads are filtered using AdapterRemoval [Lindgreen2012]_. 3. Mapping of reads 1. Processed reads resulting from the adapter-trimming / read-collapsing step above are mapped using the chosen aligner (BWA or Bowtie2). The resulting alignments are tagged using the information specified in the makefile (sample, library, lane, etc.). 2. The records of the resulting BAM are updated using "samtools fixmate" to ensure that PE reads contain the correct information about the mate read). 3. The BAM is sorted using "samtools sort", indexed using "samtools index" (if required based on the current configuration), and validated using Picard ValidateSamFile.jar. 4. Finally, the records are updated using "samtools calmd" to ensure consistent reporting of the number of mismatches relative to the reference genome (BAM tag 'NM'). 4. Processing of preexisting BAM files 1. Any preexisting BAM files are re-tagged using Picard 'AddOrReplaceReadGroups.jar' to match the tagging of other reads processed by the pipeline. 2. The resulting BAM is sorted, updated using "samtools calmd", indexed using "samtools index" (if required), and validated using Picard ValidateSamFile.jar. 5. Filtering of duplicates, rescaling of quality scores, and validation 1. If enabled, PCR duplicates are filtered using Picard MarkDuplicates.jar (for SE and PE reads) and "paleomix rmdup_collapsed" (for collapsed reads; see the :ref:`other_tools` section). PCR filtering is carried out per library. 2. If "Rescaling" is enabled, quality scores of bases that are potentially the result of *post-mortem* DNA damage are recalculated using mapDamage2.0 [Jonsson2013]_. 3. The resulting BAMs are indexed and validated using Picard ValidateSamFile.jar. Mapped reads at each position of the alignments are compared using the query name, sequence, and qualities. If a match is found, it is assumed to represent a duplication of input data (see :ref:`troubleshooting_bam`). 6. Generation of final BAMs 1. If the "Raw BAM" feature is enabled, each BAM in the previous step is merged into a final BAM file. 2. If the "Realigned BAM" feature is enabled, each BAM generated in the previous step is merged, and GATK IndelRealigner is used to perform local realignment around indels, to improve downstream analyses. The resulting BAM is updated using "samtools calmd" as above. 7. Statistics 1. If the "Summary" feature is enable, a single summary table is generated for each target. This table summarizes the input data in terms of the raw number of reads, the number of reads following filtering / collapsing, the fraction of reads mapped to each prefix, the fraction of reads filtered as duplicates, and more. 2. Coverages statistics are calculated for the intermediate and final BAM files using "paleomix coverage", depending on makefile settings. Statistics are calculated genome-wide and for any regions of interest specified by the user. 3. Depth histograms are calculated using "paleomix depths", similar to coverage statistics, these statistics are genome-wide and for any regions of interest specified by the user. 4. If the "mapDamage" feature or "Rescaling" is enabled, mapDamage plots are generated; if rescaling is enabled, a model of the post-mortem DNA damage is also generated. 5. If the "DuplicateHist" feature is enabled, histograms of PCR duplicates are estimated for each library, for use with the 'preseq' tool[Daley2013]_, to estimate the complexity of the libraries. paleomix-1.2.12/docs/bam_pipeline/requirements.rst000066400000000000000000000052601314402124200222400ustar00rootroot00000000000000.. highlight:: Bash .. _bam_requirements: Software requirements ===================== In addition to the requirements listed in the ref:`installation` section, the BAM pipeline requires that a several other pieces of software be installed. The plus-sign following version numbers are used to indicate that versions newer than that version are also supported: * `AdapterRemoval`_ v2.1+ [Lindgreen2012]_ * `SAMTools`_ v0.1.18+ [Li2009b]_ * `Picard Tools`_ v1.137+ The Picard Tools JAR-file (picard.jar) is expected to be located in ~/install/jar_root/ by default, but this behavior may be changed using either the --jar-root command-line option, or via the global configuration file (see section :ref:`bam_configuration`). Furthermore, one or both of the following sequence aligners must be installed: * `Bowtie2`_ v2.1.0+ [Langmead2012]_ * `BWA`_ v0.5.9+, v0.6.2, or v0.7.9+ [Li2009a]_ In addition, the following packages are used by default, but can be omitted if disabled during runtime: * `mapDamage`_ 2.0.2+ [Jonsson2013]_ * `Genome Analysis ToolKit`_ [McKenna2010]_ If mapDamage is used to perform rescaling of post-mortem DNA damage, then the GNU Scientific Library (GSL) and the R packages listed in the mapDamage installation instructions are required; these include 'inline', 'gam', 'Rcpp', 'RcppGSL' and 'ggplot2' (>=0.9.2). Use the following commands to verify that these packages have been correctly installed:: $ gsl-config Usage: gsl-config [OPTION] ... $ mapDamage --check-R-packages All R packages are present The GATK JAR is only required if the user wishes to carry out local realignment near indels (recommended), and is expected to be placed in the same folder as the Picard Tools JAR (see above). The example projects included in the PALEOMIX source distribution may be used to test that PALEOMIX and the BAM pipeline has been correctly installed. See the :ref:`examples` section for more information. In case of errors, please consult the :ref:`troubleshooting` section. Testing the pipeline -------------------- An example project is included with the BAM pipeline, and it is recommended to run this project in order to verify that the pipeline and required applications have been correctly installed. See the :ref:`examples` section for a description of how to run this example project. .. _AdapterRemoval: https://github.com/MikkelSchubert/adapterremoval .. _Bowtie2: http://bowtie-bio.sourceforge.net/bowtie2/ .. _BWA: http://bio-bwa.sourceforge.net/ .. _mapDamage: http://ginolhac.github.io/mapDamage/ .. _Genome Analysis ToolKit: http://www.broadinstitute.org/gatk/ .. _SAMTools: https://samtools.github.io .. _Picard Tools: http://broadinstitute.github.io/picard/paleomix-1.2.12/docs/bam_pipeline/usage.rst000066400000000000000000000750421314402124200206260ustar00rootroot00000000000000.. highlight:: Yaml .. _bam_usage: Pipeline usage ============== The following describes, step by step, the process of setting up a project for mapping FASTQ reads against a reference sequence using the BAM pipeline. For a detailed description of the configuration file (makefile) used by the BAM pipeline, please refer to the section :ref:`bam_makefile`, and for a detailed description of the files generated by the pipeline, please refer to the section :ref:`bam_filestructure`. The BAM pipeline is invoked using either the 'paleomix' command, which offers access to all tools included with PALEOMIX (see section :ref:`other_tools`), or using the (deprecated) alias 'bam_pipeline'. Thus, all commands in the following may take one of the following (equivalent) forms: .. code-block:: bash $ paleomix bam_pipeline [...] $ bam_pipeline [...] For the purpose of these instructions, we will make use of a tiny FASTQ data set included with PALEOMIX pipeline, consisting of synthetic FASTQ reads simulated against the human mitochondrial genome. To follow along, first create a local copy of the BAM pipeline example data: .. code-block:: bash $ paleomix bam_pipeline example . This will create a folder named 'bam_pipeline' in the current folder, which contain the example FASTQ reads and a 'makefile' showcasing various features of the BAM pipeline ('000\_makefile.yaml'). We will make use of a subset of the data, but we will not make use of the makefile. The data we will use consists of 3 simulated ancient DNA libraries (independent amplifications), for which either one or two lanes have been simulated: +-------------+------+------+---------------------------------+ | Library | Lane | Type | Files | +-------------+------+------+---------------------------------+ | ACGATA | 1 | PE | 000_data/ACGATA\_L1\_*.fastq.gz | +-------------+------+------+---------------------------------+ | GCTCTG | 1 | SE | 000_data/GCTCTG\_L1\_*.fastq.gz | +-------------+------+------+---------------------------------+ | TGCTCA | 1 | SE | 000_data/TGCTCA\_L1\_*.fastq.gz | +-------------+------+------+---------------------------------+ | | 2 | PE | 000_data/TGCTCA\_L2\_*.fastq.gz | +-------------+------+------+---------------------------------+ .. warning:: The BAM pipeline largely relies on the existence of final and intermediate files in order to detect if a given analytical step has been carried out. Therefore, changes made to a makefile after the pipeline has already been run (even if not run to completion) may therefore not cause analytical steps affected by these changes to be re-run. If changes are to be made at such a point, it is typically necessary to manually remove affected intermediate files before running the pipeline again. See the section :ref:`bam_filestructure` for more information about the layout of files generated by the pipeline. Creating a makefile ------------------- As described in the :ref:`introduction`, the BAM pipeline operates based on 'makefiles', which serve to specify the location and structure of input data (samples, libraries, lanes, etc), and which specific which tasks are to be run, and which settings to be used. The makefiles are written using the human-readable YAML format, which may be edited using any regular text editor. For a brief introduction to the YAML format, please refer to the :ref:`yaml_intro` section, and for a detailed description of the BAM Pipeline makefile, please refer to section :ref:`bam_makefile`. To start a new project, we must first generate a makefile template using the following command, which for the purpose of this tutorial we place in the example folder: .. code-block:: bash $ cd bam_pipeline/ $ paleomix bam_pipeline mkfile > makefile.yaml Once you open the resulting file ('makefile.yaml') in your text editor of choice, you will find that BAM pipeline makefiles are split into 3 major sections, representing 1) the default options used for processing the data; 2) the reference genomes against which reads are to be mapped; and 3) sets of input files for one or more samples which is to be processed. In a typical project, we will need to review the default options, add one or more reference genomes which we wish to target, and list the input data to be processed. Default options ^^^^^^^^^^^^^^^ The makefile starts with an "Options" section, which is applied to every set of input-files in the makefile unless explicitly overwritten for a given sample (this is described in the :ref:`bam_makefile` section). For most part, the default values should be suitable for a given project, but special attention should be paid to the following options (colons indicates subsections): **Options\:Platform** The sequencing platform used to generate the sequencing data; this information is recorded in the resulting BAM file, and may be used by downstream tools. The `SAM/BAM specification`_ the valid platforms, which currently include 'CAPILLARY', 'HELICOS', 'ILLUMINA', 'IONTORRENT', 'LS454', 'ONT', 'PACBIO', and 'SOLID'. **Options\:QualityOffset** The QualityOffset option refers to the starting ASCII value used to encode `Phred quality-scores`_ in user-provided FASTQ files, with the possible values of 33, 64, and 'Solexa'. For most modern data, this will be 33, corresponding to ASCII characters in the range '!' to 'J'. Older data is often encoded using the offset 64, corresponding to ASCII characters in the range '@' to 'h', and more rarely using Solexa quality-scores, which represent a different scheme than Phred scores, and which occupy the range of ASCII values from ';' to 'h'. For a visual representation of this, refer to the Wikipedia article linked above. .. warning:: By default, the adapter trimming software used by PALEOMIX expects quality-scores no higher than 41, corresponding to the ASCII character 'J' when encoded using offset 33. If the input-data contains quality-scores higher greater than this value, then it is necessary to specify the maximum value using the '--qualitymax' command-line option. See below. .. warning:: Presently, quality-offsets other than 33 are not supported when using the BWA 'mem' or the BWA 'bwasw' algorithms. To use these algorithms with quality-offset 64 data, it is therefore necessary to first convert these data to offset 33. This can be accomplished using the `seqtk`_ tool. **Options\:AdapterRemoval\:--adapter1** **Options\:AdapterRemoval\:--adapter2** These two options are used to specify the adapter sequences used to identify and trim reads that contain adapter contamination using AdapterRemoval. Thus, the sequence provided for --adapter1 is expected to be found in the mate 1 reads, and the sequence specified for --adapter2 is expected to be found in the mate 2 reads. In both cases, these should be specified as in the orientation that appear in these files (i.e. it should be possible to grep the files for these, assuming that the reads were long enough, and treating Ns as wildcards). It is very important that these be specified correctly. Please refer to the `AdapterRemoval documentation`_ for more information. **Aligners\:Program** The short read alignment program to use to map the (trimmed) reads to the reference genome. Currently, users many choose between 'BWA' and 'Bowtie2', with additional options available for each program. **Aligners\:BWA\:MinQuality** and **Aligners\:Bowtie2\:MinQuality** The minimum mapping quality of hits to retain during the mapping process. If this option is set to a non-zero value, any hits with a mapping quality below this value are removed from the resulting BAM file (this option does not apply to unmapped reads). If the final BAM should contain all reads in the input files, this option must be set to 0, and the 'FilterUnmappedReads' option set to 'no'. **Aligners\:BWA\:UseSeed** Enable/disable the use of a seed region when mapping reads using the BWA 'backtrack' alignment algorithm (the default). Disabling this option may yield some improvements in the alignment of highly damaged ancient DNA, at the cost of significantly increasing the running time. As such, this option is not recommended for modern samples [Schubert2012]_. For the purpose of the example project, we need only change a few options. Since the reads were simulated using an Phred score offset of 33, there is no need to change the 'QualityOffset' option, and since the simulated adapter sequences matches the adapters that AdapterRemoval searches for by default, so we do not need to set eiter of '--adapter1' or '--adapter2'. We will, however, use the default mapping program (BWA) and algorithm ('backtrack'), but change the minimum mapping quality to 30 (corresponding to an error probability of 0.001). Changing the minimum quality is accomplished by locating the 'Aligners' section of the makefile, and changing the 'MinQuality' value from 0 to 30 (line 12): .. code-block:: yaml :emphasize-lines: 12 :linenos: :lineno-start: 38 # Settings for aligners supported by the pipeline Aligners: # Choice of aligner software to use, either "BWA" or "Bowtie2" Program: BWA # Settings for mappings performed using BWA BWA: # One of "backtrack", "bwasw", or "mem"; see the BWA documentation # for a description of each algorithm (defaults to 'backtrack') Algorithm: backtrack # Filter aligned reads with a mapping quality (Phred) below this value MinQuality: 30 # Filter reads that did not map to the reference sequence FilterUnmappedReads: yes # Should be disabled ("no") for aDNA alignments, as post-mortem damage # localizes to the seed region, which BWA expects to have few # errors (sets "-l"). See http://pmid.us/22574660 UseSeed: yes Since the data we will be mapping represents (simulated) ancient DNA, we will furthermore set the UseSeed option to 'no' (line 18), in order to recover a small additional amount of alignments during mapping (c.f. [Schubert2012]_): .. code-block:: yaml :emphasize-lines: 18 :linenos: :lineno-start: 38 # Settings for aligners supported by the pipeline Aligners: # Choice of aligner software to use, either "BWA" or "Bowtie2" Program: BWA # Settings for mappings performed using BWA BWA: # One of "backtrack", "bwasw", or "mem"; see the BWA documentation # for a description of each algorithm (defaults to 'backtrack') Algorithm: backtrack # Filter aligned reads with a mapping quality (Phred) below this value MinQuality: 30 # Filter reads that did not map to the reference sequence FilterUnmappedReads: yes # Should be disabled ("no") for aDNA alignments, as post-mortem damage # localizes to the seed region, which BWA expects to have few # errors (sets "-l"). See http://pmid.us/22574660 UseSeed: no Once this is done, we can proceed to specify the location of the reference genome(s) that we wish to map our reads against. Reference genomes (prefixes) ---------------------------- Mapping is carried out using one or more reference genomes (or other sequences) in the form of FASTA files, which are indexed for use in read mapping (automatically, by the pipeline) using either the "bwa index" or "bowtie2-build" commands. Since sequence alignment index are generated at the location of these files, reference genomes are also referred to as "prefixes" in the documentation. In other words, using BWA as an example, the PALEOMIX pipeline will generate a index (prefix) of the reference genome using a command corresponding to the following, for BWA: .. code-block:: bash $ bwa index prefixes/my_genome.fa In addition to the BWA / Bowtie2 index, several other related files are also automatically generated, including a FASTA index file (.fai), and a sequence dictionary (.dict), which are required for various operations of the pipeline. These are similarly located at the same folder as the reference FASTA file. For a more detailed description, please refer to the :ref:`bam_filestructure` section. .. warning:: Since the pipeline automatically carries out indexing of the FASTA files, it therefore requires write-access to the folder containing the FASTA files. If this is not possible, one may simply create a local folder containing symbolic links to the original FASTA file(s), and point the makefile to this location. All automatically generated files will then be placed in this location. Specifying which FASTA file to align sequences is accomplished by listing these in the "Prefixes" section in the makefile. For example, assuming that we had a FASTA file named "my\_genome.fasta" which is located in the folder "my\_prefixes", the following might be used:: Prefixes: my_genome: Path: my_prefixes/my_genome.fasta The name of the prefix (here 'my\_genome') will be used to name the resulting files and in various tables that are generated by the pipeline. Typical names include 'hg19', 'EquCab20', and other standard abbreviations for reference genomes, accession numbers, and the like. Multiple prefixes can be specified, but each name MUST be unique:: Prefixes: my_genome: Path: my_prefixes/my_genome.fasta my_other_genome: Path: my_prefixes/my_other_genome.fasta .. warning:: FASTA files used in the BAM pipeline *must* be named with a .fasta file extension. Furthermore, if alignments are to be carried out against the human nuclear genome, chromosomes MUST be ordered by their number for GATK to work! See the `GATK FAQ`_ for more information. In the case of this example project, we will be mapping our data against the revised Cambridge Reference Sequence (rCRS) for the human mitochondrial genome, which is included in examples folder under '000\_prefxies', as a file named 'rCRS.fasta'. To add it to the makefile, locate the 'Prefixes' section located below the 'Options' section, and update it as described above (lines 5 and 7): .. code-block:: yaml :emphasize-lines: 6,8 :linenos: :lineno-start: 125 # Map of prefixes by name, each having a Path key, which specifies the # location of the BWA/Bowtie2 index, and optional label, and an option # set of regions for which additional statistics are produced. Prefixes: # Name of the prefix; is used as part of the output filenames rCRS: # Path to .fasta file containing a set of reference sequences. Path: 000_prefixes/rCRS.fasta Once this is done, we may specify the input data that we wish the pipeline to process for us. Specifying read data -------------------- A single makefile may be used to process one or more samples, to generate one or more BAM files and supplementary statistics. In this project we will only deal with a single sample, which we accomplish by adding creating our own section at the end of the makefile. The first step is to determine the name for the files generated by the BAM pipeline. Specifically, we will specify a name which is prefixed to all output generated for our sample (here named 'MyFilename'), by adding the following line to the end of the makefile: .. code-block:: yaml :linenos: :lineno-start: 145 # You can also add comments like these to document your experiment MyFilename: This first name, or grouping, is referred to as the target, and typically corresponds to the name of the sample being processes, though any name may do. The actual sample-name is specified next (it is possible, but uncommon, for a single target to contain multiple samples), and is used both in tables of summary statistics, and recorded in the resulting BAM files. This is accomplished by adding another line below the target name: .. code-block:: yaml :linenos: :lineno-start: 145 # You can also add comments like these to document your experiment MyFilename: MySample: Similarly, we need to specify the name of each library in our dataset. By convention, I often use the index used to construct the library as the library name (which allows for easy identification), but any name may be used for a library, provided that it unique to that sample. As described near the start of this document, we are dealing with 3 libraries: +-------------+------+------+---------------------------------+ | Library | Lane | Type | Files | +-------------+------+------+---------------------------------+ | ACGATA | 1 | PE | 000_data/ACGATA\_L1\_*.fastq.gz | +-------------+------+------+---------------------------------+ | GCTCTG | 1 | SE | 000_data/GCTCTG\_L1\_*.fastq.gz | +-------------+------+------+---------------------------------+ | TGCTCA | 1 | SE | 000_data/TGCTCA\_L1\_*.fastq.gz | +-------------+------+------+---------------------------------+ | | 2 | PE | 000_data/TGCTCA\_L2\_*.fastq.gz | +-------------+------+------+---------------------------------+ It is important to correctly specify the libraries, since the pipeline will not only use this information for summary statistics and record it in the resulting BAM files, but will also carry out filtering of PCR duplicates (and other analyses) on a per-library basis. Wrongly grouping together data will therefore result in a loss of useful alignments wrongly identified as PCR duplicates, or, similarly, in the inclusion of reads that should have been filtered as PCR duplicates. The library names are added below the name of the sample ('MySample'), in a similar manner to the sample itself: .. code-block:: yaml :linenos: :lineno-start: 145 # You can also add comments like these to document your experiment MyFilename: MySample: ACGATA: GCTCTG: TGCTCA: The final step involves specifying the location of the raw FASTQ reads that should be processed for each library, and consists of specifying one or more "lanes" of reads, each of which must be given a unique name. For single-end reads, this is accomplished simply by providing a path (with optional wildcards) to the location of the file(s). For example, for lane 1 of library ACGATA, the files are located at 000_data/ACGATA\_L1\_*.fastq.gz: .. code-block:: bash $ ls 000_data/GCTCTG_L1_*.fastq.gz 000_data/GCTCTG_L1_R1_01.fastq.gz 000_data/GCTCTG_L1_R1_02.fastq.gz 000_data/GCTCTG_L1_R1_03.fastq.gz We simply specify these paths for each of the single-end lanes, here using the lane number to name these (similar to the above, this name is used to tag the data in the resulting BAM file): .. code-block:: yaml :linenos: :lineno-start: 145 # You can also add comments like these to document your experiment MyFilename: MySample: ACGATA: GCTCTG: Lane_1: 000_data/GCTCTG_L1_*.fastq.gz TGCTCA: Lane_1: 000_data/TGCTCA_L1_*.fastq.gz Specifying the location of paired-end data is slightly more complex, since the pipeline needs to be able to locate both files in a pair. This is accomplished by making the assumption that paired-end files are numbered as either mate 1 or mate 2, as shown here for 4 pairs of files with the common _R1 and _R2 labels: .. code-block:: bash $ ls 000_data/ACGATA_L1_*.fastq.gz 000_data/ACGATA_L1_R1_01.fastq.gz 000_data/ACGATA_L1_R1_02.fastq.gz 000_data/ACGATA_L1_R1_03.fastq.gz 000_data/ACGATA_L1_R1_04.fastq.gz 000_data/ACGATA_L1_R2_01.fastq.gz 000_data/ACGATA_L1_R2_02.fastq.gz 000_data/ACGATA_L1_R2_03.fastq.gz 000_data/ACGATA_L1_R2_04.fastq.gz Knowing how that the files contain a number specifying which file in a pair they correspond to, we can then construct a path that includes the keyword '{Pair}' in place of that number. For the above example, that path would therefore be '000_data/ACGATA\_L1\_R{Pair}_*.fastq.gz' (corresponding to '000_data/ACGATA\_L1\_R[12]_*.fastq.gz'): .. code-block:: yaml :linenos: :lineno-start: 145 # You can also add comments like these to document your experiment MyFilename: MySample: ACGATA: Lane_1: 000_data/ACGATA_L1_R{Pair}_*.fastq.gz GCTCTG: Lane_1: 000_data/GCTCTG_L1_*.fastq.gz TGCTCA: Lane_1: 000_data/TGCTCA_L1_*.fastq.gz Lane_2: 000_data/TGCTCA_L2_R{Pair}_*.fastq.gz .. note:: Note that while the paths given here are relative to the location of where the pipeline is run, it is also possible to provide absolute paths, should the files be located in an entirely different location. .. note:: At the time of writing, the PALEOMIX pipeline supports uncompressed, gzipped, and bzipped FASTQ reads. It is not necessary to use any particular file extension for these, as the compression method (if any) is detected automatically. The final makefile ------------------ Once we've completed the steps described above, the resulting makefile should look like the following, shown here with the modifications that we've made highlighted: .. code-block:: yaml :emphasize-lines: 49,55,130,132,146-156 :linenos: # -*- mode: Yaml; -*- # Timestamp: 2016-02-04T10:53:59.906883 # # Default options. # Can also be specific for a set of samples, libraries, and lanes, # by including the "Options" hierarchy at the same level as those # samples, libraries, or lanes below. This does not include # "Features", which may only be specific globally. Options: # Sequencing platform, see SAM/BAM reference for valid values Platform: Illumina # Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+) # or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to # specify 'Solexa', to handle reads on the Solexa scale. This is # used during adapter-trimming and sequence alignment QualityOffset: 33 # Split a lane into multiple entries, one for each (pair of) file(s) # found using the search-string specified for a given lane. Each # lane is named by adding a number to the end of the given barcode. SplitLanesByFilenames: yes # Compression format for FASTQ reads; 'gz' for GZip, 'bz2' for BZip2 CompressionFormat: bz2 # Settings for trimming of reads, see AdapterRemoval man-page AdapterRemoval: # Adapter sequences, set and uncomment to override defaults # --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG # --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT # Some BAM pipeline defaults differ from AR defaults; # To override, change these value(s): --mm: 3 --minlength: 25 # Extra features enabled by default; change 'yes' to 'no' to disable --collapse: yes --trimns: yes --trimqualities: yes # Settings for aligners supported by the pipeline Aligners: # Choice of aligner software to use, either "BWA" or "Bowtie2" Program: BWA # Settings for mappings performed using BWA BWA: # One of "backtrack", "bwasw", or "mem"; see the BWA documentation # for a description of each algorithm (defaults to 'backtrack') Algorithm: backtrack # Filter aligned reads with a mapping quality (Phred) below this value MinQuality: 30 # Filter reads that did not map to the reference sequence FilterUnmappedReads: yes # Should be disabled ("no") for aDNA alignments, as post-mortem # localizes to the seed region, which BWA expects to have few # errors (sets "-l"). See http://pmid.us/22574660 UseSeed: no # Additional command-line options may be specified for the "aln" # call(s), as described below for Bowtie2 below. # Settings for mappings performed using Bowtie2 Bowtie2: # Filter aligned reads with a mapping quality (Phred) below this value MinQuality: 0 # Filter reads that did not map to the reference sequence FilterUnmappedReads: yes # Examples of how to add additional command-line options # --trim5: 5 # --trim3: 5 # Note that the colon is required, even if no value is specified --very-sensitive: # Example of how to specify multiple values for an option # --rg: # - CN:SequencingCenterNameHere # - DS:DescriptionOfReadGroup # Mark / filter PCR duplicates. If set to 'filter', PCR duplicates are # removed from the output files; if set to 'mark', PCR duplicates are # flagged with bit 0x400, and not removed from the output files; if set to # 'no', the reads are assumed to not have been amplified. Collapsed reads # are filtered using the command 'paleomix rmdup_duplicates', while "normal" # reads are filtered using Picard MarkDuplicates. PCRDuplicates: filter # Carry out quality base re-scaling of libraries using mapDamage # This will be done using the options set for mapDamage below RescaleQualities: no # Command-line options for mapDamage; note that the long-form # options are expected; --length, not -l, etc. Uncomment the # "mapDamage" line adding command-line options below. mapDamage: # By default, the pipeline will downsample the input to 100k hits # when running mapDamage; remove to use all hits --downsample: 100000 # Set to 'yes' exclude a type of trimmed reads from alignment / analysis; # possible read-types reflect the output of AdapterRemoval ExcludeReads: Single: no # Single-ended reads / Orphaned paired-ended reads Paired: no # Paired ended reads Singleton: no # Paired reads for which the mate was discarded Collapsed: no # Overlapping paired-ended reads collapsed into a # single sequence by AdapterRemoval CollapsedTruncated: no # Like 'Collapsed', except that the reads # truncated due to the presence ambiguous # bases or low quality bases at read termini. # Optional steps to perform during processing Features: RawBAM: no # Generate BAM from the raw libraries (no indel realignment) # Location: {Destination}/{Target}.{Genome}.bam RealignedBAM: yes # Generate indel-realigned BAM using the GATK Indel realigner # Location: {Destination}/{Target}.{Genome}.realigned.bam mapDamage: yes # Generate mapDamage plot for each (unrealigned) library # Location: {Destination}/{Target}.{Genome}.mapDamage/{Library}/ Coverage: yes # Generate coverage information for the raw BAM (wo/ indel realignment) # Location: {Destination}/{Target}.{Genome}.coverage Depths: yes # Generate histogram of number of sites with a given read-depth # Location: {Destination}/{Target}.{Genome}.depths Summary: yes # Generate summary table for each target # Location: {Destination}/{Target}.summary DuplicateHist: no # Generate histogram of PCR duplicates, for use with PreSeq # Location: {Destination}/{Target}.{Genome}.duphist/{Library}/ # Map of prefixes by name, each having a Path key, which specifies the # location of the BWA/Bowtie2 index, and optional label, and an option # set of regions for which additional statistics are produced. Prefixes: # Name of the prefix; is used as part of the output filenames rCRS: # Path to .fasta file containing a set of reference sequences. Path: 000_prefixes/rCRS.fasta # Label for prefix: One of nuclear, mitochondrial, chloroplast, # plasmid, bacterial, or viral. Is used in the .summary files. # Label: ... # Produce additional coverage / depth statistics for a set of # regions defined in a BED file; if no names are specified for the # BED records, results are named after the chromosome / contig. # RegionsOfInterest: # NAME: PATH_TO_BEDFILE # You can also add comments like these to document your experiment MyFilename: MySample: ACGATA: Lane_1: 000_data/ACGATA_L1_R{Pair}_*.fastq.gz GCTCTG: Lane_1: 000_data/GCTCTG_L1_*.fastq.gz TGCTCA: Lane_1: 000_data/TGCTCA_L1_*.fastq.gz Lane_2: 000_data/TGCTCA_L2_R{Pair}_*.fastq.gz With this makefile in hand, the pipeline may be executed using the following command: .. code-block:: bash $ paleomix bam_pipeline run makefile.yaml The pipeline will run as many simultaneous processes as there are cores in the current system, but this behavior may be changed by using the '--max-threads' command-line option. Use the '--help' command-line option to view additional options available when running the pipeline. By default, output files are placed in the same folder as the makefile, but this behavior may be changed by setting the '--destination' command-line option. For this projects, these files include the following: .. code-block:: bash $ ls -d MyFilename* MyFilename MyFilename.rCRS.coverage MyFilename.rCRS.depths MyFilename.rCRS.mapDamage MyFilename.rCRS.realigned.bai MyFilename.rCRS.realigned.bam MyFilename.summary The files include a table of the average coverages, a histogram of the per-site coverages (depths), a folder containing one set of mapDamage plots per library, and the final BAM file and its index (the .bai file), as well as a table summarizing the entire analysis. For a more detailed description of the files generated by the pipeline, please refer to the :ref:`bam_filestructure` section; should problems occur during the execution of the pipeline, then please verify that the makefile is correctly filled out as described above, and refer to the :ref:`troubleshooting_bam` section. .. note:: The first item, 'MyFilename', is a folder containing intermediate files generated while running the pipeline, required due to the many steps involved in a typical analyses, and which also allows for the pipeline to resume should the process be interrupted. This folder will typically take up 3-4x the disk-space used by the final BAM file(s), and can safely be removed once the pipeline has run to completion, in order to reduce disk-usage. .. _SAM/BAM specification: http://samtools.sourceforge.net/SAM1.pdf .. _GATK FAQ: http://www.broadinstitute.org/gatk/guide/article?id=1204 .. _seqtk: https://github.com/lh3/seqtk .. _Phred quality-scores: https://en.wikipedia.org/wiki/FASTQ_format#Quality .. _AdapterRemoval documentation: https://github.com/MikkelSchubert/adapterremovalpaleomix-1.2.12/docs/conf.py000066400000000000000000000221171314402124200156360ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # PALEOMIX documentation build configuration file, created by # sphinx-quickstart on Mon Nov 30 22:47:26 2015. # # This file is execfile()d with the current directory set to its # containing dir. # # Note that not all possible configuration values are present in this # autogenerated file. # # All configuration values have a default; values that are commented out # serve to show the default. import sys import os import shlex # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. #sys.path.insert(0, os.path.abspath('.')) # -- General configuration ------------------------------------------------ # If your documentation needs a minimal Sphinx version, state it here. #needs_sphinx = '1.0' # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = [] # Add any paths that contain templates here, relative to this directory. templates_path = [] # The suffix(es) of source filenames. # You can specify multiple suffix as a list of string: # source_suffix = ['.rst', '.md'] source_suffix = '.rst' # The encoding of source files. #source_encoding = 'utf-8-sig' # The master toctree document. master_doc = 'index' # General information about the project. project = u'PALEOMIX' copyright = u'2015, Mikkel Schubert' author = u'Mikkel Schubert' # The version info for the project you're documenting, acts as replacement for # |version| and |release|, also used in various other places throughout the # built documents. # # The short X.Y version. version = u'1.2' # The full version, including alpha/beta/rc tags. release = u'1.2.12' # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. # # This is also used if you do content translation via gettext catalogs. # Usually you set "language" from the command line for these cases. language = None # There are two options for replacing |today|: either, you set today to some # non-false value, then it is used: #today = '' # Else, today_fmt is used as the format for a strftime call. #today_fmt = '%B %d, %Y' # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. exclude_patterns = ['_build'] # The reST default role (used for this markup: `text`) to use for all # documents. #default_role = None # If true, '()' will be appended to :func: etc. cross-reference text. #add_function_parentheses = True # If true, the current module name will be prepended to all description # unit titles (such as .. function::). #add_module_names = True # If true, sectionauthor and moduleauthor directives will be shown in the # output. They are ignored by default. #show_authors = False # The name of the Pygments (syntax highlighting) style to use. pygments_style = 'sphinx' # A list of ignored prefixes for module index sorting. #modindex_common_prefix = [] # If true, keep warnings as "system message" paragraphs in the built documents. #keep_warnings = False # If true, `todo` and `todoList` produce output, else they produce nothing. todo_include_todos = False # -- Options for HTML output ---------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. html_theme = 'classic' # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. #html_theme_options = {} # Add any paths that contain custom themes here, relative to this directory. #html_theme_path = [] # The name for this set of Sphinx documents. If None, it defaults to # " v documentation". #html_title = None # A shorter title for the navigation bar. Default is the same as html_title. #html_short_title = None # The name of an image file (relative to this directory) to place at the top # of the sidebar. #html_logo = None # The name of an image file (within the static path) to use as favicon of the # docs. This file should be a Windows icon file (.ico) being 16x16 or 32x32 # pixels large. #html_favicon = None # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ['_static'] # Add any extra paths that contain custom files (such as robots.txt or # .htaccess) here, relative to this directory. These files are copied # directly to the root of the documentation. #html_extra_path = [] # If not '', a 'Last updated on:' timestamp is inserted at every page bottom, # using the given strftime format. #html_last_updated_fmt = '%b %d, %Y' # If true, SmartyPants will be used to convert quotes and dashes to # typographically correct entities. # Disabled as it also converts double-dashes in, for example, command-line # options into a single long-dash. html_use_smartypants = False # Custom sidebar templates, maps document names to template names. #html_sidebars = {} # Additional templates that should be rendered to pages, maps page names to # template names. #html_additional_pages = {} # If false, no module index is generated. #html_domain_indices = True # If false, no index is generated. #html_use_index = True # If true, the index is split into individual pages for each letter. #html_split_index = False # If true, links to the reST sources are added to the pages. #html_show_sourcelink = True # If true, "Created using Sphinx" is shown in the HTML footer. Default is True. #html_show_sphinx = True # If true, "(C) Copyright ..." is shown in the HTML footer. Default is True. #html_show_copyright = True # If true, an OpenSearch description file will be output, and all pages will # contain a tag referring to it. The value of this option must be the # base URL from which the finished HTML is served. #html_use_opensearch = '' # This is the file name suffix for HTML files (e.g. ".xhtml"). #html_file_suffix = None # Language to be used for generating the HTML full-text search index. # Sphinx supports the following languages: # 'da', 'de', 'en', 'es', 'fi', 'fr', 'hu', 'it', 'ja' # 'nl', 'no', 'pt', 'ro', 'ru', 'sv', 'tr' #html_search_language = 'en' # A dictionary with options for the search language support, empty by default. # Now only 'ja' uses this config value #html_search_options = {'type': 'default'} # The name of a javascript file (relative to the configuration directory) that # implements a search results scorer. If empty, the default will be used. #html_search_scorer = 'scorer.js' # Output file base name for HTML help builder. htmlhelp_basename = 'PALEOMIXdoc' # -- Options for LaTeX output --------------------------------------------- latex_elements = { # The paper size ('letterpaper' or 'a4paper'). #'papersize': 'letterpaper', # The font size ('10pt', '11pt' or '12pt'). #'pointsize': '10pt', # Additional stuff for the LaTeX preamble. #'preamble': '', # Latex figure (float) alignment #'figure_align': 'htbp', } # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, # author, documentclass [howto, manual, or own class]). latex_documents = [ (master_doc, 'PALEOMIX.tex', u'PALEOMIX Documentation', u'Mikkel Schubert', 'manual'), ] # The name of an image file (relative to this directory) to place at the top of # the title page. #latex_logo = None # For "manual" documents, if this is true, then toplevel headings are parts, # not chapters. #latex_use_parts = False # If true, show page references after internal links. #latex_show_pagerefs = False # If true, show URL addresses after external links. #latex_show_urls = False # Documents to append as an appendix to all manuals. #latex_appendices = [] # If false, no module index is generated. #latex_domain_indices = True # -- Options for manual page output --------------------------------------- # One entry per manual page. List of tuples # (source start file, name, description, authors, manual section). man_pages = [ (master_doc, 'paleomix', u'PALEOMIX Documentation', [author], 1) ] # If true, show URL addresses after external links. #man_show_urls = False # -- Options for Texinfo output ------------------------------------------- # Grouping the document tree into Texinfo files. List of tuples # (source start file, target name, title, author, # dir menu entry, description, category) texinfo_documents = [ (master_doc, 'PALEOMIX', u'PALEOMIX Documentation', author, 'PALEOMIX', 'TODO', 'Miscellaneous'), ] # Documents to append as an appendix to all manuals. #texinfo_appendices = [] # If false, no module index is generated. #texinfo_domain_indices = True # How to display URL addresses: 'footnote', 'no', or 'inline'. #texinfo_show_urls = 'footnote' # If true, do not generate a @detailmenu in the "Top" node's menu. #texinfo_no_detailmenu = False paleomix-1.2.12/docs/examples.rst000066400000000000000000000132661314402124200167140ustar00rootroot00000000000000.. _examples: Example projects and data-sets ============================== The PALEOMIX pipeline contains small example projects for the larger pipelines, which are designed to be executed in a short amount of time, and to help verify that the pipelines have been correctly installed. .. _examples_bam: BAM Pipeline example project ---------------------------- The example project for the BAM pipeline involves the processing of a small data set consisting of (simulated) ancient sequences derived from the human mitochondrial genome. The runtime of this project on a typical desktop or laptop ranges from around 1 minute to around 1 hour (when building of models of the ancient DNA damage patterns is enabled). To access this example project, use the 'example' command for the bam\_pipeline to copy the project files to a given directory (here, the current directory):: $ paleomix bam_pipeline example . $ cd bam_pipeline $ paleomix bam_pipeline run 000_makefile.yaml By default, this example project includes the recalibration of quality scores for bases that are identified as putative *post-mortem* damage (see [Jonsson2013]_). However, this greatly increases the time needed to run this example. While it is recommended to run this step, this step may be disabled by setting the value of the 'RescaleQualities' option in the '000\_makefile.yaml' file to 'no'. Before: .. code-block:: yaml :emphasize-lines: 3 :linenos: :lineno-start: 83 # Carry out quality base re-scaling of libraries using mapDamage # This will be done using the options set for mapDamage below RescaleQualities: yes After: .. code-block:: yaml :emphasize-lines: 3 :linenos: :lineno-start: 83 # Carry out quality base re-scaling of libraries using mapDamage # This will be done using the options set for mapDamage below RescaleQualities: no The output generated by the pipeline is described in the :ref:`bam_filestructure` section. Please see the :ref:`troubleshooting` section if you run into problems running the pipeline. .. _examples_phylo: Phylogentic Pipeline example project ------------------------------------ The example project for the Phylogenetic pipeline involves the processing and mapping of a small data set consisting of (simulated) sequences derived from the human and primate mitochondrial genome, followed by the genotyping of gene sequences and the construction of a maximum likelihood phylogeny. Since this example project starts from raw reads, it therefore requires that the BAM pipeline has been correctly installed, as described in section :ref:`bam_requirements`). The runtime of this project on a typical desktop or laptop ranges from around 30 minutes to around 1 hour. To access this example project, use the 'example' command for the phylo\_pipeline to copy the project files to a given directory (here, the current directory), and then run the 'setup.sh' script in the root directory, to generate the data set:: $ paleomix phylo_pipeline example . $ cd phylo_pipeline $ ./setup.sh Once the example data has been generated, the two pipelines may be executed:: $ cd alignment $ bam_pipeline run 000_makefile.yaml $ cd ../phylogeny $ phylo_pipeline genotype+msa+phylogeny 000_makefile.yaml The output generated by the pipeline is described in the :ref:`phylo_filestructure` section. Please see the :ref:`troubleshooting` section if you run into problems running the pipeline. .. _examples_zonkey: Zonkey Pipeline example project ------------------------------- The example project for the Zonkey pipeline is based on a synthetic hybrid between a Domestic donkey and an Arabian horse (obtained from [Orlando2013]_), using a low number of reads (1200). The runtime of these examples on a typical desktop or laptop ranges from around 30 minutes to around 1 hour, depending on your local configuration. To access this example project, download the Zonkey reference database (see the 'Prerequisites' section of the :ref:`zonkey_usage` page for instructions), and use the 'example' command for zonkey to copy the project files to a given directory. Here, the current directory directory is used; to place the example files in a different location, simply replace the '.' with the full path to the desired directory:: $ paleomix zonkey example database.tar . $ cd zonkey_pipeline The example directory contains 3 BAM files; one containing a nuclear alignment ('nuclear.bam'); one containing a mitochondrial alignment ('mitochondrial.bam'); and one containing a combined nuclear and mitochondrial alignment ('combined.bam'). In addition, a sample table is included which shows how multiple samples may be specified and processed at once. Each of these may be run as follows:: # Process only the nuclear BAM; # by default, results are saved in 'nuclear.zonkey' $ paleomix zonkey run database.tar nuclear.bam # Process only the mitochondrial BAM; # by default, results are saved in 'mitochondrial.zonkey' $ paleomix zonkey run database.tar mitochondrial.bam # Process both the nuclear and the mitochondrial BAMs; # note that is nessesary to specify an output directory $ paleomix zonkey run database.tar nuclear.bam mitochondrial.bam results # Process both the combined nuclear and the mitochondrial BAM; # by default, results are saved in 'combined.zonkey' $ paleomix zonkey run database.tar combined.bam # Process multiple samples; the table corresponds to the four # cases listed above. $ paleomix zonkey run database.tar samples.txt Please see the :ref:`troubleshooting` section if you run into problems running the pipeline. The output generated by the pipeline is described in the :ref:`zonkey_filestructure` section. paleomix-1.2.12/docs/index.rst000066400000000000000000000057311314402124200162030ustar00rootroot00000000000000 Welcome to PALEOMIX's documentation! ==================================== The PALEOMIX pipelines are a set of pipelines and tools designed to aid the rapid processing of High-Throughput Sequencing (HTS) data: The BAM pipeline processes de-multiplexed reads from one or more samples, through sequence processing and alignment, to generate BAM alignment files useful in downstream analyses; the Phylogenetic pipeline carries out genotyping and phylogenetic inference on BAM alignment files, either produced using the BAM pipeline or generated elsewhere; and the Zonkey pipeline carries out a suite of analyses on low coverage equine alignments, in order to detect the presence of F1-hybrids in archaeological assemblages. In addition, PALEOMIX aids in metagenomic analysis of the extracts. The pipelines have been designed with ancient DNA (aDNA) in mind, and includes several features especially useful for the analyses of ancient samples, but can all be for the processing of modern samples, in order to ensure consistent data processing. For a detailed description of the pipeline, please refer to `PALEOMIX `_ website and the `documentation `_; for questions, bug reports, and/or suggestions, use the `GitHub tracker `_, or contact Mikkel Schubert at `MSchubert@snm.ku.dk `_. The PALEOMIX pipelines have been published in Nature Protocols; if you make use of PALEOMIX in your work, then please cite Schubert M, Ermini L, Sarkissian CD, Jónsson H, Ginolhac A, Schaefer R, Martin MD, Fernández R, Kircher M, McCue M, Willerslev E, and Orlando L. "**Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIX**". Nat Protoc. 2014 May;9(5):1056-82. doi: `10.1038/nprot.2014.063 `_. Epub 2014 Apr 10. PubMed PMID: `24722405 `_. The Zonkey pipeline has been published in Journal of Archaeological Science; if you make use of this pipeline in your work, then please cite Schubert M, Mashkour M, Gaunitz C, Fages A, Seguin-Orlando A, Sheikhi S, Alfarhan AH, Alquraishi SA, Al-Rasheid KAS, Chuang R, Ermini L, Gamba C, Weinstock J, Vedat O, and Orlando L. "**Zonkey: A simple, accurate and sensitive pipeline to genetically identify equine F1-hybrids in archaeological assemblages**". Journal of Archaeological Science. 2007 Feb; 78:147-157. doi: `10.1016/j.jas.2016.12.005 `_. **Table of Contents:** .. toctree:: :maxdepth: 2 introduction.rst installation.rst bam_pipeline/index.rst phylo_pipeline/index.rst zonkey_pipeline/index.rst other_tools.rst examples.rst troubleshooting/index.rst yaml.rst acknowledgements.rst related.rst references.rst Indices and tables ================== * :ref:`genindex` * :ref:`search` paleomix-1.2.12/docs/installation.rst000066400000000000000000000124621314402124200175740ustar00rootroot00000000000000.. highlight:: Bash .. _installation: Installation ============ The following instructions will install PALEOMIX for the current user, but does not include specific programs required by the pipelines. For pipeline specific instructions, refer to the requirements sections for the :ref:`BAM `, the :ref:`Phylogentic `, and the :ref:`Zonkey ` pipeline. The recommended way of installing PALEOMIX is by use of the `pip`_ package manager for Python. If Pip is not installed, then please consult the documentation for your operating system. In addition to the `pip`_ package manager for Python, the pipelines require `Python`_ 2.7, and `Pysam`_ v0.8.3+, which in turn requires both Python and libz development files (see the :ref:`troubleshooting_install` section). When installing PALEOMIX using pip, Pysam is automatically installed as well. However, note that installing Pysam requires the zlib and Python 2.7 development files. On Debian based distributions, these may be installed as follows: # apt-get install libz-dev python2.7-dev .. warning:: PALEOMIX has been developed for 64 bit systems, and has not been extensively tested on 32 bit systems! Regular installation -------------------- The following command will install PALEOMIX, and the Python modules required to run it, for the current user only:: $ pip install --user paleomix To perform a system-wide installation, simply remove the --user option, and run as root:: $ sudo pip install paleomix To verify that the installation was carried out correctly, run the command 'paleomix':: $ paleomix PALEOMIX - pipelines and tools for NGS data analyses. Version: v1.0.1 Usage: paleomix [options] [...] If the command fails, then please refer to the :ref:`troubleshooting` section. Self-contained installation --------------------------- In some cases, it may be useful to make a self-contained installation of PALEOMIX, *e.g.* on shared servers. This is because Python modules that have been installed system-wide take precendence over user-installed modules (this is a limitation of Python itself), which may cause problems both with PALEOMIX itself, and with its Python dependencies. This is accomplished using `virtualenv`_ for Python, which may be installed using `pip`_ as follows:: $ pip install --user virtualenv or (for a system-wide installation):: $ sudo pip install virtualenv The follow example installs paleomix in a virtual environmental located in *~/install/virtualenvs/paleomix*, but any location may be used:: $ virtualenv ~/install/virtualenvs/paleomix $ source ~/install/virtualenvs/paleomix/bin/activate $ (paleomix) pip install paleomix $ (paleomix) deactivate Following succesful completion of these commands, the paleomix tools will be accessible in the ~/install/virtualenvs/paleomix/bin/ folder. However, as this folder also contains a copy of Python itself, it is not recommended to add it to your PATH. Instead, simply link the paleomix commands to a folder in your PATH. This can, for example, be accomplished as follows:: $ mkdir ~/bin/ $ echo 'export PATH=~/bin:$PATH' >> ~/.bashrc $ ln -s ~/install/virtualenvs/paleomix/bin/paleomix ~/bin/ PALEOMIX also includes a number of optional shortcuts which may be used in place of running 'paleomix ' (for example, the command 'bam_pipeline' is equivalent to running 'paleomix bam_pipeline'):: $ ln -s ~/install/virtualenvs/paleomix/bin/bam_pipeline ~/bin/ $ ln -s ~/install/virtualenvs/paleomix/bin/conv_gtf_to_bed ~/bin/ $ ln -s ~/install/virtualenvs/paleomix/bin/phylo_pipeline ~/bin/ $ ln -s ~/install/virtualenvs/paleomix/bin/bam_rmdup_collapsed ~/bin/ $ ln -s ~/install/virtualenvs/paleomix/bin/trim_pipeline ~/bin/ Upgrading an existing installation ---------------------------------- Upgrade an existing installation of PALEOMIX, installed using the methods described above, may also be accomplished using pip. To upgrade a regular installation, simply run pip install with the --upgrade option, for a user installation:: $ pip install --user --upgrade paleomix Or for a system-wide installation:: $ sudo pip install --upgrade paleomix To upgrade an installation a self-contained installation, simply activate the environment before proceeding:: $ source ~/install/virtualenvs/paleomix/bin/activate $ (paleomix) pip install --upgrade paleomix $ (paleomix) deactivate Upgrading from PALEOMIX v1.1.x ------------------------------ When upgrading to v1.2.x or later from version 1.1.x or an before, it is nessesary to perform a manual installation the first time. This is accomplished by downloading and unpacking the desired version of PALEOMIX from the list of releases, and then invoking setup.py. For example:: $ wget https://github.com/MikkelSchubert/paleomix/archive/v1.2.4.tar.gz $ tar xvzf v1.2.4.tar.gz $ paleomix-1.2.4/ # Either for the current user: $ python setup.py install --user # Or, for all users: $ sudo python setup.py install Once this has been done once, pip may be used to perform future upgrades as described above. .. _pip: https://pip.pypa.io/en/stable/ .. _Pysam: https://github.com/pysam-developers/pysam/ .. _Python: http://www.python.org/ .. _virtualenv: https://virtualenv.readthedocs.org/en/latest/paleomix-1.2.12/docs/introduction.rst000066400000000000000000000051411314402124200176100ustar00rootroot00000000000000.. _introduction: ============ Introduction ============ The PALEOMIX pipeline is a set of pipelines and tools designed to enable the rapid processing of High-Throughput Sequencing (HTS) data from modern and ancient samples. Currently, PALEOMIX consists of 2 major pipelines, and one protocol described in [Schubert2014]_, as well as one as of yet unpublished pipeline: * **The BAM pipeline** operates on de-multiplexed NGS reads, and carries out the steps necessary to produce high-quality alignments against a reference sequence, ultimately outputting one or more annotated BAM files. * **The Metagenomic pipeline** is a protocol describing how to carry out metagenomic analyses on reads processed by the BAM pipeline, allowing for the characterisation of the metagenomic population of ancient samples. This protocol makes use of tools included with PALEOMIX. * **The Phylogenetic pipeline** carries out genotyping, multiple sequence alignment, and phylogenetic inference on a set of regions derived from BAM files (e.g. produced using the BAM Pipeline). * **The Zonkey Pipeline** is a smaller, experimental pipeline, for the detection of F1 hybrids in equids, based on low coverage nuclear genomes (as few as thousands of aligned reads) and (optionally) mitochondrial DNA. All pipelines operate through a mix of standard bioinformatics tools, including SAMTools [Li2009b]_, BWA [Li2009a]_, and more, as well as custom scripts written to support the pipelines. The automated pipelines have been designed to run analytical in parallel steps where possible, and to run with minimal user-intervention. To guard against failed steps and to allow easy debugging of failures, all analyses are run in individual temporary folders, all output is logged (though only retained if the command fails), and results are only moved into the destination directory upon successful completion of the given task. In order to faciliate automatic execution, and to ensure that analyses are documented and can be replicated easily, the BAM and the Phylogenetic Pipelines make use of configuration files (hence-forth "makefiles") in `YAML`_ format ; these are text files which describe a project in terms of input files, settings for programs run as part of the pipeline, and which steps to run. For an overview of the YAML format, refer to the included introduction to :ref:`yaml_intro`, or to the official `YAML`_ website. For a thorough discussion of the makefiles used by either pipeline, please refer to the respective sections of the documentation (*i.e.* :ref:`BAM ` and :ref:`Phylogentic ` pipeline). .. _YAML: http://www.yaml.org paleomix-1.2.12/docs/other_tools.rst000066400000000000000000000076671314402124200174470ustar00rootroot00000000000000.. _other_tools: Other tools =========== On top of the pipelines described in the major sections of the documentation, the pipeline comes bundled with several other, smaller tools, all accessible via the 'paleomix' command. These tools are (briefly) described in this section. paleomix cleanup ---------------- .. TODO: .. paleomix cleanup -- Reads SAM file from STDIN, and outputs sorted, .. tagged, and filter BAM, for which NM and MD tags have been updated. paleomix coverage ----------------- .. TODO: .. paleomix coverage -- Calculate coverage across reference sequences .. or regions of interest. paleomix depths --------------- .. TODO: .. paleomix depths -- Calculate depth histograms across reference .. sequences or regions of interest. paleomix duphist ---------------- .. TODO: .. paleomix duphist -- Generates PCR duplicate histogram; used with .. the 'Preseq' tool. paleomix rmdup_collapsed ------------------------ .. TODO: .. paleomix rmdup_collapsed -- Filters PCR duplicates for collapsed paired- .. ended reads generated by the AdapterRemoval tool. paleomix genotype ----------------- .. TODO: .. paleomix genotype -- Creates bgzipped VCF for a set of (sparse) BED .. regions, or for entire chromosomes / contigs .. using SAMTools / BCFTools. paleomix gtf_to_bed ------------------- .. TODO: .. paleomix gtf_to_bed -- Convert GTF file to BED files grouped by .. feature (coding, RNA, etc). paleomix sample_pileup ---------------------- .. TODO: .. paleomix sample_pileup -- Randomly sample sites in a pileup to generate a .. FASTA sequence. .. warning:: This tool is deprecated, and will be removed in future versions of PALEOMIX. paleomix vcf_filter ------------------- .. TODO: .. paleomix vcf_filter -- Quality filters for VCF records, similar to .. 'vcfutils.pl varFilter'. paleomix vcf_to_fasta --------------------- .. The 'paleomix vcf\_to\_fasta' command is used to generate FASTA sequences from a VCF file, based either on a set of BED coordinates provided by the user, or for the entire genome covered by the VCF file. By default, heterzyous SNPs are represented using IUPAC codes; if a haploized sequence is desire, random sampling of heterozygous sites may be enabled. paleomix cat ------------ The 'paleomix cat' command provides a simple wrapper around the commands 'cat', 'gzip', and 'bzip2', calling each as appropriate depending on the files listed on the command-line. This tool is primarily used in order to allow the on-the-fly decompression of input for various programs that do not support both gzip and bzip2 compressed input. **Usage:** usage: paleomix cat [options] files positional arguments: files One or more input files; these may be uncompressed, compressed using gzip, or compressed using bzip2. optional arguments: -h, --help show this help message and exit --output OUTPUT Write output to this file; by default, output written to STDOUT. **Example:** .. code-block:: bash $ echo "Simple file" > file1.txt $ echo "Gzip'ed file" | gzip > file2.txt.gz $ echo "Bzip2'ed file" | bzip2 > file3.txt.bz2 $ paleomix cat file1.txt file2.txt.gz file3.txt.bz2 Simple file Gzip'ed file Bzip2'ed file .. warning: The 'paleomix cat' command works by opening the input files sequentually, identifying the compression scheme, and then calling the appropriate command. Therefore this command only works on regular files, but not on (named) pipes. paleomix-1.2.12/docs/phylo_pipeline/000077500000000000000000000000001314402124200173545ustar00rootroot00000000000000paleomix-1.2.12/docs/phylo_pipeline/configuration.rst000066400000000000000000000001771314402124200227620ustar00rootroot00000000000000.. highlight:: ini .. _phylo_configuration: Configuring the phylogenetic pipeline ===================================== TODOpaleomix-1.2.12/docs/phylo_pipeline/filestructure.rst000066400000000000000000000001211314402124200230000ustar00rootroot00000000000000.. highlight:: Yaml .. _phylo_filestructure: File structure ============== TODOpaleomix-1.2.12/docs/phylo_pipeline/index.rst000066400000000000000000000023341314402124200212170ustar00rootroot00000000000000.. _phylo_pipeline: Phylogenetic Pipeline ===================== **Table of Contents:** .. toctree:: overview.rst requirements.rst configuration.rst usage.rst makefile.rst filestructure.rst .. warning:: This section of the documentation is currently undergoing a complete rewrite, and may therefore be incomplete in places. The Phylogenetic Pipeline is a pipeline designed for processing of (one or more) BAMs, in order to carry out genotyping of a set of regions of interest. Following genotyping, multiple sequence alignment may optionally be carried out (this is required if indels were called), and phylogenetic inference may be done on the regions of interest, using a supermatrix approach through ExaML. Regions of interest, as defined for the Phylogenetic pipeline, are simply any set of regions in a reference sequence, and may span anything from a few short genomic regions, to the complete exome of complex organisms (tens of thousands of genes), and even entire genomes. While the Phylogenetic pipeline is designed for ease of use in conjunction with the BAM pipeline, but can be used on arbitrary BAM files, provided that these follow the expected naming scheme (see the :ref:`phylo_usage` section). paleomix-1.2.12/docs/phylo_pipeline/makefile.rst000066400000000000000000000002511314402124200216610ustar00rootroot00000000000000.. highlight:: Bash .. _phylo_makefile: Makefile description ==================== TODO TODO: Describe how to use 'MaxDepth: auto' with custom region, by creating newpaleomix-1.2.12/docs/phylo_pipeline/overview.rst000066400000000000000000000016771314402124200217670ustar00rootroot00000000000000Overview of analytical steps ============================ During a typical analyses, the Phylogenetic pipeline will proceed through the following steps. 1. Genotyping 1. SNPs are called on the provided regions using SAMTools, and the resulting SNPs are filtered using the 'paleomix vcf_filter' tool. 2. FASTA sequences are constructed from for the regions of interest, using the filtered SNPs generated above, one FASTA file per set of regions and per sample. 2. Multiple sequence alignment 1. Per-sample files generated in step 1 are collected, and used to build unaligned multi-FASTA files, one per region of interest. 2. If enabled, multiple-sequence alignment is carried out on these files using MAFFT, to generate aligned multi-FASTA files. 3. Phylogenetic inference Following construction of (aligned) multi-FASTA sequences, phylogenetic inference may be carried out using a partioned maximum likelihood approach via ExaML.paleomix-1.2.12/docs/phylo_pipeline/requirements.rst000066400000000000000000000040251314402124200226320ustar00rootroot00000000000000.. highlight:: Bash .. _phylo_requirements: Software requirements ===================== Depending on the parts of the Phylogenetic pipeline used, different programs are required. The following lists which programs are required for each pipeline, as well as the minimum version required: Genotyping ---------- * [SAMTools](http://samtools.sourceforge.net) v0.1.18+ [Li2009b]_ * `Tabix`_ v0.2.5 Both the 'tabix' and the 'bgzip' executable from the Tabix package must be installed. Multiple Sequence Alignment --------------------------- * `MAFFT`_ v7+ [Katoh2013]_ Note that the pipeline requires that the algorithm-specific MAFFT commands (e.g. 'mafft-ginsi', 'mafft-fftnsi'). These are automatically created by the 'make install' command. Phylogenetic Inference ---------------------- * `RAxML`_ v7.3.2+ [Stamatakis2006]_ * `ExaML`_ v1.0.5+ The pipeline expects a single-threaded binary named 'raxmlHPC' for RAxML. The pipeline expects the ExaML binary to be named 'examl', and the parser binary to be named 'parse-examl'. Compiling and running ExaML requires an MPI implementation (e.g. `OpenMPI`_), even if ExaML is run single-threaded. On Debian and Debian-based distributions, this may be accomplished installing 'mpi-default-dev' and 'mpi-default-bin'. Both programs offer a variety of makefiles suited for different server-architectures and use-cases. If in doubt, use the Makefile.SSE3.gcc makefiles, which are compatible with most modern systems:: $ make -f Makefile.SSE3.gcc Testing the pipeline -------------------- An example project is included with the phylogenetic pipeline, and it is recommended to run this project in order to verify that the pipeline and required applications have been correctly installed. See the :ref:`examples` section for a description of how to run this example project. .. _Tabix: http://samtools.sourceforge.net/ .. _MAFFT: http://mafft.cbrc.jp/alignment/software/ .. _RAxML: https://github.com/stamatak/standard-RAxML .. _EXaML: https://github.com/stamatak/ExaML .. _OpenMPI: http://www.open-mpi.org/paleomix-1.2.12/docs/phylo_pipeline/usage.rst000066400000000000000000000267331314402124200212250ustar00rootroot00000000000000.. highlight:: Yaml .. _phylo_usage: Pipeline usage ============== The 'phylo\_pipeline mkfile' command can be used to create a makefile template, as with the 'bam\_pipeline mkfile' command (see section :ref:`bam_usage`). This makefile is used to specify the samples, regions of interest (to be analysed), and options for the various programs: .. code-block:: bash $ phylo_pipeline mkfile > makefile.yaml Note that filenames are not specified explicitly with this pipeline, but are instead inferred from the names of samples, prefixes, etc. as described below. To execute the pipeline, a command corresponding to the step to be invoked is used (see below): .. code-block:: bash $ phylo_pipeline [OPTIONS] Samples ------- The phylogenetic pipeline expects a number of samples to be specified. Each sample has a name, a gender, and a genotyping method:: Samples: : SAMPLE_NAME: Gender: ... Genotyping Method: ... Gender is required, and is used to filter SNPs at homozygous sex chromsomes (e.g. chrX and chrY for male humans). Any names may be used, and can simply be set to e.g. 'NA' in case this feature is not used. The genotyping method is either "SAMTools" for the default genotyping procedure using samtools mpileupe | bcftools view, or "Random Sampling" to sample one random nucleotide in the pileup at each position. This key may be left out to use the default (SAMTools) method. Groups are optional, and may be used either for the sake of the reader, or to specify a group of samples in lists of samples, e.g. when excluding samples form a subsequent step, when filtering singletons, or when rooting phylogenetic trees (see below) For a given sample with name S, and a prefix with name P, the pipeline will expect files to be located at ./data/samples/*S*.*P*.bam, or at ./data/samples/*S*.*P*.realigned.bam if the "Realigned" option is enabled (see below). Regions of interest ------------------- Analysis is carried out for a set of "Regions of Interest", which is defined a set of named regions specified using BED files: RegionsOfInterest: NAME: Prefix: NAME_OF_PREFIX Realigned: yes/no ProteinCoding: yes/no IncludeIndels: yes/no The options 'ProteinCoding' and 'IncludeIndels' takes values 'yes' and 'no' (without quotation marks), and determines the behavior when calling indels. If 'IncludeIndels' is set to yes, indels are included in the consensus sequence, and if 'ProteinCoding' is set to yes, only indels that are a multiple of 3bp long are included. The name and the prefix determines the location of the expected BED file and the FASTA file for the prefix: For a region of interest named R, and a prefix named P, the pipeline will expect the BED file to be located at ./data/regions/P.R.bed. The prefix file is expected to be located at ./data/prefixes/P.fasta Genotyping ---------- Genotyping is done either by random sampling of positions, or by building a pileup using samtools and calling SNPs / indels using bcftools. The command used for full genotyping is similar to the following command: .. code-block:: bash $ samtools mpileup [OPTIONS] | bcftools view [OPTIONS] - In addition, SNPs / indels are filtered using the script 'vcf_filter', which is included with the pipeline. This script implements the filteres found in "vcfutils.pl varFilter", with some additions. Options for either method, including for both "samtools mpileup" and the "bcftools view" command is set using the **Genotyping** section of the makefile, and may be set for all regions of interest (default behavior) or for each set of regions of interest:: Genotyping: Defaults: ... The 'Defaults' key specifies that the options given here apply to all regions of interest; in addition to this key, the name of each set of regions of interest may be used, to set specific values for one set of regions vs. another set. Thus, assuming regions of interest 'ROI\_a' and 'ROI\_b', options may be set as follows:: Genotyping: Defaults: ... ROI_a: ... ROI_b: ... For each set of regions of interest named ROI, the final settings are derived by first taking the Defaults, and then overwriting values using the value taken from the ROI section (if one such exists). The following shows how to change values in Defaults for a single ROI:: Genotyping: Defaults: --switch: value_a ROI_N: --switch: value_b In the above, all ROI except "ROI\_N" will use the switch with 'value\_a', while "ROI\_N" will use 'value\_b'. Executing the 'genotyping' step is described below. Finally, note the "Padding" option; this option specifies a number of bases to include around each interval in a set of regions of interest. The purpose of this padding is to allow filtering of SNPs based on the distance from indels, in the case where the indels are outside the intervals themselves. Multiple sequence alignment --------------------------- Multiple sequence alignment (MSA) is currently carried out using MAFFT, if enabled. Note that it is still nessesary to run the MSA command (see below), even if the multiple sequence alignment itself is disabled (for example in the case where indels are not called in the genotyping step). This is because the MSA step is responsible for generating both the unaligned multi-FASTA files, and the aligned multi-FASTA files. It is nessesary to run the 'genotyping' step prior to running the MSA step (see above). It is possible to select among the various MAFFT algorithms using the "Algorithm" key, and additionally to specify command-line options for the selected algorithm:: MultipleSequenceAlignment: Defaults: Enabled: yes MAFFT: Algorithm: G-INS-i --maxiterate: 1000 Currently supported algorithms are as follows (as described on the `MAFFT website`_): * mafft - The basic program (mafft) * auto - Equivalent to command 'mafft --auto' * fft-ns-1 - Equivalent to the command 'fftns --retree 1' * fft-ns-2 - Equivalent to the command 'fftns' * fft-ns-i - Equivalent to the command 'fftnsi' * nw-ns-i - Equivalent to the command 'nwnsi' * l-ins-i - Equivalent to the command 'linsi' * e-ins-i - Equivalent to the command 'einsi' * g-ins-i - Equivalent to the command 'ginsi' Command line options are specified as key / value pairs, as shown above for the --maxiterate option, in the same manner that options are specified for the genotyping section. Similarly, options may be specified for all regions of interest ("Defaults"), or using the name of a set of regions of interest, in order to set options for only that set of regions. Phylogenetic inference ---------------------- Maximum likelyhood Phylogenetic inference is carried out using the ExaML program. A phylogeny consists of a named (subsets of) one or more sets of regions of interest, with individual regions partitioned according to some scheme, and rooted on the midpoint of the tree or one or more taxa:: PhylogeneticInference: PHYLOGENY_NAME: ExcludeSamples: ... RootTreesOn: ... PerGeneTrees: yes/no RegionsOfInterest: REGIONS_NAME: Partitions: "111" SubsetRegions: SUBSET_NAME ExaML: Replicates: 1 Bootstraps: 100 Model: GAMMA A phylogeny may exclude any number of samples specified in the Samples region, by listing them under the ExcludeSamples. Furthermore, if groups have been specified for samples (e.g. ""), then these may be used as a short-hand for multiple samples, by using the name of the group including the angle-brackets (""). Rooting is determined using the RootTreesOn options; if this option is not set, then the resulting trees are rooted on the midpoint of the tree, otherwise it is rooted on the clade containing all the given taxa. If the taxa does not form a monophyletic clade, then rooting is done on the monophyletic clade containing the given taxa. If PerGeneTrees is set to yes, a tree is generated for every named feature in the regions of interest (e.g. genes), otherwise a super-matrix is created based on all features in all the regions of interest specified for the current phylogeny. Each phylogeny may include one or more sets of regions of interest, specified under the "RegionsOfInterest", using the same names as those specified under the Project section. Each feature in a set of regions of interest may be partitioned according to position specific scheme. These are specified using a string of numbers (0-9), which is then applied across the selected sequences to determine the model for each position. For example, for the scheme "012" and a given nucleotide sequence, models are applied as follows:: AAGTAACTTCACCGTTGTGA 01201201201201201201 Thus, the default partitioning scheme ("111") will use the same model for all positions, and is equivalent to the schemes "1", "11", "1111", etc. Similarly, a per-codon-position scheme may be accomplished using "123" or a similar string. In addition to numbers, the character 'X' may be used to exclude specific positions in an alignment. E.g. to exclude the third position in codons, use a string like "11X". Alternatively, Partitions may be set to 'no' to disable per-feature partitions; instead a single partition is used per set of regions of interest. The options in the ExaML section specifies the number of bootstrap trees to generate from the original supermatrix, the number of phylogenetic inferences to carry out on the original supermatrix (replicate), and the model used (c.f. the ExaML documentation). The name (PHYLOGENY_NAME) is used to determine the location of the resulting files, by default ./results/TITLE/phylogenies/NAME/. If per-gene trees are generated, an addition two folders are used, namely the name of the regions of interest, and the name of the gene / feature. For each phylogeny, the following files are generated: **alignments.partitions**: List of partitions used when running ExaML; the "reduced" file contains the same list of partitions, after empty columns (no called bases) have been excluded. **alignments.phy**: Super-matrix used in conjunction with the list of partitions when calling ExaML; the "reduced" file contains the same matrix, but with empty columns (no bases called) excluded. **alignments.reduced.binary**: The reduced supermatrix / partitions in the binary format used by ExaML. **bootstraps.newick**: List of bootstrap trees in Newick format, rooted as specified in the makefile. **replicates.newick**: List of phylogenies inferred from the full super-matrix, rooted as specified in the makefile. **replicates.support.newick**: List of phylogenies inferred from the full super-matrix, with support values calculated using the bootstrap trees, and rooted as specified in the makefile. Executing the pipeline ---------------------- The phylogenetic pipeline is excuted similarly to the BAM pipeline, except that a command is provided for each step ('genotyping', 'msa', and 'phylogeny'): .. code-block:: bash $ phylo_pipeline [OPTIONS] Thus, to execute the genotyping step, the following command is used: .. code-block:: bash $ phylo_pipeline genotyping [OPTIONS] In addition, it is possible to run multiple steps by joining these with the plus-symbol. To run both the 'genotyping' and 'msa' step at the same time, use the following command: .. code-block:: bash $ phylo_pipeline genotyping+msa [OPTIONS] .. _MAFFT website: http://mafft.cbrc.jp/alignment/software/algorithms/algorithms.htmlpaleomix-1.2.12/docs/references.rst000066400000000000000000000102431314402124200172070ustar00rootroot00000000000000========== References ========== .. [Alexander2009] Alexander *et al*. "**Fast model-based estimation of ancestry in unrelated individuals**". Genome Res. 2009 Sep;19(9):1655-64. doi:10.1101/gr.094052.109 .. [Chang2015] Chang *et al*. "**Second-generation PLINK: rising to the challenge of larger and richer datasets**". Gigascience. 2015 Feb 25;4:7. doi: 10.1186/s13742-015-0047-8 .. [Daley2013] Daley and Smith. "**Predicting the molecular complexity of sequencing libraries**". Nat Methods. 2013 Apr;10(4):325-7. doi:10.1038/nmeth.2375 .. [DerSarkissian2015] Der Sarkissian *et al*. "**Evolutionary Genomics and Conservation of the Endangered Przewalski's Horse**". Curr Biol. 2015 Oct 5;25(19):2577-83. doi:10.1016/j.cub.2015.08.032 .. [Jonsson2013] Jónsson *et al*. "**mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters**". Bioinformatics. 2013 Jul 1;29(13):1682-4. doi:10.1093/bioinformatics/btt193 .. [Jonsson2014] Jónsson *et al*. "**Speciation with gene flow in equids despite extensive chromosomal plasticity**". PNAS. 2014 Dec 30;111(52):18655-60. doi:10.1073/pnas.1412627111 .. [Katoh2013] Katoh and Standley. "**MAFFT multiple sequence alignment software version 7: improvements in performance and usability**". Mol Biol Evol. 2013 Apr;30(4):772-80. doi:10.1093/molbev/mst010 .. [Langmead2012] Langmead and Salzberg. "**Fast gapped-read alignment with Bowtie 2**". Nat Methods. 2012 Mar 4;9(4):357-9. doi:10.1038/nmeth.1923 .. [Li2009a] Li and Durbin. "**Fast and accurate short read alignment with Burrows-Wheeler transform**". Bioinformatics. 2009 Jul 15;25(14):1754-60. doi:10.1093/bioinformatics/btp324 .. [Li2009b] Li *et al*. "**The Sequence Alignment/Map format and SAMtools**". Bioinformatics. 2009 Aug 15;25(16):2078-9. doi:10.1093/bioinformatics/btp352 .. [Lindgreen2012] Lindgreen. "**AdapterRemoval: Easy Cleaning of Next Generation Sequencing Reads**", BMC Research Notes. 2012 Jul 5:337. .. [McKenna2010] McKenna *et al*. "**The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data**". Genome Res. 2010 Sep;20(9):1297-303. doi:10.1101/gr.107524.110 .. [Orlando2013] Orlando *et al*. "**Recalibrating Equus evolution using the genome sequence of an early Middle Pleistocene horse**". Nature. 2013 Jul; 499(7456):74-78. doi:10.1038/nature12323. .. [Paradis2004] Paradis *et al*. "**APE: Analyses of Phylogenetics and Evolution in R language**". Bioinformatics. 2004 Jan 22;20(2):289-90. doi:10.1093/bioinformatics/btg412 .. [Patterson2006] Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006 Dec;2(12):e190. doi:10.1371/journal.pgen.0020190 .. [Peltzer2016] Peltzer *et al*. "**EAGER: efficient ancient genome reconstruction**". Genome Biology. 2016 Mar 9; 17:60. doi:10.1186/s13059-016-0918-z .. [Pickrell2012] Pickrell and Pritchard. "**Inference of population splits and mixtures from genome-wide allele frequency data**". PLoS Genet. 2012;8(11):e1002967. doi:10.1371/journal.pgen.1002967 .. [Price2006] Price *et al*. "**Principal components analysis corrects for stratification in genome-wide association studies**". Nat Genet. 2006 Aug;38(8):904-9. Epub 2006 Jul 23. doi:10.1038/ng1847 .. [Quinlan2010] Quinlan and Hall. "**BEDTools: a flexible suite of utilities for comparing genomic features**". Bioinformatics. 2010 Mar 15;26(6):841-2. doi:10.1093/bioinformatics/btq033 .. [Schubert2012] Schubert *et al*. "**Improving ancient DNA read mapping against modern reference genomes**". BMC Genomics. 2012 May 10;13:178. doi:10.1186/1471-2164-13-178. .. [Schubert2014] Schubert *et al*. "**Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIX**". Nature Protocols. 2014 May;9(5):1056-82. doi:10.1038/nprot.2014.063 .. [Stamatakis2006] Stamatakis. "**RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models**". Bioinformatics. 2006 Nov 1;22(21):2688-90. .. [Wickham2007] Wickham. "**Reshaping Data with the reshape Package**". Journal of Statistical Software. 2007 21(1). .. [Wickham2009] Wichham. "**ggplot2: Elegant Graphics for Data Analysis**". Springer-Verlag New York 2009. ISBN:978-0-387-98140-6paleomix-1.2.12/docs/related.rst000066400000000000000000000013341314402124200165070ustar00rootroot00000000000000.. _related_tools: Related Tools ============= **Pipelines:** * EAGER - Efficient Ancient GEnome Reconstruction (`website `_; [Peltzer2016]_) EAGER provides an intuitive and user-friendly way for researchers to address two problems in current ancient genome reconstruction projects; firstly, EAGER allows users to efficiently preprocess, map and analyze ancient genomic data using a standardized general framework for small to larger genome reconstruction projects. Secondly, EAGER provides a user-friendly interface that allows users to run EAGER without needing to fully understand all the underlying technical details. *(Description paraphrased from the EAGER website)* paleomix-1.2.12/docs/troubleshooting/000077500000000000000000000000001314402124200175635ustar00rootroot00000000000000paleomix-1.2.12/docs/troubleshooting/bam_pipeline.rst000066400000000000000000000336701314402124200227520ustar00rootroot00000000000000.. _troubleshooting_bam: Troubleshooting the BAM Pipeline ================================ Troubleshooting BAM pipeline makefiles -------------------------------------- **Path included multiple times in target**: This message is triggered if the same target includes one more more input files more than once:: Error reading makefiles: MakefileError: Path included multiple times in target: - Record 1: Name: ExampleProject, Sample: Synthetic_Sample_1, Library: ACGATA, Barcode: Lane_1_001 - Record 2: Name: ExampleProject, Sample: Synthetic_Sample_1, Library: ACGATA, Barcode: Lane_3_001 - Canonical path 1: /home/username/temp/bam_example/000_data/ACGATA_L1_R1_01.fastq.gz - Canonical path 2: /home/username/temp/bam_example/000_data/ACGATA_L1_R2_01.fastq.gz This may be caused by using too broad wildcards, or simple mistakes. The message indicates the lane in which the files were included, as well as the "canonical" (i.e. following the resolution of symbolic links, etc.) path to each of the files. To resolve this issue, ensure that each input file is only included once for a given target. **Target name used multiple times**: If running multiple makefiles in the same folder, it is important that the names given to targets in each makefile are unique, as the pipeline will otherwise mixfiles between different projects (see the section :ref:`bam_filestructure` for more information). The PALEOMIX pipeline attempts to detect this, and prevents the pipeline from running in this case:: Error reading makefiles: MakefileError: Target name 'ExampleProject' used multiple times; output files would be clobbered! **OutOfMemoryException (PicardTools, GATK, etc.):** By default, the BAM pipeline will limit the amount of heap-space used by Java programs to 4GB (on 64-bit systems, JVM defaults are used on 32-bit systems), which may prove insufficient in some instances. This will result in the failing program terminating with a stacktrace, such as the following:: Exception in thread "main" java.lang.OutOfMemoryError at net.sf.samtools.util.SortingLongCollection.(SortingLongCollection.java:101) at net.sf.picard.sam.MarkDuplicates.generateDuplicateIndexes(MarkDuplicates.java:443) at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:115) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:158) at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:97) To resolve this issue, increase the maximum amount of heap-space used using the "--jre-option" command-line option; this permits the passing of options to the Java Runtime Environment (JRE). For example, to increase the maximum to 8gb, run the BAM pipeline as follows:: $ bam_pipeline run --jre-option -Xmx8g [...] Troubleshooting AdapterRemoval ------------------------------ The AdapterRemoval task will attempt to determine verify the quality-offset specified in the makefile; if the contents of the file does not match the expected offset (i.e. contains quality scores that fall outside the range expected with that offset http://en.wikipedia.org/wiki/FASTQ_format\#Encoding, the task will be aborted. **Incorrect quality offsets specified in makefile**: In case where the sequence data can be determined to contain FASTQ records with a different quality offset than that specified in the makefile, the task will be aborted with the message corresponding to the following:: 'ExampleProject/reads/Synthetic_Sample_1/TGCTCA/Lane_1_002/reads.*'>: Error occurred running command: Error(s) running Node: Temporary directory: '/path/to/temp/folder' FASTQ file contains quality scores with wrong quality score offset (33); expected reads with quality score offset 64. Ensure that the 'QualityOffset' specified in the makefile corresponds to the input. Filename = 000_data/TGCTCA_L1_R1_02.fastq.gz Please verify the format of the input file, and update the makefile to use the correct QualityOffset before starting the pipeline. **Input file contains mixed FASTQ quality scores**: In case where the sequence data can be determined to contain FASTQ records with a different quality scores corresponding to the both of the possible offsets (for example both "!" and "a"), the task will be aborted with the message corresponding to the following example:: 'ExampleProject/reads/Synthetic_Sample_1/TGCTCA/Lane_1_002/reads.*'>: Error occurred running command: Error(s) running Node: Temporary directory: '/path/to/temp/folder' FASTQ file contains quality scores with both quality offsets (33 and 64); file may be unexpected format or corrupt. Please ensure that this file contains valid FASTQ reads from a single source. Filename = '000_data/TGCTCA_L1_R1_02.fastq.gz' This error would suggest that the input-file contains a mix of FASTQ records from multiple sources, e.g. resulting from the concatenation of multiple sets of data. If so, make use of the original data, and ensure that the quality score offset set for each is set correctly. **Input file does not contain quality scores**: If the input files does not contain any quality scores (e.g. due to malformed FASTQ records), the task will terminate, as these are required by the AdapterRemoval program. Please ensure that the input files are valid FASTQ files before proceeding. Input files in FASTA format / not in FASTQ format: If the input file can be determined to be in FASTA format, or otherwise be determined to not be in FASTQ format, the task will terminate with the following message:: 'ExampleProject/reads/Synthetic_Sample_1/TGCTCA/Lane_1_002/reads.*'>: Error occurred running command: Error(s) running Node: Temporary directory: '/path/to/temp/folder' Input file appears to be in FASTA format (header starts with '>', expected '@'), but only FASTQ files are supported. Filename = '000_data/TGCTCA_L1_R1_02.fastq.gz' Note that the pipeline only supports FASTQ files as input for the trimming stage, and that these have to be either uncompressed, gzipped, or bzipped. Other compression schemes are not supported at this point in time. Troubleshooting BWA ------------------- The BAM pipeline has primarily been tested with BWA v0.5.x; this is due in part to a number of issues with the Backtrack algorithm in later versions of BWA. For this reason, either v0.5.9-10 or BWA 0.7. Currently there is no version of BWA 0.7.x prior to 0.7.9a for which bugs have not been observed (see sub-sections below), excepting BWA v0.7.0 which does however lack several important bug-fixes added to later versions (see the BWA changelog). **BWA prefix generated using different version of BWA / corrupt index**: Between versions 0.5 and 0.6, BWA changed the binary format used to store the index sequenced produced using the command "bwa index". Version 0.7 is compatible with indexes generated using v0.6. The pipeline will attempt to detect the case where the current version of BWA does not correspond to the version used to generate the index, and will terminate if that is the case. As the two formats contain both contain files with the same names, the two formats cannot co-exist in the same location. Thus to resolve this issue, either create a new index in a new location, and update the makefile to use that location, or delete the old index files (path/to/prefix.fasta.*), and re-index it by using the command "bwa index path/to/prefix.fasta", or by simply re-starting the pipeline. However, because the filenames used by v0.6+ is a subset of the filenames used by v0.5.x, it is possible to accidentally end up with a prefix that appears to be v0.5.x to the pipeline, but in fact contains a mix of v0.5.x and v0.6+ files. This situation, as well as corruption of the index, may result in the following errors: 1. [bwt_restore_sa] SA-BWT inconsistency: seq_len is not the same 2. [bns_restore_core] fail to open file './rCRS.fasta.nt.ann' 3. Segmentation faults when running 'bwa aln'; these are reported as "SIGSEGV" in the file pipe.errors If this occurs, removing the old prefix files and generating a new index is advised (see above). **[gzclose] buffer error**: On some systems, BWA may terminate with an "[gzclose] buffer error" error when mapping empty files (sometimes produced by AdapterRemoval). This is caused by a bug / regression in some versions of zlibhttp://www.zlib.net/, included with some distributions. As it is typically not possible to upgrade zlib without a full system update, BWA may instead be compiled using a up-to-date version of zlib, as shown here for zlib v1.2.8 and BWA v0.5.10:: $ wget http://downloads.sourceforge.net/project/bio-bwa/bwa-0.5.10.tar.bz2 $ tar xvjf bwa-0.5.10.tar.bz2 $ cd bwa-0.5.10 $ sed -e's#INCLUDES=#INCLUDES=-Izlib-1.2.8/ #' -e's#-lz#zlib-1.2.8/libz.a#' Makefile > Makefile.zlib $ wget http://zlib.net/zlib-1.2.8.tar.gz $ tar xvzf zlib-1.2.8.tar.gz $ cd zlib-1.2.8 $ ./configure $ make $ cd .. $ make -f Makefile.zlib The resulting "bwa" executable must be placed in the PATH *before* the version of BWA built against the outdated version of zlib. Troublshooting validation of BAM files -------------------------------------- **Both mates are marked as second / first of pair**: This error message may occur during validation of the final (realigned) BAM, if the input files specified for different libraries contained duplicates reads (*not* PCR duplicate). In that case, the final BAM will contain multiple copies of the same data, thereby risking a significant bias in downstream analyses. The following demonstrates this problem, using a contrieved example based on the examples/bam_example project included with the pipeline:: $ bam_pipeline run 000_makefile.yaml [...] : Error occurred running command: Error(s) running Node: Temporary directory: '/path/to/temp/folder' Error(s) running Node: Return-codes: [1] Temporary directory: '/path/to/temp/folder' Picard's ValidateSamfile prints the error messages to STDOUT, the location of which is indicated above:: $ cat '/tmp/bam_pipeline/9a5beba9-1b24-4494-836e-62a85eb74bf3/rCRS.realigned.validated' ERROR: Record 684, Read name Seq_101_1324_104_rv_0\2, Both mates are marked as second of pair ERROR: Record 6810, Read name Seq_1171_13884_131_fw_0\2, Both mates are marked as second of pair To identify the source of the problems, the problematic reads may be extracted from the BAM file:: $ samtools view ExampleProject.rCRS.realigned.bam|grep -w "^Seq_101_1324_104_rv_0" Seq_101_1324_104_rv_0\2 131 NC_012920_1 1325 60 100M = 1325 -1 [...] Seq_101_1324_104_rv_0\2 131 NC_012920_1 1325 60 100M = 1325 1 [...] Seq_101_1324_104_rv_0\1 16 NC_012920_1 1327 37 51M2D49M * 0 0 [...] Seq_101_1324_104_rv_0\1 89 NC_012920_1 1327 60 51M2D49M * 0 0 [...] Note that both mate pairs are duplicated, with slight variations in the flags. The source of the reads may be determined using the "RG" tags (not shown here), which for files produced by the pipeline corresponds to the library names. Once these are known, the corresponding FASTQ files may be examined to determine the source of the duplicate reads. This problem should normally be detected early in the pipeline, as checks for the inclusion of duplicate data has been implemented (see below). **Read ... found in multiple files**: In order to detect the presence of data that has been included multiple times, e.g. due to incorrect merging of data, the pipeline looks for alignments with identical names, sequences and quality scores. If such reads are found, the follow error is reported:: : Error occurred running command: Read 'Seq_junk_682_0' found in multiple files: - 'ExampleProject/rCRS/Synthetic_Sample_1/ACGATA/Lane_1_002/paired.minQ0.bam' - 'ExampleProject/rCRS/Synthetic_Sample_1/ACGATA/Lane_1_001/paired.minQ0.bam' This indicates that the same data files have been included multiple times in the project. Please review the input files used in this project, to ensure that each set of data is included only once. The message given indicates which files (and hence which samples/libraries and lanes were affected, as described in section :ref:`bam_filestructure`). If only a single file is given, this suggests that the reads were also found in that one file. This problem may result from the accidental concatenation of files provided to the pipeline, or from multiple copies of the same files being included in the wildcards specified in the makefile. As including the same sequencing reads multiple times are bound to bias downstream analyses (if it does not cause validation failure, see sub-section above), this must be fixed before the pipeline is re-started. paleomix-1.2.12/docs/troubleshooting/common.rst000066400000000000000000000160151314402124200216100ustar00rootroot00000000000000.. highlight:: Bash .. _troubleshooting_common: Troubleshooting general problems ================================ If a command fails while the pipeline is running (e.g. mapping, genotyping, validation of BAMs, etc.), the pipeline will print a message to the command-line and write a message to a log-file. The location of the log-file may be specified using the --log-file command-line option, but if --log-file is not specified, a time-stamped log-file is generated in the temporary folder specified using the --temp-root command-line option, and the location of this log-file is printed by the pipeline during execution:: $ 2014-01-07 09:46:19 Pipeline; 1 failed, 202 done of 203 tasks: Log-file located at '/path/to/temp/folder/bam_pipeline.20140107_094554_00.log' [...] Most error-messages will involve a message in the following form:: : Error occurred running command: Error(s) running Node: Return-codes: [1] Temporary directory: '/path/to/temp/folder' The task that failed was the validation of the BAM 'ExampleProject.rCRS.realigned.bam' using Picard ValidateSamFile, which terminated with return-code 1. For each command involved in a given task ('node'), the command-line (as the list passed to 'Popen'http://docs.python.org/2.7/library/subprocess.html), return code, and the current working directory (CWD) is shown. In addition, STDOUT and STDERR are always either piped to files, or to a different command. In the example given, STDOUT is piped to the file 'rCRS.realigned.validated', while STDERR is piped to the file 'pipe_java_4454836272.stderr'. The asterisks in 'STDERR*' indicates that this filename was generated by the pipeline itself, and that this file is only kept if the command failed. To determine the cause of the failure (indicated by the non-zero return-code), examine the output of each command involved in the node. Normally, messages relating to failures may be found in the STDERR file, but in some cases (and in this case) the cause is found in the STDOUT file:: $ cat /path/to/temp/folder/rCRS.realigned.validated ERROR: Record 87, Read name [...], Both mates are marked as second of pair ERROR: Record 110, Read name [...], Both mates are marked as first of pair [...] This particular error indicates that the same reads have been included multiple times in the makefile (see section [sub:Troubleshooting-BAM]). Normally it is nessesary to consult the documentation of the specified program in order to determine the cause of the failure. In addition, the pipeline performs a number of which during startup, which may result in the following issues being detected: **Required file does not exist, and is not created by a node**: Before start, the BAM and Phylogenetic pipeline checks for the presence of all required files. Should one or more files be missing, and the missing file is NOT created by the pipeline itself, an error similar to the following will be raised:: $ bam_pipeline run 000_makefile.yaml [...] Errors detected during graph construction (max 20 shown): Required file does not exist, and is not created by a node: Filename: 000_prefix/rCRS.fasta Dependent node(s): [...] This typically happens if the Makefile contains typos, or if the required files have been moved since the last time the makefile was executed. To proceed, it is necessary to determine the current location of the files in question, and/or update the makefile. **Required executables are missing**: Before starting to execute a makefile, the pipeline will check that the requisite programs are installed, and verify that the installed versions meet the minimum requirements. Should an executable be missing, an error similar to the following will be issued, and the pipeline will not run:: $ bam_pipeline run 000_makefile.yaml [...] Errors detected during graph construction (max 20 shown): Required executables are missing: bwa In that case, please verify that all required programs are installed (see sections TODO) and ensure that these are accessible via the current user's PATH (i.e. can be excuted on the command-line using just the executable name). **Version requirement not met**: In addition to checking for the presence of required executables (including java JARs), version of a program is checked. Should the version of the program not be compatible with the pipeline (e.g. because it is too old), the following error is raised:: $ bam_pipeline run 000_makefile.yaml [...] Version requirement not met for 'Picard CreateSequenceDictionary.jar'; please refer to the PALEOMIX documentation for more information. Executable: /Users/mischu/bin/bwa Call: bwa Version: v0.5.7.x Required: v0.5.19.x or v0.5.110.x or v0.6.2.x or at least v0.7.9.x If so, please refer to the documentation for the pipeline in question, and install/update the program to the version required by the pipeline. Note that the executable MUST be accessible by the PATH variable. If multiple versions of a program is installed, the version required by the pipeline must be first, which may be verified by using the "which" command:: $ which -a bwa /home/username/bin/bwa /usr/local/bin/bwa **Java Runtime Environment outdated / UnsupportedClassVersionError**: If the version of the Java Runtime Environment (JRE) is too old, the pipeline may fail to run with the follow message:: The version of the Java Runtime Environment on this system is too old; please check the the requirement for the program and upgrade your version of Java. See the documentation for more information. Alternatively, Java programs may fail with a message similar to the following, as reported in the pipe_*.stderr file (abbreviated):: Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/sting/gatk/CommandLineGATK : Unsupported major.minor version 51.0 at [...] This problem is typically caused by the GenomeAnalysisTK (GATK), which as of version 2.6 requires Java 1.7 (see `their website`_). To solve this problem, you will need to either upgrade your copy of Java. .. _their website: http://www.broadinstitute.org/gatk/guide/article?id=2846paleomix-1.2.12/docs/troubleshooting/index.rst000066400000000000000000000006121314402124200214230ustar00rootroot00000000000000.. _troubleshooting: Troubleshooting =============== .. toctree:: install.rst common.rst bam_pipeline.rst phylo_pipeline.rst zonkey_pipeline.rst For troubleshooting of individual pipelines, please see the BAM pipeline :ref:`troubleshooting_bam` section, the Phylo pipeline :ref:`troubleshooting_phylo` section, and the Zonkey pipeline :ref:`troubleshooting_zonkey` section. paleomix-1.2.12/docs/troubleshooting/install.rst000066400000000000000000000045011314402124200217630ustar00rootroot00000000000000.. highlight:: Bash .. _troubleshooting_install: Throubleshooting the installation ================================= **Pysam / Cython installation fails with "Python.h: No such file or directory" or "pyconfig.h: No such file or directory"**: Installation of Pysam and Cython requires that Python development files are installed. On Debian based distributions, for example, this may be accomplished by running the following command:: $ sudo apt-get install python-dev **Pysam installation fails with "zlib.h: No such file or directory"**: Installation of Pysam requires that "libz" development files are installed. On Debian based distributions, for example, this may be accomplished by running the following command:: $ sudo apt-get install libz-dev **Command not found when attempting to run 'paleomix'**: By default, the PALEOMIX executables ('paleomix', etc.) are installed in ~/.local/bin. You must ensure that this path is included in your PATH:: $ export PATH=$PATH:~/.local/bin To automatically apply this setting on sub-sequent logins (assuming that you are using Bash), run the following command:: $ echo "export PATH=\$PATH:~/.local/bin" >> ~/.bash_profile **PALEOMIX command-line aliases invokes wrong tools**: When upgrading an old PALEOMIX installation (prior to v1.2.x) using pip, the existence of old files may result in all command-line aliases ('bam\_pipeline', 'phylo\_pipeline', 'bam\_rmdup\_collapsed', etc.) invoking the same command (typically 'phylo_pipeline'):: $ bam_pipeline makefile.yaml Phylogeny Pipeline v1.2.1 [...] This can be solved by removing these aliases, and then re-installing PALEOMIX using 'pip', shown here for a system-wide install:: $ sudo rm -v /usr/local/bin/bam_pipeline /usr/local/bin/conv_gtf_to_bed /usr/local/bin/phylo_pipeline /usr/local/bin/bam_rmdup_collapsed /usr/local/bin/trim_pipeline $ sudo python setup.py install Alternatively, this may be resolved by downloading and manually installing PALEOMIX:: $ wget https://github.com/MikkelSchubert/paleomix/archive/v1.2.4.tar.gz $ tar xvzf v1.2.4.tar.gz $ paleomix-1.2.4/ # Either for the current user: $ python setup.py install --user # Or, for all users: $ sudo python setup.py installpaleomix-1.2.12/docs/troubleshooting/phylo_pipeline.rst000066400000000000000000000003131314402124200233320ustar00rootroot00000000000000.. _troubleshooting_phylo: Troubleshooting the Phylogenetic Pipeline ========================================= TODO TODO: MaxDepth not found in depth files .. --target must match name used in makefilepaleomix-1.2.12/docs/troubleshooting/zonkey_pipeline.rst000066400000000000000000000001521314402124200235170ustar00rootroot00000000000000.. _troubleshooting_zonkey: Troubleshooting the Zonkey Pipeline =================================== TODOpaleomix-1.2.12/docs/yaml.rst000066400000000000000000000143621314402124200160360ustar00rootroot00000000000000.. highlight:: YAML .. _yaml_intro: YAML usage in PALEOMIX ====================== `YAML`_ is a simple markup language adopted for use in configuration files by pipelines included in PALEOMIX. YAML was chosen because it is a plain-text format that is easy to read and write by hand. Since YAML files are plain-text, they may be edited using any standard text editors, with the following caveats: * YAML exclusively uses spaces (space-bar) for indentation, not tabs; attempting to use tabs in YAML files will cause failures when the file is read by the associated program. * YAML is case-sensitive; an option such as 'QualityOffset' is therefore not the same as 'qualityoffset'. * It is strongly recommended that all files be named using the '.yaml' file-extension; setting the extension helps ensure proper handling by editors that natively support the YAML format. Only a subset of YAML features are actually used by PALEOMIX, which are described below. These include **mappings**, by which values are identified by names; **lists** of values; and **numbers**, **text-strings**, and **true** / **false** values, typically representing program options, file-paths, and the like. In addition, comments prefixed by the hash-sign (#) are frequently used to provide documentation. Comments -------- Comments are specified by prefixing unquoted text with the hash-sign (#); all comments are ignored, and have no effect on the operation of the program. Comments are used solely to document the YAML files used by the pipelines:: # This is a comment; the next line contains both a value and a comment: 123 # Comments may be placed on the same line as values. For the purpose of the PALEOMIX reading this YAML code, the above is equivalent to the following YAML code:: 123 As noted above, this only applies to unquoted text, and the following is therefore not a comment, but rather a text-string:: "# this is not a comment" Comments are used in the following sections to provide context. Numbers (integers and floats) ----------------------------- Numbers in YAML file include whole numbers (integers) as well as real numbers (floating point numbers). Numbers are mostly used for program options, such as a minimum read length option, and involve whole numbers, but a few options do involve real numbers. Numbers may be written as follows:: # This is an integer: 123 # This is a float: 123.5 # This is a float written using scientific notation: 1.235e2 Truth-values (booleans) ----------------------- Truth values (*true* and *false*) are frequently used to enable or disable options in PALEOMIX configuration files. Several synonyms are available which helps improve readability. More specifically, all of the following values are interpreted as *true* by the pipelines:: true yes on And similarly, the following values are all interpreted as *false*:: false no off Template files included with the pipelines mostly use 'yes' and 'no', but either of the above corresponding values may be used. Note however that none of these values are quoted: If single or double-quotations were used, then these vales would be read as text rather than truth-values, as described next. Text (strings) -------------- Text, or strings, is the most commonly used type of value used in the PALEOMIX YAML files, as these are used to present both labels and values for options, including paths to files to use in an analysis:: "Example" "This is a longer string" 'This is also a string' "/path/to/my/files/reads.fastq" For most part it is not necessary to use quotation marks, and the above could instead be written as follows:: Example This is a longer string This is also a string /path/to/my/files/reads.fastq However, it is important to make sure that values that are intended to be used strings are not mis-interpreted as a different type of value. For example, without the quotation marks the following values would be interpreted as numbers or truth-values:: "true" "20090212" "17e13" Mappings -------- Mappings associate a value with a label (key), and are used for the majority of options. A mapping is simply a label followed by a colon, and then the value associated with that label:: MinimumQuality: 17 EnableFoo: no NameOfTest: "test 17" In PALEOMIX configuration files, labels are always strings, and are normally not quoted. However, in some cases, such as when using numerical labels in some contexts, it may be useful to quote the values: "A Label": on "12032016": "CPT" Sections (mappings in mappings) ------------------------------- In addition to mapping to a single value, a mapping may also itself contain one or more mappings:: Top level: Second level: 'a value' Another value: true Mappings can be nested any number of times, which is used in this manner to create sections and sub-sections in configuration files, grouping related options together:: Options: Options for program: Option1: yes Option2: 17 Another program: Option1: /path/to/file.fastq Option2: no Note that the two mappings belonging to the 'Option' mapping are both indented the same number of spaces, which is what allows the program to figure out which values belong to what label. It is therefore important to keep indentation consistent. Lists of values --------------- In some cases, it is possible to specify zero or more values with labels. This is accomplished using lists, which consist of values prefixed with a dash:: Section: - First value - Second value - Third value Note that the indentation of each item must be the same, similar to how indentation of sub-sections must be the same (see above). Full example ------------ The following showcases basic structure of a YAML document, as used by the pipelines:: # This is a comment; this line is completely ignored This is a section: This is a subsection: # This subsection contains 3 label / value pairs: First label: "First value" Second label: 2 Third label: 3. This is just another label: "Value!" This is a section containing a list: - The first item - The second item .. _YAML: http://www.yaml.org paleomix-1.2.12/docs/zonkey_pipeline/000077500000000000000000000000001314402124200175405ustar00rootroot00000000000000paleomix-1.2.12/docs/zonkey_pipeline/configuration.rst000066400000000000000000000046201314402124200231430ustar00rootroot00000000000000.. highlight:: ini .. _zonkey_configuration: Configuring the Zonkey pipeline =============================== Unlike the :ref:`bam_pipeline` and the :ref:`phylo_pipeline`, the :ref:`zonkey_pipeline` does not make use of makefiles. However, the pipeline does expose a number options, including the maximum number of threads used, various program parameters, and more. These may be set using the corresponding command-line options (e.g. --max-threads to set the maximum number of threads used during runtime). However, it is also possible to set default values for such options, including on a per-host bases. This is accomplished by executing the following command, in order to generate a configuration file at ~/.paleomix/zonkey.ini: .. code-block:: bash $ paleomix zonkey --write-config The resulting file contains a list of options which can be overwritten:: [Defaults] max_threads = 1 log_level = warning progress_ui = progress treemix_k = 0 admixture_replicates = 1 ui_colors = on downsample_to = 1000000 These values will be used by the pipeline, unless the corresponding option is also supplied on the command-line. I.e. if "max_threads" is set to 4 in the "zonkey.ini" file, but the pipeline is run using "paleomix zonkey run --max-threads 10", then the max threads value is set to 10. .. note:: Options in the configuration file correspond directly to command-line options for the BAM pipeline, with two significant differences: The leading dashes (--) are removed and any remaining dashes are changed to underscores (_); as an example, the command-line option --max-threads becomes max\_threads in the configuration file, as shown above. It is furthermore possible to set specific options depending on the current host-name. Assuming that the pipeline was run on multiple servers sharing a single home directory, one might set the maximum number of threads on a per-server basis as follows:: [Defaults] max_threads = 32 [BigServer] max_threads = 64 [SmallServer] max_threads = 16 The names used (here "BigServer" and "SmallServer") should correspond to the hostname, i.e. the value returned by the "hostname" command: .. code-block:: bash $ hostname BigServer Any value set in the section matching the name of the current host will take precedence over the 'Defaults' section, but can still be overridden by specifying the same option on the command-line, as described above. paleomix-1.2.12/docs/zonkey_pipeline/filestructure.rst000066400000000000000000000067511314402124200232030ustar00rootroot00000000000000.. highlight:: Bash .. _zonkey_filestructure: File structure ============== The following section explains the file structure for results generated by the Zonkey pipeline, based on the results generated when analyzing the example files included with the pipeline (see :ref:`examples_zonkey`). Single sample analysis ---------------------- The following is based on running case 4a, as described in the :ref:`zonkey_usage` section. More specifically, the example in which the analysis are carried out on a BAM alignment file containing both nuclear and mitochondrial alignments:: # Case 4a: Analyses both nuclear and mitochondrial genome; results are placed in 'combined.zonkey' $ paleomix zonkey run database.tar combined.bam As noted in the comment, executing this command places the results in the directory 'combined.zonkey'. For a completed analysis, the results directory is expected to contain a (HTML) report and a directory containing each of the figures generated by the pipeline: * report.css * report.html * figures/ The report may be opened with any modern browser. Each figure displayed in the report is also available as a PDF file, accessed by clicking on a given figure in the report, or directly in the figures/ sub-directory. Analysis result files ^^^^^^^^^^^^^^^^^^^^^ In addition, the following directories are generated by the analytical steps, and contain the various files used by or generated by the programs run as part of the Zonkey pipeline: * admixture/ * mitochondria/ * pca/ * plink/ * treemix/ In general, files in these directories are sorted by the prefix 'incl\_ts' and the prefix 'excl\_ts', which indicate that sites containing transitions (C<->G, and C<->T) have been included or excluded from the analyses, respectively. For a detailed description of the files generated by each analysis, please refer to the documentation for the respective programs used in said analyses. Additionally, the results directory is expected to contain a 'temp' directory. This directory may safely be removed following the completion of a Zonkey run, but should be empty unless one or more analytical steps have failed. Multi-sample analysis --------------------- When multiple samples are processed at once, as described in case 5 (:ref:`zonkey_usage`), results are written to a single 'results' directory. This directory will contain a summary report for all samples, as well as a sub-directory for each sample listed in the table of samples provided when running the pipeline. Thus, for the samples table shown in case 5:: $ cat samples.table example1 combined.bam example2 nuclear.bam example3 mitochondrial.bam example4 nuclear.bam mitochondrial.bam # Case 5a) Analyse 3 samples; results are placed in 'my_samples.zonkey' $ paleomix zonkey run database.tar my_samples.txt The results directory is expected to contain the following files and directories: * summary.html * summary.css * example1/ * example2/ * example3/ * example4/ The summary report may be opened with any modern browser, and offers a quick over-view of all samples processed as part of this analysis. The individual report for each sample may further more be accessed by clicking on the headers corresponding to the name of a give nsample. The per-sample directories corresponding exactly to the result directories that would have been generated if the sample was processed by itself (see above), excepting that only a single 'temp' directory located in the root of the results directory is used. paleomix-1.2.12/docs/zonkey_pipeline/index.rst000066400000000000000000000030641314402124200214040ustar00rootroot00000000000000.. _zonkey_pipeline: Zonkey Pipeline =============== **Table of Contents:** .. toctree:: overview.rst requirements.rst configuration.rst usage.rst panel.rst filestructure.rst The Zonkey Pipeline is a easy-to-use pipeline designed for the analyses of low-coverage, ancient DNA derived from historical equid samples, with the purpose of determining the species of the sample, as well as determining possible hybridization between horses, zebras, and asses (see :ref:`zonkey_usage`). This is accomplished by comparing one or more samples aligned against the *Equus caballus* 2.0 reference sequence with a reference panel of modern equids, including wild and domesticated equids. The reference panel is further described in the :ref:`zonkey_panel` section. The Zonkey pipeline has been published in Journal of Archaeological Science; if you make use of this pipeline in your work, then please cite Schubert M, Mashkour M, Gaunitz C, Fages A, Seguin-Orlando A, Sheikhi S, Alfarhan AH, Alquraishi SA, Al-Rasheid KAS, Chuang R, Ermini L, Gamba C, Weinstock J, Vedat O, and Orlando L. "**Zonkey: A simple, accurate and sensitive pipeline to genetically identify equine F1-hybrids in archaeological assemblages**". Journal of Archaeological Science. 2007 Feb; 78:147-157. doi: `10.1016/j.jas.2016.12.005 `_. The sequencing data used in the Zonkey publication is available on `ENA`_ under the accession number `PRJEB15037`_. .. _ENA: https://www.ebi.ac.uk/ena/ .. _PRJEB15037: https://www.ebi.ac.uk/ena/data/view/PRJEB15037 paleomix-1.2.12/docs/zonkey_pipeline/overview.rst000066400000000000000000000104611314402124200221420ustar00rootroot00000000000000Overview of analytical steps ============================ Briefly, the Zonkey pipeline can run admixture tests on pre-defined species categories (asses, horses, and zebras) to evaluate the ancestry proportions found in the test samples. F1-hybrids are expected to show a balance mixture of two species ancestries, although this balance can deviate from the 50:50 expectation in case limited genetic information is available. This is accomplished using ADMIXTURE [Alexander2009]_. The zonkey pipeline additionally builds maximum likelihood phylogenetic trees, using RAxML [Stamatakis2006]_ for mitochondrial sequence data and using Treeemix [Pickrell2012]_ for autosomal data. In the latter case, phylogenetic affinities are reconstructed twice: First considering no migration edges and secondly allowing for one migration edge. This allows for fine-grained testing of admixture between the sample and any of the species represented in the reference panel. In cases where an admixture signal is found, the location of the sample in the mitochondrial tree allows for the identification of the maternal species contributing to the hybrid being examined. For equids, this is essential to distinguish between possible the hybrid forms, such as distinguishing between mules (|female| horse x |male| donkey F1-hybrid) and hinnies (|male| horse x |female| donkey F1-hybrid). Analyses are presented in HTML reports, one per sample and one summary report when analyzing multiple samples. Figures are generated in both as PNG and PDF format in order to facilitate use in publications (see :ref:`zonkey_filestructure`). Individual analytical steps --------------------------- During a typical analyses, the Zonkey pipeline will proceed through the following major analytical steps: 1. Analyzing nuclear alignments: 1. Input BAMs are indexed using the equivalent of 'samtools index'. 2. Nucleotides at sites overlapping SNPs in the reference panel are sampled to produce a pseudo-haploid sequence, one in which transitions are included and one in which transitions are excluded, in order to account for the presence of *post-mortem* deamination causing base substitutions. The resulting tables are processed using PLINK to generate the prerequisite files for further analyses. 3. PCA plots are generated using SmartPCA from the EIGENSOFT suite of tools for both panels of SNPs (including and excluding transitions). 4. Admixture estimates are carried out using ADMIXTURE, with a partially supervised approach by assigning each sample in the reference panel to one of either two groups (caballine and non-caballine equids) or three groups (asses, horses, and zebras), and processing the SNP panels including and excluding transitions. The input sample is not assigned to a group. 5. Migration edges are modeled using TreeMix, assuming either 0 or 1 migration edge; analyses is carried out on both the SNP panel including transitions and on the SNP panel excluding transitions. 6. PNG and PDF figures are generated for each analytical step; in addition, the the per-chromosome coverage of the nuclear genome is plotted. 1. Analyzing mitochondrial alignments: 1. Input BAMs are indexed using the equivalent of 'samtools index'. 2. The majority nucleotide at each position in the BAM is determined, and the resulting sequence is added to the mitochondrial reference multiple sequence alignment included in the reference panel. 3. A maximum likelihood phylogeny is inferred using RAxML, and the resulting tree is drawn, rooted on the midpoint of the phylogeny. 3. Generating reports and summaries 1. A HTML report is generated for each sample, summarizing the data used and presenting (graphically) the results of each analysis carried out above. All figures are available as PNG and PDF (each figure in the report links to its PDF equivalent). 2. If multiple samples were processed, a summary of all samples is generated, which presents the major results in an abbreviated form. .. note:: While the above shows an ordered list of steps, the pipeline may interleave individual steps during runtime, and may execute multiple steps in parallel in when running in multi-threaded mode (see :ref:`zonkey_usage` for how to run the Zonkey pipeline using multiple threads). .. |male| unicode:: U+02642 .. MALE .. |female| unicode:: U+02640 .. FEMALE paleomix-1.2.12/docs/zonkey_pipeline/panel.rst000066400000000000000000000361131314402124200213750ustar00rootroot00000000000000.. _zonkey_panel: Reference Panel =============== The :ref:`zonkey_pipeline` operates using a reference panel of SNPs generated from a selection of extant equid species, including the domestic horse (Equus caballus) and the Przewalski’s wild horse (Equus ferus przewalski); within African asses, the domestic donkey (Equus asinus) and the Somali wild ass (Equus africanus); within Asian asses, the onager (Equus hemionus) and the Tibetan kiang (Equus kiang), and; within zebras: the plains zebra (Equus quagga), the mountains zebra (Equus hartmannae) and the Grevyi zebra (Equus grevyi). These samples were obtained from [Orlando2013]_, [DerSarkissian2015]_, and in particular from [Jonsson2014]_, which published genomes of every remaining extant equid species. The reference panel has been generated using alignments against the Equus caballus reference nuclear genome (equCab2, via `UCSC`_) and the horse mitochondrial genome (NC\_001640.1, via `NCBI`_). The exact samples used to create the latest version of the reference panel are described below. Obtaining the reference panel ------------------------------- The latest version of the Zonkey reference panel (dated 2016-11-01) may be downloaded via the following website: http://geogenetics.ku.dk/publications/zonkey Once this reference panel has been downloaded, it is strongly recommended that you decompress it using the 'bunzip2' command, since this speeds up several analytical steps (at the cost of about 600 MB of additional disk usage). To decompress the reference panel, simply run 'bunzip2' on the file, as shown here: .. code-block:: bash $ bunzip2 database.tar.bz2 .. warning: Do not untar the reference panel. The Zonkey pipeline currently expects data files to be stored in a tar archive, and will not work if files have been extracted into a folder. This may change in the future. Once this has been done, the Zonkey pipeline may be used as described in the :ref:`zonkey_usage` section. Samples used in the reference panel ----------------------------------- The following samples have been used in the construction of the latest version of the reference panel: ====== =================== ====== =========== ============================= Group Species Sex Sample Name Publication ====== =================== ====== =========== ============================= Horses *E. caballus* Male FM1798 doi:`10.1016/j.cub.2015.08.032 `_ . *E. przewalskii* Male SB281 doi:`10.1016/j.cub.2015.08.032 `_ Asses *E. a. asinus* Male Willy doi:`10.1038/nature12323 `_ . *E. kiang* Female KIA doi:`10.1073/pnas.1412627111 `_ . *E. h. onager* Male ONA doi:`10.1073/pnas.1412627111 `_ . *E. a. somaliensis* Female SOM doi:`10.1073/pnas.1412627111 `_ Zebras *E. q. boehmi* Female BOE doi:`10.1073/pnas.1412627111 `_ . *E. grevyi* Female GRE doi:`10.1073/pnas.1412627111 `_ . *E. z. hartmannae* Female HAR doi:`10.1073/pnas.1412627111 `_ ====== =================== ====== =========== ============================= Constructing a reference panel ============================== The following section describes the format used for the reference panel in Zonkey. It is intended for people who are interested in constructing their own reference panels for a set of species. .. warning:: At the time of writing, the number of ancestral groups is hardcoded to 2 and 3 groups; support for any number of ancestral groups is planned. Contact me if this is something you need, and I'll prioritize adding this to the Zonkey pipeline. It is important to note that a reference panel will is created relative to a single reference genome. For example, for the equine reference panel, all alignments and positions are listed relative to the EquCab2.0 reference genome. The reference consists of a number of files, which are described below: settings.yaml ------------- The settings file is a simple YAML-markup file, which species global options that apply to the reference panel. The current setting file looks as follows: .. code-block:: yaml # Database format; is incremented when the format changes Format: 1 # Revision number; is incremented when the database (but not format) changes Revision: 20161101 # Arguments passed to plink Plink: "--horse" # Number of chromosomes; required for e.g. PCA analyses NChroms: 31 # N bases of padding used for mitochondrial sequences; the last N bases are # expected to be the same as the first N bases, in order to allow alignments # at this region of the genome, and are combined to generate final consensus. MitoPadding: 31 # The minimum distance between SNPs, assuming an even distribution of SNPs # across the genome. Used when --treemix-k is set to 'auto', which is the # default behavior. Value from McCue 2012 (doi:10.1371/journal.pgen.1002451). SNPDistance: 150000 The *Format* option defines the panel format, reflects the version of the Zonkey pipeline that supports this panel. It should therefore not be changed unless the format, as described on this page, is changed. The *Revision* reflects the version of a specific reference panel, and should be updated every time data or settings in the reference panel is changed. The equid reference panel simply uses the date at which a given version was created as the revision number. The *Plink* option lists specific options passed to plink. In the above, this includes just the '--horse' option, which specifies the expected number of chromosomes expected for the horse genome and data aligned against the horse genome. The *NChroms* option specifies the number of autosomal chromosomes for the reference genome used to construct the reference panel. This is requried for running PCA, but will likely be removed in the future (it is redundant due to contigs.txt). The *MitoPadding* option is used for the mitochondrial reference sequences, and specifies that some number of the bases at the end of the sequences are identical to the first bases in the sequence. Such duplication (or padding) is used to enable alignments spanning the break introduced when representing a circular genome as a FASTA sequence. If no such padding has been used, then this may simply be set to 0. The *SNPDistance* option is used to calculate the number of SNPs per block when the --treemix-k option is set to 'auto' (the default behavior). This option assumes that SNPs are evenly distributed across the genome, and calculates block size based on the number of SNPs covered for a given sample. contigs.txt ----------- The 'contigs.txt' file contains a table describing the chromsomes included in the zonkey analyses: .. code-block:: text ID Size Checksum Ns 1 185838109 NA 2276254 2 120857687 NA 1900145 3 119479920 NA 1375010 4 108569075 NA 1172002 5 99680356 NA 1937819 X 124114077 NA 2499591 The *ID* column specifies the name of the chromosome. Note that these names are expected to be either numerical (i.e. 1, 2, 21, 31) or sex chromosomes (X or Y). The *Size* column must correspond to the length of the chromosome in the reference genome. The *Ns* column, on the other hand, allows for the number of uncalled bases in the reference to be specified. This value is subtracted from the chromosome size when calculating the relative coverage for sex determination. The *Checksum* column should contain the MD5 sum calculated for the reference sequence or 'NA' if not available. If specified, this value is intended to be compared with the MD5 sums listed in the headers of BAM files analyzed by the Zonkey pipeline, to ensure that the correct reference sequence is used. .. note:: This checksum check is currently not supported, but will be added soon. .. note:: The mitochondria is not included in this table; only list autosomes to be analyzed. samples.txt ----------- The 'samples.txt' table should contains a list of all samples included in the reference panel, and provides various information about these, most important of which is what ancestral groups a given sample belongs to: .. code-block:: text ID Group(3) Group(2) Species Sex SampleID Publication ZBoe Zebra NonCaballine E. q. boehmi Female BOE doi:10.1073/pnas.1412627111 AOna Ass NonCaballine E. h. onager Male ONA doi:10.1073/pnas.1412627111 HPrz Horse Caballine E. przewalskii Male SB281 doi:10.1016/j.cub.2015.08.032 The *ID* column is used as the name of the sample in the text, tables, and figures generated when running the Zonkey pipeline. It is adviced to keep this name short and preferably descriptive about the group to which the sample belongs. The *Group(2)* and *Group(3)* columns specify the ancestral groups to which the sample belongs, when connsidering either 2 or 3 ancestral groups. Note that Zonkey currently only supports 2 and 3 ancestral groups (see above). The *Species*, *Sex*, *SampleID*, and *Publication* columns are meant to contain extra information about the samples, used in the reports generated by the Zonkey pipeline, and are not used directly by the pipeline. mitochondria.fasta ------------------ The 'mitochondria.fasta' file is expected to contain a multi-sequence alignment involving two different set of sequences. Firstly, it must contain one or more reference sequences against which the input mitochondria alignments have been carried out. In addition, it should contain at least one sequence per species in the reference panel. Zonkey will compare the reference sequences (either or not subtracting the amount of padding specified in the 'settings.txt' file) against the contigs in the input BAM in order to identify mitochondrial sequences. The Zonkey pipeline then uses the alignment of the reference sequence identified to place the sample into the multi-sequence alignment. By default, all sequences in the 'mitochondria.fasta' file are included in the mitochondrial phylogeny. However, reference sequences can be excluded by adding a 'EXCLUDE' label after the sequence name: .. code-block:: text >5835107Eq_mito3 EXCLUDE gttaatgtagcttaataatat-aaagcaaggcactgaaaatgcctagatgagtattctta Sequences thus marked are not used for the phylogenetic inference itself. simulations.txt --------------- The 'simulations.txt' file contains the results of analyzing simulated data sets in order to generate an emperical distribution of deviations from the expected admixture values. .. code-block:: text NReads K Sample1 Sample2 HasTS Percentile Value 1000 2 Caballine NonCaballine FALSE 0.000 7.000000e-06 1000 2 Caballine NonCaballine FALSE 0.001 1.973480e-04 1000 2 Caballine NonCaballine FALSE 0.002 2.683880e-04 1000 2 Caballine NonCaballine FALSE 0.003 3.759840e-04 1000 2 Caballine NonCaballine FALSE 0.004 4.595720e-04 1000 2 Caballine NonCaballine FALSE 0.005 5.518900e-04 1000 2 Caballine NonCaballine FALSE 0.006 6.591180e-04 The *NReads* column specifies the number of sequence alignments used in the simulated sample (e.g. 1000, 10000, 100000, and 1000000). Zonkey will use these simulations for different numbers of reads to establish lower and upper bounds on the empirical p-values, where the lower bound is selected as the NReads <= to the number of reads analyzed, and the upper bound is selected as the NReads >= to the number of reads analyzed, when running Zonkey. The *K* column lists the number of ancestral groups specified when the sample was analyzed; in the equine reference panel, this is either 2 or 3. The *Sample1* and *Sample2* columns lists the two ancestral groups from which the synthetic hybrid was produced. The order in which these are listed does not matter. The *HasTS* column specifies if transitions were included (TRUE) or excluded (FALSE). The *Percentile* column specifies the percent of simulations with a *Value* less than or equal to the current *Value*. The *Value* column lists the absolute observed deviation from the expected admixture proportion (i.e. 0.5). There is currently no way to generate this automatically table, but having some support for doing this is planned. Note also that zonkey can be run using a hidden option '--admixture-only', which skips all analyses but those required in order to run ADMIXTURE on the data, and thereby makes running ADMIXTURE exactly as it would be run by Zonkey trivial. For example: $ paleomix zonkey run --admixture-only database.tar simulation.bam genotypes.txt ------------- The 'genotypes.txt' file contains a table of heterozyous sites relative to the reference sequence used for the reference panel. .. warning:: Columns in the 'genotypes.txt' file are expected to be in the exact order shown below. .. code-block:: text Chrom Pos Ref AAsi;AKia;AOna;ASom;HCab;HPrz;ZBoe;ZGre;ZHar 1 1094 A CAACAAAAA 1 1102 G AGGAGGGGG 1 1114 A AAAAAAAGA 1 1126 C CCCCCCCYC 1 1128 C CCCCCCCGC 1 1674 T GGGGTTGGG 1 1675 G GCCGGGGGG The *Chrom* column is expected to contain only those contigs / chromosomes listed in the 'contigs.txt' file; the *Pos* column contains the 1-based positions of the variable sites relative to the reference sequence. The *Ref* column contains the nucleotide observed in the reference sequence for the current position; it is currently not used, and may be removed in future versions of Zonkey. The final column contains the nucleotides observed for every sample named in 'samples.txt', joined by semi-colons, and a single letter nucleotide for each of these encoded using UIPAC codes (i.e. A equals AA, W equals AT). The equine reference panel does not include sites not called in every sample, but including such sites is possible by setting the nucleotide to 'N' for the sample with missing data. Packaging the files ------------------- The reference panel is distributed as a tar archive. For best performance, the files should be laid out so that the genotypes.txt file is the last file in the archive. This may be accomplished with the following command: .. code-block:: bash $ tar cvf database.tar settings.yaml contigs.txt samples.txt mitochondria.fasta simulations.txt examples genotypes.txt The tar file may be compressed for distribution (bzip2 or gzip), but should be used uncompressed for best performance. .. _NCBI: https://www.ncbi.nlm.nih.gov/nuccore/5835107 .. _UCSC: https://genome.ucsc.edu/cgi-bin/hgGateway?clade=mammal&org=Horse&db=0 paleomix-1.2.12/docs/zonkey_pipeline/requirements.rst000066400000000000000000000065451314402124200230270ustar00rootroot00000000000000.. highlight:: bash .. _zonkey_requirements: Software Requirements ===================== The Zonkey pipeline requires PALEOMIX version 1.2.7 or later. In addition to the requirements listed for the PALEOMIX pipeline itself in the :ref:`installation` section, the Zonkey pipeline requires that other pieces of software be installed: * RScript from the `R`_ package, v3.1+. * SmartPCA from the `EIGENSOFT`_ package, v13050+ [Patterson2006]_, [Price2006]_ * `ADMIXTURE`_ v1.23+ [Alexander2009]_ * `PLINK`_ v1.7+ [Chang2015]_ * `RAxML`_ v7.3.2+ [Stamatakis2006]_ * `SAMTools`_ v0.1.19+ [Li2009b]_ * `TreeMix`_ v1.12+ [Pickrell2012]_ The following R packages are required in order to carry out the plotting: * `RColorBrewer`_ * `ape`_ [Paradis2004]_ * `ggplot2`_ [Wickham2009]_ * `ggrepel`_ * `reshape2`_ [Wickham2007]_ The R packages may be installed using the following commands:: $ R > install.packages(c('RColorBrewer', 'ape', 'ggrepel', 'ggplot2', 'reshape2')) Installing under OSX -------------------- Installing the Zonkey pipeline under OSX poses several difficulties, mainly due to SmartPCA. In the follow, it is assumed that the `Brew package manager`_ has been installed, as this greatly simplifies the installation of other, required pieces of software. Firstly, install software and libraries required to compile SmartPCA:: $ brew install gcc $ brew install homebrew/dupes/lapack $ brew install homebrew/science/openblas In each case, note down the values indicated for LDFLAGS, CFLAGS, CPPFLAGS, etc. Next, download and unpack the `EIGENSOFT`_ software. The following has been tested on EIGENSOFT version 6.1.1 ('EIG6.1.1.tar.gz'). To build SmartPCA it may further be nessesary to remove the use of the 'real-time' library:: $ sed -e's# -lrt##' Makefile > Makefile.no_rt Once you have done this, you can build SmartPCA using the locally copied libraries:: $ env CC="/usr/local/opt/gcc/bin/gcc-6" LDFLAGS="-L/usr/local/opt/openblas/lib/" CFLAGS="-flax-vector-conversions -I/usr/local/opt/lapack/include/" make -f Makefile.no_rt The above worked on my installation, but you may need to correct the variables using the values provided by Brew, which you noted down after running the 'install' command. You may also need to change the location of GGC set in the CC variable. Testing the pipeline -------------------- An example project is included with the BAM pipeline, and it is recommended to run this project in order to verify that the pipeline and required applications have been correctly installed. See the :ref:`examples_zonkey` section for a description of how to run this example project. .. _ADMIXTURE: https://www.genetics.ucla.edu/software/admixture/ .. _EIGENSOFT: http://www.hsph.harvard.edu/alkes-price/software/ .. _PLINK: https://www.cog-genomics.org/plink2 .. _R: http://www.r-base.org/ .. _RAxML: https://github.com/stamatak/standard-RAxML .. _RColorBrewer: https://cran.r-project.org/web/packages/RColorBrewer/index.html .. _SAMTools: https://samtools.github.io .. _TreeMix: http://pritchardlab.stanford.edu/software.html .. _ape: https://cran.r-project.org/web/packages/ape/index.html .. _ggrepel: https://cran.r-project.org/web/packages/ggrepel/index.html .. _ggplot2: https://cran.r-project.org/web/packages/ggplot2/index.html .. _reshape2: https://cran.r-project.org/web/packages/reshape2/index.html .. _Brew package manager: http://www.brew.sh paleomix-1.2.12/docs/zonkey_pipeline/usage.rst000066400000000000000000000257461314402124200214140ustar00rootroot00000000000000.. highlight:: Yaml .. _zonkey_usage: Pipeline usage ============== The Zonkey pipeline is run on the command-line using the command 'paleomix zonkey', which gives access to a handful of commands: .. code-block:: bash $ paleomix zonkey USAGE: paleomix zonkey run [] paleomix zonkey run paleomix zonkey run [] paleomix zonkey dryrun [...] paleomix zonkey mito Briefly, it is possible to run the pipeline on a single sample by specifying the location of `BAM alignments`_ against the Equus caballus reference nuclear genome (equCab2, see `UCSC`_), and / or against the horse mitochondrial genome (using either the standard mitochondrial sequence NC\_001640.1, see `NCBI`_, or a mitochondrial genome of one of the samples included in the reference panel, as described below). The individual commands allow for different combinations of alignment strategies: **paleomix zonkey run []** Runs the Zonkey pipeline on a single BAM alignment , which is expected to contain a nuclear and / or a mitochondrial alignment. If is specified, a directory at that location is created, and the resulting output saved there. If is not specified, the default location is chosen by replacing the file-extension of the alignment file (typically '.bam') with '.zonkey'. **paleomix zonkey run ** This commands allow for the combined analyses of the nuclear and mitochondrial genomes, in cases where these alignments have been carried out seperately. In this case, specifying a location is madatory. **paleomix zonkey run []** It is possible to run the pipeline on multiple samples at once, by specifying a list of BAM files (here ), which lists a sample name and one or two BAM files per line, with each column seperated by tabs). A destination may (optionally) be specified, as when specifying a single BAM file (see above). **paleomix zonkey dryrun [...]** The 'dryrun' command is equivalent to the 'run' command, but does not actually carry out the analytical steps; this command is useful to test for problems before excuting the pipeline, such as missing or outdated software requirements (see :ref:`zonkey_requirements`). **paleomix zonkey mito ** The 'mito' command is included to create a :ref:`bam_pipeline` project template for mapping FASTQ reads against the mitochondrial genomes of the samples included in the Zonkey reference panel samples (see Prerequisites below) for a list of samples). These possibilities are described in further detail below. Prerequisites ------------- All invocations of the Zonkey pipeline takes the path to a 'panel' file as their first argument. This file is the reference panel providing the genetic information necessary for performing species and/or hybrid identification, and currently includes representatives of all extant equid species. The reference panel thereby allows for the identification of first generation hybrids between any living equine species, i.e. within caballines. For a more detailed description of the reference panel, the species included this panel, and instructions for where to download the latest version of the file, please refer to the :ref:`zonkey_panel` section. Secondly, the pipeline requires either one or two BAM files per sample, representing alignments against nuclear and / or mitochondrial genomes as described above. The analyses carried out by the Zonkey pipeline depends on the contents of the BAM alignment file provided for a given sample, and are presented below. Single sample analysis ---------------------- For a single sample, the pipeline may be invoked by providing the path to the reference panel file followed by the path to one or two BAM files belonging to that sample, as well as an (mostly optional) destination directory. For these examples, we will assume that the reference panel is saved in the file 'database.tar', that the BAM file 'nuclear.bam' contains an alignment against the equCab2 reference genome, that the BAM file 'mitochondrial.bam' contains an alignment against the corresponding mitochondrial reference genome (Genbank Accession Nb. NC_001640.1), and that the BAM file 'combined.bam' contains an alignment against both the nuclear and mitochondrial genomes. If so, the pipeline may be invoked as follows: .. code-block:: bash # Case 1a: Analyse nuclear genome; results are placed in 'nuclear.zonkey' $ paleomix zonkey run database.tar nuclear.bam # Case 1b: Analyse nuclear genome; results are placed in 'my_results' $ paleomix zonkey run database.tar nuclear.bam my_results # Case 2b: Analyse mitochondrial genome; results are placed in 'mitochondrial.zonkey' $ paleomix zonkey run database.tar mitochondrial.bam # Case 2b: Analyse mitochondrial genome; results are placed in 'my_results' $ paleomix zonkey run database.tar mitochondrial.bam my_results # Case 3: Analyses both nuclear and mitochondrial genome, placing results in 'my_results' $ paleomix zonkey run database.tar nuclear.bam mitochondrial.bam my_results # Case 4a: Analyses both nuclear and mitochondrial genome; results are placed in 'combined.zonkey' $ paleomix zonkey run database.tar combined.bam # Case 4b: Analyses both nuclear and mitochondrial genome; results are placed in 'my_results' $ paleomix zonkey run database.tar combined.bam my_results .. note:: The filenames used here are have been chosen purely to illustrate each operation, and do not affect the operation of the pipeline. As shown above, the pipeline will place the resulting output files in a directory named after the input file by default. This behavior, however, can be overridden by the user by specifying a destination directory (cases 1b, 2b, and 4b). When specifying two input files, however, it is required to manually specify the directory in which to store output files (case 3). The resulting report may be accessed in the output directory under the name 'report.html', which contains summary statistics and figures for the analyses performed for the sample. The structure of directory containing the output files is described further in the :ref:`zonkey_filestructure` section. Multi-sample analysis --------------------- As noted above, it is possible to analyze multiple, different samples in one go. This is accomplished by providing a text file containing a tab separated table of samples, with columns separated by tabs. The first column in this table specifies the name of the sample, while the second and third column specifies the location of one or two BAM alignments associated with that sample. The following example shows one such file corresponding to cases 1 - 4 described above. .. code-block:: bash $ cat samples.txt case_1 nuclear.bam case_2 mitochondrial.bam case_3 nuclear.bam mitochondrial.bam case_4 combined.bam Processing of these samples is then carried out as shown above: .. code-block:: bash # Case 5a) Analyse 3 samples; results are placed in 'samples.zonkey' $ paleomix zonkey run database.tar samples.txt # Case 5b) Analyse 3 samples; results are placed in 'my_results' $ paleomix zonkey run database.tar samples.txt my_results The resulting directory contains a 'summary.html' file, providing an overview of all samples processed in the analyses, with link to the individual, per-sample, reports, as well as a sub-directory for each sample corresponding to that obtained from running individual analyses on each of the samples. The structure of directory containing the output files is further described in the :ref:`zonkey_filestructure` section. .. note: Note that only upper-case and lower-case letters (a-z, and A-Z), as well as numbers (0-9), and underscores (_) are allowed in sample names. Rooting TreeMix trees --------------------- By default, the Zonkey pipeline does not attempt to root TreeMix trees; this is because the out-group specified *must* form a monophyletic clade; if this is not the case (e.g. if the clade containing the two reference horse samples becomes paraphyletic due to the test sample nesting with one of them), TreeMix will fail to run to completion. Therefore it may be preferable to run the pipeline without specifying an outgroup, and then specifying the outgroup, in a second run, once the placement of the sample is done. This is accomplished by specifying these using the --treemix-outgroup command-line option, specifying the samples forming the out-group as a comma-separated list. For example, assuming that the following TreeMix tree was generated for our sample: .. image:: ../_static/zonkey/incl_ts_0_tree_unrooted.png If so, we may wish to root on the caballine specimen (all other command-line arguments omitted for simplicity): .. code-block:: bash $ paleomix zonkey run ... --treemix-outgroup Sample,HPrz,HCab This yields a tree rooted using this group as the outgroup: .. image:: ../_static/zonkey/incl_ts_0_tree_rooted.png .. note:: Rooting of the tree will be handled automatically in future versions of the Zonkey pipeline. Mapping against mitochondrial genomes ------------------------------------- In order to identify the species of the sire and dam, respectively, for F1 hybrids, the Zonkey pipeline allows for the construction of a maximum likelihood phylogeny using RAxML [Stamatakis2006]_ based on the mitochondrial genomes of reference panel (see Prerequisites, above) and a consensus sequence derived from the mitochondrial alignment provided for the sample being investigated. The resulting phylogeny is presented rooted on the mid-point: .. image:: ../_static/zonkey/mito_phylo.png As noted above, this requires that the the sample has been mapped against the mitochondrial reference genome NC\_001640.1 (see `NCBI`_), corresponding to the 'MT' mitochondrial genome included with the equCab2 reference sequence (see `UCSC`_). In addition, it is possible to carry out mapping against the mitochondrial genomes of the reference panel used in the Zonkey reference panel, by using the :ref:`bam_pipeline`. This is accomplished by running the Zonkey 'mito' command, which writes a simple BAM pipeline makefile template to a given directory, along with a directory containing the FASTA sequences of the reference mitochondrial genomes:: $ paleomix zonkey mito my_mapping/ Please refer to the :ref:`bam_pipeline` documentation if you wish to use the BAM pipeline to perform the mapping itself. Once your data has been mapped against either or all of these mitochondrial genomes, the preferred BAM file (e.g. the alignment with the highest coverage) may be included in the analyses as described above. .. _NCBI: https://www.ncbi.nlm.nih.gov/nuccore/5835107 .. _UCSC: https://genome.ucsc.edu/cgi-bin/hgGateway?clade=mammal&org=Horse&db=0 .. _BAM alignments: http://samtools.github.io/hts-specs/SAMv1.pdf paleomix-1.2.12/examples000077700000000000000000000000001314402124200225312paleomix/resources/examplesustar00rootroot00000000000000paleomix-1.2.12/licenses/000077500000000000000000000000001314402124200152115ustar00rootroot00000000000000paleomix-1.2.12/licenses/gpl.txt000066400000000000000000001045131314402124200165400ustar00rootroot00000000000000 GNU GENERAL PUBLIC LICENSE Version 3, 29 June 2007 Copyright (C) 2007 Free Software Foundation, Inc. Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The GNU General Public License is a free, copyleft license for software and other kinds of works. The licenses for most software and other practical works are designed to take away your freedom to share and change the works. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change all versions of a program--to make sure it remains free software for all its users. We, the Free Software Foundation, use the GNU General Public License for most of our software; it applies also to any other work released this way by its authors. You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for them if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs, and that you know you can do these things. To protect your rights, we need to prevent others from denying you these rights or asking you to surrender the rights. Therefore, you have certain responsibilities if you distribute copies of the software, or if you modify it: responsibilities to respect the freedom of others. For example, if you distribute copies of such a program, whether gratis or for a fee, you must pass on to the recipients the same freedoms that you received. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. Developers that use the GNU GPL protect your rights with two steps: (1) assert copyright on the software, and (2) offer you this License giving you legal permission to copy, distribute and/or modify it. For the developers' and authors' protection, the GPL clearly explains that there is no warranty for this free software. For both users' and authors' sake, the GPL requires that modified versions be marked as changed, so that their problems will not be attributed erroneously to authors of previous versions. Some devices are designed to deny users access to install or run modified versions of the software inside them, although the manufacturer can do so. This is fundamentally incompatible with the aim of protecting users' freedom to change the software. The systematic pattern of such abuse occurs in the area of products for individuals to use, which is precisely where it is most unacceptable. Therefore, we have designed this version of the GPL to prohibit the practice for those products. If such problems arise substantially in other domains, we stand ready to extend this provision to those domains in future versions of the GPL, as needed to protect the freedom of users. Finally, every program is threatened constantly by software patents. States should not allow patents to restrict development and use of software on general-purpose computers, but in those that do, we wish to avoid the special danger that patents applied to a free program could make it effectively proprietary. To prevent this, the GPL assures that patents cannot be used to render the program non-free. The precise terms and conditions for copying, distribution and modification follow. TERMS AND CONDITIONS 0. Definitions. "This License" refers to version 3 of the GNU General Public License. "Copyright" also means copyright-like laws that apply to other kinds of works, such as semiconductor masks. "The Program" refers to any copyrightable work licensed under this License. Each licensee is addressed as "you". "Licensees" and "recipients" may be individuals or organizations. To "modify" a work means to copy from or adapt all or part of the work in a fashion requiring copyright permission, other than the making of an exact copy. The resulting work is called a "modified version" of the earlier work or a work "based on" the earlier work. A "covered work" means either the unmodified Program or a work based on the Program. To "propagate" a work means to do anything with it that, without permission, would make you directly or secondarily liable for infringement under applicable copyright law, except executing it on a computer or modifying a private copy. Propagation includes copying, distribution (with or without modification), making available to the public, and in some countries other activities as well. To "convey" a work means any kind of propagation that enables other parties to make or receive copies. Mere interaction with a user through a computer network, with no transfer of a copy, is not conveying. An interactive user interface displays "Appropriate Legal Notices" to the extent that it includes a convenient and prominently visible feature that (1) displays an appropriate copyright notice, and (2) tells the user that there is no warranty for the work (except to the extent that warranties are provided), that licensees may convey the work under this License, and how to view a copy of this License. If the interface presents a list of user commands or options, such as a menu, a prominent item in the list meets this criterion. 1. Source Code. The "source code" for a work means the preferred form of the work for making modifications to it. "Object code" means any non-source form of a work. A "Standard Interface" means an interface that either is an official standard defined by a recognized standards body, or, in the case of interfaces specified for a particular programming language, one that is widely used among developers working in that language. The "System Libraries" of an executable work include anything, other than the work as a whole, that (a) is included in the normal form of packaging a Major Component, but which is not part of that Major Component, and (b) serves only to enable use of the work with that Major Component, or to implement a Standard Interface for which an implementation is available to the public in source code form. A "Major Component", in this context, means a major essential component (kernel, window system, and so on) of the specific operating system (if any) on which the executable work runs, or a compiler used to produce the work, or an object code interpreter used to run it. The "Corresponding Source" for a work in object code form means all the source code needed to generate, install, and (for an executable work) run the object code and to modify the work, including scripts to control those activities. However, it does not include the work's System Libraries, or general-purpose tools or generally available free programs which are used unmodified in performing those activities but which are not part of the work. For example, Corresponding Source includes interface definition files associated with source files for the work, and the source code for shared libraries and dynamically linked subprograms that the work is specifically designed to require, such as by intimate data communication or control flow between those subprograms and other parts of the work. The Corresponding Source need not include anything that users can regenerate automatically from other parts of the Corresponding Source. The Corresponding Source for a work in source code form is that same work. 2. Basic Permissions. All rights granted under this License are granted for the term of copyright on the Program, and are irrevocable provided the stated conditions are met. This License explicitly affirms your unlimited permission to run the unmodified Program. The output from running a covered work is covered by this License only if the output, given its content, constitutes a covered work. This License acknowledges your rights of fair use or other equivalent, as provided by copyright law. You may make, run and propagate covered works that you do not convey, without conditions so long as your license otherwise remains in force. You may convey covered works to others for the sole purpose of having them make modifications exclusively for you, or provide you with facilities for running those works, provided that you comply with the terms of this License in conveying all material for which you do not control copyright. Those thus making or running the covered works for you must do so exclusively on your behalf, under your direction and control, on terms that prohibit them from making any copies of your copyrighted material outside their relationship with you. Conveying under any other circumstances is permitted solely under the conditions stated below. Sublicensing is not allowed; section 10 makes it unnecessary. 3. Protecting Users' Legal Rights From Anti-Circumvention Law. No covered work shall be deemed part of an effective technological measure under any applicable law fulfilling obligations under article 11 of the WIPO copyright treaty adopted on 20 December 1996, or similar laws prohibiting or restricting circumvention of such measures. When you convey a covered work, you waive any legal power to forbid circumvention of technological measures to the extent such circumvention is effected by exercising rights under this License with respect to the covered work, and you disclaim any intention to limit operation or modification of the work as a means of enforcing, against the work's users, your or third parties' legal rights to forbid circumvention of technological measures. 4. Conveying Verbatim Copies. You may convey verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice; keep intact all notices stating that this License and any non-permissive terms added in accord with section 7 apply to the code; keep intact all notices of the absence of any warranty; and give all recipients a copy of this License along with the Program. You may charge any price or no price for each copy that you convey, and you may offer support or warranty protection for a fee. 5. Conveying Modified Source Versions. You may convey a work based on the Program, or the modifications to produce it from the Program, in the form of source code under the terms of section 4, provided that you also meet all of these conditions: a) The work must carry prominent notices stating that you modified it, and giving a relevant date. b) The work must carry prominent notices stating that it is released under this License and any conditions added under section 7. This requirement modifies the requirement in section 4 to "keep intact all notices". c) You must license the entire work, as a whole, under this License to anyone who comes into possession of a copy. This License will therefore apply, along with any applicable section 7 additional terms, to the whole of the work, and all its parts, regardless of how they are packaged. This License gives no permission to license the work in any other way, but it does not invalidate such permission if you have separately received it. d) If the work has interactive user interfaces, each must display Appropriate Legal Notices; however, if the Program has interactive interfaces that do not display Appropriate Legal Notices, your work need not make them do so. A compilation of a covered work with other separate and independent works, which are not by their nature extensions of the covered work, and which are not combined with it such as to form a larger program, in or on a volume of a storage or distribution medium, is called an "aggregate" if the compilation and its resulting copyright are not used to limit the access or legal rights of the compilation's users beyond what the individual works permit. Inclusion of a covered work in an aggregate does not cause this License to apply to the other parts of the aggregate. 6. Conveying Non-Source Forms. You may convey a covered work in object code form under the terms of sections 4 and 5, provided that you also convey the machine-readable Corresponding Source under the terms of this License, in one of these ways: a) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by the Corresponding Source fixed on a durable physical medium customarily used for software interchange. b) Convey the object code in, or embodied in, a physical product (including a physical distribution medium), accompanied by a written offer, valid for at least three years and valid for as long as you offer spare parts or customer support for that product model, to give anyone who possesses the object code either (1) a copy of the Corresponding Source for all the software in the product that is covered by this License, on a durable physical medium customarily used for software interchange, for a price no more than your reasonable cost of physically performing this conveying of source, or (2) access to copy the Corresponding Source from a network server at no charge. c) Convey individual copies of the object code with a copy of the written offer to provide the Corresponding Source. This alternative is allowed only occasionally and noncommercially, and only if you received the object code with such an offer, in accord with subsection 6b. d) Convey the object code by offering access from a designated place (gratis or for a charge), and offer equivalent access to the Corresponding Source in the same way through the same place at no further charge. You need not require recipients to copy the Corresponding Source along with the object code. If the place to copy the object code is a network server, the Corresponding Source may be on a different server (operated by you or a third party) that supports equivalent copying facilities, provided you maintain clear directions next to the object code saying where to find the Corresponding Source. Regardless of what server hosts the Corresponding Source, you remain obligated to ensure that it is available for as long as needed to satisfy these requirements. e) Convey the object code using peer-to-peer transmission, provided you inform other peers where the object code and Corresponding Source of the work are being offered to the general public at no charge under subsection 6d. A separable portion of the object code, whose source code is excluded from the Corresponding Source as a System Library, need not be included in conveying the object code work. A "User Product" is either (1) a "consumer product", which means any tangible personal property which is normally used for personal, family, or household purposes, or (2) anything designed or sold for incorporation into a dwelling. In determining whether a product is a consumer product, doubtful cases shall be resolved in favor of coverage. For a particular product received by a particular user, "normally used" refers to a typical or common use of that class of product, regardless of the status of the particular user or of the way in which the particular user actually uses, or expects or is expected to use, the product. A product is a consumer product regardless of whether the product has substantial commercial, industrial or non-consumer uses, unless such uses represent the only significant mode of use of the product. "Installation Information" for a User Product means any methods, procedures, authorization keys, or other information required to install and execute modified versions of a covered work in that User Product from a modified version of its Corresponding Source. The information must suffice to ensure that the continued functioning of the modified object code is in no case prevented or interfered with solely because modification has been made. If you convey an object code work under this section in, or with, or specifically for use in, a User Product, and the conveying occurs as part of a transaction in which the right of possession and use of the User Product is transferred to the recipient in perpetuity or for a fixed term (regardless of how the transaction is characterized), the Corresponding Source conveyed under this section must be accompanied by the Installation Information. But this requirement does not apply if neither you nor any third party retains the ability to install modified object code on the User Product (for example, the work has been installed in ROM). The requirement to provide Installation Information does not include a requirement to continue to provide support service, warranty, or updates for a work that has been modified or installed by the recipient, or for the User Product in which it has been modified or installed. Access to a network may be denied when the modification itself materially and adversely affects the operation of the network or violates the rules and protocols for communication across the network. Corresponding Source conveyed, and Installation Information provided, in accord with this section must be in a format that is publicly documented (and with an implementation available to the public in source code form), and must require no special password or key for unpacking, reading or copying. 7. Additional Terms. "Additional permissions" are terms that supplement the terms of this License by making exceptions from one or more of its conditions. Additional permissions that are applicable to the entire Program shall be treated as though they were included in this License, to the extent that they are valid under applicable law. If additional permissions apply only to part of the Program, that part may be used separately under those permissions, but the entire Program remains governed by this License without regard to the additional permissions. When you convey a copy of a covered work, you may at your option remove any additional permissions from that copy, or from any part of it. (Additional permissions may be written to require their own removal in certain cases when you modify the work.) You may place additional permissions on material, added by you to a covered work, for which you have or can give appropriate copyright permission. Notwithstanding any other provision of this License, for material you add to a covered work, you may (if authorized by the copyright holders of that material) supplement the terms of this License with terms: a) Disclaiming warranty or limiting liability differently from the terms of sections 15 and 16 of this License; or b) Requiring preservation of specified reasonable legal notices or author attributions in that material or in the Appropriate Legal Notices displayed by works containing it; or c) Prohibiting misrepresentation of the origin of that material, or requiring that modified versions of such material be marked in reasonable ways as different from the original version; or d) Limiting the use for publicity purposes of names of licensors or authors of the material; or e) Declining to grant rights under trademark law for use of some trade names, trademarks, or service marks; or f) Requiring indemnification of licensors and authors of that material by anyone who conveys the material (or modified versions of it) with contractual assumptions of liability to the recipient, for any liability that these contractual assumptions directly impose on those licensors and authors. All other non-permissive additional terms are considered "further restrictions" within the meaning of section 10. If the Program as you received it, or any part of it, contains a notice stating that it is governed by this License along with a term that is a further restriction, you may remove that term. If a license document contains a further restriction but permits relicensing or conveying under this License, you may add to a covered work material governed by the terms of that license document, provided that the further restriction does not survive such relicensing or conveying. If you add terms to a covered work in accord with this section, you must place, in the relevant source files, a statement of the additional terms that apply to those files, or a notice indicating where to find the applicable terms. Additional terms, permissive or non-permissive, may be stated in the form of a separately written license, or stated as exceptions; the above requirements apply either way. 8. Termination. You may not propagate or modify a covered work except as expressly provided under this License. Any attempt otherwise to propagate or modify it is void, and will automatically terminate your rights under this License (including any patent licenses granted under the third paragraph of section 11). However, if you cease all violation of this License, then your license from a particular copyright holder is reinstated (a) provisionally, unless and until the copyright holder explicitly and finally terminates your license, and (b) permanently, if the copyright holder fails to notify you of the violation by some reasonable means prior to 60 days after the cessation. Moreover, your license from a particular copyright holder is reinstated permanently if the copyright holder notifies you of the violation by some reasonable means, this is the first time you have received notice of violation of this License (for any work) from that copyright holder, and you cure the violation prior to 30 days after your receipt of the notice. Termination of your rights under this section does not terminate the licenses of parties who have received copies or rights from you under this License. If your rights have been terminated and not permanently reinstated, you do not qualify to receive new licenses for the same material under section 10. 9. Acceptance Not Required for Having Copies. You are not required to accept this License in order to receive or run a copy of the Program. Ancillary propagation of a covered work occurring solely as a consequence of using peer-to-peer transmission to receive a copy likewise does not require acceptance. However, nothing other than this License grants you permission to propagate or modify any covered work. These actions infringe copyright if you do not accept this License. Therefore, by modifying or propagating a covered work, you indicate your acceptance of this License to do so. 10. Automatic Licensing of Downstream Recipients. Each time you convey a covered work, the recipient automatically receives a license from the original licensors, to run, modify and propagate that work, subject to this License. You are not responsible for enforcing compliance by third parties with this License. An "entity transaction" is a transaction transferring control of an organization, or substantially all assets of one, or subdividing an organization, or merging organizations. If propagation of a covered work results from an entity transaction, each party to that transaction who receives a copy of the work also receives whatever licenses to the work the party's predecessor in interest had or could give under the previous paragraph, plus a right to possession of the Corresponding Source of the work from the predecessor in interest, if the predecessor has it or can get it with reasonable efforts. You may not impose any further restrictions on the exercise of the rights granted or affirmed under this License. For example, you may not impose a license fee, royalty, or other charge for exercise of rights granted under this License, and you may not initiate litigation (including a cross-claim or counterclaim in a lawsuit) alleging that any patent claim is infringed by making, using, selling, offering for sale, or importing the Program or any portion of it. 11. Patents. A "contributor" is a copyright holder who authorizes use under this License of the Program or a work on which the Program is based. The work thus licensed is called the contributor's "contributor version". A contributor's "essential patent claims" are all patent claims owned or controlled by the contributor, whether already acquired or hereafter acquired, that would be infringed by some manner, permitted by this License, of making, using, or selling its contributor version, but do not include claims that would be infringed only as a consequence of further modification of the contributor version. For purposes of this definition, "control" includes the right to grant patent sublicenses in a manner consistent with the requirements of this License. Each contributor grants you a non-exclusive, worldwide, royalty-free patent license under the contributor's essential patent claims, to make, use, sell, offer for sale, import and otherwise run, modify and propagate the contents of its contributor version. In the following three paragraphs, a "patent license" is any express agreement or commitment, however denominated, not to enforce a patent (such as an express permission to practice a patent or covenant not to sue for patent infringement). To "grant" such a patent license to a party means to make such an agreement or commitment not to enforce a patent against the party. If you convey a covered work, knowingly relying on a patent license, and the Corresponding Source of the work is not available for anyone to copy, free of charge and under the terms of this License, through a publicly available network server or other readily accessible means, then you must either (1) cause the Corresponding Source to be so available, or (2) arrange to deprive yourself of the benefit of the patent license for this particular work, or (3) arrange, in a manner consistent with the requirements of this License, to extend the patent license to downstream recipients. "Knowingly relying" means you have actual knowledge that, but for the patent license, your conveying the covered work in a country, or your recipient's use of the covered work in a country, would infringe one or more identifiable patents in that country that you have reason to believe are valid. If, pursuant to or in connection with a single transaction or arrangement, you convey, or propagate by procuring conveyance of, a covered work, and grant a patent license to some of the parties receiving the covered work authorizing them to use, propagate, modify or convey a specific copy of the covered work, then the patent license you grant is automatically extended to all recipients of the covered work and works based on it. A patent license is "discriminatory" if it does not include within the scope of its coverage, prohibits the exercise of, or is conditioned on the non-exercise of one or more of the rights that are specifically granted under this License. You may not convey a covered work if you are a party to an arrangement with a third party that is in the business of distributing software, under which you make payment to the third party based on the extent of your activity of conveying the work, and under which the third party grants, to any of the parties who would receive the covered work from you, a discriminatory patent license (a) in connection with copies of the covered work conveyed by you (or copies made from those copies), or (b) primarily for and in connection with specific products or compilations that contain the covered work, unless you entered into that arrangement, or that patent license was granted, prior to 28 March 2007. Nothing in this License shall be construed as excluding or limiting any implied license or other defenses to infringement that may otherwise be available to you under applicable patent law. 12. No Surrender of Others' Freedom. If conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot convey a covered work so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not convey it at all. For example, if you agree to terms that obligate you to collect a royalty for further conveying from those to whom you convey the Program, the only way you could satisfy both those terms and this License would be to refrain entirely from conveying the Program. 13. Use with the GNU Affero General Public License. Notwithstanding any other provision of this License, you have permission to link or combine any covered work with a work licensed under version 3 of the GNU Affero General Public License into a single combined work, and to convey the resulting work. The terms of this License will continue to apply to the part which is the covered work, but the special requirements of the GNU Affero General Public License, section 13, concerning interaction through a network will apply to the combination as such. 14. Revised Versions of this License. The Free Software Foundation may publish revised and/or new versions of the GNU General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies that a certain numbered version of the GNU General Public License "or any later version" applies to it, you have the option of following the terms and conditions either of that numbered version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the GNU General Public License, you may choose any version ever published by the Free Software Foundation. If the Program specifies that a proxy can decide which future versions of the GNU General Public License can be used, that proxy's public statement of acceptance of a version permanently authorizes you to choose that version for the Program. Later license versions may give you additional or different permissions. However, no additional obligations are imposed on any author or copyright holder as a result of your choosing to follow a later version. 15. Disclaimer of Warranty. THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 16. Limitation of Liability. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MODIFIES AND/OR CONVEYS THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. 17. Interpretation of Sections 15 and 16. If the disclaimer of warranty and limitation of liability provided above cannot be given local legal effect according to their terms, reviewing courts shall apply local law that most closely approximates an absolute waiver of all civil liability in connection with the Program, unless a warranty or assumption of liability accompanies a copy of the Program in return for a fee. END OF TERMS AND CONDITIONS How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to the public, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively state the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) This program is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program. If not, see . Also add information on how to contact you by electronic and paper mail. If the program does terminal interaction, make it output a short notice like this when it starts in an interactive mode: Copyright (C) This program comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, your program's commands might be different; for a GUI interface, you would use an "about box". You should also get your employer (if you work as a programmer) or school, if any, to sign a "copyright disclaimer" for the program, if necessary. For more information on this, and how to apply and follow the GNU GPL, see . The GNU General Public License does not permit incorporating your program into proprietary programs. If your program is a subroutine library, you may consider it more useful to permit linking proprietary applications with the library. If this is what you want to do, use the GNU Lesser General Public License instead of this License. But first, please read . paleomix-1.2.12/licenses/mit.txt000066400000000000000000000017771314402124200165570ustar00rootroot00000000000000Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. paleomix-1.2.12/misc/000077500000000000000000000000001314402124200143375ustar00rootroot00000000000000paleomix-1.2.12/misc/setup_bam_pipeline_example.makefile.yaml000066400000000000000000000010151314402124200243530ustar00rootroot00000000000000# -*- mode: Yaml; -*- Options: Platform: Illumina QualityOffset: 33 SplitLanesByFilenames: no CompressionFormat: gz Aligners: Program: Bowtie2 Bowtie2: MinQuality: 0 --very-sensitive: PCRDuplicates: no RescaleQualities: no ExcludeReads: - Paired Features: [] Prefixes: rCRS: Path: 000_prefixes/rCRS.fasta ExampleProject: Synthetic_Sample_1: ACGATA: Lane_2: 000_data/ACGATA_L2_R{Pair}_*.fastq.gz GCTCTG: Lane_2: 000_data/GCTCTG_L2_R1_*.fastq.gz paleomix-1.2.12/misc/setup_bam_pipeline_example.sh000066400000000000000000000045371314402124200222630ustar00rootroot00000000000000#!/bin/bash # # Copyright (c) 2013 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in all # copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. # set -o nounset # Fail on unset variables set -o errexit # Fail on uncaught non-zero returncodes set -o pipefail # Fail is a command in a chain of pipes fails SP_SEED=${RANDOM} rm -rv 000_data mkdir -p 000_data for barcode in ACGATA GCTCTG TGCTCA; do python $(dirname $0)/synthesize_reads.py 000_prefixes/rCRS.fasta 000_data/ \ --library-barcode=${barcode} \ --specimen-seed=${SP_SEED} \ --lanes-reads-mu=2500 \ --lanes-reads-sigma=500 \ --lanes-reads-per-file=1000 \ --lanes=2 \ --damage done rm -v 000_data/GCTCTG_L*R2*.gz rm -v 000_data/TGCTCA_L1_R2*.gz bam_pipeline run $(dirname $0)/setup_bam_pipeline_example.makefile.yaml --destination . mkdir -p 000_data/ACGATA_L2/ mv ExampleProject/reads/Synthetic_Sample_1/ACGATA/Lane_2/reads.singleton.truncated.gz 000_data/ACGATA_L2/reads.singleton.truncated.gz mv ExampleProject/reads/Synthetic_Sample_1/ACGATA/Lane_2/reads.collapsed.gz 000_data/ACGATA_L2/reads.collapsed.gz mv ExampleProject/reads/Synthetic_Sample_1/ACGATA/Lane_2/reads.collapsed.truncated.gz 000_data/ACGATA_L2/reads.collapsed.truncated.gz mv ExampleProject/rCRS/Synthetic_Sample_1/GCTCTG/Lane_2/single.minQ0.bam 000_data/GCTCTG_L2.bam rm -v 000_data/ACGATA_L2_R*.fastq.gz rm -v 000_data/GCTCTG_L2_R1_*.fastq.gz rm -rv ExampleProject paleomix-1.2.12/misc/setup_phylo_pipeline_example.sh000077500000000000000000000041071314402124200226530ustar00rootroot00000000000000#!/bin/bash # # Copyright (c) 2013 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in all # copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. # set -o nounset # Fail on unset variables set -o errexit # Fail on uncaught non-zero returncodes set -o pipefail # Fail is a command in a chain of pipes fails rm -rvf alignment/000_reads mkdir -p alignment/000_reads for PREFIX in `ls alignment/000_prefixes/*.fasta | grep -v rCRS`; do SP_SEED=${RANDOM} NAME=$(echo ${PREFIX} | sed -e's#alignment/000_prefixes/##' -e's#\..*##') mkdir -p alignment/000_reads/${NAME/*\//}/ ./synthesize_reads.py ${PREFIX} alignment/000_reads/${NAME}/ \ --specimen-seed=${SP_SEED} \ --lanes-reads-mu=50000 \ --lanes-reads-sigma=500 \ --lanes-reads-per-file=10000 \ --reads-len=50 \ --lanes=1 done # These links would not survive the package installation, so setup here ln -sf ../../alignment/000_prefixes/ phylogeny/data/prefixes ln -sf ../../alignment phylogeny/data/samples # Create link to reference sequence mkdir -p phylogeny/data/refseqs ln -sf ../../../alignment/000_prefixes/rCRS.fasta phylogeny/data/refseqs/rCRS.rCRS.fasta paleomix-1.2.12/misc/skeleton.py000066400000000000000000000025411314402124200165370ustar00rootroot00000000000000#!/usr/bin/python # -*- coding: utf-8 -*- # Copyright (c) 2014 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. import sys import argparse def parse_args(argv): parser = argparse.ArgumentParser() return parser.parse_args(argv) def main(argv): return 0 if __name__ == '__main__': sys.exit(main(sys.argv[1:])) paleomix-1.2.12/misc/synthesize_reads.py000077500000000000000000000362241314402124200203060ustar00rootroot00000000000000#!/usr/bin/python -3 # # Copyright (c) 2013 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. # from __future__ import print_function import sys import math import gzip import random from optparse import \ OptionParser, \ OptionGroup from paleomix.common.sequences import \ reverse_complement from paleomix.common.formats.fasta import \ FASTA from paleomix.common.utilities import \ fragment from paleomix.common.sampling import \ weighted_sampling def _dexp(lambda_value, position): return lambda_value * math.exp(-lambda_value * position) def _rexp(lambda_value, rng): return - math.log(rng.random()) / lambda_value def toint(value): return int(round(value)) # Adapter added to the 5' end of the forward strand (read from 5' ...) PCR1 = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC%sATCTCGTATGCCGTCTTCTGCTTG" # Adapter added to the 5' end of the reverse strand (read from 3' ...): # rev. compl of the forward PCR2 = "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT" def _get_indel_length(indel_lambda, rng): return 1 + toint(_rexp(indel_lambda, rng)) def _get_weighted_choices(rng, sub_rate, indel_rate): choices_by_nt = {} for src_nt in "ACGT": choices = "ACGTID" probs = [sub_rate / 4] * 4 # ACGT probs += [indel_rate / 2] * 2 # ID probs[choices.index(src_nt)] = 1 - sum(probs) + sub_rate / 4 choices_by_nt[src_nt] = weighted_sampling(choices, probs, rng) return choices_by_nt def _mutate_sequence(rng, choices, refseq, indel_lambda=0): position = 0 sequence, positions = [], [] while position < len(refseq): ref_nt = refseq[position] if ref_nt not in "ACGT": read_nt = rng.choice("ACGT") else: read_nt = choices[ref_nt].next() if read_nt == "D": for _ in xrange(_get_indel_length(indel_lambda, rng)): position += 1 elif read_nt == "I": for _ in xrange(_get_indel_length(indel_lambda, rng)): sequence.append(rng.choice("ACGT")) positions.append(position) else: sequence.append(read_nt) positions.append(position) position += 1 return "".join(sequence), positions class Specimen(object): """Represents a specimen, from which samples are derived. These are mutated by the addition of changes to the sequence """ def __init__(self, options, filename): genome = list(FASTA.from_file(filename)) assert len(genome) == 1, len(genome) self._genome = genome[0].sequence.upper() self._sequence = None self._positions = None self._annotations = None self._mutate(options) def _mutate(self, options): rng = random.Random(options.specimen_seed) choices = _get_weighted_choices(rng, options.specimen_sub_rate, options.specimen_indel_rate) self._sequence, self._positions = \ _mutate_sequence(rng, choices, self._genome, options.specimen_indel_lambda) @property def sequence(self): return self._sequence @property def positions(self): return self._positions @property def annotations(self): return self._annotations class Sample(object): def __init__(self, options, specimen): self._specimen = specimen self._random = random.Random(options.sample_seed) self._options = options frac_endog = self._random.gauss(options.sample_endog_mu, options.sample_endog_sigma) self._frac_endog = min(1, max(0.01, frac_endog)) self._endog_id = 0 self._contam_id = 0 def get_fragment(self): """Returns either a DNA fragmnet, representing either a fragment of the sample genome, or a randomly generated DNA sequence representing contaminant DNA that is not related to the species.""" if self._random.random() <= self._frac_endog: return self._get_endogenous_sequence() return self._get_contaminant_sequence() def _get_contaminant_sequence(self): length = self._get_frag_len() sequence = [self._random.choice("ACGT") for _ in xrange(length)] self._contam_id += 1 name = "Seq_junk_%i" % (self._contam_id,) return (False, name, "".join(sequence)) def _get_endogenous_sequence(self): length = self._get_frag_len() max_position = len(self._specimen.sequence) - length position = self._random.randint(0, max_position) strand = self._random.choice(("fw", "rv")) sequence = self._specimen.sequence[position:position + length] real_pos = self._specimen.positions[position] if strand == "rv": sequence = reverse_complement("".join(sequence)) self._endog_id += 1 name = "Seq_%i_%i_%i_%s" % (self._endog_id, real_pos, length, strand) return (True, name, sequence) def _get_frag_len(self): length = toint(self._random.gauss(self._options.sample_frag_len_mu, self._options.sample_frag_len_sigma)) return max(self._options.sample_frag_len_min, min(self._options.sample_frag_len_max, length)) class Damage(object): def __init__(self, options, sample): self._options = options self._sample = sample self._random = random.Random(options.damage_seed) self._rates = self._calc_damage_rates(options) def get_fragment(self): is_endogenous, name, sequence = self._sample.get_fragment() if is_endogenous and self._options.damage: sequence = self._damage_sequence(sequence) return (name, sequence) def _damage_sequence(self, sequence): result = [] length = len(sequence) for (position, nucleotide) in enumerate(sequence): if nucleotide == "C": if self._random.random() < self._rates[position]: nucleotide = "T" elif nucleotide == "G": rv_position = length - position - 1 if self._random.random() < self._rates[rv_position]: nucleotide = "A" result.append(nucleotide) return "".join(result) @classmethod def _calc_damage_rates(cls, options): rate = options.damage_lambda rates = [_dexp(rate, position) for position in range(options.sample_frag_len_max)] return rates class Library(object): def __init__(self, options, sample): self._options = options self._sample = sample self._cache = [] self._rng = random.Random(options.library_seed) self.barcode = options.library_barcode if self.barcode is None: self.barcode = "".join(self._rng.choice("ACGT") for _ in range(6)) assert len(self.barcode) == 6, options.barcode pcr1 = PCR1 % (self.barcode,) self.lanes = self._generate_lanes(options, self._rng, sample, pcr1) @classmethod def _generate_lanes(cls, options, rng, sample, pcr1): lane_counts = [] for _ in xrange(options.lanes_num): lane_counts.append(toint(random.gauss(options.lanes_reads_mu, options.lanes_reads_sigma))) reads = cls._generate_reads(options, rng, sample, sum(lane_counts), pcr1) lanes = [] for count in lane_counts: lanes.append(Lane(options, reads[:count])) reads = reads[count:] return lanes @classmethod def _generate_reads(cls, options, rng, sample, minimum, pcr1): reads = [] while len(reads) < minimum: name, sequence = sample.get_fragment() cur_forward = sequence + pcr1 cur_reverse = reverse_complement(sequence) + PCR2 # Number of PCR copies -- minimum 1 num_dupes = toint(_rexp(options.library_pcr_lambda, rng)) + 1 for dupe_id in xrange(num_dupes): cur_name = "%s_%s" % (name, dupe_id) reads.append((cur_name, cur_forward, cur_reverse)) random.shuffle(reads) return reads class Lane(object): def __init__(self, options, reads): rng = random.Random() choices = _get_weighted_choices(rng, options.reads_sub_rate, options.reads_indel_rate) self._sequences = [] for (name, forward, reverse) in reads: forward, _ = _mutate_sequence(rng, choices, forward, options.reads_indel_lambda) if len(forward) < options.reads_len: forward += "A" * (options.reads_len - len(forward)) elif len(forward) > options.reads_len: forward = forward[:options.reads_len] reverse, _ = _mutate_sequence(rng, choices, reverse, options.reads_indel_lambda) if len(reverse) < options.reads_len: reverse += "T" * (options.reads_len - len(reverse)) elif len(reverse) > options.reads_len: reverse = reverse[:options.reads_len] self._sequences.append((name, "".join(forward), "".join(reverse))) @property def sequences(self): return self._sequences def parse_args(argv): parser = OptionParser() group = OptionGroup(parser, "Specimen") group.add_option("--specimen-seed", default=None, help="Seed used to initialize the 'speciment', for the " "creation of a random genotype. Set to a specific " "values if runs are to be done for the same " "genotype.") group.add_option("--specimen-sub-rate", default=0.005, type=float) group.add_option("--specimen-indel-rate", default=0.0005, type=float) group.add_option("--specimen-indel-lambda", default=0.9, type=float) parser.add_option_group(group) group = OptionGroup(parser, "Samples from specimens") group.add_option("--sample-seed", default=None) group.add_option("--sample-frag-length-mu", dest="sample_frag_len_mu", default=100, type=int) group.add_option("--sample-frag-length-sigma", dest="sample_frag_len_sigma", default=30, type=int) group.add_option("--sample-frag-length-min", dest="sample_frag_len_min", default=0, type=int) group.add_option("--sample-frag-length-max", dest="sample_frag_len_max", default=500, type=int) group.add_option("--sample-endogenous_mu", dest="sample_endog_mu", default=0.75, type=float) group.add_option("--sample-endogenous_sigma", dest="sample_endog_sigma", default=0.10, type=float) parser.add_option_group(group) group = OptionGroup(parser, "Post mortem damage of samples") group.add_option("--damage", dest="damage", default=False, action="store_true") group.add_option("--damage-seed", dest="damage_seed", default=None) group.add_option("--damage-lambda", dest="damage_lambda", default=0.25, type=float) parser.add_option_group(group) group = OptionGroup(parser, "Libraries from samples") group.add_option("--library-seed", dest="library_seed", default=None) group.add_option("--library-pcr-lambda", dest="library_pcr_lambda", default=3, type=float) group.add_option("--library-barcode", dest="library_barcode", default=None) parser.add_option_group(group) group = OptionGroup(parser, "Lanes from libraries") group.add_option("--lanes", dest="lanes_num", default=3, type=int) group.add_option("--lanes-reads-mu", dest="lanes_reads_mu", default=10000, type=int) group.add_option("--lanes-reads-sigma", dest="lanes_reads_sigma", default=2500, type=int) group.add_option("--lanes-reads-per-file", dest="lanes_per_file", default=2500, type=int) parser.add_option_group(group) group = OptionGroup(parser, "Reads from lanes") group.add_option("--reads-sub-rate", dest="reads_sub_rate", default=0.005, type=float) group.add_option("--reads-indel-rate", dest="reads_indel_rate", default=0.0005, type=float) group.add_option("--reads-indel-lambda", dest="reads_indel_lambda", default=0.9, type=float) group.add_option("--reads-length", dest="reads_len", default=100, type=int) parser.add_option_group(group) options, args = parser.parse_args(argv) if len(args) != 2: sys.stderr.write("Usage: %s \n" % sys.argv[0]) return None, None return options, args def main(argv): options, args = parse_args(argv) if not options: return 1 print("Generating %i lane(s) of synthetic reads ...\nDISCLAIMER: For " "demonstration of PALEOMIX only; the synthetic data is not " "biologically meaningful!" % (options.lanes_num,)) specimen = Specimen(options, args[0]) sample = Sample(options, specimen) damage = Damage(options, sample) library = Library(options, damage) for (lnum, lane) in enumerate(library.lanes, start=1): fragments = fragment(options.lanes_per_file, lane.sequences) for (readsnum, reads) in enumerate(fragments, start=1): templ = "%s%s_L%i_R%%s_%02i.fastq.gz" % (args[1], library.barcode, lnum, readsnum) print(" Writing %s" % (templ % "{Pair}",)) with gzip.open(templ % 1, "w") as out_1: with gzip.open(templ % 2, "w") as out_2: for (name, seq_1, seq_2) in reads: out_1.write("@%s%s/1\n%s\n" % (library.barcode, name, seq_1)) out_1.write("+\n%s\n" % ("I" * len(seq_1),)) out_2.write("@%s%s/2\n%s\n" % (library.barcode, name, seq_2)) out_2.write("+\n%s\n" % ("H" * len(seq_2),)) if __name__ == '__main__': sys.exit(main(sys.argv[1:])) paleomix-1.2.12/paleomix/000077500000000000000000000000001314402124200152225ustar00rootroot00000000000000paleomix-1.2.12/paleomix/__init__.py000066400000000000000000000036641314402124200173440ustar00rootroot00000000000000#!/usr/bin/python # # Copyright (c) 2012 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. # __version_info__ = (1, 2, 12) __version__ = '%i.%i.%i' % __version_info__ def run(command=None): """Main entry-point for setuptools""" import sys import paleomix.main argv = [] if command is not None: argv.append(command) argv.extend(sys.argv[1:]) return paleomix.main.main(argv) def run_bam_pipeline(): """Legacy entry-point for setuptools""" return run("bam_pipeline") def run_gtf_to_bed(): """Legacy entry-point for setuptools""" return run("gtf_to_bed") def run_phylo_pipeline(): """Legacy entry-point for setuptools""" return run("phylo_pipeline") def run_rmdup_collapsed(): """Legacy entry-point for setuptools""" return run("rmdup_collapsed") def run_trim_pipeline(): """Legacy entry-point for setuptools""" return run("trim_pipeline") paleomix-1.2.12/paleomix/atomiccmd/000077500000000000000000000000001314402124200171625ustar00rootroot00000000000000paleomix-1.2.12/paleomix/atomiccmd/__init__.py000066400000000000000000000021621314402124200212740ustar00rootroot00000000000000#!/usr/bin/python # # Copyright (c) 2012 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in all # copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. # paleomix-1.2.12/paleomix/atomiccmd/builder.py000066400000000000000000000565001314402124200211700ustar00rootroot00000000000000#!/usr/bin/python # # Copyright (c) 2012 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. # """Tools for passing CLI options to AtomicCmds used by Nodes. The module contains 1 class and 2 decorators, which may be used in conjunction to create Node classes for which the call carried out by AtomicCmds may be modified by the end user, without explicit support added to the init function of the class. The basic outline of such a class is as follows: class ExampleNode(CommandNode): @create_customizable_cli_parameters def customize(self, ...): # Use passed parameters to create AtomicCmdBuilder obj builder = AtomicCmdBuilder(...) builder.set_option(...) # Return dictionary of AtomicCmdBuilder objects and any # additional parameters required to run the Node. return {"command" : builder, "example" : ...} @use_customizable_cli_parameters def __init__(self, parameters): # Create AtomicCmd object using (potentially tweaked) parameters command = parameters.command.finalize() # Do something with a parameter passed to customize description = "" % parameters.example CommandNode.__init__(command = command, description = description, ...) This class can then be used in two ways: 1) Without doing any explicit modifications to the CLI calls: >> node = ExampleNode(...) 2) Retrieving and tweaking AtomicCmdBuilder before creating the Node: >> params = ExampleNode.customize(...) >> params.command.set_option(...) >> node = params.build_node() """ import os import types import inspect import subprocess import collections from paleomix.atomiccmd.command import \ AtomicCmd from paleomix.common.utilities import \ safe_coerce_to_tuple import paleomix.common.versions as versions class AtomicCmdBuilderError(RuntimeError): """Error raised by AtomicCmdBuilder.""" class AtomicCmdBuilder(object): """AtomicCmdBuilder is a class used to allow step-wise construction of an AtomicCmd object. This allows the user of a Node to modify the behavior of the called programs using some CLI parameters, without explicit support for these in the Node API. Some limitations are in place, to help catch cases where overwriting or adding a flag would break the Node. The system call is constructed in the following manner: $ The components are defined as follows: - The minimal call needed invoke the current program. Typically this is just the name of the executable, but may be a more complex set of values for nested calls (e.g. java/scripts).