pax_global_header00006660000000000000000000000064144317366520014525gustar00rootroot0000000000000052 comment=cf6880a83ad0c0c8a81ddae14518d581615f32f3 paleomix-1.3.8/000077500000000000000000000000001443173665200133545ustar00rootroot00000000000000paleomix-1.3.8/.coveragerc000066400000000000000000000000621443173665200154730ustar00rootroot00000000000000[run] branch = True [report] skip_covered = True paleomix-1.3.8/.gitignore000066400000000000000000000002541443173665200153450ustar00rootroot00000000000000*.swp .\#* \#* *~ __pycache__ cov.xml .coverage* MANIFEST # Packages dist/ build/ sdist/ *.egg/ *.egg-info/ .eggs/ .tox/ .vscode/ .mypy_cache/ # Misc venv docs/_build/ paleomix-1.3.8/CHANGES.md000066400000000000000000001220061443173665200147470ustar00rootroot00000000000000# Changelog ## [1.3.8] - 2023-05-19 #### Fixed - Added 'genotyping' alias to match phylo pipeline documentation (issue #48). - Fixed options for BWA `aln` being applied to `samse` and `sampe` (issue #49). ## [1.3.7] - 2022-08-22 #### Added - Added example to BAM pipeline YAML template, showing how to increase the maximum allowed Phred score for AdapterRemoval. This is needed due to the value being capped at 41 by default, lower than the maximum observed in some modern data. #### Fixed - Fixed regression in config file parsing, that would cause failure if no value was specified for an option. - Fixed error message not being printed correctly when attempting to use Phred+64 data with BWA mem/bwasw. - Fixed regressions that prevented the use of "regions of interest" in the BAM pipeline. - Fixed failure when using `--list-output-files` and auxilary files were missing or dependecies were unmet. Output files are now printed. ## [1.3.6] - 2021-11-28 ### Added - Added explicit support for the AdapterRemoval `--trim5p` and `--trim3p` options, which may take one or two values (as a list). ### Changed - User options for AdapterRemoval are no longer restricted by a whitelist. ## [1.3.5] - 2021-10-12 ### Added - Added command-line option to reduce/turn off validation with picard ValidateSamFile. ## [1.3.4] - 2021-09-23 ### Added - Added support for the `--collapse-conservatively` AdapterRemoval option. ### Changed - Avoid creating log files on invalid commandline arguments. - The directory for the log-file is created automatically if it does not exist. - No longer prints stack-trace if the user terminates a pipeline wiht Ctrl + C. - Log-level command-line options are now case insensitive. - The default number of threads used by AdapterRemova, Bowtie2, and BWA are now scaled based on the available number of cores instead of defaulting to 1 thread. - Less exhaustive validation of .bai index files using picard ValidateSamFile. The overhead of validating these files was excessive in light of the small benefit. ### Fixed - Fixed regression causing certain option to not be applied when mapping with BWA. - Fixed --log-level not having an effect. - Fixed possible infinite recursion when using lazily created log-files. - Fixed BAM pipeline failing if mapDamage feature was not explicitly set. - Fixed default values of 0 or 1 not being listed in commnad-line help text. ## [1.3.3] - 2021-04-06 ### Fixed - Fixed regression in BAM pipeline summary node, causing failing if there were zero hits or reads. - Fixed BAM validation always being run in big-genome mode, resulting in some checks being disabled despite being applicable. ## [1.3.2] - 2020-09-03 ### Added - Added `-v` and `--version` to all command-line tools. - Added the default values (if any) to the help-strings of all command-line options. ### Changed - Decoupled `--log-level` from command-line logging. Changed default log-level to ERROR and made it apply to automatically generated log files as well. ### Fixed - Fixed the pipeline failing on `jre_options` (now `jre-option`) in config files. - Fixed the pipeline failing on empty options in config files from PALEOMIX v1.2.x. - Fixed Bowtie2 using command-line options from the BWA makefile section - Fixed conda installation instructions and environment file for PALEOMIX 1.3.x. ## [1.3.1] - 2020-09-01 ### Fixed - Updated shebangs to 'python3'. Patch courtesy of Andreas Tille. - Added minimal support for previously removed command-line options, to prevent the pipelines from failing when used with old configuration files. ## [1.3.0] - 2020-08-31 PALEOMIX v1.3.0 is a major maintenance release, with the goal of porting PALEOMIX to Python 3 and to prepare for further work to update and expand the pipeline. A number of deprecated tools and options have been removed, as has support for very old versions of tools used by the pipelines. Existing makefiles are compatible with PALEOMIX 1.3.0 with a few notable exceptions: - BAM pipeline support for the GATK Indel Realigner has been removed, and the options 'RealignedBAM' and 'RawBam' no longer have any effect. These are now simply ignored and a "raw" BAM is always produced. The Indel Realigner tool was removed from GATK as of GATK4 (released 2018) and continued support is not deemed worthwhile due to the minor benefit from running the Indel Realigner. - BAM pipeline support for generating PCR duplicate histogram files for use with PreSeq has been removed. The option is simply ignored. - BAM pipeline support for AdapterRemoval options --pcr1 and --pcr2 has been removed, as these options are long deprecated and will be removed from AdapterRemoval. Use the --adapter1 and --adapter2 options as described in the BAM pipeline documentation. - Phylo pipeline options for BCFTools must be be updated to replace the option invoking the consensus caller ("-g") with "-c", or with "-m" for the multiallelic caller. - The Phylo pipeline genotyping methods 'Random Sampling' and 'Reference Sequence' are no longer supported. Please open an issue if features or options import to your work have been removed. ### Added - The BAM and Phylo pipelines print warnings when deprecated/removed options are used - A log-file is automatically created if errors are encountered during run-time. ### Changed - Converted project from Python 2.7 to Python 3.5+. - Removed internal copy of pyyaml and added dependency on ruamel.yaml. - Command-line output was changed to a simpler, log-log output using coloredlogs. - Bumped minimum version requirements for most tools used by the pipelines; minimum versions were largely informed by availability in Debian stretch. - Changed naming of BAM index files created by the BAM pipeline from 'filename.bai' to 'filename.bam.bai' in order to match the behavior of standard tools (e.g. samtools). - The filenames of input FASTQ files are now used in the intermediate file-structure, with the goal of making the pipeline more robust to changes in input files. - The pipeline no longer fails if a command generates more files than expected, instead this merely triggers a warning. - Moved PCR duplicate filtering and rescaling to 'Features' in BAM pipeline makefiles. ### Fixed - Fixed spurious warnings from pysam (htslib) when opening BAMs without index files. ### Removed - Removed limited support for 32 bit systems - Removed the 'cat' command. - Removed the 'duphist' command and the corresponding BAM pipeline feature. - Removed the 'ena' command. - Removed the 'sample_pileup' command. - Removed the 'retable' command. A more performant standalone version can be found at https://github.com/MikkelSchubert/retable - Removed the bam_pipeline 'remap' command. - Removed entry-points other than the 'paleomix' command; that is to say the stand- alone 'bam_pipeline', 'phylo_pipeline', etc. commands. - Removed data for the original publication of PALEOMIX. The instructions in that publication are outdated and cannot be carried out for current versions of PALEOMIX. - Removed support for configuration files with per-host sections. Files are now assumed to contain only one set of command-line options. - Removed --to-dot option for pipelines. - Removed keyboard shortcuts for modifying pipeline behavior during runtime. - Removed undocumented options from Zonkey. - Removed undocumented codeml support from the Phylo pipeline. - Removed 'Random Sampling' and 'Reference Sequence' genotyping methods. - Removed makefile metadata (filename, hash, mtime) from BAM pipeline summary reports. - Removed support for compressing intermediate FASTQ files using bzip2. Reads are now always compressed using gzip. - Removed ability to merge FASTQ files with the the SplitLanesByFilenames option. Files are now always split, meaning that individual FASTQ files or pairs are mapped. - Removed support for indel realignment using GATK due to its removal from GATK. - Removed creation of FASTA sequence dictionaries as they were only needed by GATK. - Removed support for labels for BAM pipeline prefixes. ## [1.2.14] - 2019-12-01 ### Changed - Improved handling of K-groups in zonkey database files - Change BAM pipeline version requirement for GATK to < v4.0, as the the Indel Realigner has been removed in GATK v4.0 ### Fixed - Fixed version detection of GATK for v4.0 (issue #23) ## [1.2.13.8] - 2019-10-27 ### Changed - Zonkey now identifies nuclear chromosomes by size instead of name; this is done to better handle FASTAs downloaded from different sources ## [1.2.13.7] - 2019-10-15 ### Fixed - Fixed handling of digit only chromosome names in Zonkey - Remove dashes from Zonkey MT genomes when running 'mito' command ## [1.2.13.6] - 2019-10-13 ### Fixed - Handle .*miss files created by some versions of plink in Zonkey ## [1.2.13.5] - 2019-09-29 ### Fixed - Ignore ValidateSamFile warning REF_SEQ_TOO_LONG_FOR_BAI warning when processing genomes with contigs too large for BAI index files. ## [1.2.13.4] - 2019-03-25 ### Fixed - Improved detection of Picard versions in cases where 'java' outputs additional text. ## [1.2.13.3] - 2018-11-01 ### Fixed - Fixed validation/read counting of pre-trimmed reads not including the mate 1 files of paired-end reads. This resulted in the 'seq_retained_reads' count being half the expected value. ## [1.2.13.2] - 2018-04-22 ### Fixed - Additional fixes to divisions by zeros in summary calculations. - Fixed 'empty file' message if FASTA file ends with empty sequence. - Renamed pre-trimmed FASTQ validation/statistics file, to avoid failure if an older run was resumed. ## [1.2.13.1] - 2018-03-25 ### Fixed - Fixed divisions by zero if empty files are listed as pre-trimmed reads. ## [1.2.13] - 2018-03-25 ### Added - Added 'retable' command for pretty-printing whitespace separated data in the previously used by the BAM pipeline. - Basic statistics are collected for pre-trimmed reads in the BAM pipeline. ### Changed - BAM pipeline tables are now saved as tab separated columns. The old pretty-printed format may be produced by running the 'retable' tool on the resulting files. - Memory usage for the 'coverage' and 'depths' commands were reduced when using very big BED files. ### Fixed - Fixed input / output files not being listed in 'pipe.errors' files. - Use the same max open files limit for picard (ulimit -n minus headroom) when determining if the default should be changed and as the final value. - Removed explicit test for JRE version, which was failing on some (valid) runtimes. Java programs are still checked prior to running pipelines. - Fixed changes to recent version of Pysam breaking the alignment step in the BAM pipeline. - Fixed various test failures resulting in different environments. - Fixed validation of pre-trimmed FASTQ files in BAM pipelines stopping early if empty files are encountered. ### Removed - Removed automatic migrating of configuration files created PALEOMIX PALEOMIX prior to v1.2.0. - Previously deprecated support for existing BAM files was removed from the BAM pipeline. ## [1.2.12] - 2017-08-13 ### Fixed - Fixed input / output files not being listed in 'pipe.errors' files. - Use the same max open files limit for picard (ulimit -n minus headroom) when determining if the default should be changed and as the final value. ### Added - The 'vcf_to_fasta' command now supports VCFs containing haploid genotype calls, courtesy of Graham Gower. ### Changed - Require Pysam version 0.10.0 or later. ## [1.2.11] - 2017-06-09 ### Fixed - Fixed unhandled exception if a FASTA file for a prefix is missing in a BAM pipeline makefile. - Fixed the 'RescaleQualities' option not being respected for non-global options in BAM pipeline makefiles. ## [1.2.10] - 2017-05-29 ### Added - Preliminary support for CSI indexed BAM files, required for genomes with chromosomes > 2^29 - 1 bp in size. Support is still missing in HTSJDK, so GATK cannot currently be used with such genomes. CSI indexing is enabled automatically when required. ### Fixed - Reference sequences placed in the current directory no longer cause the BAM pipeline to complain about non-writable directories. - The maximum number of temporary files used by picard will no longer be increased above the default value used by the picard tools. ### Changed - The 'Status' of processes terminated by the pipeline will now be reported as 'Automatically terminated by PALEOMIX'. This is to help differentiate between processes that failed or were killed by an external source, and processes that were cleaned up by the pipeline itself. - Pretty-printing of commands shown when commands fail have been revised to make it more readable, including explicit descriptions when output is piped from one process to another and vice versa. - Commands are now shown in a format more suitable for running on the command-line, instead of as a Python list, when a node fails. Pipes are still specified separately. - Improved error messages for missing programs during version checks, and for exceptions raised when calling Popen during version checks. - Strip MC tags from reads with unmapped mates during cleanup; this is required since Picard (v2.9.0) ValidateSamFile considers such tags invalid. ## [1.2.9] - 2017-05-01 ### Fixed - Improved handling of BAM tags to prevent unintended type changes. - Fixed 'rmdup_collapsed' underreporting the number of duplicate reads (in the 'XP' tag), when duplicates with different CIGAR strings were processed. ### Changed - PCR duplicates detected for collapsed reads using 'rmdup\_collapsed' are now identified based on alignments that include clipped bases. This matches the behavior of the Picard 'MarkDuplicates' command. - Depending on work-load, 'rmdup\_collapsed' may now run up to twice as fast. ## [1.2.8] - 2017-04-28 ### Added - Added FILTER entry for 'F' filter used in vcf\_filter. This corresponds to heterozygous sites where the allele frequency was not determined. - Added 'dupcheck' command. This command roughly corresponds to the DetectInputDuplication step that is part of the BAM pipeline, and attempts to identify duplicate data (not PCR duplicates), by locating reads mapped to the same position, with the same name, sequence, and quality scores. - Added link to sample data used in publication to the Zonkey documentation. ### Changed - Only letters, numbers, and '-', '_', and '.' are allowed in sample-names used in Zonkey, in order to prevent invalid filenames and certain programs breaking on whitespace. Trailing whitespace is stripped. - Show more verbose output when building Zonkey pipelines. - Picard tools version 1.137 or later is now required by the BAM pipeline. This is nessesary as newer BAM files (header version 1.5) would fail to validate when using earlier versions of Picard tools. ### Fixed - Fixed validation nodes failing on output paths without a directory. - Fixed possible uncaught exceptions when terminating cat commands used by FASTQ validation nodes resulting in loss of error messages. - Fixed makefile validation failing with an unhandled TypeError if unhashable types were found in unexpected locations. For example, a dict found where a subset of strings were allowed. These now result in a proper MakeFileError. - Fixed user options in the 'BWA' section of the BAM Pipeline makefiles not being correctly applied when using the 'mem' or the 'bwasw' algorithms. - Fixed some unit tests failing when the environment caused getlogin to fail. ## [1.2.7] - 2017-01-03 ### Added - PALEOMIX now includes the 'Zonkey' pipeline, a pipeline for detecting equine F1 hybrids from archeological remains. Usage is described in the documentation. ### Changed - The wrongly named per-sample option 'Gender' in the phylogenetic pipeline makefile has been replaced with a 'Sex' option. This does not break backwards compatibility, and makefiles using the old name will still work correctly. - The 'RescaleQualities' option has been merged with the 'mapDamage' Feature in the BAM pipeline makefile. The 'mapDamage' feature now takes the options 'plot', 'model', and 'rescale', allowing more fine-grained control. ### Fixed - Fixed the phylogenetic pipeline complaining about missing sample genders (now sex) if no regions of interest had been specified. The pipeline will now complain about there being no regions of interest, instead. - The 'random sampling' genotyper would misinterpret mapping qualities 10 (encoded as '+') and 12 (encoded as '-') as indels, resulting in the genotyping failing. These mapping qualities are now correctly ignored. ## [1.2.6] - 2016-10-12 ### Changed - PALEOMIX now uses the 'setproctitle' for better compatibility; installing / upgraing PALEOMIX using pip (or equivalent tools) should automatically install this dependency. ### Fixed - mapDamage plots should not require indexed BAMs; this fixed missing file errors for some makefile configurations. - Version check for java did now works correctly for OpenJDK JVMs. - Pressing 'l' or 'L' to list the currently running tasks now correctly reports the total runtime of the pipeline, rather than 0s. - Fixed broken version-check in setup.py breaking on versions of python older than than 2.7, preventing meaningful message (patch by beeso018). - The total runtime is now correctly reported when pressing the 'l' key during execution of a pipeline. - The logger will automatically create the output directory if this does not already exist; previously logged messages could cause the pipeline to fail, even if these were not in themselves fatal. - Executables required executables for version checks are now included in the prior checks for missing executables, to avoid version-checks failing due to missing executables. ### Added - PALEOMIX will attempt to automatically limit the per-process maximum number of file-handles used by when invoking Picard tools, in order to prevent failures due to exceeding the system limits (ulimit -n). ## [1.2.5] - 2015-06-06 ### Changed - Improved information capture when a node raises an unexpected exception, mainly for nodes implementing their own 'run' function (not CommandNodes). - Improved printing of the state of output files when using the command-line option --list-output-files. Outdated files are now always listed as outdated, where previously these could be listed as 'Missing' if the task in question was queued to be run next. - Don't attempt to validate prefixes when running 'trim_pipeline'; note that the structure of the Prefix section the makefile still has to be valid. - Reverted commit normalizing the strand of unmapped reads. - The commands 'paleomix coverage' and 'paleomix depths' now accept records lacking read-group information by default; these are record as in the sample and library columns. It is further possible to ignore all read-group information using the --ignore-readgroups command-line option. - The 'bam_pipeline mkfile' command now does limited validation of input 'SampleSheet.csv', prints generated targets sorted alphabetically, and automatically generates unique names for identically named lanes. Finally, the target template is not included automatically generating a makefile. - The 'coverage' and 'depth' commands are now capable of processing files containing reads with and without read-groups, without requiring the use of the --ignore-readgroups command-line option. Furthermore, reads for which the read-group is missing in the BAM header are treated as if no readgroup was specified for that read. - The 'coverage' and 'depth' command now checks that input BAM files are sorted during startup and while processing a file. - Normalized information printed by different progress UIs (--progress-ui), and included the maximum number of threads allowed. - Restructured CHANGELOG based on http://keepachangelog.com/ ### Fixed - Fixed mislabeling of BWA nodes; all were labeled as 'SE'. - Terminate read duplication checks when reaching the trailing, unmapped reads; this fixes uncontrolled memory growth when an alignment produces a large number of unmapped reads. - Fixed the pipeline demanding the existence of files from lanes that had been entirely excluded due to ExcludeReads settings. - Fixed some tasks needlessly depending on BAM files being indexed (e.g. depth histograms of a single BAM), resulting in missing file errors for certain makefile configurations. - Fixed per-prefix scan for duplicate input data not being run if no BAMs were set to be generated in the makefile, i.e. if both 'RawBAM' and 'RealignedBAM' was set to 'off'. ### Deprecated - Removed the BAM file from the bam_pipeline example, and added deprecation warning; support for including preexisting BAMs will be removed in a future version of PALEOMIX. ## [1.2.4] - 2015-03-14 ### Added - Included PATH in 'pipe.errors' file, to assist debugging of failed nodes. ### Fixed - Fix regression causing 'fixmate' not to be run on paired-end reads. This would occasionally cause paired-end mapping to fail during validation. ## [1.2.3] - 2015-03-11 ### Added - Added the ability to the pipelines to output the list of input files required for a given makefile, excluding any file built by the pipeline itself. Use the --list-input-files command-line option to view these. ### Changed - Updated 'bam_pipeline' makefile template; prefixes and targets are described more explicitly, and values for the prefix are commented out by default. The 'Label' option is no included in the template, as it is considered deprecated. - Allow the 'trim_pipeline' to be run on a makefile without any prefixes; this eases use of this pipeline in the case where a mapping is not wanted. - Improved handling of unmapped reads in 'paleomix cleanup'; additional flags (in particular 0x2; proper alignment) are now cleared if the mate is unmapped, and unmapped reads are always represented on the positive strand (clearing 0x4 and / or 0x20). ## [1.2.2] - 2015-03-10 ### Added - Documented work-arounds for problem caused when upgrading an old version of PALEOMIX (< 1.2.0) by using 'pip' to install a newer version, in which all command-line aliases invoke the same tool. - Added expanded description of PALEOMIX to README file. - The tool 'paleomix vcf_filter' can now clear any existing value in the FILTER column, and only record the result of running the filters implemented by this tool. This behavior may be enabled by running vcf_filter with the command-line option '--reset-filter yes'. ### Changed - Improved parsing of 'depths' histograms when running the phylogenetic pipeline genotyping step with 'MaxDepth: auto'; mismatches between the sample name in the table and in the makefile now only cause a warning, allowing for the common case where files depths were manually recalculated (and --target was not set), or where files were renamed. - The tool 'paleomix rmdup_collapsed' now assumes that ALL single-end reads (flag 0x1 not set) are collapsed. This ensures that pre-collapsed reads used in the pipeline are correctly filtered. Furthermore, reads without quality scores will be filtered, but only selected as the unique representative for a set of potential duplicates if no reads have quality scores. In that case, a random read is selected among the candidates. ### Fixed - Fixed failure during mapping when using SAMTools v1.x. ## [1.2.1] - 2015-03-08 ### Changed - Remove dependency on BEDTools from the Phylogenetic pipeline. - Change paleomix.__version__ to follow PEP 0396. ### Fixed - Stop 'phylo_pipeline makefile' from always printing help text. - Fixed bug causing the phylo_pipeline to throw exception if no additional command-line arguments were given. - Allow simulation of reads for phylogenetic pipeline example to be executed when PALEOMIX is run from a virtual environment. ## [1.2.0] - 2015-02-24 This is a major revision of PALEOMIX, mainly focused on reworking the internals of the PALEOMIX framework, as well as cleaning up several warts in the BAM pipeline. As a result, the default makefile has changed in a number of ways, but backwards compatibility is still retained with older makefiles, with one exception. Where previously the 'FilterUnmappedReads' would only be in effect when 'MinQuality' was set to 0, this option is now independent of the 'MinQuality' option. In addition, it is now possible to install PALEOMIX via Pypi, as described in the (partially) updated documentation now hosted on ReadTheDocs. ### Changed - Initial version of updated documentation hosted on ReadTheDocs, to replace documentation currently hosted on the repository wiki. - mapDamage files and models are now only kept in the {Target}.{Prefix}.mapDamage folder to simplify the file-structure; consequently, re-scaling can be re-done with different parameters by re-running the model step in these folders. - Rework BWA backtrack mapping to be carried out in two steps; this requires saving the .sai files (and hence more disk-space used by intermediate files, which can be removed afterwards), but allows better control over thread and memory usage. - Validate paths in BAM makefiles, to ensure that these can be parsed, and that these do not contain keys other than '{Pair}'. - The mapping-quality filter in the BAM pipeline / 'cleanup' command now only applies to mapped reads; consequently, setting a non-zero mapq value, and setting 'FilterUnmappedReads' to 'no' will not result in unmapped reads being filtered. - Improved the cleanup of BAM records following mapping, to better ensure that the resulting records follow the recommendations in the SAM spec. with regards to what fields / flags are set. - Configuration files are now expected to be located in ~/.paleomix or /etc/paleomix rather than ~/.pypeline and /etc/pypeline. To ensure backwards compatibility, ~/.pypeline will be migrated when a pipeline is first run, and replaced with a symbolic link to the new location. Furthermore, files in /etc/pypeline are still read, but settings in /etc/paleomix take precedence. - When parsing GTF files with 'gtf_to_bed', use either the attribute 'gene_type' or 'gene_biotype', defaulting to the value 'unknown_genetype' if neither attribute can be found; also support reading of gz / bz2 files. - The "ExcludeReads" section of the BAM Pipeline makefile is now a dictionary rather a list of strings. Furthermore, 'Singleton' reads are now considered seperately from 'Single'-end reads, and may be excluded independently of those. This does not break backwards compatibility, but as a consequence 'Single' includes both single-end and singleton reads when using old makefiles. - Added command-line option --nth-sample to the 'vcf_to_fasta' command, allowing FASTA construction from multi-sample VCFs; furthermore, if no BED file is specified, the entire genotype is constructed assuming that the VCF header is present. - Modify the FASTA indexing node so that SAMTools v0.1.x and v1.x can be used (added workaround for missing feature in v1.x). - The "Features" section of the BAM Pipeline makefile is now a dictionary rather than a list of strings, and spaces have been removed from feature names. This does not break backwards compatibility. - EXaML v3.0+ is now required; the name of the examl parser executable is required to be 'parse-examl' (previously expected to be 'examlParser'), following the name used by EXaML v3.0+. - Pysam v0.8.3+ is now required. - AdapterRemoval v2.1.5+ is now required; it is now possible to provide a list of adapter sequences using --adapter-list, and to specify the number of threads uses by AdapterRemoval via the --adapterremoval-max-threads command-line option. - Renamed module from 'pypeline' to 'paleomix' to aviod conflicts. - Improved handling FASTQ paths containing wildcards in the BAM pipeline, including additional checks to catch unequal numbers of files for paired- end reads. - Switch to setuptools in preperation for PyPI registration. - Avoid seperate indexing of intermediate BAMs when possible, reducing the total number of steps required for typical runs. - Restructure tests, removing (mostly unused) node tests. - Reworked sub-command handling to enable migration to setup-tools, and improved the safety of invoking these from the pipeline itself. - The output of "trim_pipeline mkfile" now includes the section for AdapterRemoval, which was previously mistakenly omitted. - Increased the speed of the checks for duplicate input data (i.e. the same FASTQ record(s) included multiple times in one or more files) by ~4x. ### Added - PALEOMIX v1.2.0 is now available via Pypi ('pip install paleomix'). - Added command 'paleomix ena', which is designed to ease the preparation of FASTQ reads previously recorded in a BAM pipeline makefile for submission to the European Nucleotide Archive; this command is current unstable, and not available by default (see comments in 'main.py'). - Exposed 'bam_pipeline remap' command, which eases re-mapping the hits identified against one prefix against other prefixes. - Added validation of BED files supplied to the BAM pipeline, and expand validation of BED files supplied to the Phylogenetic pipeline, to catch some cases that may cause unexpected behavior or failure during runtime. - Support SAMTools v1.x in the BAM pipeline; note, however, that the phylogenetic pipeline still requires SAMTools v0.1.19, due to major changes to BCFTools 1.x, which is not yet supported. - Modified 'bam_cleanup' to support SAMTools 1.x; SAMTools v0.1.19 or v1.x+ is henceforth required by this tool. - The gender 'NA' may now be used for samples for which no filtering of sex chromosomes is to be carried out, and defaults to an empty set of chromsomes unless explicitly overridden. - Pipeline examples are now available following installation via the commands "bam_pipeline example" and "phylo_pipeline example", which copy the example files to a folder specified by the user. - Added ability to specify the maximum number of threads used by GATK; currently only applicable for training of indel realigner. ### Fixed - Ensured that only a single header is generated when using multiple threads during genotyping, in order to avoid issues with programs unable to handle multiple headers. - Information / error messages are now more consistently logged to stderr, to better ensure that results printed to stdout are not mixed with those. - Fixed bug which could cause the data duplication detection to fail when unmapped reads were included. - Fixed default values not being shown for 'vcf_filter --help'. - Fix 'vcf_filter' when using pysam v0.8.4; would raise exception due to changes to the VCF record class. ### Removed - Removed the 'paleomix zip' command, as this is no longer needed thanks to built-in gzip / bzip2 support in AdapterRemoval v2. - Removed commandline options --allow-missing-input-files, --list-orphan-files, --target, and --list-targets. ## [1.1.1] - 2015-10-10 ### Changed - Detect the presence of carriage-returns ('\r') in FASTA files used as prefixes; these cause issues with some tools, and files should be converted using e.g. 'dos2unix' first. ### Fixed - Minor fix to help-text displayed as part of running information. ### Deprecated - AdapterRemoval v1.x is now considered deprecated, and support will be dropped shortly. Please upgrade to v2.1 or later, which can be found at https://github.com/MikkelSchubert/adapterremoval ### Removed - Dropped support for Picard tools versions prior to 1.124; this was nessesitated Picard tools merging into a single jar for all commands. This jar (picard.jar) is expected to be located in the --jar-root folder. ## [1.1.0] - 2015-09-08 ### Added - Check that regions of interest specified in PhylogeneticInference section corresponds to those specified earlier in the makefile. - Added the ability to automatically read MaxReadDepth values from depth-histograms generated by the BAM pipeline to the genotyping step. - Add support for BWA algorithms "bwasw" and "mem", which are recommended for longer sequencing reads. The default remains the "backtrack" algorithm. - Include list of filters in 'vcf_filter' output and renamed these to be compatible with GATK (using ':' instead of '='). - Support for genotyping entire BAM (once, and only once), even if only a set of regions are to be called; this is useful in the context of larger projects, and when multiple overlapping regions are to be genotyped. - Added validation of FASTA files for the BAM pipeline, in order to catch serveral types of errors that may lead to failure during mapping. - Added options to BAM / Phylo pipelines for writing Dot-file of the full dependency tree of a pipeline. - Added the ability to change the number of threads, and more, while the pipeline is running. Currently, already running tasks are not terminated if the maximum number of threads is decreased. Press 'h' during runtime to list commands. - Support for AdapterRemoval v2. - Allow the -Xmx option for Java to be overridden by the user. ### Changed - Prohibit whitespace and parentheses in prefix paths; these cause problems with Bowtie2, due to the wrapper script used by this program. - Allow "*" as the name for prefixes, when selecting prefixes by wildcards. - Rework genotyping step to improve performance when genotyping sparse regions (e.g. genes), and to allow transparent parallelization. - Require BWA 0.5.9, 0.5.10, 0.6.2, or 0.7.9+ for BWA backtrack; other versions have never been tested, or are known to contain bugs that result in invalid BAM files. - The memory limit it no longer increased for 32-bit JREs by default, as the value used by the pipeline exceeded the maxmimum for this architecture. - Improved verification of singleton-filtering settings in makefiles. - Reworked the 'sample_pileup' command, to reduce the memory usage for larger regions (e.g. entire chromosomes) by an order of magnitude. Also fixed some inconsistency in the calculation of distance to indels, resulting in some changes in results. - Changed 'gtf_to_bed' to group by the gene biotype, instead of the source. ### Fixed - Fixed a bug preventing new tasks from being started immediately after a task had failed; new tasks would only be started once a task had finished, or no running tasks were left. - Fixed MaxDepth calculation being limited to depths in the range 0 .. 200. - Added workaround for bug in Pysam, which caused parsing of some GTF files to fail if these contained unquoted values (e.g. "exon_number 2;"). - Fixed bug causing some tasks to not be re-run if the input file changed. - Fixed off-by-one error for coverages near the end of regions / contigs. - Ensure that the correct 'paleomix' wrapper script is called when invoking the various other tools, even if this is not located in the current PATH. - Parse newer SAMTools / BCFTools version strings, so that a meaningful version check failure can be reported, as these versions are not supported yet due to missing functionality. - Fix potential deadlock in the genotyping tool, which could occur if either of the invoked commands failed to start or crashed / were killed during execution. - Fixed error in which summary files could not be generated if two (or more) prefixes using the same label contained contigs with overlapping names but different sizes. - Fixed problems calculating coverage, depths, and others, when when using a user-provided BED without a name column. - Improved termination of child-processes, when the pipeline is interrupted. ### Deprecated - The 'mkfile' command has been renamed to 'makefile' for both pipelines; the old command is still supported, but considered deprecated. ### Removed - Dropped support for the "verbose" terminal output due to excessive verbosity (yes, really). The new default is "running" (previously called "quiet"), which shows a list of currently running nodes at every update. ## [1.0.1] - 2014-04-30 ### Added - Add 'paleomix' command, which provides interface for the various tools included in the PALEOMIX pipeline; this reduces the number of executables exposed by the pipeline, and allows for prerequisite checks to be done in one place. - Added warning if HomozygousContigs contains contigs not included in any of the prefixes specified in the makefile. ### Changed - Reworking version checking; add checks for JRE version (1.6+), for GATK (to check that the JRE can run it), and improved error messages for unidentified and / or outdated versions, and reporting of version numbers and requirements. - Dispose of hsperfdata_* folders created by certain JREs when using a custom temporary directory, when running Picard tools. - Cleanup of error-message displayed if Pysam version is outdated. - Ensure that file-handles are closed in the main process before subprocess execution, to ensure that these recieve SIGPIPE upon broken pipes. - Improvements to handling of implicit empty lists in makefiles; it is now no longer required to explicitly specify an empty list. Thus, the following is equivalent assuming that the pipeline expects a list: ExplicitEmptyList: [] ImplicitEmptyList: - Tweak makefile templates; the phylo makefile now specifies Male/Female genders with chrM and chrX; for the BAM pipeline the ROIs sub-tree and Label is commented out by default, as these are optional. - Reduced start-up time for bigger pipelines. ### Fixed - Fix manifest, ensuring that all files are included in source distribution. - Fix regression in coverage / depths, which would fail if invoked for specific regions of interest. - Fix bug preventing Padding from being set to zero when genotyping. ## [1.0.0] - 2014-04-16 ### Changed - Switching to more traditional version-number tracking. [1.3.7]: https://github.com/MikkelSchubert/paleomix/compare/v1.3.6...v1.3.7 [1.3.6]: https://github.com/MikkelSchubert/paleomix/compare/v1.3.5...v1.3.6 [1.3.5]: https://github.com/MikkelSchubert/paleomix/compare/v1.3.4...v1.3.5 [1.3.4]: https://github.com/MikkelSchubert/paleomix/compare/v1.3.3...v1.3.4 [1.3.3]: https://github.com/MikkelSchubert/paleomix/compare/v1.3.2...v1.3.3 [1.3.2]: https://github.com/MikkelSchubert/paleomix/compare/v1.3.1...v1.3.2 [1.3.1]: https://github.com/MikkelSchubert/paleomix/compare/v1.3.0...v1.3.1 [1.3.0]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.14...v1.3.0 [1.2.14]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.13.8...v1.2.14 [1.2.13.8]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.13.7...v1.2.13.8 [1.2.13.7]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.13.6...v1.2.13.7 [1.2.13.6]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.13.5...v1.2.13.6 [1.2.13.5]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.13.4...v1.2.13.5 [1.2.13.4]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.13.3...v1.2.13.4 [1.2.13.3]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.13.2...v1.2.13.3 [1.2.13.2]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.13.1...v1.2.13.2 [1.2.13.1]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.13...v1.2.13.1 [1.2.13]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.12...v1.2.13 [1.2.12]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.11...v1.2.12 [1.2.11]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.10...v1.2.11 [1.2.10]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.9...v1.2.10 [1.2.9]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.8...v1.2.9 [1.2.8]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.7...v1.2.8 [1.2.7]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.6...v1.2.7 [1.2.6]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.5...v1.2.6 [1.2.5]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.4...v1.2.5 [1.2.4]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.3...v1.2.4 [1.2.3]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.2...v1.2.3 [1.2.2]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.1...v1.2.2 [1.2.1]: https://github.com/MikkelSchubert/paleomix/compare/v1.2.0...v1.2.1 [1.2.0]: https://github.com/MikkelSchubert/paleomix/compare/v1.1.1...v1.2.0 [1.1.1]: https://github.com/MikkelSchubert/paleomix/compare/v1.1.0...v1.1.1 [1.1.0]: https://github.com/MikkelSchubert/paleomix/compare/v1.0.1...v1.1.0 [1.0.1]: https://github.com/MikkelSchubert/paleomix/compare/v1.0.0...v1.0.1 [1.0.0]: https://github.com/MikkelSchubert/paleomix/compare/v1.0.0-RC...v1.0.0 paleomix-1.3.8/LICENSE000066400000000000000000000017771443173665200143750ustar00rootroot00000000000000Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. paleomix-1.3.8/MANIFEST.in000066400000000000000000000006421443173665200151140ustar00rootroot00000000000000include .coveragerc include CHANGES.md include LICENSE include MANIFEST.in include paleomix_environment.yaml include paleomix/yaml/CHANGES include paleomix/yaml/LICENSE include paleomix/yaml/PKG-INFO include paleomix/yaml/README include pylint.conf include README.rst include tox.ini # Non source files recursive-include docs * recursive-include misc * recursive-include paleomix/resources * recursive-include tests *paleomix-1.3.8/README.rst000066400000000000000000000051441443173665200150470ustar00rootroot00000000000000********************** The PALEOMIX pipelines ********************** The PALEOMIX pipelines are a set of pipelines and tools designed to aid the rapid processing of High-Throughput Sequencing (HTS) data: The BAM pipeline processes de-multiplexed reads from one or more samples, through sequence processing and alignment, to generate BAM alignment files useful in downstream analyses; the Phylogenetic pipeline carries out genotyping and phylogenetic inference on BAM alignment files, either produced using the BAM pipeline or generated elsewhere; and the Zonkey pipeline carries out a suite of analyses on low coverage equine alignments, in order to detect the presence of F1-hybrids in archaeological assemblages. In addition, PALEOMIX aids in metagenomic analysis of the extracts. The pipelines have been designed with ancient DNA (aDNA) in mind, and includes several features especially useful for the analyses of ancient samples, but can all be for the processing of modern samples, in order to ensure consistent data processing. ---------------------- Installation and usage ---------------------- Detailed instructions can be found in the `documentation `_ for PALEOMIX. For questions, bug reports, and/or suggestions, please use the `GitHub tracker `_ or contact Mikkel Schubert at `MikkelSch@gmail.com `_. --------- Citations --------- The PALEOMIX pipelines have been published in Nature Protocols; if you make use of PALEOMIX in your work, then please cite Schubert M, Ermini L, Sarkissian CD, Jónsson H, Ginolhac A, Schaefer R, Martin MD, Fernández R, Kircher M, McCue M, Willerslev E, and Orlando L. "**Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIX**". Nat Protoc. 2014 May;9(5):1056-82. doi: `10.1038/nprot.2014.063 `_. Epub 2014 Apr 10. PubMed PMID: `24722405 `_. The Zonkey pipeline has been published in Journal of Archaeological Science; if you make use of this pipeline in your work, then please cite Schubert M, Mashkour M, Gaunitz C, Fages A, Seguin-Orlando A, Sheikhi S, Alfarhan AH, Alquraishi SA, Al-Rasheid KAS, Chuang R, Ermini L, Gamba C, Weinstock J, Vedat O, and Orlando L. "**Zonkey: A simple, accurate and sensitive pipeline to genetically identify equine F1-hybrids in archaeological assemblages**". Journal of Archaeological Science. 2017 Feb; 78:147-157. doi: `10.1016/j.jas.2016.12.005 `_. paleomix-1.3.8/docs/000077500000000000000000000000001443173665200143045ustar00rootroot00000000000000paleomix-1.3.8/docs/Makefile000066400000000000000000000011041443173665200157400ustar00rootroot00000000000000# Minimal makefile for Sphinx documentation # # You can set these variables from the command line. SPHINXOPTS = SPHINXBUILD = sphinx-build SOURCEDIR = . BUILDDIR = _build # Put it first so that "make" without argument is like "make help". help: @$(SPHINXBUILD) -M help "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O) .PHONY: help Makefile # Catch-all target: route all unknown targets to Sphinx using the new # "make mode" option. $(O) is meant as a shortcut for $(SPHINXOPTS). %: Makefile @$(SPHINXBUILD) -M $@ "$(SOURCEDIR)" "$(BUILDDIR)" $(SPHINXOPTS) $(O)paleomix-1.3.8/docs/_static/000077500000000000000000000000001443173665200157325ustar00rootroot00000000000000paleomix-1.3.8/docs/_static/zonkey/000077500000000000000000000000001443173665200172515ustar00rootroot00000000000000paleomix-1.3.8/docs/_static/zonkey/incl_ts_0_tree_rooted.png000066400000000000000000000446071443173665200242370ustar00rootroot00000000000000PNG  IHDR@@Na pHYs+tEXtSoftwareGPL Ghostscript 9.19 IDATxl}%Q~PHj Ⱆڻ8 $]:@mH_j(zl|@tE/Z|iivvhv;&#_ݡRrm;(S($gwfvrvr|yY' 0@&`L"D2d$ I 0@&`L"D2d$ I 0@&`L"D2d$ I 0@&`L"D2d$ I 0@&`L"D2d$ I 0@&`L"D2d$ I 0@&`L"D2d$ I 0@&`L"D2d$ I 0@&`L"D2i_ KH7^{/ayUtan=0na{O~o-?o-\@~5t䋜 0s ѣz]Ӵۂlc@O9366Fza0=UVgffnCzBh4;hG @Tթ) 8/Bn`zu]!D˿[V{^ٳz=醠 NLL 1"BRa:0]'5MEQn ދn`@wanXL!7r]WQ!vVs`?L^t ~WUΝ;wᥥ[>"ظkB!/ׯ_O]O+raVю!D .m;m':6???dp2fYA(RTMthj <7.lcq]W B333FC[e1wY4MEFybUUUvL8 (B$ÚVtm-]kF{ q]Ns<\.PMӂ (J2i e,l۶lvѣGZu]YORpF${P4M#BAEuݲ,<Ѵ֮m!i(䫯*0 [_g9]Zݸ[9P.n4M4+\-O%NOOG=u}zzZ^.JKL΁=" ɭb(Dzͦ>h4Oc0$}nB_4{zK GE €#u]uEavC]2== !eɗG A`Y%a0HV r`0Ġ}b xlB4]יʁ s{J]q:Aa.+ CUUʁ Щu{0ciI7YEa3{pr qS9&/'  ]"&.MӦ-J!$ ok߻[*۶+ S93]";[BaxZFfS]0MS8N A`%H"S9X.w LUՙ rKH7^{//_X=}l!}6ꋗWBؓtE뚦(JP`ffhoYV^u~a޼\-/\MT?#յO%Ĭ}qػiffT}UW, !z_w死Ƞ<.}ʭozס!s>ѾwI^LPejjRt>#zG%=!WsM} !/9 YVW8kX].39Ӧin`|߸uxpƯ-8BWn5xhp@Kջw@`ӾwITb<2mՏHt_guhoi]޿O~յWnE)]8<ז`}ˌK~o}W|k&Zpu߹}?Oz}E=VWz'k߻[zcO~ley4'=}pLwySj%]ȟupkw \]|G\+KWWoOm42{\]vxp x}np}߻rK^Xw<Xػv!Ĵ]-=^c ]#K{ɓ[ػbGFvN-lW.-6Dky'.D`% !n};@`k߻DV!wd=0!.-Nn@7ػ2-w%r[gr'>77?3?я~[$݀^˿ٿI\(m70 5Mm;v+C\5MSUU۶ٚ +=VaA '͛a-DLpD^8Imy 0I4KeYtې, @UB`Y#q/癦)m~yEt]0 4[ m(ie%=11ue9r1H!z`3 }0+L!)Z.ܑtoc#X%W= =) P\ RQjQ Nض].5Ms1 vBRe||EHC[\4uݤ[X| ߠ<޽GVꩮ@JTjzzZ4tmXy[wroxm9,m70 )@ Un^r4Zz`ݥ(8Q*eYbrI}'˵{?pbz~{L[JIBIz팜y(;eO*8̌m۱ogHճ^X"<3MS8g q2*|9z` r1 P뺤W+,rXTb6iW+纮ir3#i|mE,yrQu Ph'0 k,ڰ* ҋv !N\ddHţI%聥eYV(X.L/d6A,bmoW~%Ky5TXzaiI@/^"ζmYJ4Jb!^H-1W'eaŗ8W',i-@%+w$"|Jc1@eY^B,Lӌ P\ ȴO|_I"2L(j5XdܟƝ;wϕeV|n 'd*EQ(@.04M mEQ?a6?Eq]ViI;*/+V ;#y3%ߴ;%""4et5 yk[?ۨwۙ-X0,]Bͣ+n4ؽ`}KUUq]W^NE@DAV$Z9Ysb^>Z4Mв,9GC.tYh1tɿz.֙t.9GCօ}C銵TDa?0 岬[*(}_%D,_d*MUb$?k۶<Ս"'tpF B,<ϓ '1 #Y%m !ڐKmrvGiR;xv0sͲ,EgDPkèT-m-oz`B'MxW.^" v(h)@U,)@ۼaP"o 0ͶFA*Ue U՞,!3ӂY]yBܸ^=x\;R{rBnR lAZo,CVYi&[0=n\r FQ)s~eﰺ|щNiCㅛ\yfk岷؞! dF;aL@EyKZ\;xN+kǵ#}³³_S;a NG+Oi ᷍BFAKH-˚;sLTʒ4.BcZkOkV``8ro (O<՗BDlrbHVb0)*@yErb0Oꓤ[\ jd!:G[d*EQ(@n)wN9 Pr 'KVs0 ~̠U[.@.L] #GKR\zI;'\˪N,fG'0deYFL]#ضmY%l}YLFDlHk4uݤ[mo?hy 6n2 в,yE_ʿY|RIؒE4 4M Pe7orY SUyD^^_T q% ˏe2bt۶F9peAG6?(r=vxRgez-l-KO,q]W.KE7mZefأ]^LnYiXBa(,ȑ#GeN4X,/=jZ'@iN*E IDATy(juff\srOh Θ#ڶEHy遵k'4ZuqEzeEm\.3 U̝Nm۞ X+Y}irT)ԺCiuY&0Mh_ 9^QC.w=y>)*N(";e}` P&; {`ȗzW䅻+·!)s`H7Rhv zo_ 艿v{'wDˢKh2LYOA1EGj8Vڢ;[!۶ iFö=[VϞ=;11133cFmn\Ve/j~erM&ə7Үխ#?Vޭ_[wo.{{؝UG;RbY#}qEQ È"T,bɗr2eYʳeYKr͖Sb!g:c岷-}3(mڠ?/lN"jHRI-}lͧl9sfddZ%lV*I!dRi6|'GZ掽*kkkʯc]yf?^Yyde*nX^w6<ޣT,KϺD=}4UU[.4 yo!ӕ˵'߱%}'˵u7;|ڳ_~,5r,OMMs){ckkkх64$S ).*trU/t\5MѣJezzl.3 U!Dt!AxW}+I[ a(FwUE+ Uuv^#y].jܲmhwZ7CmW*˲2,,2>>.AҎVbI? Ek*t,eLV1w4h}:rv(۝71;t0釁tt]/8]] CUTє34M+ a:tR*v&C膨'3絋:U̝sR2~F5to⥿|__h ,@itsK<0dWctAT*u0vEuv7RRxc\f'2q0=ϋfg^ەw}{cHJ:CVWsr%mbً9Fr܍3rZ|шkt|}C1ءCI? d,@ei疠ޯb8rZ*FrF!;[^'Gݹ,D&q`;l}_QMj߸ {1wIN?zh\fvF`{bd#Ge*t,B;!τ;E3dVjHͻخ_{8"vB(Q,)@%^ 94b33z)B|L܉AJI!r1Аΰ3fO_`1e"b.ms'eX, !\ץ\a`kqT +9*U*,˒CUIGҶCr׾}sl6|}ɲ\J*Nhrts Q*Xh~Y}CɢSawb3qoN|X,8 Pi֗ҿCt o,d$zAr/ PeesZˬ|CZo3CQWL)w偻T@RU Pens%"68CJHEm PZ7>X#nߌ$4ͨU?gtstaG29HT빖*mi- <rLb3MszzJ>lz`T@ZTghD?V1ojatbc!}T@]UoV1oKT4MbGb}i,mV1oeY BD0Ly[ż-rE]#b)1"K9Ubޖ֮X,NOV>liȅTi[;urG>J =[Bb̷T}U"bBMRkʐ|C>JI!3d)#YJ.KE9Zܹu_0uo\r+Oiݕu'"5B! y[U}tn/7.>z%Kuਖ਼;K·,JxgEq'*@b\8Ζ]'w¸q*^ a cϐ2x\SvW/] Wkuy;+uR']y}L@ELLj@>W1o\ɷyWlBeh84Z\V1̛<vOͬ. ~m brM^B5JSq|Z=.ۻ'$MHUW1w.jkWwoNNB<ݳ_y}tPq{)8|t`)G4{u!;[" Q^.*m1 c_͍Kս굺unٯ\~+d2b߱=C7[V_0>YF]yOā~Ѻ}po P97/NL!çK7/B'o^rMX\? !_q岷-}3({G2 q?8?rFћTiǟu7.woO[CШ~g|< ӮV-/:O6˵'̓Zԇ˨|]6r#b%Y +³OiKuKoxE Oܼ.`.<ˉ+SWxCkkkI5@aX.3gض|w9$R bJV +=y'mY>z`c Q}k]*ljm߰*G!ehT]Ȣ|ݛLA$GKeYb3!o̱LSFb3!'ʫT:v_U|޽1 !fE2 cXŌnGŲu`%۶[ PXŌ.G  1YZ+mYŌ.G݊kaV1`{w# CȻTn^蒜ؠ'v!D@Tm Pݐ[[vzH4 uX,n\U|8)%CTM P]:v*G~@i&XŌ.GݹCݾKR%X*q0ؔ(+@*ftO>l_ FtaO?t-BG[1XN* Uԧ;tsПgJ@bt[П`ȼ!D`ߐhe?lo]>ޘ⨅(BvQD+g{"+""^9  0x$1n%x/EGc~uXOkkkI.Ų;9E1q;EQǑÉ,[x%Rw~w{<:#3F_>z`r!.݃mۍFT*{rD"9''.,-\z,6jr_,`iA={[J>΁ ^CضmYV2 ,"FXW5eh-sı!•<[|Q>qE!ģ|IyԟmS!=?.Vw}'ͪm+0Lt]7  W/Ϟ~d])u !)!KB=翬?xo,-|/ǂ'GO.޺1peuƇe{PR*_xR45-r":S.04 hlx3]yENܸu1۳or嗳s`]033S.-"ÀAX(J\6 #q6LK}S_o҂>zR;ޯ7\Y;vϽ￟|q{&GO?翼]|L0 1MSp>T_Txi63_z^wPN}Pp._>vn?15§|̩~XiGH뺦if}8junn̙3Ԉ[7&'l @b9 rnḭ0(ﺮ(R0 &$'I͞>>>AIz%LE PN&4 CUF!,v!DJ(bJqEQ (JR$s<4}߷m;Y:.v0Iu}qqqt]/BZ&,Fac(% EVX7-^4*$,QD9PUUuKRm^@HaEŇaMqEq]eZ|`כQDY-^Ur<33Ixd$/ ãG.M@H"zW,77 #ˣߞ}!K ıDŽ|)e{c;90igu+>SܕWCFt-,dτ'g|D{l)AHhqkGFFzܞ |/.Zy{'GxMaoK߮|q{&GO?翜 !ȀBaAfx{-޺s?|&g֍h1";gƟ _ZB=l֘  o4 fTK K #l~Z˿ޓ`@&:110z`;= {'GO- +G;ܲ} a;wn]bv30 6E ðX,&$$ΝK1H 36Ed#"dƺQDdXn`u)9GȒuɶ "dI( y@Șhl&$1ԑʳ\8l[?r(;!?rJu%b!GEO-HL.z`Ep=rW^I!HR.o޼lys1gO~l?gdWLӴBise`mm-6teYB۶;\<3@~<ܹs###2ƊŢi,soWz]j  X\\u}||\u:g?t]qAy^ѨVRI199)u]s0IUU4/=}^;???669pAFL\qZ_I2|ߗU Ιi-ˊ\lBvL:gB0g?^ sqqQm(/r\nl6ы{9JeffPf|$2 ^֛K 箼dX될m^yQE0䇠 o3y8;[C =)/nũ6pe/.Լ{Õ`iA:8wSv(&?4M++biAhy^\v0 y/JP&IDAT I Ei݇sWr,2tp[<:A!_xY=)mS'Hf!<7AH71ٺ5Բs&o,rCp_޺<ȧWlǿ샏|ʅc~P\GTJav >-ÿ[?LsW^:+/ŕs/ }nF!@ިɚWN&募|_(|t{B?[[;?6s/ŅS_woSi] =0Ȼǿ|?l<\̩>̗WnK Bs  L`•s S`!`L 0@&`L"D2d$ I 0@6?fȅ={oEsZ-VKVLK/ >={oE$ I 0@&`LX[[K ]BUդ.﫪(J I0 4-醤 /vT67\0$ I 0@&`L"D2i'݆x8SV6\CᵛK"xy|WT='u!DR1Ms>'/7A^Ėn-aOZ$6~jjJi=y='R_*Z-:l6=\_*I\Kz`E}R]ϝ;嵛KȈmۖe9ƖzuZRmω"bff˿㥲9]~C59m7ᵛKb, #dz^*⥢|A>;;{ƍTڟEQR}!t]>h~Q۶OKE,u]0d#Bs"RXP}_^n=ɼɵKJʖrR4- C;5/KL40BxpK9~tirH山3gΌuyPpԆnxlii6cccj5wM_//yζs՗Jlh u [ +OoE;xTv,JK +p C 0@&`L"D2d$ I 0@&`L"D2d$ I 0`=iA}˲eqt]m}I7HriViQe||\^,>n8Fcff&JVyX,ޘk=@BP,]BơhQFEV.~ae!D\FUUm6{nvXiZBh&Eu-˒wژk[UkIQY[[h4k[/DjH\[[h4F\n099zF#Ξ9sF^jjubbBژk[] ![ibbbɰ-i(BQ ˲0J 6AS\r%Y4'1#DWދE5fݵ@ .e p "UUE ID2d$ I 0@&`L"D2d$ I 0@&.u15IENDB`paleomix-1.3.8/docs/_static/zonkey/incl_ts_0_tree_unrooted.png000066400000000000000000000423101443173665200245670ustar00rootroot00000000000000PNG  IHDR@@Na pHYs+tEXtSoftwareGPL Ghostscript 9.19 IDATxl}J")Ҥ:fD:Rڃhm_{@E5D0_T-;(¡} Kw+Z&{E|9.6`gvwv[hPSJ)-)'^r1;d5拟=Immm P@=0@,`X"Db% K 0@,`X"Db% K 0@,`X"Db% K 0@,`X"Db% K 0@,`X"Db% K 0@,`X"Db% K 0@,`X"Db% K 0@,`X"Db% K 0@,`X"DbґVut/>B_Z:0S:8h<,ԁ+<^z__x<+o?/|6{|`X\)=}ç틗_ V{to]vX\{>|kˏ@,{r-[~A~˯ /kO8=xE;^&<]xs83<4M0l. kO}bTu !sBg>G]tߒfҙgG6``Yօ &''K;c'bf"oKO5u`(X];y<8>|ze[WZyŇMG> 2X}40 w=UxnڏʗC#Cw=td|OʿNR Vkkm뺮5x//]Jե%}6Yx=,ɯ?ُWƆ>=k/S@[a֦04M<۶M o~g`i尴6[g;dW~td`hnG_}>=>|ZYX]J3 ~OVǀJi8F7Wo _lm?@0 s܎r,!RFP4FGm;N>{bzǎa3s]WiA@z@SSS)@( Pf`-`YV:^ԇXSɍ 76 3qG4UU 5CԔeYt(E}?Ɔd,!6eYLFk^!f`RBcU.D 35ǡ543(}_>uTFXr\XTU5 V:3(E۶0Wr#EEllluyM,RUղ,&d9,bd/UUTL^*ĉi@ "Vq̭|>/`]"DEź"Dk ,J<|]1hƺ"}]^eYXAL}SZWmQXdr]P(X{z.c]vCEi_{`U(bvmϺ"lG5[+vXd)uE؎늆a (6UJ늊d2UUǡ1!TɷJ늆ar9_uEɑj:D*󺮷]ם=w\6e7.<_q'~[/:#u`O}Lkȸ^:W0ZՁ+YY~Kq{nW=[X]zf3ȤRBթ`8af\ۼxէ _93_yv#{OWҟx.^9aU>|z|tClGE&j|]qbb0_{ƇO?_z{?lv\r6?Jo>%hgDi[񘰸^,_zsW_{&\bnG| _|g^_YGҟ_/GZ=ў&0۶>y;y|LDrL:046t:0$ k҃ko:@`I!wT|ܵ/>6 ^ kJO'e,.-M΄çύ|Rޞ|1l4,!FfllC؟+yo>yGƇ?tîV+cCN=&c0 Ei(/z?+?G_{맟+{sRƇOӑե?ܱU,NL.-. vV򯽹~kl~_v Eb,2l "=05,%`G Ȝ8qC!"} 9hMFEceeCd!D+h,1@$*80@,`i YhЊ@>o q?sGBE`yyݻ$ 7nz,XܹS,[= H,o EE`nn0 hJ1=4vPr~Z=HJ+@3`x:f ;B Pennc!"B Z=H 젂 PE188H@`U4:h0d2BEQ[=H JMCE@^yttJzh@䦗 0UUggg[="H @ʧ\U4ΑV dy[/ 򇯭|_zݻOLR:xOWJX'[1}cv G30F_~np<~Cݟ{s]{^ۢ13`(B0K=~Dκ<}W*=㵻ttmHSjv Ai͍ͷ>ܕ B</#dVU(a @ Au @Ϸ4{`Z󼑑DoW*RQ[ʹXG"WцC7\@$dL`cbppa1v[*4:-y&R;g`,//70P(2Xvl!"}?!`ێm8Q*,30a !Mx'nm8{=L4Y>o(nȍ1 4Yə~ n4Ƙh22LG<XvlQ(J5F%30JB4f`I^2 =|K:U{ NL[Bկ "MyeY_xqkki#@,Ui!)Hg27n/RTQ $'gLD4JRaAd BZuر|>a63zTFFFP0 =+ 3-۶UjXVZJUUISpaظ3 t NjLKUUCgZ iay2TUa\NaYiD#qdo54M"LkGXT\.'?O"KӴ L> 90 mۖko쫂C(J$GEhk.U۳sKVL:WUUu˲d}x?K=n4???^(GhW3q.\P^nnPs)Md_L&c<{bbKؾ7FFFjGh5eMOOdSU4͉R7xM4MSQ*4rAWs)UU<Uxe6U7Y^04 0 I%ٶNDž)@.L ݒ$6%,;*l*CK6mJQ}(JRi2TUd_"%V'kl[!ѓFsJTbYVVpG`SOtym4Vr]… |3$UU'&&={IRZETJBшQm"d3=]nTjar yW?jQUga{he2yP\UY׎CC}Bق%-Ahŗmۮz }oEQZd yfΪy7555331j*30ˎ5118NHs'''^! Ghu]4wlۆCRUUBl߰0MSCd}}B iZPh(I$E+yRh !dhɆPL&8DH^UU#-Gh麾 S[[[C RWEax >V12 z&0[$m8$$_VCkppPufZH#jr6%ccc\TѶV&jЀIӎz D- IQ\-Bo`{-MӘE" C -`7U_R~xw>3Xqv;>sp}bZI?9na,#-M,Q߽\#oګ_~K VovscG?͍Wh'Luݨ- v q-t:]c͍o`KWw !Gߔvscw7.ykYûMI$m8$EQ:Z@$t]_XXxpO%=|_O)痖BւGv{~Nƽ`{C>Uo֍"(Ȉy嫈ϿSC]BKwv|~>9;{m?X;{Rr/s;K@߽}NvB- uREh躞f5M#Hض]1̇Jr7{ozںzȽ}<߼uں R\w:;]ME{ҫIXnv -]׹ROQq.ݼuH o]̇޸'枇yx?Xڛ.DAVLZ@Sw-f?rTquGe7S\#.㵻 hPQ$yB ӎw_Zkw:z𵕛[?^+Y~ɮ?|zW*~ɟ4ok&qhar0Zr C+NZ@\7nnlw+onlw޾o\[7dЈ (+277Ӷʎ%Z@VIסOvۥ[[p|33=4`{9c"t]m[H.u#Рy(] =4om5#Fmy2|'VM#pH6ɫ(y+++2&&&-C!6R3v -q-FUӸJw5/%%5ImP8S~<}4h/RZ㚦Z:UR}\U0 s܉' O|ʉ',r>`aeyaP:`*(R(jA=\MZ$G ,?|.Z^^@^{G<4mccիa5MK!4m(J:rl[[[2 _f`ar2mMlghi(kDx WL *$h(JB*0U+JsBWRf`۩ZjiL&)h։Ɯ uB("+5g}opp '3ӆC "vմIߴ lg+++y!>d{']כZI$+fQkհql{nJٶmfy^u;+oyZK(G} }lddddG`aG" É 44x}=}^qLI;C='2 !*6Wp 4E)o$~>ZLLpfS`E]Y\Y 7>Kޗ?J7׋b>0M^/4MuUUu]`GvmK3T!5s==e!b޺;_<5KB#\t@d7?I/*L4+N@@Q/is[+uu* 1u+ż .s{jȀzwcA$#nDVyW6,KUUa(a*D˲4M};XGFoxE_~Pw~P"vhvOy dVi&o^:Xk%ŋ+Ms못}e'գxXօ = ,PdHI"PQyלΚWb&dGO|Gx٩Kjpq%D`GRH:8nH8Gdv? 09 ;2 %#zz"F*K\Ncpg;ԣXHFY`PUrBWVFHƉ̇s P(4ɛ#qdhdRFҹd|ǔh6YJzyV@$#܎ ~֮G0]}̲Im`]ޣdnhvSSSdIFmFJaU'NhS7,m#{bhgɨBg`57eYy-b;w∝RxMtdA)F0]7#4M3 h(:q4v168<ϻxbw(:q4r&E'\x~j7\26vzFЉcX$uu]'O>zdx? (p*mm_yBDhdؑTek c;ћfffdCEQ<<Z({`t⸧?MaAy^yaEQ䊢fZ"Npc}N40%ĻQ쁭۸*n52dJVc"u]0͗KmF15qL&c6dM,!vuڊeY.\FC4u3Ms;=.!%04MBD>==fY`qigb#Za}dna&I+/!}YnmAJj)tdl>z:>nF9==mF$cv8uEfUmOMM^'4u]8XwO5x&~ =b۶a38ɘmsE d# ͓ia4N2P9K{DEEfbKgM+vAGW/,􈊊8q0,F ;Fd%#f`~eYeYӶm7E^l/?~F3+ՠc+H{6t`z9ӒPawV[W.6?]~Q:>?ԭ5!$KFmEq9FAK0*e؎^}=})ydBYPrG9u+rY .;7 Vqѓ,;vB0i:[2h3cJ)y^#$3uS=Ùڕ\뤖 _6Jsǽ#f1\^\ !bʸX7/7|g;BkjE6ހ;|8QRQum\7SF)Cq+'X.dݾVGci# 'pX;mv i}x@XLjc%<Z#]{,/?rFBVyd]'5QWqͿuus|P=\',ORq@GJFbo#ً0 ,',l[W3 }gCeHv)mmûApdUWl#zf fk>CszDE;ΣbyL=o\6{u%מG֌@Hmmmz OA䣿(G1]>11QyaXlCY5ϴdGAIؑX\Y1??_:f'''e(rU$$$6oG?wX7˲iChd,8# ꨤ4MYe$J2Ll傖'V'{Dm4lPЎbR~@GІ ### Z"&Od>^Y1::Z#ۭG4_2z"HjHg(hdz{`Q(hdX$Z]F/K60Db+7HK"b_LwI-i$#Rw#H;כ3؊s'ɘ1 XB9AGFO$#ވ d/=dq@="I |4QP]2D֔/B(S2f`b=Z` {Y^^ߠdK PJ߄w}fMx-d"s۟&x?ϛY{[zHdfT! !lBP7%#R VQu=ϳ,9q;%DIUUu/\y^^b%&nG^MFa[#DRGpN{mtv ?py6a]2,=EQq]W&JщC|@MıiLLL$DRf`ՌN;2 Ct%zLKlf`~[M-u]yDB+;7!;tKz ͸/!Jٶf}oXŒQAl]`$fa+@ˤZ=ƻiu'9~/D@4EQ `]8YSyӡ@%#\m`bzvLpfB+Kio@%#?b׽ iɼu]!M^rq\{O !Dtj;neci;A٩ciX>;>@B$%m.K oQii^%`Ug۶ad6%DAG d)籖=%bz!;{GhCZB Ǿ+8NDm(Y&hñO !0m(YFRK4 GLӬXuuz`(Xb^TpXE5Ǒ>FZy,`m8UԄ\B?;u+=5q[QQ\wBCݕ"Eo;{Ng?9gvBpsܺngagX;cV.;]'!o?rvj).N}g̮Zi%<0 5 #GO|S~5m[P{d@îdߙz(X^ϰޤ >TF뺦imz ug5gmCyM#8I LdZ=:RsQJ! n6;;!"q@IJɳ 0 0@HJ ![=@d`<@GIJ 6$%h&)Ft tm8$"w+}߲n=L"}{a(DJX"L}CD)|͛7=޹[n5yHJzx`yիW>.@1?{ԧ>8NEKr\OR>O%b$)&PU}43 '~@%($۶뺚qW"8dPiV:iMMMan}KDu]4u]^hs[B,gaRB袲b'K|7 CQqw)3Yimz8=0$+;4Ms]h[*!z8`;(UvLLLXaGb5Tv@J uR(8NǗ*;4V`mr9X=kff&r"ie EQl>/=}=aNOO??~ALxUUIL6xՁ:Wǟ|`woBEov_^~5l}I:3k'?-^i3$z A.勇ΟyR. !~.Q^{HtGksK?R6`P%Db% K 0@,`X"Di+~W~3tkDt?>99e'Q7ryHIP=]קv,-fYVk" 0@,`X"DbR" v Bt0 44Q9UUU(-њtDb% K 0@,1DqT*vn\\}a>'KO5q}Y~ԡee2挧g`y!a^f-˲ў߫BA1==mf OU>0 5M[^^BX1K-?<-˺pB«A7BPgffjs;wvM]\UT|!POuzzz||to6T|>?>>Xy]jss]wd秚f>ثn۶ݵ5 C4i樎4Ś`yy)o~/v?R?  Ø䓯EOUoF7{|4mnbNbYiiF!lVgX:.V [#j&g Wy'PŶm6jW/^LRTJJh`w0 ŞvJ,PSuGQ-fTT}VAX9IDATߟ.l@VJB&]Ӯi"MFFFΟ??88X(DYu{Q?7CA>UT''';?.//LOOvRGkI3W;Aȋk{>FSm꟪\3C}vf:^% K 0@,`X"Db% K 0@,`X"DbT˘oY8#ql.4M5M}_U(2::*o[y"Jmmm5;j@;4MNr9q0 @Q Ð J !<+a(*ϧi0JY^^˟4͊{GGGMtg~~>͖Ud2oL/״OK2BMӔySZBQK7ʿ}0r"˕VUU]^^.,V_,!{illb3lO)"Pe||<YbYV',=SzNB! 1oʕ+u`|?w]J[hmk611!Kϼ{u}bbBޞ{WWW#y#@kTyeYlg}_n5?;;iZ>7M4MyQ&mw``f`2122Ri^eٶ]~PQMTU}zzZQl6[*B?77Wz{e~rMRJrx/8SLrt\.纮<'UWȩ^i˭b0H؞S(u` h UUK KDb% K 0@,`X"Db% K 0@,4arNIENDB`paleomix-1.3.8/docs/_static/zonkey/mito_phylo.png000066400000000000000000000307171443173665200221520ustar00rootroot00000000000000PNG  IHDR@@Na pHYs+tEXtSoftwareGPL Ghostscript 9.19 IDATx}L\K3t0/gZٱ:ZMJ l*nU+eob+ݫ&wqI;3DWR1Dc3 @^f8'%uBlimޛ3 i6c[VUK1cDl4n6ڛ}%k cBpl74>3_;3pl{ 7~scZ~43xgι7g`Vno fm|f}fji'jNv5am?0;su)_f)QND%wwӷoD܃۱plJ6g`pn-+{~y(HDR"`)0@J %HDR"`)0@J %HDR"`)0@J %R U f}200l6ClJ@8흚'cccB@ g`]GGpdrrRUǏ !,K___aaCcV D+((㏍^d1::mTWW !Z[[]㊢h6s!6$H BD"zL& ClbItijj4zTSS#())1z!HjIJi*$5200mĒ"(FQ!XAArClbIw_l6{<: eeeSSS#KY02_̱p׮m:?| 7^ߞTciޱwehplJxl˿<0 `NMȕYT|9sz܂B׭f{o!7SGj! XC9SƦ~-] ؕQk٩Ow[!`$v5G:t=~MroϽuK;>3^72ˉ;MBplSG 3;gXo Į_#?裄S{Z|o?=! [vwÉ)9\:@b###> |bX?UWDHhP]oX~4hvr c-Wu9Fe \HDrϝ?쳦@K:hƴ> {&~n6 R"`)0@J %HDR"`)0@J %F~$˳l<(b6g`6h4a!Dwww pwjjaq`UUUBŽaG$gzGUղ2eaao~!Diizmb @Vylkk'߿"uYO!v]p_08809ydMMv(.K?3s:---.+%%rhv{KKigrN3nle) Fប!q:F*++?ݻwݻ~Ν7n</Ξ=PSS_GZN<)8}ZRRRrss[ZZErjkk[[[Ϟ=;99jjjO;;$~(JSSSSSz]&v^ov{$lh\.>w:Z4Vss>)..߂ >޾Vi7ꚚN< Xvtn_!irް3gǟNnmrrR=s}}s8ڶ덿p8O!Ngqq>}ѣ Go>---===tnnv@MM "hg={( Ngee B=9(nw8 'X^E?9D xEQgFp8EnbB뭩YXX>Y;:A <p8eZ!`\6z HDR"`)0@J %HDR"`)0@J %HDR"`)0@J %HDR"`)0@J 4` MNux~x: ^5es?*=Y˄ddW, o :KPGB5:}n=O7bHj) Fប!q:FFP#JFS!orS^?=j28`:jtFk]/XkPОc=v G^Bc?B<őjdh~ɋUO(S( y}Q/7;;uԩzj~cz:m#30=RՓ=E!'8}GS<~Yo!򗿼p3z-XkŕK/:8/=Ǫo;k}[GCW8s[u0Ȥ޿=jJwܹq?W.Yő]jdZ ?e(m{L !ˉb AJ^%=sU{WծaqU6r]/~ˋ?d0,6['Ow~| 귾t]֙p 6?\9; }z#+f7q8/uhLH UO\N;m^5e-%^f5:a?  w?}TcgcJW٢FgS֯]ڻ; o ^V2{ÿ> 6=~2z555B-CHqi8J$/N^dHDR"`)0.`P%$t~KH VYYiH % A Pl6듁p8gVg`^8흚'cccB@ 8:::'?~\aX bk30ȭ6~2>>(mTU]n@j!Ç}W"*ɴ8C"~8C򑑑g}$ill\200mF"i*(FQ466VPP6\^^xl6[ 8trCR#` D,+++((Xr@^ lb$S)0@J %HDR"`)0@J %a|S,(lևJ8vn(ٕ;o :KZz&??_KE"` cϱTRL/WZ ~VqDT,? ^&`09V}[L~m_q4Ū'lmo_Fdh >\=~ިsC/^o컗^yI!?\K^xپ#JF֩wмB!5kgΜ1v%x`‵ʥO$\ŕKrI!o"pq9V!D΂s/&JF+]sO_ A&<ߚLrׂP^}B}|wδ.\/6[]QX+.;B:O(6[|۔d(ʧ~j*ƞ=jʇ9vlߑg_U#ӓaB\Sb0,6[Ow&E߹r_ZB\~x=:蛌pm_͇0̹˯w<8Ǫ]|Ff#Q2?kzmUwk:Q_ȎsNAM0/BJW\ JbycB˟?V7/+=_Vb#6 0/U=/Xr-=xhOpXQi-ϱ; O6%@VJFʱ jEQfr`` l6%T*pwjjJ{>66&+ `p/1Qh~. ^*hmmmmm5z-F'm#rE ŕKriyx9V]?*=qk!#<222b* RVV?߽Eo};:OE~IDİ/8r=0'^뀀!Q{{s8F/03^XU7;GsG ."~uKsn{A{vr}ֈ{sn+]?*9p+:6үϋsP#3XW BW^/G*9PBٍC\7dd?*JF `m0X3kŌ÷DR"`)0@J %HDR"`)0@J %HDR"`)0@J %HDR"`)0@J 4 f@8˳l1K`S LpwjjJ{>66&+ ` LIUU?.X,}}}K [1 XCCb1zFB|'ZڪWE۶l `K,,,|F/$)-+2L+ %GTݵͤ(vM_kjLJjݼdZݐE_*v鶶G}o޼yMkvmBDFNm8Q9%u;[펦:C ,-`*n޼e``@+ %EFêy(77WUP{22*r&2şͤ??,riҭ]"Wvi~qlֿc!6@gggKqΝs!\'62B[CyGӶ~w%5XF8n{USl7t&9`0` hؕ?x`dZs)72≌[B;c3$Dl—pL g,Lͱgr{Ct:^/fvȝ]RI|#2.K5NͰȈw.س$73䎎ٞ"#ީ?YXq[Elw7nyVG|=k-j+s Q/5QGU .fBC:˷<;-A!C`wrLVNHͱ 'Dt7Ɵ끀[QI˱m;#؄o> !ҭ#^>mTȶ %$X' ,~c5/3>'{`)0@J %HDR"`)0@J %XG@@Q٬Op^^f[n`58K8흚'cccBLIDAT@ *q䤪ǏBX,z(aRXXhr 0 9sG1z!XVOOO¤ZڪOi:;;O6d&Aݟٟ?U` D"=W&i X0z X+--B=ztI|?JŢX=lEQTUFBȄ{`1fC-F@& XG/ ***6O6lb%'V{`)0@J %HDR"`)0@J %lE5x^rt7N_kL_k ^p&|B}H|=mBȈ7#ߩ c/iZw: X"e8Ȉ's4kP.&c^JB-g> !&/8Ss;[+ou=m0` \~]hkk3v% j,r !riȈGXd;3Ȩw,re֧صլځ0` 444\td2vuu:u*~8{j##^!D1;w4mPsMΉwW5-UM L^sz#`xw^ªL_kN1FF<6Yޚrk +;[nf(/c35Ǟ90`krgşZ}#2IU}vGo3̐;:f{fRFFS8a"&|wÃYEìٛ#j;[j>m>N_k^@-df5\\ߖnÙ]iʉ Ή9v!ĝ_;Kisi9] H(*reՂ볍ٖ$oOv0 R"`)0@J %HDR"`)0.D aml6ٳG#H^^f3t8O HDQD/_6b]8Ç'@@QkƬ 6L4MOOD"Cͦ()d`#,,,|# {"#'Ҭ3CQ|gVKy̝nu6j0B:{\fqmVQ.! &lw2#F㭎ss[uU/8\^rG;ݍaZ6'+G5z!/SnVk.N \ڮgpF_BhWjNU"| !moGuKvi펺7E-3C;3\2 7%+{!{gBۙ3g3CE.!DZ=ZhfRfZ2wȯ.B&|ѶH˱ !ҭOixlw4iY䊌zs[cfƂ 0k-ѫX7n͹\.}2{j7ҭ!I;Iږlj^4GG)\h`{USdԛj.%>PBm{[x` 9s|bZs)72≌xB;n.䟾ּIh}*. _i NnjLWJą 덇8@KǼB!̅JYmmKٳa+.6*f<7Q3bݥ7l._9muT˙bʽݨݘb,r ?`jM5'|!yz[{?}c6 e(M-oٝ374i9]mt1---Kk9.eaa5lu>t*(z㾎{Yr9lxZnUZ[[n`?ip `~^{yAϷcˉN3c\g`^%;n^&BJJJkkZ1Xy@  !c`EYc`I3-xk8UUU1pf0~~OZjn=zY{yGj~myc`JK[SSSΨ_R歭B-+ʴ&9X1E0HDR"`)0@J %HDR"`)0@J %HDR"`)0@J %HDR"`)0@J %HDR"`)0@J ,hTIENDB`paleomix-1.3.8/docs/acknowledgements.rst000066400000000000000000000011021443173665200203620ustar00rootroot00000000000000================ Acknowledgements ================ The PALEOMIX pipeline has been developed by researchers at the `Orlando Group`_ at the `Centre for GeoGenetics`_, University of Copenhagen, Denmark. Its development was supported by the Danish Council for Independent Research, Natural Sciences (FNU); the Danish National Research Foundation (DNRF94); a Marie-Curie Career Integration grant (FP7 CIG-293845); the Lundbeck foundation (R52-A5062). .. _Orlando Group: http://geogenetics.ku.dk/research_groups/palaeomix_group/ .. _Centre for GeoGenetics: http://geogenetics.ku.dk/paleomix-1.3.8/docs/bam_pipeline/000077500000000000000000000000001443173665200167305ustar00rootroot00000000000000paleomix-1.3.8/docs/bam_pipeline/configuration.rst000066400000000000000000000022041443173665200223270ustar00rootroot00000000000000.. highlight:: ini .. _bam_configuration: Configuring the BAM pipeline ============================ The BAM pipeline supports a number of command-line options (see `paleomix bam run --help`). These options may be set directly on the command-line (e.g. using `--max-threads 16`), but it is also possible to set default values for such options. This is accomplished by writing options in `~/.paleomix/bam_pipeline.ini`, such as:: max-threads = 16 bowtie2-max-threads = 1 bwa-max-threads = 1 jar-root = /home/username/install/jar_root jre-option = -Xmx4g log-level = warning temp-root = /tmp/username/bam_pipeline Options in the configuration file correspond directly to command-line options for the BAM pipeline, with leading dashes removed. For example, the command-line option `--max-threads` becomes `max-threads` in the configuration file. Options specified on the command-line take precedence over those in the `bam_pipeline.ini` file. For example, if `max-threads` is set to 4 in the `bam_pipeline.ini` file, but the pipeline is run using `paleomix bam run --max-threads 10`, then the max threads value is set to 10. paleomix-1.3.8/docs/bam_pipeline/filestructure.rst000066400000000000000000000205611443173665200223660ustar00rootroot00000000000000.. highlight:: Yaml .. _bam_filestructure: File structure ============== The following section explains the file structure of the BAM pipeline example project (see :ref:`examples`), which results if that project is executed:: ExampleProject: Synthetic_Sample_1: ACGATA: Lane_1: data/ACGATA_L1_R{Pair}_*.fastq.gz Lane_2: Singleton: data/ACGATA_L2/reads.singleton.truncated.gz Collapsed: data/ACGATA_L2/reads.collapsed.gz CollapsedTruncated: data/ACGATA_L2/reads.collapsed.truncated.gz GCTCTG: Lane_1: data/GCTCTG_L1_R1_*.fastq.gz TGCTCA: Options: BWA: MinQuality: 30 Lane_1: data/TGCTCA_L1_R1_*.fastq.gz Lane_2: data/TGCTCA_L2_R{Pair}_*.fastq.gz Once executed, this example is expected to generate the following result files, depending on which options are enabled: * ExampleProject.rCRS.bam * ExampleProject.rCRS.bam.bai * ExampleProject.rCRS.coverage * ExampleProject.rCRS.depths * ExampleProject.rCRS.mapDamage/ * ExampleProject.summary As well as a folder containing intermediate results: * ExampleProject/ .. warning:: Please be aware that the internal file structure of PALEOMIX may change between major revisions (e.g. v1.1 to 1.2), but is not expected to change between minor revisions (v1.1.1 to v1.1.2). Consequently, if you wish to re-run an old project with the PALEOMIX pipeline, it is recommended to either use the same version of PALEOMIX, or remove the folder containing intermediate files before starting (see below), in order to ensure that analyses are re-run from scratch. Primary results --------------- These files are the main results generated by the PALEOMIX pipeline: **ExampleProject.rCRS.bam** and **ExampleProject.rCRS.bam.bai** Final BAM file and its index file (.bai), created using the "samtools index". If rescaling has been enabled, this BAM will contain reads processed by mapDamage. **ExampleProject.rCRS.mapDamage/** Per-library analyses generated using mapDamage2.0. If rescaling or modeling is enabled, these folders also includes the model files generated for each library. See the `mapDamage2.0 documentation`_ for a description of these files. **ExampleProject.rCRS.coverage** Coverage statistics generated using the 'paleomix coverage' command. These include per sample, per library, and per contig / chromosome breakdowns. **ExampleProject.rCRS.depths** Depth-histogram generated using 'paleomix depths' commands. As with the coverage, this information is broken down by sample, library, and contig / chromosome. **ExampleProject.summary** A summary table, which is created for each target if enabled in the makefile. This table contains contains a summary of the project, including the number / types of reads processed, average coverage, and other statistics broken down by prefix, sample, and library. .. warning:: Some statistics will missing if pre-trimmed reads are included in the makefile, since PALEOMIX relies on the output from the adapter trimming software to collect these values. Intermediate results -------------------- The BAM pipeline uses a simple file structure that corresponds to the structure of targets in the makefile. A folder is created for each target in the makefile (here "ExampleProject"). This folder contains a folder for the processed FASTQ reads, and a folder for each prefix, as well as some additional files used in certain analytical steps (see below): .. code-block:: bash $ ls ExampleProject/ reads/ rCRS/ [...] Processed reads ^^^^^^^^^^^^^^^ Each of these folders contain a directory structure that corresponds to that of the makefiles. In addition, named folders are generated from each input FASTQ file or pair of FASTQ files: .. code-block:: bash ExampleProject/ reads/ Synthetic_Sample_1/ ACGATA/ Lane_1/ ACGATA_L1_Rx_01.fastq.gz/ ACGATA_L1_Rx_02.fastq.gz/ ACGATA_L1_Rx_03.fastq.gz/ ACGATA_L1_Rx_04.fastq.gz/ [...] The contents of the lane folders contains the output of AdapterRemoval, with most filenames corresponding to the read-types listed in the makefile under the option "ExcludeReads": .. code-block:: bash $ ls ExampleProject/reads/Synthetic_Sample_1/ACGATA/Lane_1/ACGATA_L1_Rx_01.fastq.gz/ reads.settings # Settings / statistics file generated by AdapterRemoval reads.discarded.gz # Low-quality or short reads reads.truncated.gz # Single-ended reads following adapter-removal reads.collapsed.gz # Paired-ended reads collapsed into single reads reads.collapsed.truncated.gz # Collapsed reads trimmed at either termini reads.pair1.truncated.gz # The first mate read of paired reads reads.pair2.truncated.gz # The second mate read of paired reads reads.singleton.truncated.gz # Paired-ended reads for which one mate was discarded If the reads were pre-trimmed (as is the case for Lane_2 of the library ACGATA), then a single file is generated to signal that the reads have been validated (attempting to detect invalid quality scores and/or file formats): .. code-block:: bash $ ls ExampleProject/reads/Synthetic_Sample_1/ACGATA/Lane_2/ reads.pretrimmed.validated The .validated file is an empty file marking the successful validation of pre-trimmed reads. If the validation fails with a false positive, creating this file for lane in question allows one to bypass the validation step. Mapped reads ^^^^^^^^^^^^ The file-structure used for mapped reads is similar to that described for the trimmed reads, but includes a larger number of files. Using lane "Lane_1" of library "ACGATA" as an example, the following files are created in each folder for that lane, with each type of reads represented (collapsed, collapsedtruncated, paired, and single) depending on the lane type (SE or PE): .. code-block:: bash $ ls ExampleProject/rCRS/Synthetic_Sample_1/ACGATA/Lane_1/ACGATA_L1_Rx_01.fastq.gz/ collapsed.bam # The mapped reads in BAM format collapsed.bam.bai # Index file used for accessing the .bam file collapsed.coverage # Coverage statistics collapsed.sai # Intermediate alignment file generated by the BWA backtrack collapsed.validated # Log-file from Picard ValidateSamFile indicating marking that the .bam file has been validated [...] For each library, two sets of files are created in the folder corresponding to the sample; these corresponds to the way in which duplicates are filtered, with one method for "normal" reads (paired and single-ended reads), and one method for "collapsed" reads (taking advantage of the fact that both external coordinates of the mapping is informative). Note however, that "collapsedtruncated" reads are included among normal reads, as at least one of the external coordinates are unreliable for these. Thus, the following files are observed: .. code-block:: bash ExampleProject/ rCRS/ Synthetic_Sample_1/ ACGATA.duplications_checked ACGATA.rmdup.*.bam ACGATA.rmdup.*.bam.bai ACGATA.rmdup.*.coverage ACGATA.rmdup.*.validated With the exception of the "duplicates_checked" file, these corresponds to the files created in the lane folder. The "duplicates_checked" file marks the successful completion of a validation step in which attempts to detect data duplication due to the inclusion of the same reads / files multiple times (not to be confused with PCR duplicates). If rescaling is enabled, a set of files is created for each library, containing the BAM file generated using the mapDamage2.0 quality rescaling functionality, but otherwise corresponding to the files described above: .. code-block:: bash ExampleProject/ rCRS/ Synthetic_Sample_1/ ACGATA.rescaled.bam ACGATA.rescaled.bam.bai ACGATA.rescaled.coverage ACGATA.rescaled.validated Finally, the resulting BAMs for each library (rescaled or not) are merged and validated. This results in the creation of the following files in the target folder: .. code-block:: bash ExampleProject/ rCRS.validated # Signifies that the final BAM has been validated rCRS.duplications_checked # Similar to above, but catches duplicates across samples / libraries .. _mapDamage2.0 documentation: http://ginolhac.github.io/mapDamage/\#a7 .. _preseq: http://smithlabresearch.org/software/preseq/ paleomix-1.3.8/docs/bam_pipeline/index.rst000066400000000000000000000021211443173665200205650ustar00rootroot00000000000000.. _bam_pipeline: BAM Pipeline ============ **Table of Contents:** .. toctree:: overview.rst requirements.rst configuration.rst usage.rst makefile.rst filestructure.rst The BAM Pipeline is a pipeline designed for the processing of demultiplexed high-throughput sequencing (HTS) data, primarily that generated from Illumina high-throughput sequencing platforms. The pipeline carries out trimming of adapter sequences, filtering of low quality reads, merging of overlapping mate-pairs to reduce the error rate, mapping of reads using against one or more reference genomes / sequences, filtering of PCR duplicates, analyses and correction of post-mortem DNA damage, estimation of coverage, depth-of-coverage histograms, and more. To ensure the correctness of the results, the pipeline invokes frequent validation of intermediate results and attempts to detect common errors in input files. To allow tailoring of the process to the needs of individual projects, many features may be disabled, and the behavior of most programs can be tweaked to suit the specific of a given project. paleomix-1.3.8/docs/bam_pipeline/makefile.rst000066400000000000000000001157241443173665200212510ustar00rootroot00000000000000.. highlight:: YAML .. _bam_makefile: Makefile description ==================== .. contents:: The following sections reviews the options available in the BAM pipeline makefiles. As described in the :ref:`bam_usage` section, a default makefile may be generated using the 'paleomix bam\_pipeline makefile' command. For clarity, the location of options in subsections are specified by concatenating the names using '\:\:' as a separator. Thus, in the following (simplified example), the 'UseSeed' option (line 13) would be referred to as 'Options \:\: Aligners \:\: BWA \:\: UseSeed': .. code-block:: yaml :emphasize-lines: 13 :linenos: Options: # Settings for aligners supported by the pipeline Aligners: # Choice of aligner software to use, either "BWA" or "Bowtie2" Program: BWA # Settings for mappings performed using BWA BWA: # May be disabled ("no") for aDNA alignments with the 'aln' algorithm. # Post-mortem damage localizes to the seed region, which BWA expects to # have few errors (sets "-l"). See http://pmid.us/22574660 UseSeed: yes Specifying command-line options ------------------------------- For several programs it is possible to directly specify command-line options; this is accomplished in one of 3 ways; firstly, for command-line options that take a single value, this is accomplished simply by specifying the option and value as any other option. For example, if we wish to supply the option --mm 5 to AdapterRemoval, then we would list it as "--mm: 5" (all other options omitted for brevity):: AdapterRemoval: --mm: 5 For options that do not take any values, such as the AdapterRemoval `--trimns` (enabling the trimming of Ns in the reads), these are specified either as "--trimmns: ", with the value left blank, or as "--trimns: yes". The following are therefore equivalent:: AdapterRemoval: --trimns: # Method 1 --trimns: yes # Method 2 In some cases the BAM pipeline will enable features by default, but still allow these to be overridden. In those cases, the feature can be disabled by setting the value to 'no' (without quotes), as shown here:: AdapterRemoval: --trimns: no If you need to provide the text "yes" or "no" as the value for an option, it is necessary to put these in quotes:: --my-option: "yes" --my-option: "no" In some cases it is possible or even necessary to specify an option multiple times. Due to the way YAML works, this is not possible to do so directly. Instead, the pipeline allows multiple instances of the same option by providing these as a list:: --my-option: - "yes" - "no" - "maybe" The above will be translated into calling the program in question with the options "--my-option yes --my-option no --my-option maybe". Options ------- By default, the 'Options' section of the makefile contains the following: .. literalinclude:: makefile.yaml :language: yaml :linenos: :lines: 2-107 General Options ^^^^^^^^^^^^^^^ **Options \:\: Platform** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 7 :lines: 7-8 The sequencing platform used to generate the sequencing data; this information is recorded in the resulting BAM file, and may be used by downstream tools. The `SAM/BAM specification`_ the valid platforms, which currently include 'CAPILLARY', 'HELICOS', 'ILLUMINA', 'IONTORRENT', 'LS454', 'ONT', 'PACBIO', and 'SOLID'. **Options \:\: QualityOffset** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 9 :lines: 9-13 The QualityOffset option refers to the starting ASCII value used to encode `Phred quality-scores`_ in user-provided FASTQ files, with the possible values of 33, 64, and 'Solexa'. For most modern data, this will be 33, corresponding to ASCII characters in the range '!' to 'J'. Older data is often encoded using the offset 64, corresponding to ASCII characters in the range '@' to 'h', and more rarely using Solexa quality-scores, which represent a different scheme than Phred scores, and which occupy the range of ASCII values from ';' to 'h'. For a visual representation of this, refer to the Wikipedia article linked above. Adapter Trimming ^^^^^^^^^^^^^^^^ The "AdapterRemoval" subsection allows for options that are applied when AdapterRemoval is applied to the FASTQ reads supplied by the user. For a more detailed description of command-line options, please refer to the `AdapterRemoval documentation`_. A few important particularly options are described here: **Options \:\: AdapterRemoval \:\: --adapter1** and **Options \:\: AdapterRemoval \:\: --adapter2** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 17 :lines: 17-19 These two options are used to specify the adapter sequences used to identify and trim reads that contain adapter contamination. Thus, the sequence provided for --adapter1 is expected to be found in the mate 1 reads, and the sequence specified for --adapter2 is expected to be found in the mate 2 reads. In both cases, these should be specified as in the orientation that appear in these files (i.e. it should be possible to grep the files for these, assuming that the reads were long enough, and treating Ns as wildcards). It is very important that these be specified correctly. Please refer to the `AdapterRemoval documentation`_ for more information. .. note:: As of version AdapterRemoval 2.1, it is possible to use multiple threads to speed up trimming of adapter sequences. This is accomplished not by setting the --threads command-line option in the makefile, but by supplying the --adapterremoval-max-threads option to the BAM pipeline itself: .. code-block:: bash $ paleomix bam run makefile.yaml --adapterremoval-max-threads 2 **Options \:\: AdapterRemoval \:\: --mm** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 22 :lines: 22 Sets the fraction of mismatches allowed when aligning reads / adapter sequences. If the specified value (MM) is greater than 1, this is calculated as 1 / MM, otherwise the value is used directly. To set, replace the default value as desired:: --mm: 3 # Maximum mismatch rate of 1 / 3 --mm: 5 # Maximum mismatch rate of 1 / 5 --mm: 0.2 # Maximum mismatch rate of 1 / 5 **Options \:\: AdapterRemoval \:\: --minlength** The minimum length required after read merging, adapter trimming, and base-quality quality trimming; resulting reads shorter than this length are discarded, and thereby excluded from further analyses by the pipeline. A value of at least 25 bp is recommended to cut down on the rate of spurious alignments; if possible, a value of 30 bp may be used to greatly reduce the fraction of spurious alignments, with smaller gains for greater minimums [Schubert2012]_. .. warning:: The default value used by PALEOMIX for `--minlength` (25 bp) differs from the default value for AdapterRemoval (15 bp). Thus, if a minimum length of 15 bp is desired, it is nessesarily to explicitly state so in the makefile, simply commenting out this command-line argument is not sufficient. **Options \:\: AdapterRemoval \:\: --collapse** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 25 :lines: 25 If enabled, AdapterRemoval will attempt to combine overlapping paired-end reads into a single (potentially longer) sequence. This has at least two advantages, namely that longer reads allow for less ambiguous alignments against the target reference genome, and that the fidelity of the overlapping region (potentially the entire read) is improved by selecting the highest quality base when discrepancies are observed. The names of reads thus merged are prefixed with either 'M\_' or 'MT\_', with the latter marking reads that have been trimmed from the 5' or 3' termini following collapse, and which therefore do not represent the full insert. To disable this behavior, set the option to 'no' (without quotes):: --collapse: yes # Option enabled --collapse: no # Option disabled .. note:: This option may be combined with the 'ExcludeReads' option (see below), to either eliminate or select for short inserts, depending on the expectations from the experiment. I.e. for ancient samples, where the most inserts should be short enough to allow collapsing (< 2x read read - 11, by default), excluding paired (uncollapsed) and singleton reads may help reduce the fraction of exogenous DNA mapped. **Options \:\: AdapterRemoval \:\: --trimns** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 26 :lines: 26 If set to 'yes' (without quotes), AdapterRemoval will trim uncalled bases ('N') from the 5' and 3' end of the reads. Trimming will stop at the first called base ('A', 'C', 'G', or 'T'). If both --trimns and --trimqualities are enabled, then consecutive stretches of Ns and / or low-quality bases are trimmed from the 5' and 3' end of the reads. To disable, set the option to 'no' (without quotes):: --trimns: yes # Option enabled --trimns: no # Option disabled **Options \:\: AdapterRemoval \:\: --trimqualities** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 27 :lines: 27 If set to 'yes' (without quotes), AdapterRemoval will trim low-quality bases from the 5' and 3' end of the reads. Trimming will stop at the first base which is greater than the (Phred encoded) minimum quality score specified using the command-line option --minquality. This value defaults to 2. If both --trimns and --trimqualities are enabled, then consecutive stretches of Ns and / or low-quality bases are trimmed from the 5' and 3' end of the reads. To disable, set the option to 'no' (without quotes):: --trimqualities: yes # Option enabled --trimqualities: no # Option disabled Short read aligners ^^^^^^^^^^^^^^^^^^^ This section allow selection between supported short read aligners (currently BWA [Li2009a]_ and Bowtie2 [Langmead2012]_), as well as setting options for these, individually: .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 29 :lines: 29-32 To select a mapping program, set the 'Program' option appropriately:: Program: BWA # Using BWA to map reads Program: Bowtie2 # Using Bowtie2 to map reads Short read aligners - BWA """"""""""""""""""""""""" The following options are applied only when running the BWA short read aligner; see the section "Options: Short read aligners" above for how to select this aligner. .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 35 :lines: 35-49 **Options \:\: Aligners \:\: BWA \:\: Algorithm** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 26 :lines: 36-38 The mapping algorithm to use; options are 'backtrack' (corresponding to 'bwa aln'), 'bwasw', and 'mem'. Additional command-line options may be specified for these. Algorithms are selected as follows:: Algorithm: backtrack # 'Backtrack' algorithm, using the command 'bwa aln' Algorithm: bwasw # 'SW' algorithm for long queries, using the command 'bwa bwasw' Algorithm: mem # 'mem' algorithm, using the command 'bwa mem' .. warning:: Alignment algorithms 'bwasw' and 'mem' currently cannot be used with input data that is encoded using QualityOffset 64 or 'Solexa'. This is a limitation of PALEOMIX, and will be resolved in future versions. In the mean time, this can be circumvented by converting FASTQ reads to the standard quality-offset 33, using for example `seqtk`_. **Options \:\: Aligners \:\: BWA \:\: MinQuality** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 39 :lines: 39-40 Specifies the minimum mapping quality of alignments produced by BWA. Any aligned read with a quality score below this value are removed during the mapping process. Note that while unmapped read have a quality of zero, these are not excluded by a non-zero 'MinQuality' value. To filter unmapped reads, use the option 'FilterUnmappedReads' (see below). To set this option, replace the default value with a desired minimum:: MinQuality: 0 # Keep all hits MinQuality: 25 # Keep only hits where mapping-quality >= 25 **Options \:\: Aligners \:\: BWA \:\: FilterUnmappedReads** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 41 :lines: 41-42 Specifies wether or not unmapped reads (reads not aligned to a target sequence) are to be retained in the resulting BAM files. If set to 'yes' (without quotes), all unmapped reads are discarded during the mapping process, while setting the option to 'no' (without quotes) retains these reads in the BAM. By convention, paired reads in which one mate is unmapped are assigned the same chromosome and position, while no chromosome / position are assigned to unmapped single-end reads. To change this setting, replace the value with either 'yes' or 'no' (without quotes):: FilterUnmappedReads: yes # Remove unmapped reads during alignment FilterUnmappedReads: no # Keep unmapped reads **Options \:\: Aligners \:\: BWA \:\: \*** Additional command-line options may be specified for the selected alignment algorithm, as described in the "Specifying command-line options" section above. See also the examples listed for Bowtie2 below. Note that for the 'backtrack' algorithm, it is only possible to specify options for the 'bwa aln' call. Short read aligners - Bowtie2 """"""""""""""""""""""""""""" The following options are applied only when running the Bowtie2 short read aligner; see the section "Options: Short read aligners" above for how to select this aligner. .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 51 :lines: 51-65 **Options \:\: Aligners \:\: Bowtie2 \:\: MinQuality** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 53 :lines: 53-54 See 'Options \:\: Aligners \:\: BWA \:\: MinQuality' above. **Options \:\: Aligners \:\: Bowtie2 \:\: FilterUnmappedReads** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 55 :lines: 55-56 See 'Options \:\: Aligners \:\: BWA \:\: FilterUnmappedReads' above. **Options \:\: Aligners \:\: BWA \:\: \*** Additional command-line options may be specified for Bowtie2, as described in the "Specifying command-line options" section above. Please refer to the `Bowtie2 documentation`_ for more information about available command-line options. mapDamage options ^^^^^^^^^^^^^^^^^ .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 67 :lines: 67-71 This subsection is used to specify options for mapDamage2.0, when plotting *post-mortem* DNA damage, when building models of the *post-mortem* damage, and when rescaling quality scores to account for this damage. In order to enable plotting, modeling, or rescaling of quality scores, please see the 'mapDamage' option in the 'Features' section below. .. note:: It may be worthwhile to tweak mapDamage parameters before building a model of *post-mortem* DNA damage; this may be accomplished by running the pipeline without rescaling, running with the 'mapDamage' feature set to 'plot' (with or without quotes), inspecting the plots generated per-library, and then tweaking parameters as appropriate, before setting 'mapDamage' to 'model' (with or without quotes). Should you wish to change the modeling and rescaling parameters, after having already run the pipeline with rescaling enabled, simply remove the mapDamage files generated for the relevant libraries (see the :ref:`bam_filestructure` section). .. warning:: Rescaling requires a certain minimum number of C>T and G>A substitutions, before it is possible to construct a model of *post-mortem* DNA damage. If mapDamage fails with an error indicating that "DNA damage levels are too low", then it is necessary to disable rescaling for that library to continue. **Options \:\: mapDamage :: --downsample** .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 69 :lines: 69-71 By default the BAM pipeline only samples 100k reads for use in constructing mapDamage plots; in our experience, this is sufficient for accurate plots and models. If no downsampling is to be done, this value can set to 0 to disable this features:: --downsample: 100000 # Sample 100 thousand reads --downsample: 1000000 # Sample 1 million reads --downsample: 0 # No downsampling **Options \:\: mapDamage :: \*** Additional command-line options may be supplied to mapDamage, just like the `--downsample` parameter, as described in the "Specifying command-line options" section above. These are used during plotting and rescaling (if enabled). Excluding read-types ^^^^^^^^^^^^^^^^^^^^ .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 73 :lines: 73-87 During the adapter-trimming and read-merging step, AdapterRemoval will generate a selection of different read types. This option allows certain read-types to be excluded from further analyses. In particular, it may be useful to exclude non-collapsed (paired and singleton) reads when processing (ancient) DNA for which only short inserts is expected, since this may help exclude exogenous DNA. The following read types are currently recognized: *Single* Single-end reads; these are the (trimmed) reads generated from supplying single-end FASTQ files to the pipeline. *Paired* Paired-end reads; these are the (trimmed) reads generated from supplying paired-end FASTQ files to the pipeline, but covering only the subset of paired reads for which *both* mates were retained, and which were not merged into a single read (if --collapse is set for AdapterRemoval). *Singleton* Paired-end reads; these are (trimmed) reads generated from supplying paired-end FASTQ files to the pipeline, but covering only those reads in which one of the two mates were discarded due to either the `--maxns`, the `--minlength`, or the `--maxlength` options supplied to AdapterRemoval. Consequently, these reads are mapped and PCR-duplicate filtered in single-end mode. *Collapsed* Paired-end reads, for which the sequences overlapped, and which were consequently merged by AdapterRemoval into a single sequence (enabled by the --collapse command-line option). These sequences are expected to represent the complete insert, and while they are mapped in single-end mode, PCR duplicate filtering is carried out in a manner that treats these as paired reads. Note that all collapsed reads are tagged by prefixing the read name with 'M\_'. *CollapsedTruncated* Paired-end reads (like *Collapsed*), which were trimmed due to the `--trimqualities` or the `--trimns` command-line options supplied to AdapterRemoval. Consequently, and as these sequences represent the entire insert, these reads are mapped and PCR-duplicate filtered in single-end mode. Note that all collapsed, truncated reads are tagged by prefixing the read name with 'MT\_'. To enable / disable exclusion of a read type, set the value for the appropriate type to 'yes' or 'no' (without quotes):: Singleton: no # Singleton reads are NOT excluded Singleton: yes # Singleton reads are excluded Optional features ^^^^^^^^^^^^^^^^^ .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 104 :lines: 89-107 This section lists several optional features, in particular determining which BAM files and which summary statistics are generated when running the pipeline. Currently, the following options are available: *PCRDuplicates* This option determines how the BAM pipeline handles PCR duplicates following the mapping of trimmed reads. At present, 3 possible options are available. The first option is 'filter', which corresponds to running Picard MarkDuplicates and 'paleomix rmdup_collapsed' on the input files, and removing any read determined to be a PCR duplicate; the second option 'mark' functions like the 'filter' option, except that reads are not removed from the output, but instead the read flag is marked using the 0x400 bit (see the `SAM/BAM specification`_ for more information), in order to allow down-stream tools to identify these as duplicates. The final option is 'no' (without quotes), in which case no PCR duplicate detection / filtering is carried out on the aligned reads, useful for data generated using amplification free sequencing. *mapDamage* The 'mapDamage' option accepts four possible values: 'no', 'plot', 'model', and 'rescale'. By default value ('plot'), will cause mapDamage to be run in order to generate simple plots of the *post-mortem* DNA damage rates, as well as base composition plots, and more. If set to 'model', mapDamage will firstly generate the plots described for 'plot', but also construct models of DNA damage parameters, as described in [Jonsson2013]_. Note that a minimum amount of DNA damage is required to be present in order to build these models. If the option is set to 'rescale', both plots and models will be constructed using mapDamage, and in addition, the quality scores of bases will be down-graded based on how likely they are to represent *post-mortem* DNA damage (see above). *Coverage* If enabled, a table summarizing the number of hits, the number of aligned bases, bases inserted, and bases deleted, as well as the mean coverage, is generated for each reference sequence, stratified by sample, library, and contig. *Depths* If enabled, a table containing a histogram of the depth of coverage, ranging from 0 to 200, is generated for each reference sequence, stratified by sample, library, and contig. These files may further be used by the Phylogenetic pipeline, in order to automatically select a maximum read depth during SNP calling (see the :ref:`phylo_usage` section for more information). *Summary* If enabled, a single summary table will be generated per target, containing information about the number of reads processed, hits and fraction of PCR duplicates (per prefix and per library), and much more. For a description of where files are placed, refer to the :ref:`bam_filestructure` section. It is possible to run the BAM pipeline without any of these options enabled, and this may be useful in certain cases (if only the statistics or per-library BAMs are needed). To enable / disable a features, set the value for that feature to 'yes' or 'no' (without quotes):: Summary: no # Do NOT generate a per-target summary table Summary: yes # Generate a per-target summary table Prefixes -------- .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 110 :lines: 110-126 Reference genomes used for mapping are specified by listing these (one or more) in the 'Prefixes' section. Each reference genome is associated with a name (used in summary statistics and as part of the resulting filenames), and the path to a FASTA file which contains the reference genome. Several other options are also available, but only the name and the 'Path' value are required, as shown here for several examples:: # Map of prefixes by name, each having a Path key, which specifies the # location of the BWA/Bowtie2 index, and optional label, and an option # set of regions for which additional statistics are produced. Prefixes: # Name of the prefix; is used as part of the output filenames MyPrefix1: # Path to FASTA file containing reference genome; must end with '.fasta' Path: /path/to/genomes/file_1.fasta MyPrefix2: Path: /path/to/genomes/file_2.fasta MyPrefix3: Path: /path/to/genomes/AE008922_1.fasta Each sample in the makefile is mapped against each prefix, and BAM files are generated according to the enabled 'Features' (see above). In addition to the path, it is possible to specify `RegionsOfInterest`, which are described below. Regions of interest ^^^^^^^^^^^^^^^^^^^ It is possible to specify one or more "regions of interest" for a particular reference genome. Doing so results in the production of coverage and depth tables being generated for those regions (if these features are enabled, see above), as well as additional information in the summary table (if enabled). Such regions are specified using a BED file containing one or more regions; in particular, the first three columns (name, 0-based start coordinate, and 1-based end coordinate) are required, with the 4th column (the name) being optional. Strand information (the 6th column) is not used, but must still be valid according to the BED format. Statistics are merged by the names specified in the BED file, or by the contig names if no names were specified. Thus, it is important to insure that names are unique if individual statistics are desired for every region. Specifying regions of interest is accomplished by providing a name and a path for each set of regions of interest under the `RegionOfInterest` section for a given prefix:: # Produce additional coverage / depth statistics for a set of # regions defined in a BED file; if no names are specified for the # BED records, results are named after the chromosome / contig. RegionsOfInterest: MyRegions: /path/to/my_regions.bed MyOtherRegions: /path/to/my_other_regions.bed The following is a simple example of such a BED file, for an alignment against the rCRS (`NC_012920.1`_):: NC_012920_1 3306 4262 region_a NC_012920_1 4469 5510 region_b NC_012920_1 5903 7442 region_a In this case, the resulting tables will contain information about two different regions, namely `region_a` (2495 bp, resulting from merging the two individual regions specified), and `region_b` (1041 bp). The order of lines in this file does not matter. Adding multiple prefixes ^^^^^^^^^^^^^^^^^^^^^^^^ In cases where it is necessary to map samples against a large number of reference genomes, it may become impractical to add these to the makefile by hand. It is therefore possible to specify the location of the reference genomes via a path containing wild-cards, and letting the BAM pipeline collect these automatically. For the following example, we assume that we have a folder at '/path/to/genomes', which contains our reference genomes: .. code-block:: bash $ ls /path/to/genomes AE000516_2.fasta AE004091_2.fasta AE008922_1.fasta AE008923_1.fasta To automatically add these four reference genomes to the makefile, we would add a prefix as follows:: # Map of prefixes by name, each having a Path key, which specifies the # location of the BWA/Bowtie2 index, and optional label, and an option # set of regions for which additional statistics are produced. Prefixes: # Name of the prefix; is used as part of the output filenames MyGenomes*: # Path to .fasta file containing a set of reference sequences. Path: /path/to/genomes/*.fasta There are two components to this, namely the name of the pseudo-prefix which *must* end with a star (\*), and the path which may contain one or more wild-cards. If the prefix name does not end with a star, the BAM pipeline will simply treat the path as a regular path. In this particular case, the BAM pipeline will perform the equivalent of 'ls /path/to/genomes/\*.fasta', and then add each file it has located using the filename without extensions as the name of the prefix. In other words, the above is equivalent to the following:: # Map of prefixes by name, each having a Path key, which specifies the # location of the BWA/Bowtie2 index, and optional label, and an option # set of regions for which additional statistics are produced. Prefixes: # Name of the prefix; is used as part of the output filenames AE000516_2: Path: /path/to/genomes/AE000516_2.fasta AE004091_2: Path: /path/to/genomes/AE004091_2.fasta AE008922_1: Path: /path/to/genomes/AE008922_1.fasta AE008923_1: Path: /path/to/genomes/AE008923_1.fasta .. note:: The name provided for the pseudo-prefix (here 'MyGenomes') is not used by the pipeline, and can instead be used to document the nature of the files being included. .. warning:: Just like regular prefixes, it is required that the filename of the reference genome ends with '.fasta'. However, the pipeline will attempt to add *any* file found using the provided path with wildcards, and care should therefore be taken to avoid including non-FASTA files. For example, if the path '/path/to/genomes/\*' was used instead of '/path/to/genomes/\*.fasta', this would cause the pipeline to abort due to the inclusion of (for example) non-FASTA index files generated at this location by the pipeline itself. Targets ------- .. literalinclude:: makefile.yaml :language: yaml :linenos: :lineno-start: 129 :lines: 129- In the BAM pipeline, the term 'Target' is used to refer not to a particular sample (though in typical usage a target includes just one sample), but rather one or more samples to processed together to generate a BAM file per prefix (see above). A sample included in a target may likewise contain one or more libraries, for each of which one or more sets of FASTQ reads are specified. The following simplified example, derived from the makefile constructed as part of :ref:`bam_usage` section exemplifies this: .. code-block:: yaml :linenos: # Target name; all output files uses this name as a prefix MyFilename: # Sample name; used to tag data for segregation in downstream analyses MySample: # Library name; used to tag data for segregation in downstream analyses TGCTCA: # Lane / run names and paths to FASTQ files Lane_1: data/TGCTCA_L1_*.fastq.gz Lane_2: data/TGCTCA_L2_R{Pair}_*.fastq.gz *Target name* The first top section of this target (line 2, 'MyFilename') constitute the target name. This name is used as part of summary statistics and, more importantly, determined the first part of name of files generated as part of the processing of data specified for this target. Thus, in this example all files and folders generated during the processing of this target will start with 'MyFilename'; for example, the summary table normally generated from running the pipeline will be placed in a file named 'MyFilename.summary'. *Sample name* The subsections listed in the 'Target' section (line 4, 'MySample') constitute the (biological) samples included in this target; in the vast majority of analyses, you will have only a single sample per target, and in that case it is considered good practice to use the same name for both the target and the sample. A single target can, however, contain any number of samples, the data for which are tagged according to the names given in the makefile, using the SAM/BAM readgroup ('RG') tags. *Library name* The subsections listed in the 'Sample' section (line 6, 'TGCTCA') constitute the sequencing libraries constructed during the extraction and library building for the current sample. For modern samples, there is typically only a single library per sample, but more complex sequencing projects (modern and ancient) may involve any number of libraries constructed from one or more extracts. It is very important that libraries be listed correctly (see below). .. warning:: Note that the BAM pipeline imposes the restriction that each library name specified for a target must be unique, even if these are located in to different samples. This restriction may be removed in future versions of PALEOMIX. *Lane name* The subsections of each library are used to specify the names of individual lanes. Each sample may include any number of lanes and each lane may include any number of FASTQ files. Paths listed here may include wildcards (\*) or the special value `{Pair}`, which is used to indicate that the lane is paired end. During runtime, `{Pair}` is replaced with `1` and `2`, before searching using wildcards, and the pipeline expects the resulting filenames to correspond to mate 1 and mate 2 reads respectively. In addition to these target (sub)sections, it is possible to specify 'Options' for individual targets, samples, and libraries, similarly to how this is done globally at the top of the makefile. This is described below. .. warning:: It is very important that lanes are assigned to their corresponding libraries in the makefile; while it is possible to simply record every sequencing run / lane under a single library and run the pipeline like that, this will result in several unintended side effects: Firstly, the BAM pipeline uses the library information to ensure that PCR duplicates are filtered correctly. Wrongly grouping together lanes will result either in the loss of sequences which are not, in fact, PCR duplicates, while wrongly splitting a library into multiple entries will result in PCR duplicates not being correctly identified across these. Furthermore, mapDamage analyses make use of this information to carry out various analyses on a per-library basis, which may similarly be negatively impacted by incorrect specification of libraries. Including already trimmed reads ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ In some cases it is useful to include FASTQ reads that has already been trimmed for adapter sequences. While this is not recommended in general, as it may introduce systematic bias if some data has been processed differently than the remaining FASTQ reads, the BAM pipeline makes it simple to incorporate both 'raw' and trimmed FASTQ reads, and to ensure that these integrate in the pipeline. To include already trimmed reads, these are specified as values belonging to a lane, using the same names for read-types as in the 'ExcludeReads' option (see above). The following minimal example demonstrates this: .. code-block:: yaml :linenos: MyFilename: MySample: ACGATA: # Regular lane, containing reads that are not already trimmed Lane_1: data/ACGATA_L1_R{Pair}_*.fastq.gz # Lane containing pre-trimmed reads of each type Lane_2: # Single-end reads Single: /path/to/single_end_reads.fastq.gz # Paired-end reads where one mate has been discarded Singleton: /path/to/singleton_reads.fastq.gz # Paired end reads; note that the {Pair} key is required, # just like with raw, paired-end reads Paired: /path/to/paired_end_{Pair}.fastq.gz # Paired-end reads merged into a single sequence Collapsed: /path/to/collapsed.fastq.gz # Paired-end reads merged into a single sequence, and then truncated CollapsedTruncated: /path/to/collapsed_truncated.fastq.gz The above examples show how each type of reads are to be listed, but it is not necessary to specify more than a single type of pre-trimmed reads in the makefile. .. note:: Including already trimmed reads currently result in the absence of some summary statistics in the .summary file, namely the number of raw reads, as well as trimming statistics, since the BAM pipeline currently relies on AdapterRemoval to collect these statistics. Overriding global settings ^^^^^^^^^^^^^^^^^^^^^^^^^^ In addition to the main `Options` section, it is possible to override options at a Target, Sample, and Library level. This allows, for example, that different adapter sequences be specified for each library generated for a sample, or options that should only be applied to a particular sample among several included in a makefile. The following demonstration uses the makefile constructed as part of :ref:`bam_usage` section as the base: .. code-block:: yaml :linenos: :emphasize-lines: 2-7, 10-14, 20-23 MyFilename: # These options apply to all samples with this filename Options: # In this example, we override the default adapter sequences AdapterRemoval: --adapter1: AGATCGGAAGAGC --adapter2: AGATCGGAAGAGC MySample: # These options apply to libraries 'ACGATA', 'GCTCTG', and 'TGCTCA' Options: # In this example, we assume that FASTQ files for our libraries # include Phred quality scores encoded with offset 64. QualityOffset: 64 ACGATA: Lane_1: data/ACGATA_L1_R{Pair}_*.fastq.gz GCTCTG: # These options apply to 'Lane_1' in the 'GCTCTG' library Options: # It is possible to override options we have previously overridden QualityOffset: 33 Lane_1: data/GCTCTG_L1_*.fastq.gz TGCTCA: Lane_1: data/TGCTCA_L1_*.fastq.gz Lane_2: data/TGCTCA_L2_R{Pair}_*.fastq.gz In this example, we have overwritten options at 3 places: * The first place (lines 2 - 7) will be applied to *all* samples, libraries, and lanes in this target, unless subsequently overridden. In this example, we have set a new pair of adapter sequences, which we wish to use for these data. * The second place (lines 10 - 14) are applied to the sample 'MySample' that we have included in this target, and consequently applies to all libraries specified for this sample ('ACGATA', 'GCTCTG', and 'TGCTCA'). In most cases you will only have a single sample, and so it will not make a difference whether or not you override options for the entire target (e.g. lines 3 - 8), or just for that sample (e.g. lines 11-15). * Finally, the third place (lines 20 - 23) demonstrate how options can be overridden for a particular library. In this example, we have chosen to override an option (for this library only!) we previously overrode for that sample (the 'QualityOffset' option). .. note:: It currently not possible to override options for a single lane, it is only possible to override options for all lanes in a library. .. warning:: Only the `mapDamage` and the `PCRDuplicates` features can be overridden for individual targets, samples, or libraries. Over features can be overridden per target, but not per sample or per library. .. _AdapterRemoval documentation: https://github.com/MikkelSchubert/adapterremoval .. _Bowtie2 documentation: http://bowtie-bio.sourceforge.net/bowtie2/manual.shtml .. _NC_012920.1: http://www.ncbi.nlm.nih.gov/nuccore/251831106 .. _Phred quality-scores: https://en.wikipedia.org/wiki/FASTQ_format#Quality .. _SAM/BAM specification: http://samtools.sourceforge.net/SAM1.pdf .. _seqtk: https://github.com/lh3/seqtk paleomix-1.3.8/docs/bam_pipeline/makefile.yaml000066400000000000000000000142731443173665200214000ustar00rootroot00000000000000# -*- mode: Yaml; -*- # Default options. # Can also be specific for a set of samples, libraries, and lanes, # by including the "Options" hierarchy at the same level as those # samples, libraries, or lanes below. Options: # Sequencing platform, see SAM/BAM reference for valid values Platform: Illumina # Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+) # or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to # specify 'Solexa', to handle reads on the Solexa scale. This is # used during adapter-trimming and sequence alignment QualityOffset: 33 # Settings for trimming of reads, see AdapterRemoval man-page AdapterRemoval: # Set and uncomment to override defaults adapter sequences # --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG # --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT # Some BAM pipeline defaults differ from AR defaults; # To override, change these value(s): --mm: 3 --minlength: 25 # Extra features enabled by default; change 'yes' to 'no' to disable --collapse: yes --trimns: yes --trimqualities: yes # Settings for aligners supported by the pipeline Aligners: # Choice of aligner software to use, either "BWA" or "Bowtie2" Program: BWA # Settings for mappings performed using BWA BWA: # One of "backtrack", "bwasw", or "mem"; see the BWA documentation # for a description of each algorithm (defaults to 'backtrack') Algorithm: backtrack # Filter aligned reads with a mapping quality (Phred) below this value MinQuality: 0 # Filter reads that did not map to the reference sequence FilterUnmappedReads: yes # May be disabled ("no") for aDNA alignments with the 'aln' algorithm. # Post-mortem damage localizes to the seed region, which BWA expects to # have few errors (sets "-l"). See http://pmid.us/22574660 UseSeed: yes # Additional command-line options may be specified below. For 'backtrack' these # are applied to the "bwa aln". See Bowtie2 for more examples. # -n: 0.04 # Settings for mappings performed using Bowtie2 Bowtie2: # Filter aligned reads with a mapping quality (Phred) below this value MinQuality: 0 # Filter reads that did not map to the reference sequence FilterUnmappedReads: yes # Examples of how to add additional command-line options # --trim5: 5 # --trim3: 5 # Note that the colon is required, even if no value is specified --very-sensitive: # Example of how to specify multiple values for an option # --rg: # - CN:SequencingCenterNameHere # - DS:DescriptionOfReadGroup # Command-line options for mapDamage; use long-form options(--length not -l): mapDamage: # By default, the pipeline will downsample the input to 100k hits # when running mapDamage; remove to use all hits --downsample: 100000 # Set to 'yes' exclude a type of trimmed reads from alignment / analysis; # possible read-types reflect the output of AdapterRemoval ExcludeReads: # Exclude single-end reads (yes / no)? Single: no # Exclude non-collapsed paired-end reads (yes / no)? Paired: no # Exclude paired-end reads for which the mate was discarded (yes / no)? Singleton: no # Exclude overlapping paired-ended reads collapsed into a single sequence # by AdapterRemoval (yes / no)? Collapsed: no # Like 'Collapsed', but only for collapsed reads truncated due to the # presence of ambiguous or low quality bases at read termini (yes / no). CollapsedTruncated: no # Optional steps to perform during processing. Features: # If set to 'filter', PCR duplicates are removed from the output files; if set to # 'mark', PCR duplicates are flagged with bit 0x400, and not removed from the # output files; if set to 'no', the reads are assumed to not have been amplified. PCRDuplicates: filter # Set to 'no' to disable mapDamage; set to 'plots' to build basic mapDamage plots; # set to 'model' to build plots and post-mortem damage models; and set to 'rescale' # to build plots, models, and BAMs with rescaled quality scores. All analyses are # carried out per library. mapDamage: plot # Generate coverage information for the final BAM and for each 'RegionsOfInterest' # specified in 'Prefixes' (yes / no). Coverage: yes # Generate histograms of number of sites with a given read-depth, from 0 to 200, # for each BAM and for each 'RegionsOfInterest' specified in 'Prefixes' (yes / no). Depths: yes # Generate summary table for each target (yes / no) Summary: yes # Map of prefixes by name, each having a Path, which specifies the location of the # BWA/Bowtie2 index, and optional regions for which additional statistics are produced. Prefixes: # Replace 'NAME_OF_PREFIX' with name of the prefix; this name is used in summary # statistics and as part of output filenames. NAME_OF_PREFIX: # Replace 'PATH_TO_PREFIX' with the path to .fasta file containing the references # against which reads are to be mapped. Using the same name as filename is strongly # recommended (e.g. /path/to/Human_g1k_v37.fasta should be named 'Human_g1k_v37'). Path: PATH_TO_PREFIX.fasta # (Optional) Uncomment and replace 'PATH_TO_BEDFILE' with the path to a .bed file # listing extra regions for which coverage / depth statistics should be calculated; # if no names are specified for the BED records, results are named after the # chromosome / contig. Replace 'NAME' with the desired name for these regions. # RegionsOfInterest: # NAME: PATH_TO_BEDFILE # Mapping targets are specified using the following structure. Replace 'NAME_OF_TARGET' # with the desired prefix for filenames. NAME_OF_TARGET: # Replace 'NAME_OF_SAMPLE' with the name of this sample. NAME_OF_SAMPLE: # Replace 'NAME_OF_LIBRARY' with the name of this sample. NAME_OF_LIBRARY: # Replace 'NAME_OF_LANE' with the lane name (e.g. the barcode) and replace # 'PATH_WITH_WILDCARDS' with the path to the FASTQ files to be trimmed and mapped # for this lane (may include wildcards). NAME_OF_LANE: PATH_WITH_WILDCARDS paleomix-1.3.8/docs/bam_pipeline/overview.rst000066400000000000000000000056561443173665200213440ustar00rootroot00000000000000Overview of analytical steps ============================ During a typical analyses, the BAM pipeline will proceed through the following steps for each sample: 1. Initial steps Each prefix (reference sequences in FASTA format) is indexed using `samtools faidx` and using the short-read aligner configured for the current project. 2. Preprocessing of reads Adapter sequences, low quality bases and ambiguous bases are trimmed; overlapping paired-end reads are merged, and short reads are filtered using AdapterRemoval [Schubert2016]_. 3. Mapping of reads 1. Processed reads resulting from the adapter-trimming / read-collapsing step above are mapped using the chosen short-read aligner (BWA or Bowtie2). The resulting BAMs are tagged using the information specified in the makefile (sample, library, lane, etc.). 2. The records of the resulting BAM are updated using `samtools fixmate` to ensure that PE reads contain the correct information about the mate read. 3. The BAM is sorted using `samtools sort`, indexed using `samtools index`, and validated using Picard `ValidateSamFile`. 4. Finally, the records are updated using `samtools calmd` to ensure consistent reporting of the number of mismatches relative to the reference genome (BAM tag 'NM'). 4. Filtering of duplicates, recalculation (rescaling) of quality scores, and validation 1. If enabled, PCR duplicates are filtered using Picard `MarkDuplicates` for SE and PE reads and using `paleomix rmdup_collapsed` for collapsed reads (see the :ref:`other_tools` section). PCR filtering is carried out per library. 2. If mapDamage based rescaling of quality scores is, quality scores of bases that are potentially the result of *post-mortem* DNA damage are recalculated using a damage model built using mapDamage2.0 [Jonsson2013]_. 3. The resulting BAMs are indexed and validated using Picard `ValidateSamFile`. Mapped reads at each position of the alignments are compared using the query name, sequence, and qualities. If a match is found, it is assumed to represent a duplication of input data (see :ref:`troubleshooting_bam`). 5. Generation of final BAMs Each BAM in the previous step is merged into a final BAM file. 6. Statistics 1. If the `Summary` feature is enable, a single summary table is generated for each target. This table summarizes the input data in terms of the raw number of reads, the number of reads following filtering / collapsing, the fraction of reads mapped to each prefix, the fraction of reads filtered as duplicates, and more. 2. Coverage statistics and depth histograms are calculated for the intermediate and final BAM files using `paleomix coverage` and `paleomix depths`, if enabled. Statistics are calculated genome-wide and for any regions of interest specified by the user. 3. If mapDamage is enabled, mapDamage plots are generated; if modeling or rescaling is enabled, a model of the post-mortem DNA damage is also generated. paleomix-1.3.8/docs/bam_pipeline/requirements.rst000066400000000000000000000055571443173665200222210ustar00rootroot00000000000000.. highlight:: Bash .. _bam_requirements: Software requirements ===================== In addition to the requirements listed in the :ref:`installation` section, the BAM pipeline requires several other pieces of software. The version numbers indicates the oldest supported version of each program: * `AdapterRemoval`_ v2.2.0 [Schubert2016]_ * `SAMTools`_ v1.3.1 [Li2009b]_ * `Picard Tools`_ v1.137 The Picard Tools JAR-file (`picard.jar`) is expected to be located in `~/install/jar_root` by default, but this behavior may be changed using either the `--jar-root` command-line option, or via the global configuration file (see section :ref:`bam_configuration`):: $ mkdir -p ~/install/jar_root $ wget -O ~/install/jar_root/picard.jar https://github.com/broadinstitute/picard/releases/download/2.23.3/picard.jar Running Picard requires a Jave Runtime Environment (i.e. the `java` command). Please refer to your distro's documentation for how to install a JRE. Furthermore, one or both of the following sequence aligners must be installed: * `Bowtie2`_ v2.3.0 [Langmead2012]_ * `BWA`_ v0.7.15 [Li2009a]_ mapDamage is required by default, but can be disabled on a per-project basis: * `mapDamage`_ 2.2.1 [Jonsson2013]_ If mapDamage is used to perform rescaling of post-mortem DNA damage, then the GNU Scientific Library (GSL) and the R packages listed in the mapDamage installation instructions are required; these include `inline`, `gam`, `Rcpp`, `RcppGSL` and `ggplot2`. Use the following commands to verify that these packages have been correctly installed:: $ gsl-config Usage: gsl-config [OPTION] ... $ mapDamage --check-R-packages All R packages are present On Debian-based systems, most of these dependencies can be installed using the following command:: $ sudo apt-get install adapterremoval samtools bowtie2 bwa mapdamage Testing the pipeline -------------------- An example project is included with the BAM pipeline, and it is recommended to run this project in order to verify that the pipeline and required applications have been correctly installed. See the :ref:`examples` section for a description of how to run this example project. .. Note:: The example project does not carry out rescaling using mapDamage by default. If you wish to test that the requirements for mapDamage rescaling feature have been installed correctly, then change the following line .. code-block:: yaml mapDamage: plot to .. code-block:: yaml mapDamage: rescale In case of errors, please consult the :ref:`troubleshooting` section. .. _AdapterRemoval: https://github.com/MikkelSchubert/adapterremoval .. _Bowtie2: http://bowtie-bio.sourceforge.net/bowtie2/ .. _BWA: http://bio-bwa.sourceforge.net/ .. _mapDamage: http://ginolhac.github.io/mapDamage/ .. _SAMTools: https://samtools.github.io .. _Picard Tools: http://broadinstitute.github.io/picard/paleomix-1.3.8/docs/bam_pipeline/usage.rst000066400000000000000000000717331443173665200206010ustar00rootroot00000000000000.. highlight:: Yaml .. _bam_usage: Pipeline usage ============== The following describes, step by step, the process of setting up a project for mapping FASTQ reads against a reference sequence using the BAM pipeline. For a detailed description of the configuration file (makefile) used by the BAM pipeline, please refer to the section :ref:`bam_makefile`, and for a detailed description of the files generated by the pipeline, please refer to the section :ref:`bam_filestructure`. The BAM pipeline is invoked using the `paleomix` command, which offers access to the pipelines and to other tools included with PALEOMIX (see section :ref:`other_tools`). For the purpose of these instructions, we will make use of a tiny FASTQ data set included with PALEOMIX pipeline, consisting of synthetic FASTQ reads simulated against the human mitochondrial genome. To follow along, first create a local copy of the example data-set: .. code-block:: bash $ paleomix bam example . This will create a folder named `bam_pipeline` in the current folder, which contain the example FASTQ reads and a 'makefile' showcasing various features of the BAM pipeline (`makefile.yaml`). We will make use of a subset of the data, but we will not make use of the makefile. The data we will use consists of 3 simulated ancient DNA libraries (independent amplifications), for which either one or two lanes have been simulated: +-------------+------+------+-----------------------------+ | Library | Lane | Type | Files | +-------------+------+------+-----------------------------+ | ACGATA | 1 | PE | data/ACGATA\_L1\_*.fastq.gz | +-------------+------+------+-----------------------------+ | GCTCTG | 1 | SE | data/GCTCTG\_L1\_*.fastq.gz | +-------------+------+------+-----------------------------+ | TGCTCA | 1 | SE | data/TGCTCA\_L1\_*.fastq.gz | +-------------+------+------+-----------------------------+ | | 2 | PE | data/TGCTCA\_L2\_*.fastq.gz | +-------------+------+------+-----------------------------+ .. warning:: The BAM pipeline largely relies on the existence of final and intermediate files in order to detect if a given analytical step has been carried out. Therefore, changes made to a makefile after the pipeline has already been run (even if not run to completion) may therefore not cause analytical steps affected by these changes to be re-run. If changes are to be made at such a point, it is typically necessary to manually remove affected intermediate files before running the pipeline again. See the section :ref:`bam_filestructure` for more information about the layout of files generated by the pipeline. Creating a makefile ------------------- As described in the :ref:`introduction`, the BAM pipeline operates based on 'makefiles', which serve to specify the location and structure of input data (samples, libraries, lanes, etc), and which specific which tasks are to be run, and which settings to be used. The makefiles are written using the human-readable YAML format, which may be edited using any regular text editor. For a brief introduction to the YAML format, please refer to the :ref:`yaml_intro` section, and for a detailed description of the BAM Pipeline makefile, please refer to section :ref:`bam_makefile`. To start a new project, we must first generate a makefile template using the following command, which for the purpose of this tutorial we place in the example folder: .. code-block:: bash $ cd bam_pipeline/ $ paleomix bam makefile > makefile.yaml Once you open the resulting file (`makefile.yaml`) in your text editor of choice, you will find that BAM pipeline makefiles are split into three major sections, representing 1) the default options; 2) the reference genomes against which reads are to be mapped; and 3) the of input files for the samples which are to be processed. In a typical project, we will need to review the default options, add one or more reference genomes which we wish to target, and list the input data to be processed. Default options ^^^^^^^^^^^^^^^ The makefile starts with an `Options` section, which is applied to every set of input-files in the makefile unless explicitly overwritten for a given sample (this is described in the :ref:`bam_makefile` section). For most part, the default values should be suitable for any given project, but special attention should be paid to the following options (double colons are used to separate subsections): **Options \:\: Platform** The sequencing platform used to generate the sequencing data; this information is recorded in the resulting BAM file, and may be used by downstream tools. The `SAM/BAM specification`_ the valid platforms, which currently include `CAPILLARY`, `HELICOS`, `ILLUMINA`, `IONTORRENT`, `LS454`, `ONT`, `PACBIO`, and `SOLID`. **Options \:\: QualityOffset** The QualityOffset option refers to the starting ASCII value used to encode `Phred quality-scores`_ in user-provided FASTQ files, with the possible values of 33, 64, and `Solexa`. For most modern data, this will be 33, corresponding to ASCII characters in the range `!` to `J`. Older data is often encoded using the offset 64, corresponding to ASCII characters in the range `@` to `h`, and more rarely using Solexa quality-scores, which represent a different scheme than Phred scores, and which occupy the range of ASCII values from `;` to `h`. For a visual representation of this, refer to the `Phred quality-scores`_ page. .. warning:: By default, the adapter trimming software used by PALEOMIX expects quality-scores no greater than 41, corresponding to the ASCII character `J` when encoded using offset 33. If the input-data contains quality-scores higher greater than this value, then it is necessary to specify the maximum value using the `--qualitymax` command-line option. See below. .. warning:: Presently, quality-offsets other than 33 are not supported when using the BWA `mem` or the BWA `bwasw` algorithms. To use these algorithms with quality-offset 64 data, it is therefore necessary to first convert these data to offset 33. This can be accomplished using the `seqtk`_ tool. **Options \:\: AdapterRemoval \:\: --adapter1** and **Options \:\: AdapterRemoval \:\: --adapter2** These two options are used to specify the adapter sequences used to identify and trim reads that contain adapter contamination using AdapterRemoval. Thus, the sequence provided for `--adapter1` is expected to be found in the mate 1 reads, and the sequence specified for `--adapter2` is expected to be found in the mate 2 reads. In both cases, these should be specified as in the orientation that appear in these files (i.e. it should be possible to grep the files for these sequences, assuming that the reads were long enough, if you treat Ns as wildcards). .. warning:: It is very important that the correct adapter sequences are used. Please refer to the `AdapterRemoval documentation`_ for more information and for help identifying the adapters for paired-end reads. **Aligners \:\: Program** The short read alignment program to use to map the (trimmed) reads to the reference genome. Currently, users many choose between `BWA` and `Bowtie2`, with additional options available for each program. **Aligners \:\: \* \:\: MinQuality** The minimum mapping quality of hits to retain during the mapping process. If this option is set to a non-zero value, any hits with a mapping quality below this value are removed from the resulting BAM file (this option does not apply to unmapped reads). If the final BAM should contain all reads in the input files, this option must be set to 0, and the `FilterUnmappedReads` option set to `no`. **Aligners \:\: BWA \:\: UseSeed** Enable/disable the use of a seed region when mapping reads using the BWA `backtrack` alignment algorithm (the default). Disabling this option may yield some improvements in the alignment of highly damaged ancient DNA, at the cost of significantly increasing the running time. As such, this option is not recommended for modern samples [Schubert2012]_. For the purpose of the example project, we need only change a few options. Since the reads were simulated using an Phred score offset of 33, there is no need to change the `QualityOffset` option, and since the simulated adapter sequences matches the adapters that AdapterRemoval searches for by default, so we do not need to set either of `--adapter1` or `--adapter2`. We will, however, use the default mapping program (BWA) and algorithm (`backtrack`), but change the minimum mapping quality to 30 (corresponding to an error probability of 0.001). Changing the minimum quality is accomplished by locating the `Aligners` section of the makefile, and changing the `MinQuality` value from 0 to 30 (line 40): .. code-block:: yaml :emphasize-lines: 12 :linenos: :lineno-start: 29 # Settings for aligners supported by the pipeline Aligners: # Choice of aligner software to use, either "BWA" or "Bowtie2" Program: BWA # Settings for mappings performed using BWA BWA: # One of "backtrack", "bwasw", or "mem"; see the BWA documentation # for a description of each algorithm (defaults to 'backtrack') Algorithm: backtrack # Filter aligned reads with a mapping quality (Phred) below this value MinQuality: 30 # Filter reads that did not map to the reference sequence FilterUnmappedReads: yes # May be disabled ("no") for aDNA alignments with the 'aln' algorithm. # Post-mortem damage localizes to the seed region, which BWA expects to # have few errors (sets "-l"). See http://pmid.us/22574660 UseSeed: yes Since the data we will be mapping represents (simulated) ancient DNA, we will furthermore set the UseSeed option to `no` (line 55), in order to recover a small additional amount of alignments during mapping (see [Schubert2012]_): .. code-block:: yaml :emphasize-lines: 18 :linenos: :lineno-start: 38 # Settings for aligners supported by the pipeline Aligners: # Choice of aligner software to use, either "BWA" or "Bowtie2" Program: BWA # Settings for mappings performed using BWA BWA: # One of "backtrack", "bwasw", or "mem"; see the BWA documentation # for a description of each algorithm (defaults to 'backtrack') Algorithm: backtrack # Filter aligned reads with a mapping quality (Phred) below this value MinQuality: 30 # Filter reads that did not map to the reference sequence FilterUnmappedReads: yes # May be disabled ("no") for aDNA alignments with the 'aln' algorithm. # Post-mortem damage localizes to the seed region, which BWA expects to # have few errors (sets "-l"). See http://pmid.us/22574660 UseSeed: no Once this is done, we can proceed to specify the location of the reference genome(s) that we wish to map our reads against. Reference genomes (prefixes) ---------------------------- Mapping is carried out using one or more reference genomes (or other sequences) in the form of FASTA files, which are indexed for use in read mapping (automatically, by the pipeline) using either the `bwa index` or `bowtie2-build` commands. Since sequence alignment index are generated at the location of these files, reference genomes are also referred to as "prefixes" in the documentation. In other words, using BWA as an example, the PALEOMIX pipeline will generate a index (prefix) of the reference genome using a command corresponding to the following, for BWA: .. code-block:: bash $ bwa index prefixes/my_genome.fasta In addition to the BWA / Bowtie2 index, several other related files are also automatically generated, including a FASTA index file (`.fai`), which are required for various operations of the pipeline. These are similarly located at the same folder as the reference FASTA file. For a more detailed description, please refer to the :ref:`bam_filestructure` section. .. warning:: Since the pipeline automatically carries out indexing of the FASTA files, it therefore requires write-access to the folder containing the FASTA files. If this is not possible, one may simply create a local folder containing symbolic links to the original FASTA file(s), and point the makefile to this location. All automatically generated files will then be placed in this location. Specifying which FASTA file to align sequences is accomplished by listing these in the `Prefixes` section in the makefile. For example, assuming that we had a FASTA file named `my\_genome.fasta` which is located in the `my\_prefixes` folder, the following might be used:: Prefixes: my_genome: Path: my_prefixes/my_genome.fasta The name of the prefix (here `my\_genome`) will be used to name the resulting files and in various tables that are generated by the pipeline. Typical names include `hg19`, `EquCab20`, and other standard abbreviations for reference genomes, accession numbers, and the like. Multiple prefixes can be specified, but each name MUST be unique:: Prefixes: my_genome: Path: my_prefixes/my_genome.fasta my_other_genome: Path: my_prefixes/my_other_genome.fasta In the case of this example project, we will be mapping our data against the revised Cambridge Reference Sequence (rCRS) for the human mitochondrial genome, which is included in examples folder under `prefixes`, as a file named `rCRS.fasta`. To add it to the makefile, locate the `Prefixes` section located below the `Options` section, and update it as described above (lines 115 and 119): .. code-block:: yaml :emphasize-lines: 6,10 :linenos: :lineno-start: 110 # Map of prefixes by name, each having a Path, which specifies the location of the # BWA/Bowtie2 index, and optional regions for which additional statistics are produced. Prefixes: # Replace 'NAME_OF_PREFIX' with name of the prefix; this name is used in summary # statistics and as part of output filenames. rCRS: # Replace 'PATH_TO_PREFIX' with the path to .fasta file containing the references # against which reads are to be mapped. Using the same name as filename is strongly # recommended (e.g. /path/to/Human_g1k_v37.fasta should be named 'Human_g1k_v37'). Path: prefixes/rCRS.fasta Once this is done, we may specify the input data that we want the pipeline to process. Specifying read data -------------------- A single makefile may be used to process one or more samples, to generate one or more BAM files and supplementary statistics. In this project we will only deal with a single sample, which we accomplish by adding creating our own section at the end of the makefile. The first step is to determine the name for the files generated by the BAM pipeline. Specifically, we will specify a name which is prefixed to all output generated for our sample (here named `MyFilename`), by adding the following line to the end of the makefile: .. code-block:: yaml :linenos: :lineno-start: 129 # You can also add comments like these to document your experiment MyFilename: This first name, or grouping, is referred to as the target, and typically corresponds to the name of the sample being processes, though any name may do. The actual sample-name is specified next (it is possible, but uncommon, for a single target to contain multiple samples), and is used both in tables of summary statistics, and recorded in the resulting BAM files. This is accomplished by adding another line below the target name: .. code-block:: yaml :linenos: :lineno-start: 129 # You can also add comments like these to document your experiment MyFilename: MySample: Similarly, we need to specify the name of each library in our data set. By convention, I often use the index used to construct the library as the library name (which allows for easy identification), but any name may be used for a library, provided that it unique to that sample. As described near the start of this document, we are dealing with 3 libraries: +-------------+------+------+-----------------------------+ | Library | Lane | Type | Files | +-------------+------+------+-----------------------------+ | ACGATA | 1 | PE | data/ACGATA\_L1\_*.fastq.gz | +-------------+------+------+-----------------------------+ | GCTCTG | 1 | SE | data/GCTCTG\_L1\_*.fastq.gz | +-------------+------+------+-----------------------------+ | TGCTCA | 1 | SE | data/TGCTCA\_L1\_*.fastq.gz | +-------------+------+------+-----------------------------+ | | 2 | PE | data/TGCTCA\_L2\_*.fastq.gz | +-------------+------+------+-----------------------------+ It is important to correctly specify the libraries, since the pipeline will not only use this information for summary statistics and record it in the resulting BAM files, but will also carry out filtering of PCR duplicates (and other analyses) on a per-library basis. Wrongly grouping together data will therefore result in a loss of useful alignments wrongly identified as PCR duplicates, or, similarly, in the inclusion of reads that should have been filtered as PCR duplicates. The library names are added below the name of the sample (`MySample`), in a similar manner to the sample itself: .. code-block:: yaml :linenos: :lineno-start: 129 # You can also add comments like these to document your experiment MyFilename: MySample: ACGATA: GCTCTG: TGCTCA: The final step involves specifying the location of the raw FASTQ reads that should be processed for each library, and consists of specifying one or more "lanes" of reads, each of which must be given a unique name. For single-end reads, this is accomplished simply by providing a path (with optional wildcards) to the location of the file(s). For example, for lane 1 of library ACGATA, the files are located at `data/ACGATA\_L1\_*.fastq.gz`: .. code-block:: bash $ ls data/GCTCTG_L1_*.fastq.gz data/GCTCTG_L1_R1_01.fastq.gz data/GCTCTG_L1_R1_02.fastq.gz data/GCTCTG_L1_R1_03.fastq.gz We simply specify these paths for each of the single-end lanes, here using the lane number to name these (similar to the above, this name is used to tag the data in the resulting BAM file): .. code-block:: yaml :linenos: :lineno-start: 129 # You can also add comments like these to document your experiment MyFilename: MySample: ACGATA: GCTCTG: Lane_1: data/GCTCTG_L1_*.fastq.gz TGCTCA: Lane_1: data/TGCTCA_L1_*.fastq.gz Specifying the location of paired-end data is slightly more complex, since the pipeline needs to be able to locate both files in a pair. This is accomplished by making the assumption that paired-end files are numbered as either mate 1 or mate 2, as shown here for 4 pairs of files with the common _R1 and _R2 labels: .. code-block:: bash $ ls data/ACGATA_L1_*.fastq.gz data/ACGATA_L1_R1_01.fastq.gz data/ACGATA_L1_R1_02.fastq.gz data/ACGATA_L1_R1_03.fastq.gz data/ACGATA_L1_R1_04.fastq.gz data/ACGATA_L1_R2_01.fastq.gz data/ACGATA_L1_R2_02.fastq.gz data/ACGATA_L1_R2_03.fastq.gz data/ACGATA_L1_R2_04.fastq.gz Knowing how that the files contain a number specifying which file in a pair they correspond to, we can then construct a path that includes the keyword `{Pair}` in place of that number. For the above example, that path would therefore be `data/ACGATA\_L1\_R{Pair}_*.fastq.gz` (corresponding to `data/ACGATA\_L1\_R[12]_*.fastq.gz`): .. code-block:: yaml :linenos: :lineno-start: 129 # You can also add comments like these to document your experiment MyFilename: MySample: ACGATA: Lane_1: data/ACGATA_L1_R{Pair}_*.fastq.gz GCTCTG: Lane_1: data/GCTCTG_L1_*.fastq.gz TGCTCA: Lane_1: data/TGCTCA_L1_*.fastq.gz Lane_2: data/TGCTCA_L2_R{Pair}_*.fastq.gz .. note:: Note that while the paths given here are relative to the location of where the pipeline is run, it is also possible to provide absolute paths, should the files be located in an entirely different location. .. note:: At the time of writing, the PALEOMIX pipeline supports uncompressed, gzipped, and bzipped FASTQ reads. It is not necessary to use any particular file extension for these, as the compression method (if any) is detected automatically. The final makefile ------------------ Once we've completed the steps described above, the resulting makefile should look like the following, shown here with the modifications that we've made highlighted: .. code-block:: yaml :emphasize-lines: 40,46,115,119,129- :linenos: # -*- mode: Yaml; -*- # Default options. # Can also be specific for a set of samples, libraries, and lanes, # by including the "Options" hierarchy at the same level as those # samples, libraries, or lanes below. Options: # Sequencing platform, see SAM/BAM reference for valid values Platform: Illumina # Quality offset for Phred scores, either 33 (Sanger/Illumina 1.8+) # or 64 (Illumina 1.3+ / 1.5+). For Bowtie2 it is also possible to # specify 'Solexa', to handle reads on the Solexa scale. This is # used during adapter-trimming and sequence alignment QualityOffset: 33 # Settings for trimming of reads, see AdapterRemoval man-page AdapterRemoval: # Set and uncomment to override defaults adapter sequences # --adapter1: AGATCGGAAGAGCACACGTCTGAACTCCAGTCACNNNNNNATCTCGTATGCCGTCTTCTGCTTG # --adapter2: AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT # Some BAM pipeline defaults differ from AR defaults; # To override, change these value(s): --mm: 3 --minlength: 25 # Extra features enabled by default; change 'yes' to 'no' to disable --collapse: yes --trimns: yes --trimqualities: yes # Settings for aligners supported by the pipeline Aligners: # Choice of aligner software to use, either "BWA" or "Bowtie2" Program: BWA # Settings for mappings performed using BWA BWA: # One of "backtrack", "bwasw", or "mem"; see the BWA documentation # for a description of each algorithm (defaults to 'backtrack') Algorithm: backtrack # Filter aligned reads with a mapping quality (Phred) below this value MinQuality: 30 # Filter reads that did not map to the reference sequence FilterUnmappedReads: yes # May be disabled ("no") for aDNA alignments with the 'aln' algorithm. # Post-mortem damage localizes to the seed region, which BWA expects to # have few errors (sets "-l"). See http://pmid.us/22574660 UseSeed: no # Additional command-line options may be specified below. For 'backtrack' these # are applied to the "bwa aln". See Bowtie2 for more examples. # -n: 0.04 # Settings for mappings performed using Bowtie2 Bowtie2: # Filter aligned reads with a mapping quality (Phred) below this value MinQuality: 0 # Filter reads that did not map to the reference sequence FilterUnmappedReads: yes # Examples of how to add additional command-line options # --trim5: 5 # --trim3: 5 # Note that the colon is required, even if no value is specified --very-sensitive: # Example of how to specify multiple values for an option # --rg: # - CN:SequencingCenterNameHere # - DS:DescriptionOfReadGroup # Command-line options for mapDamage; use long-form options(--length not -l): mapDamage: # By default, the pipeline will downsample the input to 100k hits # when running mapDamage; remove to use all hits --downsample: 100000 # Set to 'yes' exclude a type of trimmed reads from alignment / analysis; # possible read-types reflect the output of AdapterRemoval ExcludeReads: # Exclude single-end reads (yes / no)? Single: no # Exclude non-collapsed paired-end reads (yes / no)? Paired: no # Exclude paired-end reads for which the mate was discarded (yes / no)? Singleton: no # Exclude overlapping paired-ended reads collapsed into a single sequence # by AdapterRemoval (yes / no)? Collapsed: no # Like 'Collapsed', but only for collapsed reads truncated due to the # presence of ambiguous or low quality bases at read termini (yes / no). CollapsedTruncated: no # Optional steps to perform during processing. Features: # If set to 'filter', PCR duplicates are removed from the output files; if set to # 'mark', PCR duplicates are flagged with bit 0x400, and not removed from the # output files; if set to 'no', the reads are assumed to not have been amplified. PCRDuplicates: filter # Set to 'no' to disable mapDamage; set to 'plots' to build basic mapDamage plots; # set to 'model' to build plots and post-mortem damage models; and set to 'rescale' # to build plots, models, and BAMs with rescaled quality scores. All analyses are # carried out per library. mapDamage: plot # Generate coverage information for the final BAM and for each 'RegionsOfInterest' # specified in 'Prefixes' (yes / no). Coverage: yes # Generate histograms of number of sites with a given read-depth, from 0 to 200, # for each BAM and for each 'RegionsOfInterest' specified in 'Prefixes' (yes / no). Depths: yes # Generate summary table for each target (yes / no) Summary: yes # Map of prefixes by name, each having a Path, which specifies the location of the # BWA/Bowtie2 index, and optional regions for which additional statistics are produced. Prefixes: # Replace 'NAME_OF_PREFIX' with name of the prefix; this name is used in summary # statistics and as part of output filenames. rCRS: # Replace 'PATH_TO_PREFIX' with the path to .fasta file containing the references # against which reads are to be mapped. Using the same name as filename is strongly # recommended (e.g. /path/to/Human_g1k_v37.fasta should be named 'Human_g1k_v37'). Path: prefixes/rCRS.fasta # (Optional) Uncomment and replace 'PATH_TO_BEDFILE' with the path to a .bed file # listing extra regions for which coverage / depth statistics should be calculated; # if no names are specified for the BED records, results are named after the # chromosome / contig. Replace 'NAME' with the desired name for these regions. # RegionsOfInterest: # NAME: PATH_TO_BEDFILE # You can also add comments like these to document your experiment MyFilename: MySample: ACGATA: Lane_1: data/ACGATA_L1_R{Pair}_*.fastq.gz GCTCTG: Lane_1: data/GCTCTG_L1_*.fastq.gz TGCTCA: Lane_1: data/TGCTCA_L1_*.fastq.gz Lane_2: data/TGCTCA_L2_R{Pair}_*.fastq.gz With this makefile in hand, the pipeline may be executed using the following command: .. code-block:: bash $ paleomix bam run makefile.yaml The pipeline will run as many simultaneous processes as there are cores in the current system, but this behavior may be changed by using the `--max-threads` command-line option. Use the `--help` command-line option to view additional options available when running the pipeline. By default, output files are placed in the same folder as the makefile, but this behavior may be changed by setting the `--destination` command-line option. For this projects, these files include the following: .. code-block:: bash $ ls -d MyFilename* MyFilename MyFilename.rCRS.coverage MyFilename.rCRS.depths MyFilename.rCRS.mapDamage MyFilename.summary The files include a table of the average coverage, a histogram of the per-site coverage (depths), a folder containing one set of mapDamage plots per library, and the final BAM file and its index (the `.bai` file), as well as a table summarizing the entire analysis. For a more detailed description of the files generated by the pipeline, please refer to the :ref:`bam_filestructure` section; should problems occur during the execution of the pipeline, then please verify that the makefile is correctly filled out as described above, and refer to the :ref:`troubleshooting_bam` section. .. note:: The first item, `MyFilename`, is a folder containing intermediate files generated while running the pipeline, required due to the many steps involved in a typical analyses, and which also allows for the pipeline to resume should the process be interrupted. This folder will typically take up 3-4x the disk-space used by the final BAM file(s), and can safely be removed once the pipeline has run to completion, in order to reduce disk-usage. .. _SAM/BAM specification: http://samtools.sourceforge.net/SAM1.pdf .. _seqtk: https://github.com/lh3/seqtk .. _Phred quality-scores: https://en.wikipedia.org/wiki/FASTQ_format#Quality .. _AdapterRemoval documentation: https://github.com/MikkelSchubert/adapterremovalpaleomix-1.3.8/docs/conf.py000066400000000000000000000124101443173665200156010ustar00rootroot00000000000000# -*- coding: utf-8 -*- # # Configuration file for the Sphinx documentation builder. # # This file does only contain a selection of the most common options. For a # full list see the documentation: # http://www.sphinx-doc.org/en/master/config # -- Path setup -------------------------------------------------------------- # If extensions (or modules to document with autodoc) are in another directory, # add these directories to sys.path here. If the directory is relative to the # documentation root, use os.path.abspath to make it absolute, like shown here. # # import os # import sys # sys.path.insert(0, os.path.abspath('.')) # -- Project information ----------------------------------------------------- project = "PALEOMIX" copyright = "2015, Mikkel Schubert" author = "Mikkel Schubert" # The short X.Y version version = "1.3" # The full version, including alpha/beta/rc tags release = "1.3.8" # -- General configuration --------------------------------------------------- # If your documentation needs a minimal Sphinx version, state it here. # # needs_sphinx = '1.0' # Add any Sphinx extension module names here, as strings. They can be # extensions coming with Sphinx (named 'sphinx.ext.*') or your custom # ones. extensions = [] # Add any paths that contain templates here, relative to this directory. templates_path = [] # The suffix(es) of source filenames. # You can specify multiple suffix as a list of string: # # source_suffix = ['.rst', '.md'] source_suffix = ".rst" # The master toctree document. master_doc = "index" # The language for content autogenerated by Sphinx. Refer to documentation # for a list of supported languages. # # This is also used if you do content translation via gettext catalogs. # Usually you set "language" from the command line for these cases. language = None # List of patterns, relative to source directory, that match files and # directories to ignore when looking for source files. # This pattern also affects html_static_path and html_extra_path. exclude_patterns = ["_build", "Thumbs.db", ".DS_Store"] # The name of the Pygments (syntax highlighting) style to use. pygments_style = None # Disable smart-quotes to prevent conversion of double-dashes into long-dashes smartquotes = False # Use `blah` for inline code / shell snippets default_role = "literal" # -- Options for HTML output ------------------------------------------------- # The theme to use for HTML and HTML Help pages. See the documentation for # a list of builtin themes. # html_theme = "classic" # Theme options are theme-specific and customize the look and feel of a theme # further. For a list of options available for each theme, see the # documentation. # # html_theme_options = {} # Add any paths that contain custom static files (such as style sheets) here, # relative to this directory. They are copied after the builtin static files, # so a file named "default.css" will overwrite the builtin "default.css". html_static_path = ["_static"] # Custom sidebar templates, must be a dictionary that maps document names # to template names. # # The default sidebars (for documents that don't match any pattern) are # defined by theme itself. Builtin themes are using these templates by # default: ``['localtoc.html', 'relations.html', 'sourcelink.html', # 'searchbox.html']``. # # html_sidebars = {} # -- Options for HTMLHelp output --------------------------------------------- # Output file base name for HTML help builder. htmlhelp_basename = "PALEOMIXdoc" # -- Options for LaTeX output ------------------------------------------------ latex_elements = { # The paper size ('letterpaper' or 'a4paper'). # # 'papersize': 'letterpaper', # The font size ('10pt', '11pt' or '12pt'). # # 'pointsize': '10pt', # Additional stuff for the LaTeX preamble. # # 'preamble': '', # Latex figure (float) alignment # # 'figure_align': 'htbp', } # Grouping the document tree into LaTeX files. List of tuples # (source start file, target name, title, # author, documentclass [howto, manual, or own class]). latex_documents = [ (master_doc, "PALEOMIX.tex", "PALEOMIX Documentation", "Mikkel Schubert", "manual"), ] # -- Options for manual page output ------------------------------------------ # One entry per manual page. List of tuples # (source start file, name, description, authors, manual section). man_pages = [(master_doc, "paleomix", "PALEOMIX Documentation", [author], 1)] # -- Options for Texinfo output ---------------------------------------------- # Grouping the document tree into Texinfo files. List of tuples # (source start file, target name, title, author, # dir menu entry, description, category) texinfo_documents = [ ( master_doc, "PALEOMIX", "PALEOMIX Documentation", author, "PALEOMIX", "Pipelines and tools for the processing of ancient and modern HTS data.", "Miscellaneous", ), ] # -- Options for Epub output ------------------------------------------------- # Bibliographic Dublin Core info. epub_title = project # The unique identifier of the text. This can be a ISBN number # or the project homepage. # # epub_identifier = '' # A unique identification for the text. # # epub_uid = '' # A list of files that should not be packed into the epub file. epub_exclude_files = ["search.html"] paleomix-1.3.8/docs/examples.rst000066400000000000000000000114061443173665200166560ustar00rootroot00000000000000.. _examples: Example projects and data-sets ============================== The PALEOMIX pipeline contains small example projects for the larger pipelines, which are designed to be executed in a short amount of time, and to help verify that the pipelines have been correctly installed. .. _examples_bam: BAM Pipeline example project ---------------------------- The example project for the BAM pipeline involves the processing of a small data set consisting of (simulated) ancient sequences derived from the human mitochondrial genome. The runtime of this project on a typical desktop or laptop ranges from around 1 minute to around 1 hour (when full modeling of ancient DNA damage patterns is enabled). To access this example project, use the 'example' command for the BAM pipeline to copy the project files to a given directory (here, the current directory):: $ paleomix bam example . $ cd bam_pipeline $ paleomix bam run makefile.yaml The output generated by the pipeline is described in the :ref:`bam_filestructure` section. Please see the :ref:`troubleshooting` section if you run into problems running the pipeline. .. _examples_phylo: Phylogentic Pipeline example project ------------------------------------ The example project for the phylogenetic pipeline involves the processing and mapping of a small data set consisting of (simulated) sequences derived from the human and primate mitochondrial genome, followed by the genotyping of gene sequences and the construction of a maximum likelihood phylogeny. Since this example project starts from raw reads, it therefore requires that the BAM pipeline has been correctly installed, as described in section :ref:`bam_requirements`). The runtime of this project on a typical desktop or laptop ranges from around 30 minutes to around 1 hour. To access this example project, use the 'example' command for the phylogenetic pipeline to copy the project files to a given directory (here, the current directory), and then run the 'setup.sh' script in the root directory, to generate the data set:: $ paleomix phylo example . $ cd phylo_pipeline $ ./setup.sh Once the example data has been generated, the two pipelines may be executed:: $ cd alignment $ paleomix bam run makefile.yaml $ cd ../phylogeny $ paleomix phylo genotype+msa+phylogeny makefile.yaml The output generated by the pipeline is described in the :ref:`phylo_filestructure` section. Please see the :ref:`troubleshooting` section if you run into problems running the pipeline. .. _examples_zonkey: Zonkey Pipeline example project ------------------------------- The example project for the Zonkey pipeline is based on a synthetic hybrid between a Domestic donkey and an Arabian horse (obtained from [Orlando2013]_), using a low number of reads (1200). The runtime of these examples on a typical desktop or laptop ranges from around 30 minutes to around 1 hour, depending on your local configuration. To access this example project, download the Zonkey reference database (see the 'Prerequisites' section of the :ref:`zonkey_usage` page for instructions), and use the 'example' command for zonkey to copy the project files to a given directory. Here, the current directory directory is used; to place the example files in a different location, simply replace the '.' with the full path to the desired directory:: $ paleomix zonkey example database.tar . $ cd zonkey_pipeline The example directory contains 3 BAM files; one containing a nuclear alignment ('nuclear.bam'); one containing a mitochondrial alignment ('mitochondrial.bam'); and one containing a combined nuclear and mitochondrial alignment ('combined.bam'). In addition, a sample table is included which shows how multiple samples may be specified and processed at once. Each of these may be run as follows:: # Process only the nuclear BAM; # by default, results are saved in 'nuclear.zonkey' $ paleomix zonkey run database.tar nuclear.bam # Process only the mitochondrial BAM; # by default, results are saved in 'mitochondrial.zonkey' $ paleomix zonkey run database.tar mitochondrial.bam # Process both the nuclear and the mitochondrial BAMs; # note that is nessesary to specify an output directory $ paleomix zonkey run database.tar nuclear.bam mitochondrial.bam results # Process both the combined nuclear and the mitochondrial BAM; # by default, results are saved in 'combined.zonkey' $ paleomix zonkey run database.tar combined.bam # Process multiple samples; the table corresponds to the four # cases listed above. $ paleomix zonkey run database.tar samples.txt Please see the :ref:`troubleshooting` section if you run into problems running the pipeline. The output generated by the pipeline is described in the :ref:`zonkey_filestructure` section. paleomix-1.3.8/docs/index.rst000066400000000000000000000051571443173665200161550ustar00rootroot00000000000000 Welcome to PALEOMIX's documentation! ==================================== The PALEOMIX pipelines are a set of pipelines and tools designed to aid the rapid processing of High-Throughput Sequencing (HTS) data: The BAM pipeline processes demultiplexed reads from one or more samples, through sequence processing and alignment, to generate BAM alignment files useful in downstream analyses; the Phylogenetic pipeline carries out genotyping and phylogenetic inference on BAM alignment files, either produced using the BAM pipeline or generated elsewhere; and the Zonkey pipeline carries out a suite of analyses on low coverage equine alignments, in order to detect the presence of F1-hybrids in archaeological assemblages. The pipelines were originally designed with ancient DNA (aDNA) in mind, and includes several features especially useful for the analyses of ancient samples, but can all be used for the processing of modern samples. The PALEOMIX pipelines have been published in Nature Protocols; if you make use of PALEOMIX in your work, then please cite Schubert M, Ermini L, Sarkissian CD, Jónsson H, Ginolhac A, Schaefer R, Martin MD, Fernández R, Kircher M, McCue M, Willerslev E, and Orlando L. "**Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIX**". Nat Protoc. 2014 May;9(5):1056-82. doi: `10.1038/nprot.2014.063 `_. Epub 2014 Apr 10. PubMed PMID: `24722405 `_. The Zonkey pipeline has been published in Journal of Archaeological Science; if you make use of this pipeline in your work, then please cite Schubert M, Mashkour M, Gaunitz C, Fages A, Seguin-Orlando A, Sheikhi S, Alfarhan AH, Alquraishi SA, Al-Rasheid KAS, Chuang R, Ermini L, Gamba C, Weinstock J, Vedat O, and Orlando L. "**Zonkey: A simple, accurate and sensitive pipeline to genetically identify equine F1-hybrids in archaeological assemblages**". Journal of Archaeological Science. 2007 Feb; 78:147-157. doi: `10.1016/j.jas.2016.12.005 `_. For questions, bug reports, and/or suggestions, please use the PALEOMIX `GitHub tracker `_. **Table of Contents:** .. toctree:: :maxdepth: 2 introduction.rst installation.rst bam_pipeline/index.rst phylo_pipeline/index.rst zonkey_pipeline/index.rst other_tools.rst examples.rst troubleshooting/index.rst yaml.rst acknowledgements.rst related.rst references.rst Indices and tables ================== * :ref:`genindex` * :ref:`search` paleomix-1.3.8/docs/installation.rst000066400000000000000000000117631443173665200175470ustar00rootroot00000000000000.. highlight:: Bash .. _installation: Installation ============ The following instructions will install PALEOMIX for the current user, but does not include specific programs required by the pipelines. For pipeline specific instructions, refer to the requirements sections for the :ref:`BAM `, the :ref:`Phylogentic `, and the :ref:`Zonkey ` pipeline. The recommended way of installing PALEOMIX is by use of the `pip`_ package manager for Python 3. If pip is not installed, then please consult the documentation for your operating system. For Debian based operating systems, pip may be installed as follows:: $ sudo apt-get install python3-pip In addition, some libraries used by PALEOMIX may require additional development files, namely those for `zlib`, `libbz2`, `liblzma`, and for Python 3:: $ sudo apt-get install libz-dev libbz2-dev liblzma-dev python3-dev Once all requirements have been installed, PALEOMIX may be installed using `pip`:: $ python3 -m pip install paleomix==1.3.7 To verify that the installation was carried out correctly, run the command `paleomix`:: $ paleomix PALEOMIX - pipelines and tools for NGS data analyses Version: v1.3.7 ... If you have not previously used pip, then you may need to add the pip `bin` folder to your `PATH` and restart your terminal before running the `paleomix` command:: $ echo 'export PATH=~/.local/bin:$PATH' >> ~/.bashrc Self-contained installation --------------------------- The recommended method for installing PALEOMIX is using a virtual environment. Doing so allows different versions of PALEOMIX to be installed simultaneously and ensures that PALEOMIX and its dependencies are not affected by the addition or removal of other python modules. This installation method requires the `venv` module. On Debian based systems, this module must be installed separately:: $ sudo apt-get install python3-venv Once `venv` is installed, creation of a virtual environment and installation of PALEOMIX may be carried out as shown here:: $ python3 -m venv venv $ ./venv/bin/pip install paleomix==v1.3.7 Following successful completion of these commands, the `paleomix` executable will be accessible in the `./venv/bin/` folder. However, as this folder also contains a copy of Python itself, it is not recommended to add it to your `PATH`. Instead, simply link the `paleomix` executable to a folder in your `PATH`. This can be accomplished as follows:: $ mkdir -p ~/.local/bin/ $ ln -s ${PWD}/venv/bin/paleomix ~/.local/bin/ If ~/.local/bin is not already in your PATH, then it can be added as follows:: $ echo 'export PATH=~/.local/bin:$PATH' >> ~/.bashrc Upgrading an existing installation ---------------------------------- Upgrade an existing installation of PALEOMIX, installed using the methods described above, may also be accomplished using pip. To upgrade a regular installation, simply run `pip install` with the `--upgrade` option:: $ pip install --upgrade paleomix To upgrade an installation a self-contained installation, simply call the `pip` executable in that environment:: $ ./venv/bin/pip install --upgrade paleomix Conda installation ------------------ `Conda`_ can be used to automatically setup a self-contained environment that includes the software required by PALEOMIX. To install `conda` and also set it up so it can use the `bioconda`_ bioinformatics repository, follow the instructions on the bioconda website `here`_. Next, run the following commands to download the conda environment template for this release of PALEOMIX and to create a new conda environment named `paleomix` using that template:: $ curl -fL https://github.com/MikkelSchubert/paleomix/releases/download/v1.3.7/paleomix_environment.yaml > paleomix_environment.yaml $ conda env create -n paleomix -f paleomix_environment.yaml You can now activate the paleomix environment with:: $ conda activate paleomix PALEOMIX requires that the Picard JAR file can be found in a specific location, so we can symlink the versions in your conda environment into the correct place:: $ (paleomix) mkdir -p ~/install/jar_root/ $ (paleomix) ln -s ~/*conda*/envs/paleomix/share/picard-*/picard.jar ~/install/jar_root/ .. note:: If you installed conda in a different location, then you can obtain the location of the `paleomix` environment by running `conda env list`. Once completed, you can test the environment works correctly using the pipeline test commands described in :ref:`examples`. To deactivate the paleomix environment, simply run:: $ conda deactivate If you ever need to remove the entire environment, run the following command:: $ conda env remove -n paleomix .. _bioconda: https://bioconda.github.io .. _conda: https://docs.conda.io/projects/conda/en/latest/index.html .. _here: https://bioconda.github.io/user/install.html#install-conda .. _pip: https://pip.pypa.io/en/stable/ .. _Pysam: https://github.com/pysam-developers/pysam/ .. _Python: http://www.python.org/ paleomix-1.3.8/docs/introduction.rst000066400000000000000000000051471443173665200175660ustar00rootroot00000000000000.. _introduction: ============ Introduction ============ The PALEOMIX pipeline is a set of pipelines and tools designed to enable the rapid processing of High-Throughput Sequencing (HTS) data from modern and ancient samples. Currently, PALEOMIX consists of two pipelines and one protocol described in [Schubert2014]_, as well as one pipeline described in [Schubert2017]_: * **The BAM pipeline** operates on de-multiplexed NGS reads, and carries out the steps necessary to produce high-quality alignments against a reference sequence, ultimately producing one or more annotated BAM files [Schubert2014]_. * **The Phylogenetic pipeline** carries out genotyping, multiple sequence alignment, and phylogenetic inference on a set of regions derived from one or more BAM files, such as those BAM files produced using the BAM Pipeline [Schubert2014]_. * **The Metagenomic protocol** is a protocol describing how to carry out metagenomic analyses on reads processed by the BAM pipeline, allowing for the characterisation of the metagenomic population of ancient samples. This protocol makes use of tools included with PALEOMIX [Schubert2014]_. * **The Zonkey Pipeline** is a pipeline for the detection of F1 hybrids in equids, based on low coverage nuclear genomes (as few as thousands of aligned reads) and mitochondrial DNA [Schubert2017]_. All pipelines operate through a mix of standard bioinformatics tools, such as SAMTools [Li2009b]_, BWA [Li2009a]_, in addition to custom scripts written to support the pipelines. The automated pipelines have been designed to run analytical in parallel steps where possible, and to run with minimal user-intervention. To guard against incomplete runs and to allow easy debugging of failures, all analyses are run in individual temporary folders, all output is logged, and results files are only merged into the destination upon successful completion of the given task. In order to faciliate automatic execution, and to ensure that analyses are documented and can be replicated easily, the BAM and the Phylogenetic Pipelines make use of configuration files (hence-forth "makefiles") in `YAML`_ format. These are text files which describe a project in terms of input files, settings for programs run as part of the pipeline, and which steps to run. For an overview of the YAML format, refer to the included introduction to :ref:`yaml_intro`, or to the official `YAML`_ website. For a thorough discussion of the makefiles used by either pipeline, please refer to the respective sections of the documentation (*i.e.* :ref:`BAM ` and :ref:`Phylogentic ` pipeline). .. _YAML: http://www.yaml.org paleomix-1.3.8/docs/other_tools.rst000066400000000000000000000024551443173665200174050ustar00rootroot00000000000000.. _other_tools: Other tools =========== On top of the pipelines described in the major sections of the documentation, the pipeline comes bundled with several other, smaller tools, all accessible via the `paleomix` command. These tools are (briefly) described in this section. paleomix coverage ----------------- Calculates coverage for a BAM file, either for the entire genome or for specific regions. paleomix depths --------------- Calculates a depth histogram for a BAM file, either for the entire genome or for specific regions. paleomix rmdup_collapsed ------------------------ Filters PCR duplicates for merged/collapsed paired-ended reads, such as those generated by AdapterRemoval with the `--collapse` option enabled. Unlike `SAMtools rmdup` or `Picard MarkDuplicates`, this tool identifies duplicates based on both the 5' and the 3' alignment coordinates of individual reads. paleomix vcf_filter ------------------- Quality filters for VCF records, similar to `vcfutils.pl varFilter`. paleomix vcf_to_fasta --------------------- The 'paleomix vcf\_to\_fasta' command is used to generate FASTA sequences from a VCF file, based either on a set of BED coordinates provided by the user, or for the entire genome covered by the VCF file. By default, heterozygous SNPs are represented using IUPAC codes. paleomix-1.3.8/docs/phylo_pipeline/000077500000000000000000000000001443173665200173245ustar00rootroot00000000000000paleomix-1.3.8/docs/phylo_pipeline/configuration.rst000066400000000000000000000020661443173665200227310ustar00rootroot00000000000000.. highlight:: ini .. _phylo_configuration: Configuring the phylogenetic pipeline ===================================== The Phylo pipeline supports a number of command-line options (see `paleomix phylo --help`). These options may be set directly on the command-line (e.g. using `--max-threads 16`), but it is also possible to set default values for such options. This is accomplished by writing options in `~/.paleomix/phylo_pipeline.ini`:: max-threads = 16 examl-max-threads = 7 log-level = warning temp-root = /tmp/username/phylo_pipeline Options in the configuration file correspond directly to command-line options for the Phylo pipeline, with leading dashes removed. For example, the command-line option `--max-threads` becomes `max-threads` in the configuration file. Options specified on the command-line take precedence over those in the `phylo_pipeline.ini` file. For example, if `max-threads` is set to 4 in the `phylo_pipeline.ini` file, but the pipeline is run using `paleomix phylo --max-threads 10`, then the max threads value is set to 10. paleomix-1.3.8/docs/phylo_pipeline/filestructure.rst000066400000000000000000000001211443173665200227500ustar00rootroot00000000000000.. highlight:: Yaml .. _phylo_filestructure: File structure ============== TODOpaleomix-1.3.8/docs/phylo_pipeline/index.rst000066400000000000000000000023341443173665200211670ustar00rootroot00000000000000.. _phylo_pipeline: Phylogenetic Pipeline ===================== **Table of Contents:** .. toctree:: overview.rst requirements.rst configuration.rst usage.rst makefile.rst filestructure.rst .. warning:: This section of the documentation is currently undergoing a complete rewrite, and may therefore be incomplete in places. The Phylogenetic Pipeline is a pipeline designed for processing of (one or more) BAMs, in order to carry out genotyping of a set of regions of interest. Following genotyping, multiple sequence alignment may optionally be carried out (this is required if indels were called), and phylogenetic inference may be done on the regions of interest, using a supermatrix approach through ExaML. Regions of interest, as defined for the Phylogenetic pipeline, are simply any set of regions in a reference sequence, and may span anything from a few short genomic regions, to the complete exome of complex organisms (tens of thousands of genes), and even entire genomes. While the Phylogenetic pipeline is designed for ease of use in conjunction with the BAM pipeline, but can be used on arbitrary BAM files, provided that these follow the expected naming scheme (see the :ref:`phylo_usage` section). paleomix-1.3.8/docs/phylo_pipeline/makefile.rst000066400000000000000000000002511443173665200216310ustar00rootroot00000000000000.. highlight:: Bash .. _phylo_makefile: Makefile description ==================== TODO TODO: Describe how to use 'MaxDepth: auto' with custom region, by creating newpaleomix-1.3.8/docs/phylo_pipeline/overview.rst000066400000000000000000000016771443173665200217370ustar00rootroot00000000000000Overview of analytical steps ============================ During a typical analyses, the Phylogenetic pipeline will proceed through the following steps. 1. Genotyping 1. SNPs are called on the provided regions using SAMTools, and the resulting SNPs are filtered using the 'paleomix vcf_filter' tool. 2. FASTA sequences are constructed from for the regions of interest, using the filtered SNPs generated above, one FASTA file per set of regions and per sample. 2. Multiple sequence alignment 1. Per-sample files generated in step 1 are collected, and used to build unaligned multi-FASTA files, one per region of interest. 2. If enabled, multiple-sequence alignment is carried out on these files using MAFFT, to generate aligned multi-FASTA files. 3. Phylogenetic inference Following construction of (aligned) multi-FASTA sequences, phylogenetic inference may be carried out using a partioned maximum likelihood approach via ExaML.paleomix-1.3.8/docs/phylo_pipeline/requirements.rst000066400000000000000000000051021443173665200225770ustar00rootroot00000000000000.. highlight:: Bash .. _phylo_requirements: Software requirements ===================== Depending on the parts of the Phylogenetic pipeline used, different programs are required. The following lists which programs are required for each pipeline, as well as the minimum version required: Genotyping ---------- * `SAMTools`_ v1.3.1 [Li2009b]_ * `BCFTools`_ v1.4.0 [Li2009b]_ * `Tabix`_ v1.3.1 Both the 'tabix' and the 'bgzip' executable from the Tabix package must be installed. On Debian based distros, these tools can be installed as follows:: $ sudo apt-get install bcftools samtools tabix Multiple Sequence Alignment --------------------------- * `MAFFT`_ v7.307 [Katoh2013]_ Note that the pipeline requires that the algorithm-specific MAFFT commands (e.g. 'mafft-ginsi', 'mafft-fftnsi'). These are automatically created by the 'make install' command. On Debian based distros, MAFFT can be installed as follows:: $ sudo apt-get install mafft The various mafft-\* binaries may not be added to your PATH by default on Debian. If that is the case, then they can be included as follows:: $ echo 'export PATH=/usr/lib/mafft/bin/:$PATH' >> ~/.bashrc Phylogenetic Inference ---------------------- * `RAxML`_ v8.2.9 [Stamatakis2006]_ * `ExaML`_ v3.0.21 The pipeline expects a single-threaded binary named 'raxmlHPC' for RAxML. The pipeline expects the ExaML binary to be named 'examl', and the parser binary to be named 'parse-examl'. Compiling and running ExaML requires an MPI implementation (e.g. `OpenMPI`_), even if ExaML is run single-threaded. Both programs offer a variety of makefiles suited for different server-architectures and use-cases. If in doubt, use the Makefile.SSE3.gcc makefiles, which are compatible with most modern systems:: $ make -f Makefile.SSE3.gcc RAxML and MPI (mpirun/mpicc) can be installed as follows on Debian based distros:: sudo apt-get install raxml mpi-default-bin mpi-default-dev Testing the pipeline -------------------- An example project is included with the phylogenetic pipeline, and it is recommended to run this project in order to verify that the pipeline and required applications have been correctly installed. See the :ref:`examples` section for a description of how to run this example project. .. _EXaML: https://github.com/stamatak/ExaML .. _MAFFT: http://mafft.cbrc.jp/alignment/software/ .. _OpenMPI: http://www.open-mpi.org/ .. _RAxML: https://github.com/stamatak/standard-RAxML .. _SAMTools: https://github.com/samtools/samtools .. _BCFTools: https://github.com/samtools/bcftools .. _Tabix: https://github.com/samtools/htslib paleomix-1.3.8/docs/phylo_pipeline/usage.rst000066400000000000000000000260261443173665200211700ustar00rootroot00000000000000.. highlight:: Yaml .. _phylo_usage: Pipeline usage ============== The 'phylo\_pipeline mkfile' command can be used to create a makefile template, as with the 'bam\_pipeline mkfile' command (see section :ref:`bam_usage`). This makefile is used to specify the samples, regions of interest (to be analysed), and options for the various programs: .. code-block:: bash $ paleomix phylo mkfile > makefile.yaml Note that filenames are not specified explicitly with this pipeline, but are instead inferred from the names of samples, prefixes, etc. as described below. To execute the pipeline, a command corresponding to the step to be invoked is used (see below): .. code-block:: bash $ paleomix phylo [OPTIONS] Samples ------- The phylogenetic pipeline expects a number of samples to be specified. Each sample has a name and a sex:: Samples: : SAMPLE_NAME: Sex: ... Sex is required, and is used to filter SNPs at homozygous sex chromsomes (e.g. chrX and chrY for male humans). Any names may be used, and can simply be set to e.g. 'NA' in case this feature is not used. Groups are optional, and may be used either for the sake of the reader, or to specify a group of samples in lists of samples, e.g. when excluding samples form a subsequent step, when filtering singletons, or when rooting phylogenetic trees (see below) For a given sample with name S, and a prefix with name P, the pipeline will expect files to be located at ./data/samples/*S*.*P*.bam. Regions of interest ------------------- Analysis is carried out for a set of "Regions of Interest", which is defined a set of named regions specified using BED files:: RegionsOfInterest: NAME: Prefix: NAME_OF_PREFIX ProteinCoding: yes/no IncludeIndels: yes/no The options 'ProteinCoding' and 'IncludeIndels' takes values 'yes' and 'no' (without quotation marks), and determines the behavior when calling indels. If 'IncludeIndels' is set to yes, indels are included in the consensus sequence, and if 'ProteinCoding' is set to yes, only indels that are a multiple of 3bp long are included. The name and the prefix determines the location of the expected BED file and the FASTA file for the prefix: For a region of interest named R, and a prefix named P, the pipeline will expect the BED file to be located at ./data/regions/P.R.bed. The prefix file is expected to be located at ./data/prefixes/P.fasta Genotyping ---------- Genotyping is done by building a pileup using samtools and calling SNPs / indels using bcftools. The command used for full genotyping is similar to the following command: .. code-block:: bash $ samtools mpileup [OPTIONS] | bcftools view [OPTIONS] - In addition, SNPs / indels are filtered using the script 'vcf_filter', which is included with the pipeline. This script implements the filteres found in "vcfutils.pl varFilter", with some additions. Options for either method, including for both "samtools mpileup" and the "bcftools view" command is set using the **Genotyping** section of the makefile, and may be set for all regions of interest (default behavior) or for each set of regions of interest:: Genotyping: Defaults: ... The 'Defaults' key specifies that the options given here apply to all regions of interest; in addition to this key, the name of each set of regions of interest may be used, to set specific values for one set of regions vs. another set. Thus, assuming regions of interest 'ROI\_a' and 'ROI\_b', options may be set as follows:: Genotyping: Defaults: ... ROI_a: ... ROI_b: ... For each set of regions of interest named ROI, the final settings are derived by first taking the Defaults, and then overwriting values using the value taken from the ROI section (if one such exists). The following shows how to change values in Defaults for a single ROI:: Genotyping: Defaults: --switch: value_a ROI_N: --switch: value_b In the above, all ROI except "ROI\_N" will use the switch with 'value\_a', while "ROI\_N" will use 'value\_b'. Executing the 'genotyping' step is described below. Finally, note the "Padding" option; this option specifies a number of bases to include around each interval in a set of regions of interest. The purpose of this padding is to allow filtering of SNPs based on the distance from indels, in the case where the indels are outside the intervals themselves. Multiple sequence alignment --------------------------- Multiple sequence alignment (MSA) is currently carried out using MAFFT, if enabled. Note that it is still nessesary to run the MSA command (see below), even if the multiple sequence alignment itself is disabled (for example in the case where indels are not called in the genotyping step). This is because the MSA step is responsible for generating both the unaligned multi-FASTA files, and the aligned multi-FASTA files. It is nessesary to run the 'genotyping' step prior to running the MSA step (see above). It is possible to select among the various MAFFT algorithms using the "Algorithm" key, and additionally to specify command-line options for the selected algorithm:: MultipleSequenceAlignment: Defaults: Enabled: yes MAFFT: Algorithm: G-INS-i --maxiterate: 1000 Currently supported algorithms are as follows (as described on the `MAFFT website`_): * mafft - The basic program (mafft) * auto - Equivalent to command 'mafft --auto' * fft-ns-1 - Equivalent to the command 'fftns --retree 1' * fft-ns-2 - Equivalent to the command 'fftns' * fft-ns-i - Equivalent to the command 'fftnsi' * nw-ns-i - Equivalent to the command 'nwnsi' * l-ins-i - Equivalent to the command 'linsi' * e-ins-i - Equivalent to the command 'einsi' * g-ins-i - Equivalent to the command 'ginsi' Command line options are specified as key / value pairs, as shown above for the --maxiterate option, in the same manner that options are specified for the genotyping section. Similarly, options may be specified for all regions of interest ("Defaults"), or using the name of a set of regions of interest, in order to set options for only that set of regions. Phylogenetic inference ---------------------- Maximum likelyhood Phylogenetic inference is carried out using the ExaML program. A phylogeny consists of a named (subsets of) one or more sets of regions of interest, with individual regions partitioned according to some scheme, and rooted on the midpoint of the tree or one or more taxa:: PhylogeneticInference: PHYLOGENY_NAME: ExcludeSamples: ... RootTreesOn: ... PerGeneTrees: yes/no RegionsOfInterest: REGIONS_NAME: Partitions: "111" SubsetRegions: SUBSET_NAME ExaML: Replicates: 1 Bootstraps: 100 Model: GAMMA A phylogeny may exclude any number of samples specified in the Samples region, by listing them under the ExcludeSamples. Furthermore, if groups have been specified for samples (e.g. ""), then these may be used as a short-hand for multiple samples, by using the name of the group including the angle-brackets (""). Rooting is determined using the RootTreesOn options; if this option is not set, then the resulting trees are rooted on the midpoint of the tree, otherwise it is rooted on the clade containing all the given taxa. If the taxa does not form a monophyletic clade, then rooting is done on the monophyletic clade containing the given taxa. If PerGeneTrees is set to yes, a tree is generated for every named feature in the regions of interest (e.g. genes), otherwise a super-matrix is created based on all features in all the regions of interest specified for the current phylogeny. Each phylogeny may include one or more sets of regions of interest, specified under the "RegionsOfInterest", using the same names as those specified under the Project section. Each feature in a set of regions of interest may be partitioned according to position specific scheme. These are specified using a string of numbers (0-9), which is then applied across the selected sequences to determine the model for each position. For example, for the scheme "012" and a given nucleotide sequence, models are applied as follows:: AAGTAACTTCACCGTTGTGA 01201201201201201201 Thus, the default partitioning scheme ("111") will use the same model for all positions, and is equivalent to the schemes "1", "11", "1111", etc. Similarly, a per-codon-position scheme may be accomplished using "123" or a similar string. In addition to numbers, the character 'X' may be used to exclude specific positions in an alignment. E.g. to exclude the third position in codons, use a string like "11X". Alternatively, Partitions may be set to 'no' to disable per-feature partitions; instead a single partition is used per set of regions of interest. The options in the ExaML section specifies the number of bootstrap trees to generate from the original supermatrix, the number of phylogenetic inferences to carry out on the original supermatrix (replicate), and the model used (c.f. the ExaML documentation). The name (PHYLOGENY_NAME) is used to determine the location of the resulting files, by default ./results/TITLE/phylogenies/NAME/. If per-gene trees are generated, an addition two folders are used, namely the name of the regions of interest, and the name of the gene / feature. For each phylogeny, the following files are generated: **alignments.partitions**: List of partitions used when running ExaML; the "reduced" file contains the same list of partitions, after empty columns (no called bases) have been excluded. **alignments.phy**: Super-matrix used in conjunction with the list of partitions when calling ExaML; the "reduced" file contains the same matrix, but with empty columns (no bases called) excluded. **alignments.reduced.binary**: The reduced supermatrix / partitions in the binary format used by ExaML. **bootstraps.newick**: List of bootstrap trees in Newick format, rooted as specified in the makefile. **replicates.newick**: List of phylogenies inferred from the full super-matrix, rooted as specified in the makefile. **replicates.support.newick**: List of phylogenies inferred from the full super-matrix, with support values calculated using the bootstrap trees, and rooted as specified in the makefile. Executing the pipeline ---------------------- The phylogenetic pipeline is excuted similarly to the BAM pipeline, except that a command is provided for each step ('genotyping', 'msa', and 'phylogeny'): .. code-block:: bash $ paleomix phylo [OPTIONS] Thus, to execute the genotyping step, the following command is used: .. code-block:: bash $ paleomix phylo genotyping [OPTIONS] In addition, it is possible to run multiple steps by joining these with the plus-symbol. To run both the 'genotyping' and 'msa' step at the same time, use the following command: .. code-block:: bash $ paleomix phylo genotyping+msa [OPTIONS] .. _MAFFT website: http://mafft.cbrc.jp/alignment/software/algorithms/algorithms.htmlpaleomix-1.3.8/docs/references.rst000066400000000000000000000076061443173665200171700ustar00rootroot00000000000000========== References ========== .. [Alexander2009] Alexander *et al*. "**Fast model-based estimation of ancestry in unrelated individuals**". Genome Res. 2009 Sep;19(9):1655-64. doi:10.1101/gr.094052.109 .. [Chang2015] Chang *et al*. "**Second-generation PLINK: rising to the challenge of larger and richer datasets**". Gigascience. 2015 Feb 25;4:7. doi: 10.1186/s13742-015-0047-8 .. [DerSarkissian2015] Der Sarkissian *et al*. "**Evolutionary Genomics and Conservation of the Endangered Przewalski's Horse**". Curr Biol. 2015 Oct 5;25(19):2577-83. doi:10.1016/j.cub.2015.08.032 .. [Jonsson2013] Jónsson *et al*. "**mapDamage2.0: fast approximate Bayesian estimates of ancient DNA damage parameters**". Bioinformatics. 2013 Jul 1;29(13):1682-4. doi:10.1093/bioinformatics/btt193 .. [Jonsson2014] Jónsson *et al*. "**Speciation with gene flow in equids despite extensive chromosomal plasticity**". PNAS. 2014 Dec 30;111(52):18655-60. doi:10.1073/pnas.1412627111 .. [Katoh2013] Katoh and Standley. "**MAFFT multiple sequence alignment software version 7: improvements in performance and usability**". Mol Biol Evol. 2013 Apr;30(4):772-80. doi:10.1093/molbev/mst010 .. [Langmead2012] Langmead and Salzberg. "**Fast gapped-read alignment with Bowtie 2**". Nat Methods. 2012 Mar 4;9(4):357-9. doi:10.1038/nmeth.1923 .. [Li2009a] Li and Durbin. "**Fast and accurate short read alignment with Burrows-Wheeler transform**". Bioinformatics. 2009 Jul 15;25(14):1754-60. doi:10.1093/bioinformatics/btp324 .. [Li2009b] Li *et al*. "**The Sequence Alignment/Map format and SAMtools**". Bioinformatics. 2009 Aug 15;25(16):2078-9. doi:10.1093/bioinformatics/btp352 .. [Orlando2013] Orlando *et al*. "**Recalibrating Equus evolution using the genome sequence of an early Middle Pleistocene horse**". Nature. 2013 Jul; 499(7456):74-78. doi:10.1038/nature12323. .. [Paradis2004] Paradis *et al*. "**APE: Analyses of Phylogenetics and Evolution in R language**". Bioinformatics. 2004 Jan 22;20(2):289-90. doi:10.1093/bioinformatics/btg412 .. [Patterson2006] Patterson N, Price AL, Reich D. Population structure and eigenanalysis. PLoS Genet. 2006 Dec;2(12):e190. doi:10.1371/journal.pgen.0020190 .. [Peltzer2016] Peltzer *et al*. "**EAGER: efficient ancient genome reconstruction**". Genome Biology. 2016 Mar 9; 17:60. doi:10.1186/s13059-016-0918-z .. [Pickrell2012] Pickrell and Pritchard. "**Inference of population splits and mixtures from genome-wide allele frequency data**". PLoS Genet. 2012;8(11):e1002967. doi:10.1371/journal.pgen.1002967 .. [Price2006] Price *et al*. "**Principal components analysis corrects for stratification in genome-wide association studies**". Nat Genet. 2006 Aug;38(8):904-9. Epub 2006 Jul 23. doi:10.1038/ng1847 .. [Schubert2012] Schubert *et al*. "**Improving ancient DNA read mapping against modern reference genomes**". BMC Genomics. 2012 May 10;13:178. doi:10.1186/1471-2164-13-178. .. [Schubert2014] Schubert *et al*. "**Characterization of ancient and modern genomes by SNP detection and phylogenomic and metagenomic analysis using PALEOMIX**". Nature Protocols. 2014 May;9(5):1056-82. doi:10.1038/nprot.2014.063 .. [Schubert2016] Schubert *et al*. "**AdapterRemoval v2: rapid adapter trimming, identification, and read merging**". BMC Research Notes, 12;9(1):88 .. [Schubert2017] Schubert *et al*. "**Zonkey: A simple, accurate and sensitive pipeline to genetically identify equine F1-hybrids in archaeological assemblages**". Journal of Archaeological Science. 2017 Feb; 78:147-157. doi: 10.1016/j.jas.2016.12.005. .. [Stamatakis2006] Stamatakis. "**RAxML-VI-HPC: maximum likelihood-based phylogenetic analyses with thousands of taxa and mixed models**". Bioinformatics. 2006 Nov 1;22(21):2688-90. .. [Wickham2007] Wickham. "**Reshaping Data with the reshape Package**". Journal of Statistical Software. 2007 21(1). .. [Wickham2009] Wichham. "**ggplot2: Elegant Graphics for Data Analysis**". Springer-Verlag New York 2009. ISBN:978-0-387-98140-6paleomix-1.3.8/docs/related.rst000066400000000000000000000013301443173665200164530ustar00rootroot00000000000000.. _related_tools: Related Tools ============= **Pipelines:** * EAGER - Efficient Ancient GEnome Reconstruction (`website `_; [Peltzer2016]_) EAGER provides an intuitive and user-friendly way for researchers to address two problems in current ancient genome reconstruction projects; firstly, EAGER allows users to efficiently preprocess, map and analyze ancient genomic data using a standardized general framework for small to larger genome reconstruction projects. Secondly, EAGER provides a user-friendly interface that allows users to run EAGER without needing to fully understand all the underlying technical details. *(Description paraphrased from the EAGER website)* paleomix-1.3.8/docs/release.rst000066400000000000000000000007231443173665200164600ustar00rootroot00000000000000Release checklist ----------------- * Update changelog * Update version in `paleomix/__init__.py` and in `docs/conf.py` * Update version in paleomix_environment.yaml Publish to PyPi --------------- * git clone https://github.com/MikkelSchubert/paleomix.git * cd paleomix * tox * git clean -fdx * check-manifest * python3 setup.py sdist * twine upload -r testpypi dist/* * twine upload dist/* Publish to github ----------------- * Tag release * git push --tags paleomix-1.3.8/docs/troubleshooting/000077500000000000000000000000001443173665200175335ustar00rootroot00000000000000paleomix-1.3.8/docs/troubleshooting/bam_pipeline.rst000066400000000000000000000305421443173665200227150ustar00rootroot00000000000000.. _troubleshooting_bam: Troubleshooting the BAM Pipeline ================================ Troubleshooting BAM pipeline makefiles -------------------------------------- **Path included multiple times in target**: This message is triggered if the same target includes one more more input files more than once:: Error reading makefiles: MakefileError: Path included multiple times in target: - Record 1: Name: ExampleProject, Sample: Synthetic_Sample_1, Library: ACGATA, Barcode: Lane_1_001 - Record 2: Name: ExampleProject, Sample: Synthetic_Sample_1, Library: ACGATA, Barcode: Lane_3_001 - Canonical path 1: /home/username/temp/bam_example/data/ACGATA_L1_R1_01.fastq.gz - Canonical path 2: /home/username/temp/bam_example/data/ACGATA_L1_R2_01.fastq.gz This may be caused by using too broad wildcards, or simple mistakes. The message indicates the lane in which the files were included, as well as the "canonical" (i.e. following the resolution of symbolic links, etc.) path to each of the files. To resolve this issue, ensure that each input file is only included once for a given target. **Target name used multiple times**: If running multiple makefiles in the same folder, it is important that the names given to targets in each makefile are unique, as the pipeline will otherwise mixfiles between different projects (see the section :ref:`bam_filestructure` for more information). The PALEOMIX pipeline attempts to detect this, and prevents the pipeline from running in this case:: Error reading makefiles: MakefileError: Target name 'ExampleProject' used multiple times; output files would be clobbered! **OutOfMemoryException (Picard Tools):** By default, the BAM pipeline will limit the amount of heap-space used by Java programs to 4GB (on 64-bit systems, JVM defaults are used on 32-bit systems), which may prove insufficient in some instances. This will result in the failing program terminating with a stacktrace, such as the following:: Exception in thread "main" java.lang.OutOfMemoryError at net.sf.samtools.util.SortingLongCollection.(SortingLongCollection.java:101) at net.sf.picard.sam.MarkDuplicates.generateDuplicateIndexes(MarkDuplicates.java:443) at net.sf.picard.sam.MarkDuplicates.doWork(MarkDuplicates.java:115) at net.sf.picard.cmdline.CommandLineProgram.instanceMain(CommandLineProgram.java:158) at net.sf.picard.sam.MarkDuplicates.main(MarkDuplicates.java:97) To resolve this issue, increase the maximum amount of heap-space used using the "--jre-option" command-line option; this permits the passing of options to the Java Runtime Environment (JRE). For example, to increase the maximum to 8gb, run the BAM pipeline as follows:: $ paleomix bam run --jre-option=-Xmx8g [...] Troubleshooting AdapterRemoval ------------------------------ The AdapterRemoval task will attempt to determine verify the quality-offset specified in the makefile; if the contents of the file does not match the expected offset (i.e. contains quality scores that fall outside the range expected with that offset http://en.wikipedia.org/wiki/FASTQ_format\#Encoding, the task will be aborted. **Incorrect quality offsets specified in makefile**: In case where the sequence data can be determined to contain FASTQ records with a different quality offset than that specified in the makefile, the task will be aborted with the message corresponding to the following:: 'ExampleProject/reads/Synthetic_Sample_1/TGCTCA/Lane_1_002/reads.*'>: Error occurred running command: Error(s) running Node: Temporary directory: '/path/to/temp/folder' FASTQ file contains quality scores with wrong quality score offset (33); expected reads with quality score offset 64. Ensure that the 'QualityOffset' specified in the makefile corresponds to the input. Filename = data/TGCTCA_L1_R1_02.fastq.gz Please verify the format of the input file, and update the makefile to use the correct QualityOffset before starting the pipeline. **Input file contains mixed FASTQ quality scores**: In case where the sequence data can be determined to contain FASTQ records with a different quality scores corresponding to the both of the possible offsets (for example both "!" and "a"), the task will be aborted with the message corresponding to the following example:: 'ExampleProject/reads/Synthetic_Sample_1/TGCTCA/Lane_1_002/reads.*'>: Error occurred running command: Error(s) running Node: Temporary directory: '/path/to/temp/folder' FASTQ file contains quality scores with both quality offsets (33 and 64); file may be unexpected format or corrupt. Please ensure that this file contains valid FASTQ reads from a single source. Filename = 'data/TGCTCA_L1_R1_02.fastq.gz' This error would suggest that the input-file contains a mix of FASTQ records from multiple sources, e.g. resulting from the concatenation of multiple sets of data. If so, make use of the original data, and ensure that the quality score offset set for each is set correctly. **Input file does not contain quality scores**: If the input files does not contain any quality scores (e.g. due to malformed FASTQ records), the task will terminate, as these are required by the AdapterRemoval program. Please ensure that the input files are valid FASTQ files before proceeding. Input files in FASTA format / not in FASTQ format: If the input file can be determined to be in FASTA format, or otherwise be determined to not be in FASTQ format, the task will terminate with the following message:: 'ExampleProject/reads/Synthetic_Sample_1/TGCTCA/Lane_1_002/reads.*'>: Error occurred running command: Error(s) running Node: Temporary directory: '/path/to/temp/folder' Input file appears to be in FASTA format (header starts with '>', expected '@'), but only FASTQ files are supported. Filename = 'data/TGCTCA_L1_R1_02.fastq.gz' Note that the pipeline only supports FASTQ files as input for the trimming stage, and that these have to be either uncompressed, gzipped, or bzipped. Other compression schemes are not supported at this point in time. Troubleshooting BWA ------------------- **BWA prefix generated using different version of BWA / corrupt index**: Between versions 0.5 and 0.6, BWA changed the binary format used to store the index sequenced produced using the command "bwa index". Version 0.7 is compatible with indexes generated using v0.6. The pipeline will attempt to detect the case where the current version of BWA does not correspond to the version used to generate the index, and will terminate if that is the case. As the two formats contain both contain files with the same names, the two formats cannot co-exist in the same location. Thus to resolve this issue, either create a new index in a new location, and update the makefile to use that location, or delete the old index files (path/to/prefix.fasta.*), and re-index it by using the command "bwa index path/to/prefix.fasta", or by simply re-starting the pipeline. However, because the filenames used by v0.6+ is a subset of the filenames used by v0.5.x, it is possible to accidentally end up with a prefix that appears to be v0.5.x to the pipeline, but in fact contains a mix of v0.5.x and v0.6+ files. This situation, as well as corruption of the index, may result in the following errors: 1. [bwt_restore_sa] SA-BWT inconsistency: seq_len is not the same 2. [bns_restore_core] fail to open file './rCRS.fasta.nt.ann' 3. Segmentation faults when running 'bwa aln'; these are reported as "SIGSEGV" in the file pipe.errors If this occurs, removing the old prefix files and generating a new index is advised (see above). Troubleshooting validation of BAM files --------------------------------------- **Both mates are marked as second / first of pair**: This error message may occur during validation of the final BAM, if the input files specified for different libraries contained duplicates reads (*not* PCR duplicate). In that case, the final BAM will contain multiple copies of the same data, thereby risking a significant bias in downstream analyses. The following demonstrates this problem, using a contrieved example based on the examples/bam_example project included with the pipeline:: $ paleomix bam run makefile.yaml [...] : Error occurred running command: Error(s) running Node: Temporary directory: '/path/to/temp/folder' Error(s) running Node: Return-codes: [1] Temporary directory: '/path/to/temp/folder' Picard's ValidateSamfile prints the error messages to STDOUT, the location of which is indicated above:: $ cat '/tmp/bam_pipeline/9a5beba9-1b24-4494-836e-62a85eb74bf3/rCRS.validated' ERROR: Record 684, Read name Seq_101_1324_104_rv_0\2, Both mates are marked as second of pair ERROR: Record 6810, Read name Seq_1171_13884_131_fw_0\2, Both mates are marked as second of pair To identify the source of the problems, the problematic reads may be extracted from the BAM file:: $ samtools view ExampleProject.rCRS.bam|grep -w "^Seq_101_1324_104_rv_0" Seq_101_1324_104_rv_0\2 131 NC_012920_1 1325 60 100M = 1325 -1 [...] Seq_101_1324_104_rv_0\2 131 NC_012920_1 1325 60 100M = 1325 1 [...] Seq_101_1324_104_rv_0\1 16 NC_012920_1 1327 37 51M2D49M * 0 0 [...] Seq_101_1324_104_rv_0\1 89 NC_012920_1 1327 60 51M2D49M * 0 0 [...] Note that both mate pairs are duplicated, with slight variations in the flags. The source of the reads may be determined using the "RG" tags (not shown here), which for files produced by the pipeline corresponds to the library names. Once these are known, the corresponding FASTQ files may be examined to determine the source of the duplicate reads. This problem should normally be detected early in the pipeline, as checks for the inclusion of duplicate data has been implemented (see below). **Read ... found in multiple files**: In order to detect the presence of data that has been included multiple times, e.g. due to incorrect merging of data, the pipeline looks for alignments with identical names, sequences and quality scores. If such reads are found, the follow error is reported:: : Error occurred running command: Read 'Seq_junk_682_0' found in multiple files: - 'ExampleProject/rCRS/Synthetic_Sample_1/ACGATA/Lane_1_002/paired.minQ0.bam' - 'ExampleProject/rCRS/Synthetic_Sample_1/ACGATA/Lane_1_001/paired.minQ0.bam' This indicates that the same data files have been included multiple times in the project. Please review the input files used in this project, to ensure that each set of data is included only once. The message given indicates which files (and hence which samples/libraries and lanes were affected, as described in section :ref:`bam_filestructure`). If only a single file is given, this suggests that the reads were also found in that one file. This problem may result from the accidental concatenation of files provided to the pipeline, or from multiple copies of the same files being included in the wildcards specified in the makefile. As including the same sequencing reads multiple times are bound to bias downstream analyses (if it does not cause validation failure, see sub-section above), this must be fixed before the pipeline is re-started. paleomix-1.3.8/docs/troubleshooting/common.rst000066400000000000000000000153511443173665200215620ustar00rootroot00000000000000.. highlight:: Bash .. _troubleshooting_common: Troubleshooting general problems ================================ If a command fails while the pipeline is running (e.g. mapping, genotyping, validation of BAMs, etc.), the pipeline will print a message to the command-line and write a message to a log-file. The location of the log-file may be specified using the --log-file command-line option, but if --log-file is not specified, a time-stamped log-file is generated in the temporary folder specified using the --temp-root command-line option, and the location of this log-file is printed by the pipeline during execution:: $ 2014-01-07 09:46:19 Pipeline; 1 failed, 202 done of 203 tasks: Log-file located at '/path/to/temp/folder/bam_pipeline.20140107_094554_00.log' [...] Most error-messages will involve a message in the following form:: : Error occurred running command: Error(s) running Node: Return-codes: [1] Temporary directory: '/path/to/temp/folder' The task that failed was the validation of the BAM 'ExampleProject.rCRS.bam' using Picard ValidateSamFile, which terminated with return-code 1. For each command involved in a given task ('node'), the command-line (as the list passed to 'Popen'http://docs.python.org/2.7/library/subprocess.html), return code, and the current working directory (CWD) is shown. In addition, STDOUT and STDERR are always either piped to files, or to a different command. In the example given, STDOUT is piped to the file 'rCRS.validated', while STDERR is piped to the file 'pipe_java_4454836272.stderr'. The asterisks in 'STDERR*' indicates that this filename was generated by the pipeline itself, and that this file is only kept if the command failed. To determine the cause of the failure (indicated by the non-zero return-code), examine the output of each command involved in the node. Normally, messages relating to failures may be found in the STDERR file, but in some cases (and in this case) the cause is found in the STDOUT file:: $ cat /path/to/temp/folder/rCRS.validated ERROR: Record 87, Read name [...], Both mates are marked as second of pair ERROR: Record 110, Read name [...], Both mates are marked as first of pair [...] This particular error indicates that the same reads have been included multiple times in the makefile (see section [sub:Troubleshooting-BAM]). Normally it is nessesary to consult the documentation of the specified program in order to determine the cause of the failure. In addition, the pipeline performs a number of which during startup, which may result in the following issues being detected: **Required file does not exist, and is not created by a node**: Before start, the BAM and Phylogenetic pipeline checks for the presence of all required files. Should one or more files be missing, and the missing file is NOT created by the pipeline itself, an error similar to the following will be raised:: $ paleomix bam run makefile.yaml [...] Errors detected during graph construction (max 20 shown): Required file does not exist, and is not created by a node: Filename: prefix/rCRS.fasta Dependent node(s): [...] This typically happens if the Makefile contains typos, or if the required files have been moved since the last time the makefile was executed. To proceed, it is necessary to determine the current location of the files in question, and/or update the makefile. **Required executables are missing**: Before starting to execute a makefile, the pipeline will check that the requisite programs are installed, and verify that the installed versions meet the minimum requirements. Should an executable be missing, an error similar to the following will be issued, and the pipeline will not run:: $ paleomix bam run makefile.yaml [...] Errors detected during graph construction (max 20 shown): Required executables are missing: bwa In that case, please verify that all required programs are installed (see sections TODO) and ensure that these are accessible via the current user's PATH (i.e. can be excuted on the command-line using just the executable name). **Version requirement not met**: In addition to checking for the presence of required executables (including java JARs), version of a program is checked. Should the version of the program not be compatible with the pipeline (e.g. because it is too old), the following error is raised:: $ paleomix bam run makefile.yaml [...] Version requirement not met for 'Picard CreateSequenceDictionary.jar'; please refer to the PALEOMIX documentation for more information. Executable: /Users/mischu/bin/bwa Call: bwa Version: v0.5.7.x Required: v0.5.19.x or v0.5.110.x or v0.6.2.x or at least v0.7.9.x If so, please refer to the documentation for the pipeline in question, and install/update the program to the version required by the pipeline. Note that the executable MUST be accessible by the PATH variable. If multiple versions of a program is installed, the version required by the pipeline must be first, which may be verified by using the "which" command:: $ which -a bwa /home/username/bin/bwa /usr/local/bin/bwa **Java Runtime Environment outdated / UnsupportedClassVersionError**: If the version of the Java Runtime Environment (JRE) is too old, the pipeline may fail to run with the follow message:: The version of the Java Runtime Environment on this system is too old; please check the the requirement for the program and upgrade your version of Java. See the documentation for more information. Alternatively, Java programs may fail with a message similar to the following, as reported in the pipe_*.stderr file (abbreviated):: Exception in thread "main" java.lang.UnsupportedClassVersionError: org/broadinstitute/sting/gatk/CommandLineGATK : Unsupported major.minor version 51.0 at [...] To solve this problem, you will need to upgrade your copy of Java.paleomix-1.3.8/docs/troubleshooting/index.rst000066400000000000000000000006121443173665200213730ustar00rootroot00000000000000.. _troubleshooting: Troubleshooting =============== .. toctree:: install.rst common.rst bam_pipeline.rst phylo_pipeline.rst zonkey_pipeline.rst For troubleshooting of individual pipelines, please see the BAM pipeline :ref:`troubleshooting_bam` section, the Phylo pipeline :ref:`troubleshooting_phylo` section, and the Zonkey pipeline :ref:`troubleshooting_zonkey` section. paleomix-1.3.8/docs/troubleshooting/install.rst000066400000000000000000000023151443173665200217340ustar00rootroot00000000000000.. highlight:: Bash .. _troubleshooting_install: Throubleshooting the installation ================================= **Pysam / Cython installation fails with "Python.h: No such file or directory" or "pyconfig.h: No such file or directory"**: Installation of Pysam and Cython requires that Python development files are installed. On Debian based distributions, for example, this may be accomplished by running the following command:: $ sudo apt-get install python-dev **Pysam installation fails with "zlib.h: No such file or directory"**: Installation of Pysam requires that "libz" development files are installed. On Debian based distributions, for example, this may be accomplished by running the following command:: $ sudo apt-get install libz-dev **Command not found when attempting to run 'paleomix'**: By default, the PALEOMIX executables ('paleomix', etc.) are installed in ~/.local/bin. You must ensure that this path is included in your PATH:: $ export PATH=$PATH:~/.local/bin To automatically apply this setting on sub-sequent logins (assuming that you are using Bash), run the following command:: $ echo "export PATH=\$PATH:~/.local/bin" >> ~/.bash_profile paleomix-1.3.8/docs/troubleshooting/phylo_pipeline.rst000066400000000000000000000003131443173665200233020ustar00rootroot00000000000000.. _troubleshooting_phylo: Troubleshooting the Phylogenetic Pipeline ========================================= TODO TODO: MaxDepth not found in depth files .. --target must match name used in makefilepaleomix-1.3.8/docs/troubleshooting/zonkey_pipeline.rst000066400000000000000000000001521443173665200234670ustar00rootroot00000000000000.. _troubleshooting_zonkey: Troubleshooting the Zonkey Pipeline =================================== TODOpaleomix-1.3.8/docs/yaml.rst000066400000000000000000000143311443173665200160020ustar00rootroot00000000000000.. highlight:: YAML .. _yaml_intro: YAML usage in PALEOMIX ====================== `YAML`_ is a simple markup language adopted for use in configuration files by pipelines included in PALEOMIX. YAML was chosen because it is a plain-text format that is easy to read and write by hand. Since YAML files are plain-text, they may be edited using any standard text editors, with the following caveats: * YAML exclusively uses spaces for indentation, not tabs; attempting to use tabs in YAML files will cause failures when the file is read by the pipelines. * YAML is case-sensitive; an option such as `QualityOffset` is not the same as `qualityoffset`. * It is strongly recommended that all files be named using the `.yaml` file-extension; setting the extension helps ensure proper handling by editors that natively support the YAML format. Only a subset of YAML features are actually used by PALEOMIX, which are described below. These include **mappings**, by which values are identified by names; **lists** of values; and **numbers**, **text-strings**, and **true** / **false** values, typically representing program options, file-paths, and the like. In addition, comments prefixed by the hash-sign (`#`) are frequently used to provide documentation. Comments -------- Comments are specified by prefixing unquoted text with the hash-sign (`#`); all comments are ignored, and have no effect on the operation of the program. Comments are used solely to document the YAML files used by the pipelines:: # This is a comment; the next line contains both a value and a comment: 123 # Comments may be placed on the same line as values. For the purpose of the PALEOMIX reading this YAML code, the above is equivalent to the following YAML code:: 123 As noted above, this only applies to unquoted text, and the following is therefore not a comment, but rather a text-string:: "# this is not a comment" Comments are used in the following sections to provide context. Numbers (integers and floats) ----------------------------- Numbers in YAML file include whole numbers (integers) as well as real numbers (floating point numbers). Numbers are mostly used for program options, such as a minimum read length option, and involve whole numbers, but a few options do involve real numbers. Numbers may be written as follows:: # This is an integer: 123 # This is a float: 123.5 # This is a float written using scientific notation: 1.235e2 Truth-values (booleans) ----------------------- Truth values (*true* and *false*) are frequently used to enable or disable options in PALEOMIX configuration files. Several synonyms are available which helps improve readability. More specifically, all of the following values are interpreted as *true* by the pipelines:: true yes on And similarly, the following values are all interpreted as *false*:: false no off Template files included with the pipelines mostly use `yes` and `no`, but either of the above corresponding values may be used. Note however that none of these values are quoted: If single or double-quotations were used, then these vales would be read as text rather than truth-values, as described next. Text (strings) -------------- Text, or strings, is the most commonly used type of value used in the PALEOMIX YAML files, as these are used to present both labels and values for options, including paths to files to use in an analysis:: "Example" "This is a longer string" 'This is also a string' "/path/to/my/files/reads.fastq" For most part it is not necessary to use quotation marks, and the above could instead be written as follows:: Example This is a longer string This is also a string /path/to/my/files/reads.fastq However, it is important to make sure that values that are intended to be used strings are not misinterpreted as a different type of value. For example, without the quotation marks the following values would be interpreted as numbers or truth-values:: "true" "20090212" "17e13" Mappings -------- Mappings associate a value with a label (key), and are used for the majority of options. A mapping is simply a label followed by a colon, and then the value associated with that label:: MinimumQuality: 17 EnableFoo: no NameOfTest: "test 17" In PALEOMIX configuration files, labels are always strings, and are normally not quoted. However, in some cases, such as when using numerical labels in some contexts, it may be useful to quote the values: "A Label": on "12032016": "CPT" Sections (mappings in mappings) ------------------------------- In addition to mapping to a single value, a mapping may also itself contain one or more mappings:: Top level: Second level: 'a value' Another value: true Mappings can be nested any number of times, which is used in this manner to create sections and sub-sections in configuration files, grouping related options together:: Options: Options for program: Option1: yes Option2: 17 Another program: Option1: /path/to/file.fastq Option2: no Note that the two mappings belonging to the `Option` mapping are both indented the same number of spaces, which is what allows the program to figure out which values belong to what label. It is therefore important to keep indentation consistent. Lists of values --------------- In some cases, it is possible to specify zero or more values with labels. This is accomplished using lists, which consist of values prefixed with a dash:: Section: - First value - Second value - Third value Note that the indentation of each item must be the same, similar to how indentation of sub-sections must be the same (see above). Full example ------------ The following showcases basic structure of a YAML document, as used by the pipelines:: # This is a comment; this line is completely ignored This is a section: This is a subsection: # This subsection contains 3 label / value pairs: First label: "First value" Second label: 2 Third label: 3.14 This is just another label: "Value!" This is a section containing a list: - The first item - The second item .. _YAML: http://www.yaml.org paleomix-1.3.8/docs/zonkey_pipeline/000077500000000000000000000000001443173665200175105ustar00rootroot00000000000000paleomix-1.3.8/docs/zonkey_pipeline/configuration.rst000066400000000000000000000045701443173665200231170ustar00rootroot00000000000000.. highlight:: ini .. _zonkey_configuration: Configuring the Zonkey pipeline =============================== Unlike the :ref:`bam_pipeline` and the :ref:`phylo_pipeline`, the :ref:`zonkey_pipeline` does not make use of makefiles. However, the pipeline does expose a number options, including the maximum number of threads used, various program parameters, and more. These may be set using the corresponding command-line options (e.g. --max-threads to set the maximum number of threads used during runtime). However, it is also possible to set default values for such options, including on a per-host bases. This is accomplished by executing the following command, in order to generate a configuration file at ~/.paleomix/zonkey.ini: .. code-block:: bash $ paleomix zonkey --write-config The resulting file contains a list of options which can be overwritten:: [Defaults] max_threads = 1 log_level = warning treemix_k = 0 admixture_replicates = 1 ui_colors = on downsample_to = 1000000 These values will be used by the pipeline, unless the corresponding option is also supplied on the command-line. I.e. if "max_threads" is set to 4 in the "zonkey.ini" file, but the pipeline is run using "paleomix zonkey run --max-threads 10", then the max threads value is set to 10. .. note:: Options in the configuration file correspond directly to command-line options for the BAM pipeline, with two significant differences: The leading dashes (--) are removed and any remaining dashes are changed to underscores (_); as an example, the command-line option --max-threads becomes max\_threads in the configuration file, as shown above. It is furthermore possible to set specific options depending on the current host-name. Assuming that the pipeline was run on multiple servers sharing a single home directory, one might set the maximum number of threads on a per-server basis as follows:: [Defaults] max_threads = 32 [BigServer] max_threads = 64 [SmallServer] max_threads = 16 The names used (here "BigServer" and "SmallServer") should correspond to the hostname, i.e. the value returned by the "hostname" command: .. code-block:: bash $ hostname BigServer Any value set in the section matching the name of the current host will take precedence over the 'Defaults' section, but can still be overridden by specifying the same option on the command-line, as described above. paleomix-1.3.8/docs/zonkey_pipeline/filestructure.rst000066400000000000000000000067511443173665200231530ustar00rootroot00000000000000.. highlight:: Bash .. _zonkey_filestructure: File structure ============== The following section explains the file structure for results generated by the Zonkey pipeline, based on the results generated when analyzing the example files included with the pipeline (see :ref:`examples_zonkey`). Single sample analysis ---------------------- The following is based on running case 4a, as described in the :ref:`zonkey_usage` section. More specifically, the example in which the analysis are carried out on a BAM alignment file containing both nuclear and mitochondrial alignments:: # Case 4a: Analyses both nuclear and mitochondrial genome; results are placed in 'combined.zonkey' $ paleomix zonkey run database.tar combined.bam As noted in the comment, executing this command places the results in the directory 'combined.zonkey'. For a completed analysis, the results directory is expected to contain a (HTML) report and a directory containing each of the figures generated by the pipeline: * report.css * report.html * figures/ The report may be opened with any modern browser. Each figure displayed in the report is also available as a PDF file, accessed by clicking on a given figure in the report, or directly in the figures/ sub-directory. Analysis result files ^^^^^^^^^^^^^^^^^^^^^ In addition, the following directories are generated by the analytical steps, and contain the various files used by or generated by the programs run as part of the Zonkey pipeline: * admixture/ * mitochondria/ * pca/ * plink/ * treemix/ In general, files in these directories are sorted by the prefix 'incl\_ts' and the prefix 'excl\_ts', which indicate that sites containing transitions (C<->G, and C<->T) have been included or excluded from the analyses, respectively. For a detailed description of the files generated by each analysis, please refer to the documentation for the respective programs used in said analyses. Additionally, the results directory is expected to contain a 'temp' directory. This directory may safely be removed following the completion of a Zonkey run, but should be empty unless one or more analytical steps have failed. Multi-sample analysis --------------------- When multiple samples are processed at once, as described in case 5 (:ref:`zonkey_usage`), results are written to a single 'results' directory. This directory will contain a summary report for all samples, as well as a sub-directory for each sample listed in the table of samples provided when running the pipeline. Thus, for the samples table shown in case 5:: $ cat samples.table example1 combined.bam example2 nuclear.bam example3 mitochondrial.bam example4 nuclear.bam mitochondrial.bam # Case 5a) Analyse 3 samples; results are placed in 'my_samples.zonkey' $ paleomix zonkey run database.tar my_samples.txt The results directory is expected to contain the following files and directories: * summary.html * summary.css * example1/ * example2/ * example3/ * example4/ The summary report may be opened with any modern browser, and offers a quick over-view of all samples processed as part of this analysis. The individual report for each sample may further more be accessed by clicking on the headers corresponding to the name of a give nsample. The per-sample directories corresponding exactly to the result directories that would have been generated if the sample was processed by itself (see above), excepting that only a single 'temp' directory located in the root of the results directory is used. paleomix-1.3.8/docs/zonkey_pipeline/index.rst000066400000000000000000000030641443173665200213540ustar00rootroot00000000000000.. _zonkey_pipeline: Zonkey Pipeline =============== **Table of Contents:** .. toctree:: overview.rst requirements.rst configuration.rst usage.rst panel.rst filestructure.rst The Zonkey Pipeline is a easy-to-use pipeline designed for the analyses of low-coverage, ancient DNA derived from historical equid samples, with the purpose of determining the species of the sample, as well as determining possible hybridization between horses, zebras, and asses (see :ref:`zonkey_usage`). This is accomplished by comparing one or more samples aligned against the *Equus caballus* 2.0 reference sequence with a reference panel of modern equids, including wild and domesticated equids. The reference panel is further described in the :ref:`zonkey_panel` section. The Zonkey pipeline has been published in Journal of Archaeological Science; if you make use of this pipeline in your work, then please cite Schubert M, Mashkour M, Gaunitz C, Fages A, Seguin-Orlando A, Sheikhi S, Alfarhan AH, Alquraishi SA, Al-Rasheid KAS, Chuang R, Ermini L, Gamba C, Weinstock J, Vedat O, and Orlando L. "**Zonkey: A simple, accurate and sensitive pipeline to genetically identify equine F1-hybrids in archaeological assemblages**". Journal of Archaeological Science. 2007 Feb; 78:147-157. doi: `10.1016/j.jas.2016.12.005 `_. The sequencing data used in the Zonkey publication is available on `ENA`_ under the accession number `PRJEB15037`_. .. _ENA: https://www.ebi.ac.uk/ena/ .. _PRJEB15037: https://www.ebi.ac.uk/ena/data/view/PRJEB15037 paleomix-1.3.8/docs/zonkey_pipeline/overview.rst000066400000000000000000000104611443173665200221120ustar00rootroot00000000000000Overview of analytical steps ============================ Briefly, the Zonkey pipeline can run admixture tests on pre-defined species categories (asses, horses, and zebras) to evaluate the ancestry proportions found in the test samples. F1-hybrids are expected to show a balance mixture of two species ancestries, although this balance can deviate from the 50:50 expectation in case limited genetic information is available. This is accomplished using ADMIXTURE [Alexander2009]_. The zonkey pipeline additionally builds maximum likelihood phylogenetic trees, using RAxML [Stamatakis2006]_ for mitochondrial sequence data and using Treeemix [Pickrell2012]_ for autosomal data. In the latter case, phylogenetic affinities are reconstructed twice: First considering no migration edges and secondly allowing for one migration edge. This allows for fine-grained testing of admixture between the sample and any of the species represented in the reference panel. In cases where an admixture signal is found, the location of the sample in the mitochondrial tree allows for the identification of the maternal species contributing to the hybrid being examined. For equids, this is essential to distinguish between possible the hybrid forms, such as distinguishing between mules (|female| horse x |male| donkey F1-hybrid) and hinnies (|male| horse x |female| donkey F1-hybrid). Analyses are presented in HTML reports, one per sample and one summary report when analyzing multiple samples. Figures are generated in both as PNG and PDF format in order to facilitate use in publications (see :ref:`zonkey_filestructure`). Individual analytical steps --------------------------- During a typical analyses, the Zonkey pipeline will proceed through the following major analytical steps: 1. Analyzing nuclear alignments: 1. Input BAMs are indexed using the equivalent of 'samtools index'. 2. Nucleotides at sites overlapping SNPs in the reference panel are sampled to produce a pseudo-haploid sequence, one in which transitions are included and one in which transitions are excluded, in order to account for the presence of *post-mortem* deamination causing base substitutions. The resulting tables are processed using PLINK to generate the prerequisite files for further analyses. 3. PCA plots are generated using SmartPCA from the EIGENSOFT suite of tools for both panels of SNPs (including and excluding transitions). 4. Admixture estimates are carried out using ADMIXTURE, with a partially supervised approach by assigning each sample in the reference panel to one of either two groups (caballine and non-caballine equids) or three groups (asses, horses, and zebras), and processing the SNP panels including and excluding transitions. The input sample is not assigned to a group. 5. Migration edges are modeled using TreeMix, assuming either 0 or 1 migration edge; analyses is carried out on both the SNP panel including transitions and on the SNP panel excluding transitions. 6. PNG and PDF figures are generated for each analytical step; in addition, the the per-chromosome coverage of the nuclear genome is plotted. 1. Analyzing mitochondrial alignments: 1. Input BAMs are indexed using the equivalent of 'samtools index'. 2. The majority nucleotide at each position in the BAM is determined, and the resulting sequence is added to the mitochondrial reference multiple sequence alignment included in the reference panel. 3. A maximum likelihood phylogeny is inferred using RAxML, and the resulting tree is drawn, rooted on the midpoint of the phylogeny. 3. Generating reports and summaries 1. A HTML report is generated for each sample, summarizing the data used and presenting (graphically) the results of each analysis carried out above. All figures are available as PNG and PDF (each figure in the report links to its PDF equivalent). 2. If multiple samples were processed, a summary of all samples is generated, which presents the major results in an abbreviated form. .. note:: While the above shows an ordered list of steps, the pipeline may interleave individual steps during runtime, and may execute multiple steps in parallel in when running in multi-threaded mode (see :ref:`zonkey_usage` for how to run the Zonkey pipeline using multiple threads). .. |male| unicode:: U+02642 .. MALE .. |female| unicode:: U+02640 .. FEMALE paleomix-1.3.8/docs/zonkey_pipeline/panel.rst000066400000000000000000000361211443173665200213440ustar00rootroot00000000000000.. _zonkey_panel: Reference Panel =============== The :ref:`zonkey_pipeline` operates using a reference panel of SNPs generated from a selection of extant equid species, including the domestic horse (Equus caballus) and the Przewalski’s wild horse (Equus ferus przewalski); within African asses, the domestic donkey (Equus asinus) and the Somali wild ass (Equus africanus); within Asian asses, the onager (Equus hemionus) and the Tibetan kiang (Equus kiang), and; within zebras: the plains zebra (Equus quagga), the mountains zebra (Equus hartmannae) and the Grevyi zebra (Equus grevyi). These samples were obtained from [Orlando2013]_, [DerSarkissian2015]_, and in particular from [Jonsson2014]_, which published genomes of every remaining extant equid species. The reference panel has been generated using alignments against the Equus caballus reference nuclear genome (equCab2, via `UCSC`_) and the horse mitochondrial genome (NC\_001640.1, via `NCBI`_). The exact samples used to create the latest version of the reference panel are described below. Obtaining the reference panel ------------------------------- The latest version of the Zonkey reference panel (dated 2016-11-01) may be downloaded via the following website: https://github.com/MikkelSchubert/zonkey/releases/ Once this reference panel has been downloaded, it is strongly recommended that you decompress it using the 'bunzip2' command, since this speeds up several analytical steps (at the cost of about 600 MB of additional disk usage). To decompress the reference panel, simply run 'bunzip2' on the file, as shown here: .. code-block:: bash $ bunzip2 database.tar.bz2 .. warning: Do not untar the reference panel. The Zonkey pipeline currently expects data files to be stored in a tar archive, and will not work if files have been extracted into a folder. This may change in the future. Once this has been done, the Zonkey pipeline may be used as described in the :ref:`zonkey_usage` section. Samples used in the reference panel ----------------------------------- The following samples have been used in the construction of the latest version of the reference panel: ====== =================== ====== =========== ============================= Group Species Sex Sample Name Publication ====== =================== ====== =========== ============================= Horses *E. caballus* Male FM1798 doi:`10.1016/j.cub.2015.08.032 `_ . *E. przewalskii* Male SB281 doi:`10.1016/j.cub.2015.08.032 `_ Asses *E. a. asinus* Male Willy doi:`10.1038/nature12323 `_ . *E. kiang* Female KIA doi:`10.1073/pnas.1412627111 `_ . *E. h. onager* Male ONA doi:`10.1073/pnas.1412627111 `_ . *E. a. somaliensis* Female SOM doi:`10.1073/pnas.1412627111 `_ Zebras *E. q. boehmi* Female BOE doi:`10.1073/pnas.1412627111 `_ . *E. grevyi* Female GRE doi:`10.1073/pnas.1412627111 `_ . *E. z. hartmannae* Female HAR doi:`10.1073/pnas.1412627111 `_ ====== =================== ====== =========== ============================= Constructing a reference panel ============================== The following section describes the format used for the reference panel in Zonkey. It is intended for people who are interested in constructing their own reference panels for a set of species. .. warning:: At the time of writing, the number of ancestral groups is hardcoded to 2 and 3 groups; support for any number of ancestral groups is planned. Contact me if this is something you need, and I'll prioritize adding this to the Zonkey pipeline. It is important to note that a reference panel will is created relative to a single reference genome. For example, for the equine reference panel, all alignments and positions are listed relative to the EquCab2.0 reference genome. The reference consists of a number of files, which are described below: settings.yaml ------------- The settings file is a simple YAML-markup file, which species global options that apply to the reference panel. The current setting file looks as follows: .. code-block:: yaml # Database format; is incremented when the format changes Format: 1 # Revision number; is incremented when the database (but not format) changes Revision: 20161101 # Arguments passed to plink Plink: "--horse" # Number of chromosomes; required for e.g. PCA analyses NChroms: 31 # N bases of padding used for mitochondrial sequences; the last N bases are # expected to be the same as the first N bases, in order to allow alignments # at this region of the genome, and are combined to generate final consensus. MitoPadding: 31 # The minimum distance between SNPs, assuming an even distribution of SNPs # across the genome. Used when --treemix-k is set to 'auto', which is the # default behavior. Value from McCue 2012 (doi:10.1371/journal.pgen.1002451). SNPDistance: 150000 The *Format* option defines the panel format, reflects the version of the Zonkey pipeline that supports this panel. It should therefore not be changed unless the format, as described on this page, is changed. The *Revision* reflects the version of a specific reference panel, and should be updated every time data or settings in the reference panel is changed. The equid reference panel simply uses the date at which a given version was created as the revision number. The *Plink* option lists specific options passed to plink. In the above, this includes just the '--horse' option, which specifies the expected number of chromosomes expected for the horse genome and data aligned against the horse genome. The *NChroms* option specifies the number of autosomal chromosomes for the reference genome used to construct the reference panel. This is requried for running PCA, but will likely be removed in the future (it is redundant due to contigs.txt). The *MitoPadding* option is used for the mitochondrial reference sequences, and specifies that some number of the bases at the end of the sequences are identical to the first bases in the sequence. Such duplication (or padding) is used to enable alignments spanning the break introduced when representing a circular genome as a FASTA sequence. If no such padding has been used, then this may simply be set to 0. The *SNPDistance* option is used to calculate the number of SNPs per block when the --treemix-k option is set to 'auto' (the default behavior). This option assumes that SNPs are evenly distributed across the genome, and calculates block size based on the number of SNPs covered for a given sample. contigs.txt ----------- The 'contigs.txt' file contains a table describing the chromsomes included in the zonkey analyses: .. code-block:: text ID Size Checksum Ns 1 185838109 NA 2276254 2 120857687 NA 1900145 3 119479920 NA 1375010 4 108569075 NA 1172002 5 99680356 NA 1937819 X 124114077 NA 2499591 The *ID* column specifies the name of the chromosome. Note that these names are expected to be either numerical (i.e. 1, 2, 21, 31) or sex chromosomes (X or Y). The *Size* column must correspond to the length of the chromosome in the reference genome. The *Ns* column, on the other hand, allows for the number of uncalled bases in the reference to be specified. This value is subtracted from the chromosome size when calculating the relative coverage for sex determination. The *Checksum* column should contain the MD5 sum calculated for the reference sequence or 'NA' if not available. If specified, this value is intended to be compared with the MD5 sums listed in the headers of BAM files analyzed by the Zonkey pipeline, to ensure that the correct reference sequence is used. .. note:: This checksum check is currently not supported, but will be added soon. .. note:: The mitochondria is not included in this table; only list autosomes to be analyzed. samples.txt ----------- The 'samples.txt' table should contains a list of all samples included in the reference panel, and provides various information about these, most important of which is what ancestral groups a given sample belongs to: .. code-block:: text ID Group(3) Group(2) Species Sex SampleID Publication ZBoe Zebra NonCaballine E. q. boehmi Female BOE doi:10.1073/pnas.1412627111 AOna Ass NonCaballine E. h. onager Male ONA doi:10.1073/pnas.1412627111 HPrz Horse Caballine E. przewalskii Male SB281 doi:10.1016/j.cub.2015.08.032 The *ID* column is used as the name of the sample in the text, tables, and figures generated when running the Zonkey pipeline. It is adviced to keep this name short and preferably descriptive about the group to which the sample belongs. The *Group(2)* and *Group(3)* columns specify the ancestral groups to which the sample belongs, when connsidering either 2 or 3 ancestral groups. Note that Zonkey currently only supports 2 and 3 ancestral groups (see above). The *Species*, *Sex*, *SampleID*, and *Publication* columns are meant to contain extra information about the samples, used in the reports generated by the Zonkey pipeline, and are not used directly by the pipeline. mitochondria.fasta ------------------ The 'mitochondria.fasta' file is expected to contain a multi-sequence alignment involving two different set of sequences. Firstly, it must contain one or more reference sequences against which the input mitochondria alignments have been carried out. In addition, it should contain at least one sequence per species in the reference panel. Zonkey will compare the reference sequences (either or not subtracting the amount of padding specified in the 'settings.txt' file) against the contigs in the input BAM in order to identify mitochondrial sequences. The Zonkey pipeline then uses the alignment of the reference sequence identified to place the sample into the multi-sequence alignment. By default, all sequences in the 'mitochondria.fasta' file are included in the mitochondrial phylogeny. However, reference sequences can be excluded by adding a 'EXCLUDE' label after the sequence name: .. code-block:: text >5835107Eq_mito3 EXCLUDE gttaatgtagcttaataatat-aaagcaaggcactgaaaatgcctagatgagtattctta Sequences thus marked are not used for the phylogenetic inference itself. simulations.txt --------------- The 'simulations.txt' file contains the results of analyzing simulated data sets in order to generate an emperical distribution of deviations from the expected admixture values. .. code-block:: text NReads K Sample1 Sample2 HasTS Percentile Value 1000 2 Caballine NonCaballine FALSE 0.000 7.000000e-06 1000 2 Caballine NonCaballine FALSE 0.001 1.973480e-04 1000 2 Caballine NonCaballine FALSE 0.002 2.683880e-04 1000 2 Caballine NonCaballine FALSE 0.003 3.759840e-04 1000 2 Caballine NonCaballine FALSE 0.004 4.595720e-04 1000 2 Caballine NonCaballine FALSE 0.005 5.518900e-04 1000 2 Caballine NonCaballine FALSE 0.006 6.591180e-04 The *NReads* column specifies the number of sequence alignments used in the simulated sample (e.g. 1000, 10000, 100000, and 1000000). Zonkey will use these simulations for different numbers of reads to establish lower and upper bounds on the empirical p-values, where the lower bound is selected as the NReads <= to the number of reads analyzed, and the upper bound is selected as the NReads >= to the number of reads analyzed, when running Zonkey. The *K* column lists the number of ancestral groups specified when the sample was analyzed; in the equine reference panel, this is either 2 or 3. The *Sample1* and *Sample2* columns lists the two ancestral groups from which the synthetic hybrid was produced. The order in which these are listed does not matter. The *HasTS* column specifies if transitions were included (TRUE) or excluded (FALSE). The *Percentile* column specifies the percent of simulations with a *Value* less than or equal to the current *Value*. The *Value* column lists the absolute observed deviation from the expected admixture proportion (i.e. 0.5). There is currently no way to generate this automatically table, but having some support for doing this is planned. Note also that zonkey can be run using a hidden option '--admixture-only', which skips all analyses but those required in order to run ADMIXTURE on the data, and thereby makes running ADMIXTURE exactly as it would be run by Zonkey trivial. For example: $ paleomix zonkey run --admixture-only database.tar simulation.bam genotypes.txt ------------- The 'genotypes.txt' file contains a table of heterozyous sites relative to the reference sequence used for the reference panel. .. warning:: Columns in the 'genotypes.txt' file are expected to be in the exact order shown below. .. code-block:: text Chrom Pos Ref AAsi;AKia;AOna;ASom;HCab;HPrz;ZBoe;ZGre;ZHar 1 1094 A CAACAAAAA 1 1102 G AGGAGGGGG 1 1114 A AAAAAAAGA 1 1126 C CCCCCCCYC 1 1128 C CCCCCCCGC 1 1674 T GGGGTTGGG 1 1675 G GCCGGGGGG The *Chrom* column is expected to contain only those contigs / chromosomes listed in the 'contigs.txt' file; the *Pos* column contains the 1-based positions of the variable sites relative to the reference sequence. The *Ref* column contains the nucleotide observed in the reference sequence for the current position; it is currently not used, and may be removed in future versions of Zonkey. The final column contains the nucleotides observed for every sample named in 'samples.txt', joined by semi-colons, and a single letter nucleotide for each of these encoded using UIPAC codes (i.e. A equals AA, W equals AT). The equine reference panel does not include sites not called in every sample, but including such sites is possible by setting the nucleotide to 'N' for the sample with missing data. Packaging the files ------------------- The reference panel is distributed as a tar archive. For best performance, the files should be laid out so that the genotypes.txt file is the last file in the archive. This may be accomplished with the following command: .. code-block:: bash $ tar cvf database.tar settings.yaml contigs.txt samples.txt mitochondria.fasta simulations.txt examples genotypes.txt The tar file may be compressed for distribution (bzip2 or gzip), but should be used uncompressed for best performance. .. _NCBI: https://www.ncbi.nlm.nih.gov/nuccore/5835107 .. _UCSC: https://genome.ucsc.edu/cgi-bin/hgGateway?clade=mammal&org=Horse&db=0 paleomix-1.3.8/docs/zonkey_pipeline/requirements.rst000066400000000000000000000044321443173665200227700ustar00rootroot00000000000000.. highlight:: bash .. _zonkey_requirements: Software Requirements ===================== In addition to the requirements listed for the PALEOMIX pipeline itself in the :ref:`installation` section, the Zonkey pipeline requires that other pieces of software be installed: * RScript from the `R`_ package, v3.3.3. * SmartPCA from the `EIGENSOFT`_ package, v13050+ [Patterson2006]_, [Price2006]_ * `ADMIXTURE`_ v1.23 [Alexander2009]_ * `PLINK`_ v1.07 [Chang2015]_ * `RAxML`_ v8.2.9 [Stamatakis2006]_ * `SAMTools`_ v1.3.1 [Li2009b]_ * `TreeMix`_ v1.12 [Pickrell2012]_ The following R packages are required in order to carry out the plotting: * `RColorBrewer`_ * `ape`_ [Paradis2004]_ * `ggplot2`_ [Wickham2009]_ * `ggrepel`_ * `reshape2`_ [Wickham2007]_ The R packages may be installed using the following commands:: $ R > install.packages(c('RColorBrewer', 'ape', 'ggrepel', 'ggplot2', 'reshape2')) Zonkey reference panel ---------------------- Running the Zonkey pipeline requires a reference panel containing the information needed for hybrid identification. A detailed description of the reference panel and instructions for where to download the latest version can be found in the :ref:`zonkey_panel` section. Testing the pipeline -------------------- An example project is included with the BAM pipeline, and it is recommended to run this project in order to verify that the pipeline and required applications have been correctly installed. See the :ref:`examples_zonkey` section for a description of how to run this example project. .. _ADMIXTURE: https://www.genetics.ucla.edu/software/admixture/ .. _EIGENSOFT: http://www.hsph.harvard.edu/alkes-price/software/ .. _PLINK: https://www.cog-genomics.org/plink2 .. _R: http://www.r-base.org/ .. _RAxML: https://github.com/stamatak/standard-RAxML .. _RColorBrewer: https://cran.r-project.org/web/packages/RColorBrewer/index.html .. _SAMTools: https://samtools.github.io .. _TreeMix: http://pritchardlab.stanford.edu/software.html .. _ape: https://cran.r-project.org/web/packages/ape/index.html .. _ggrepel: https://cran.r-project.org/web/packages/ggrepel/index.html .. _ggplot2: https://cran.r-project.org/web/packages/ggplot2/index.html .. _reshape2: https://cran.r-project.org/web/packages/reshape2/index.html .. _Brew package manager: http://www.brew.sh paleomix-1.3.8/docs/zonkey_pipeline/usage.rst000066400000000000000000000254221443173665200213530ustar00rootroot00000000000000.. highlight:: Yaml .. _zonkey_usage: Pipeline usage ============== The Zonkey pipeline is run on the command-line using the command 'paleomix zonkey', which gives access to a handful of commands: .. code-block:: bash $ paleomix zonkey USAGE: paleomix zonkey run [] paleomix zonkey run paleomix zonkey run [] paleomix zonkey dryrun [...] paleomix zonkey mito Briefly, it is possible to run the pipeline on a single sample by specifying the location of `BAM alignments`_ against the Equus caballus reference nuclear genome (equCab2, see `UCSC`_), and / or against the horse mitochondrial genome (using either the standard mitochondrial sequence NC\_001640.1, see `NCBI`_, or a mitochondrial genome of one of the samples included in the reference panel, as described below). The individual commands allow for different combinations of alignment strategies: **paleomix zonkey run []** Runs the Zonkey pipeline on a single BAM alignment , which is expected to contain a nuclear and / or a mitochondrial alignment. If is specified, a directory at that location is created, and the resulting output saved there. If is not specified, the default location is chosen by replacing the file-extension of the alignment file (typically '.bam') with '.zonkey'. **paleomix zonkey run ** This commands allow for the combined analyses of the nuclear and mitochondrial genomes, in cases where these alignments have been carried out seperately. In this case, specifying a location is madatory. **paleomix zonkey run []** It is possible to run the pipeline on multiple samples at once, by specifying a list of BAM files (here ), which lists a sample name and one or two BAM files per line, with each column seperated by tabs). A destination may (optionally) be specified, as when specifying a single BAM file (see above). **paleomix zonkey dryrun [...]** The 'dryrun' command is equivalent to the 'run' command, but does not actually carry out the analytical steps; this command is useful to test for problems before excuting the pipeline, such as missing or outdated software requirements (see :ref:`zonkey_requirements`). **paleomix zonkey mito ** The 'mito' command is included to create a :ref:`bam_pipeline` project template for mapping FASTQ reads against the mitochondrial genomes of the samples included in the Zonkey reference panel samples (see Prerequisites below) for a list of samples). These possibilities are described in further detail below. Prerequisites ------------- All invocations of the Zonkey pipeline takes the path to a `panel` file as their first argument. This file contains the SNP panel necessary for performing hybrid identification and currently includes representatives of all extant equid species. For a more detailed description of the reference panel, and instructions for where to download the latest version of the file, please refer to the :ref:`zonkey_panel` section. Secondly, the pipeline requires either one or two BAM files per sample, representing alignments against nuclear and / or mitochondrial genomes as described above. The analyses carried out by the Zonkey pipeline depends on the contents of the BAM alignment file provided for a given sample, and are presented below. Single sample analysis ---------------------- For a single sample, the pipeline may be invoked by providing the path to the reference panel file followed by the path to one or two BAM files belonging to that sample, as well as an (mostly optional) destination directory. For these examples, we will assume that the reference panel is saved in the file 'database.tar', that the BAM file 'nuclear.bam' contains an alignment against the equCab2 reference genome, that the BAM file 'mitochondrial.bam' contains an alignment against the corresponding mitochondrial reference genome (Genbank Accession Nb. NC_001640.1), and that the BAM file 'combined.bam' contains an alignment against both the nuclear and mitochondrial genomes. If so, the pipeline may be invoked as follows: .. code-block:: bash # Case 1a: Analyse nuclear genome; results are placed in 'nuclear.zonkey' $ paleomix zonkey run database.tar nuclear.bam # Case 1b: Analyse nuclear genome; results are placed in 'my_results' $ paleomix zonkey run database.tar nuclear.bam my_results # Case 2b: Analyse mitochondrial genome; results are placed in 'mitochondrial.zonkey' $ paleomix zonkey run database.tar mitochondrial.bam # Case 2b: Analyse mitochondrial genome; results are placed in 'my_results' $ paleomix zonkey run database.tar mitochondrial.bam my_results # Case 3: Analyses both nuclear and mitochondrial genome, placing results in 'my_results' $ paleomix zonkey run database.tar nuclear.bam mitochondrial.bam my_results # Case 4a: Analyses both nuclear and mitochondrial genome; results are placed in 'combined.zonkey' $ paleomix zonkey run database.tar combined.bam # Case 4b: Analyses both nuclear and mitochondrial genome; results are placed in 'my_results' $ paleomix zonkey run database.tar combined.bam my_results .. note:: The filenames used here are have been chosen purely to illustrate each operation, and do not affect the operation of the pipeline. As shown above, the pipeline will place the resulting output files in a directory named after the input file by default. This behavior, however, can be overridden by the user by specifying a destination directory (cases 1b, 2b, and 4b). When specifying two input files, however, it is required to manually specify the directory in which to store output files (case 3). The resulting report may be accessed in the output directory under the name 'report.html', which contains summary statistics and figures for the analyses performed for the sample. The structure of directory containing the output files is described further in the :ref:`zonkey_filestructure` section. Multi-sample analysis --------------------- As noted above, it is possible to analyze multiple, different samples in one go. This is accomplished by providing a text file containing a tab separated table of samples, with columns separated by tabs. The first column in this table specifies the name of the sample, while the second and third column specifies the location of one or two BAM alignments associated with that sample. The following example shows one such file corresponding to cases 1 - 4 described above. .. code-block:: bash $ cat samples.txt case_1 nuclear.bam case_2 mitochondrial.bam case_3 nuclear.bam mitochondrial.bam case_4 combined.bam Processing of these samples is then carried out as shown above: .. code-block:: bash # Case 5a) Analyse 3 samples; results are placed in 'samples.zonkey' $ paleomix zonkey run database.tar samples.txt # Case 5b) Analyse 3 samples; results are placed in 'my_results' $ paleomix zonkey run database.tar samples.txt my_results The resulting directory contains a 'summary.html' file, providing an overview of all samples processed in the analyses, with link to the individual, per-sample, reports, as well as a sub-directory for each sample corresponding to that obtained from running individual analyses on each of the samples. The structure of directory containing the output files is further described in the :ref:`zonkey_filestructure` section. .. note: Note that only upper-case and lower-case letters (a-z, and A-Z), as well as numbers (0-9), and underscores (_) are allowed in sample names. Rooting TreeMix trees --------------------- By default, the Zonkey pipeline does not attempt to root TreeMix trees; this is because the out-group specified *must* form a monophyletic clade; if this is not the case (e.g. if the clade containing the two reference horse samples becomes paraphyletic due to the test sample nesting with one of them), TreeMix will fail to run to completion. Therefore it may be preferable to run the pipeline without specifying an outgroup, and then specifying the outgroup, in a second run, once the placement of the sample is done. This is accomplished by specifying these using the --treemix-outgroup command-line option, specifying the samples forming the out-group as a comma-separated list. For example, assuming that the following TreeMix tree was generated for our sample: .. image:: ../_static/zonkey/incl_ts_0_tree_unrooted.png If so, we may wish to root on the caballine specimen (all other command-line arguments omitted for simplicity): .. code-block:: bash $ paleomix zonkey run ... --treemix-outgroup Sample,HPrz,HCab This yields a tree rooted using this group as the outgroup: .. image:: ../_static/zonkey/incl_ts_0_tree_rooted.png .. note:: Rooting of the tree will be handled automatically in future versions of the Zonkey pipeline. Mapping against mitochondrial genomes ------------------------------------- In order to identify the species of the sire and dam, respectively, for F1 hybrids, the Zonkey pipeline allows for the construction of a maximum likelihood phylogeny using RAxML [Stamatakis2006]_ based on the mitochondrial genomes of reference panel (see Prerequisites, above) and a consensus sequence derived from the mitochondrial alignment provided for the sample being investigated. The resulting phylogeny is presented rooted on the mid-point: .. image:: ../_static/zonkey/mito_phylo.png As noted above, this requires that the the sample has been mapped against the mitochondrial reference genome NC\_001640.1 (see `NCBI`_), corresponding to the 'MT' mitochondrial genome included with the equCab2 reference sequence (see `UCSC`_). In addition, it is possible to carry out mapping against the mitochondrial genomes of the reference panel used in the Zonkey reference panel, by using the :ref:`bam_pipeline`. This is accomplished by running the Zonkey 'mito' command, which writes a simple BAM pipeline makefile template to a given directory, along with a directory containing the FASTA sequences of the reference mitochondrial genomes:: $ paleomix zonkey mito database.tar output_folder/ Please refer to the :ref:`bam_pipeline` documentation if you wish to use the BAM pipeline to perform the mapping itself. Once your data has been mapped against either or all of these mitochondrial genomes, the preferred BAM file (e.g. the alignment with the highest coverage) may be included in the analyses as described above. .. _NCBI: https://www.ncbi.nlm.nih.gov/nuccore/5835107 .. _UCSC: https://genome.ucsc.edu/cgi-bin/hgGateway?clade=mammal&org=Horse&db=0 .. _BAM alignments: http://samtools.github.io/hts-specs/SAMv1.pdf paleomix-1.3.8/misc/000077500000000000000000000000001443173665200143075ustar00rootroot00000000000000paleomix-1.3.8/misc/setup_bam_pipeline_example.makefile.yaml000066400000000000000000000006661443173665200243360ustar00rootroot00000000000000# -*- mode: Yaml; -*- Options: Platform: Illumina QualityOffset: 33 Aligners: Program: Bowtie2 Bowtie2: MinQuality: 0 --very-sensitive: PCRDuplicates: no ExcludeReads: - Paired Features: [] Prefixes: rCRS: Path: prefixes/rCRS.fasta ExampleProject: Synthetic_Sample_1: ACGATA: Lane_2: data/ACGATA_L2_R{Pair}_*.fastq.gz GCTCTG: Lane_2: data/GCTCTG_L2_R1_*.fastq.gz paleomix-1.3.8/misc/setup_bam_pipeline_example.sh000066400000000000000000000044531443173665200222300ustar00rootroot00000000000000#!/bin/bash # # Copyright (c) 2013 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in all # copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. # set -o nounset # Fail on unset variables set -o errexit # Fail on uncaught non-zero returncodes set -o pipefail # Fail is a command in a chain of pipes fails SP_SEED=${RANDOM} rm -rv data mkdir -p data for barcode in ACGATA GCTCTG TGCTCA; do python $(dirname $0)/synthesize_reads.py prefixes/rCRS.fasta data/ \ --library-barcode=${barcode} \ --specimen-seed=${SP_SEED} \ --lanes-reads-mu=2500 \ --lanes-reads-sigma=500 \ --lanes-reads-per-file=1000 \ --lanes=2 \ --damage done rm -v data/GCTCTG_L*R2*.gz rm -v data/TGCTCA_L1_R2*.gz paleomix bam run $(dirname $0)/setup_bam_pipeline_example.makefile.yaml --destination . mkdir -p data/ACGATA_L2/ mv ExampleProject/reads/Synthetic_Sample_1/ACGATA/Lane_2/reads.singleton.truncated.gz data/ACGATA_L2/reads.singleton.truncated.gz mv ExampleProject/reads/Synthetic_Sample_1/ACGATA/Lane_2/reads.collapsed.gz data/ACGATA_L2/reads.collapsed.gz mv ExampleProject/reads/Synthetic_Sample_1/ACGATA/Lane_2/reads.collapsed.truncated.gz data/ACGATA_L2/reads.collapsed.truncated.gz mv ExampleProject/rCRS/Synthetic_Sample_1/GCTCTG/Lane_2/single.minQ0.bam data/GCTCTG_L2.bam rm -v data/ACGATA_L2_R*.fastq.gz rm -v data/GCTCTG_L2_R1_*.fastq.gz rm -rv ExampleProject paleomix-1.3.8/misc/setup_phylo_pipeline_example.sh000077500000000000000000000040551443173665200226250ustar00rootroot00000000000000#!/bin/bash # # Copyright (c) 2013 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in all # copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. # set -o nounset # Fail on unset variables set -o errexit # Fail on uncaught non-zero returncodes set -o pipefail # Fail is a command in a chain of pipes fails rm -rvf alignment/reads mkdir -p alignment/reads for PREFIX in `ls alignment/prefixes/*.fasta | grep -v rCRS`; do SP_SEED=${RANDOM} NAME=$(echo ${PREFIX} | sed -e's#alignment/prefixes/##' -e's#\..*##') mkdir -p alignment/reads/${NAME/*\//}/ ./synthesize_reads.py ${PREFIX} alignment/reads/${NAME}/ \ --specimen-seed=${SP_SEED} \ --lanes-reads-mu=50000 \ --lanes-reads-sigma=500 \ --lanes-reads-per-file=10000 \ --reads-len=50 \ --lanes=1 done # These links would not survive the package installation, so setup here ln -sf ../../alignment/prefixes/ phylogeny/data/prefixes ln -sf ../../alignment phylogeny/data/samples # Create link to reference sequence mkdir -p phylogeny/data/refseqs ln -sf ../../../alignment/prefixes/rCRS.fasta phylogeny/data/refseqs/rCRS.rCRS.fasta paleomix-1.3.8/misc/skeleton.py000066400000000000000000000025431443173665200165110ustar00rootroot00000000000000#!/usr/bin/python3 # -*- coding: utf-8 -*- # Copyright (c) 2014 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. import sys import argparse def parse_args(argv): parser = argparse.ArgumentParser() return parser.parse_args(argv) def main(argv): return 0 if __name__ == "__main__": sys.exit(main(sys.argv[1:])) paleomix-1.3.8/misc/synthesize_reads.py000077500000000000000000000347301443173665200202560ustar00rootroot00000000000000#!/usr/bin/python3 # # Copyright (c) 2013 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. # import argparse import gzip import math import random import sys from paleomix.common.formats.fasta import FASTA from paleomix.common.sampling import weighted_sampling from paleomix.common.sequences import reverse_complement from paleomix.common.utilities import fragment def _dexp(lambda_value, position): return lambda_value * math.exp(-lambda_value * position) def _rexp(lambda_value, rng): return -math.log(rng.random()) / lambda_value def toint(value): return int(round(value)) # Adapter added to the 5' end of the forward strand (read from 5' ...) PCR1 = "AGATCGGAAGAGCACACGTCTGAACTCCAGTCAC%sATCTCGTATGCCGTCTTCTGCTTG" # Adapter added to the 5' end of the reverse strand (read from 3' ...): # rev. compl of the forward PCR2 = "AGATCGGAAGAGCGTCGTGTAGGGAAAGAGTGTAGATCTCGGTGGTCGCCGTATCATT" def _get_indel_length(indel_lambda, rng): return 1 + toint(_rexp(indel_lambda, rng)) def _get_weighted_choices(rng, sub_rate, indel_rate): choices_by_nt = {} for src_nt in "ACGT": choices = "ACGTID" probs = [sub_rate / 4] * 4 # ACGT probs += [indel_rate / 2] * 2 # ID probs[choices.index(src_nt)] = 1 - sum(probs) + sub_rate / 4 choices_by_nt[src_nt] = weighted_sampling(choices, probs, rng) return choices_by_nt def _mutate_sequence(rng, choices, refseq, indel_lambda=0): position = 0 sequence, positions = [], [] while position < len(refseq): ref_nt = refseq[position] if ref_nt not in "ACGT": read_nt = rng.choice("ACGT") else: read_nt = next(choices[ref_nt]) if read_nt == "D": for _ in range(_get_indel_length(indel_lambda, rng)): position += 1 elif read_nt == "I": for _ in range(_get_indel_length(indel_lambda, rng)): sequence.append(rng.choice("ACGT")) positions.append(position) else: sequence.append(read_nt) positions.append(position) position += 1 return "".join(sequence), positions class Specimen: """Represents a specimen, from which samples are derived. These are mutated by the addition of changes to the sequence """ def __init__(self, options, filename): genome = list(FASTA.from_file(filename)) assert len(genome) == 1, len(genome) self._genome = genome[0].sequence.upper() self._sequence = None self._positions = None self._annotations = None self._mutate(options) def _mutate(self, options): rng = random.Random(options.specimen_seed) choices = _get_weighted_choices( rng, options.specimen_sub_rate, options.specimen_indel_rate ) self._sequence, self._positions = _mutate_sequence( rng, choices, self._genome, options.specimen_indel_lambda ) @property def sequence(self): return self._sequence @property def positions(self): return self._positions @property def annotations(self): return self._annotations class Sample: def __init__(self, options, specimen): self._specimen = specimen self._random = random.Random(options.sample_seed) self._options = options frac_endog = self._random.gauss( options.sample_endog_mu, options.sample_endog_sigma ) self._frac_endog = min(1, max(0.01, frac_endog)) self._endog_id = 0 self._contam_id = 0 def get_fragment(self): """Returns either a DNA fragmnet, representing either a fragment of the sample genome, or a randomly generated DNA sequence representing contaminant DNA that is not related to the species.""" if self._random.random() <= self._frac_endog: return self._get_endogenous_sequence() return self._get_contaminant_sequence() def _get_contaminant_sequence(self): length = self._get_frag_len() sequence = [self._random.choice("ACGT") for _ in range(length)] self._contam_id += 1 name = "Seq_junk_%i" % (self._contam_id,) return (False, name, "".join(sequence)) def _get_endogenous_sequence(self): length = self._get_frag_len() max_position = len(self._specimen.sequence) - length position = self._random.randint(0, max_position) strand = self._random.choice(("fw", "rv")) sequence = self._specimen.sequence[position : position + length] real_pos = self._specimen.positions[position] if strand == "rv": sequence = reverse_complement("".join(sequence)) self._endog_id += 1 name = "Seq_%i_%i_%i_%s" % (self._endog_id, real_pos, length, strand) return (True, name, sequence) def _get_frag_len(self): length = toint( self._random.gauss( self._options.sample_frag_len_mu, self._options.sample_frag_len_sigma ) ) return max( self._options.sample_frag_len_min, min(self._options.sample_frag_len_max, length), ) class Damage: def __init__(self, options, sample): self._options = options self._sample = sample self._random = random.Random(options.damage_seed) self._rates = self._calc_damage_rates(options) def get_fragment(self): is_endogenous, name, sequence = self._sample.get_fragment() if is_endogenous and self._options.damage: sequence = self._damage_sequence(sequence) return (name, sequence) def _damage_sequence(self, sequence): result = [] length = len(sequence) for (position, nucleotide) in enumerate(sequence): if nucleotide == "C": if self._random.random() < self._rates[position]: nucleotide = "T" elif nucleotide == "G": rv_position = length - position - 1 if self._random.random() < self._rates[rv_position]: nucleotide = "A" result.append(nucleotide) return "".join(result) @classmethod def _calc_damage_rates(cls, options): rate = options.damage_lambda rates = [ _dexp(rate, position) for position in range(options.sample_frag_len_max) ] return rates class Library: def __init__(self, options, sample): self._options = options self._sample = sample self._cache = [] self._rng = random.Random(options.library_seed) self.barcode = options.library_barcode if self.barcode is None: self.barcode = "".join(self._rng.choice("ACGT") for _ in range(6)) assert len(self.barcode) == 6, options.barcode pcr1 = PCR1 % (self.barcode,) self.lanes = self._generate_lanes(options, self._rng, sample, pcr1) @classmethod def _generate_lanes(cls, options, rng, sample, pcr1): lane_counts = [] for _ in range(options.lanes_num): lane_counts.append( toint(random.gauss(options.lanes_reads_mu, options.lanes_reads_sigma)) ) reads = cls._generate_reads(options, rng, sample, sum(lane_counts), pcr1) lanes = [] for count in lane_counts: lanes.append(Lane(options, reads[:count])) reads = reads[count:] return lanes @classmethod def _generate_reads(cls, options, rng, sample, minimum, pcr1): reads = [] while len(reads) < minimum: name, sequence = sample.get_fragment() cur_forward = sequence + pcr1 cur_reverse = reverse_complement(sequence) + PCR2 # Number of PCR copies -- minimum 1 num_dupes = toint(_rexp(options.library_pcr_lambda, rng)) + 1 for dupe_id in range(num_dupes): cur_name = "%s_%s" % (name, dupe_id) reads.append((cur_name, cur_forward, cur_reverse)) random.shuffle(reads) return reads class Lane: def __init__(self, options, reads): rng = random.Random() choices = _get_weighted_choices( rng, options.reads_sub_rate, options.reads_indel_rate ) self._sequences = [] for (name, forward, reverse) in reads: forward, _ = _mutate_sequence( rng, choices, forward, options.reads_indel_lambda ) if len(forward) < options.reads_len: forward += "A" * (options.reads_len - len(forward)) elif len(forward) > options.reads_len: forward = forward[: options.reads_len] reverse, _ = _mutate_sequence( rng, choices, reverse, options.reads_indel_lambda ) if len(reverse) < options.reads_len: reverse += "T" * (options.reads_len - len(reverse)) elif len(reverse) > options.reads_len: reverse = reverse[: options.reads_len] self._sequences.append((name, "".join(forward), "".join(reverse))) @property def sequences(self): return self._sequences def parse_args(argv): parser = argparse.ArgumentParser() parser.add_argument("fasta", help="Input FASTA file") parser.add_argument("output_prefix", help="Prefix for output filenames") group = parser.add_argument_group("Specimen") group.add_argument( "--specimen-seed", default=None, help="Seed used to initialize the 'speciment', for the " "creation of a random genotype. Set to a specific " "values if runs are to be done for the same " "genotype.", ) group.add_argument("--specimen-sub-rate", default=0.005, type=float) group.add_argument("--specimen-indel-rate", default=0.0005, type=float) group.add_argument("--specimen-indel-lambda", default=0.9, type=float) group = parser.add_argument_group("Samples from specimens") group.add_argument("--sample-seed", default=None) group.add_argument( "--sample-frag-length-mu", dest="sample_frag_len_mu", default=100, type=int ) group.add_argument( "--sample-frag-length-sigma", dest="sample_frag_len_sigma", default=30, type=int ) group.add_argument( "--sample-frag-length-min", dest="sample_frag_len_min", default=0, type=int ) group.add_argument( "--sample-frag-length-max", dest="sample_frag_len_max", default=500, type=int ) group.add_argument( "--sample-endogenous_mu", dest="sample_endog_mu", default=0.75, type=float ) group.add_argument( "--sample-endogenous_sigma", dest="sample_endog_sigma", default=0.10, type=float ) group = parser.add_argument_group("Post mortem damage of samples") group.add_argument("--damage", dest="damage", default=False, action="store_true") group.add_argument("--damage-seed", dest="damage_seed", default=None) group.add_argument( "--damage-lambda", dest="damage_lambda", default=0.25, type=float ) group = parser.add_argument_group("Libraries from samples") group.add_argument("--library-seed", dest="library_seed", default=None) group.add_argument( "--library-pcr-lambda", dest="library_pcr_lambda", default=3, type=float ) group.add_argument("--library-barcode", dest="library_barcode", default=None) group = parser.add_argument_group("Lanes from libraries") group.add_argument("--lanes", dest="lanes_num", default=3, type=int) group.add_argument( "--lanes-reads-mu", dest="lanes_reads_mu", default=10000, type=int ) group.add_argument( "--lanes-reads-sigma", dest="lanes_reads_sigma", default=2500, type=int ) group.add_argument( "--lanes-reads-per-file", dest="lanes_per_file", default=2500, type=int ) group = parser.add_argument_group("Reads from lanes") group.add_argument( "--reads-sub-rate", dest="reads_sub_rate", default=0.005, type=float ) group.add_argument( "--reads-indel-rate", dest="reads_indel_rate", default=0.0005, type=float ) group.add_argument( "--reads-indel-lambda", dest="reads_indel_lambda", default=0.9, type=float ) group.add_argument("--reads-length", dest="reads_len", default=100, type=int) return parser.parse_args(argv) def main(argv): options = parse_args(argv) print("Generating %i lane(s) of synthetic reads" % (options.lanes_num,)) specimen = Specimen(options, options.fasta) sample = Sample(options, specimen) damage = Damage(options, sample) library = Library(options, damage) for (lnum, lane) in enumerate(library.lanes, start=1): fragments = fragment(options.lanes_per_file, lane.sequences) for (readsnum, reads) in enumerate(fragments, start=1): templ = "%s%s_L%i_R%%s_%02i.fastq.gz" % ( options.output_prefix, library.barcode, lnum, readsnum, ) print(" Writing %s" % (templ % "{Pair}",)) with gzip.open(templ % 1, "wt") as out_1: with gzip.open(templ % 2, "wt") as out_2: for (name, seq_1, seq_2) in reads: out_1.write("@%s%s/1\n%s\n" % (library.barcode, name, seq_1)) out_1.write("+\n%s\n" % ("I" * len(seq_1),)) out_2.write("@%s%s/2\n%s\n" % (library.barcode, name, seq_2)) out_2.write("+\n%s\n" % ("H" * len(seq_2),)) if __name__ == "__main__": sys.exit(main(sys.argv[1:])) paleomix-1.3.8/paleomix/000077500000000000000000000000001443173665200151725ustar00rootroot00000000000000paleomix-1.3.8/paleomix/__init__.py000066400000000000000000000022751443173665200173110ustar00rootroot00000000000000#!/usr/bin/python3 # # Copyright (c) 2012 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. # __version_info__ = (1, 3, 8) __version__ = "%i.%i.%i" % __version_info__ paleomix-1.3.8/paleomix/atomiccmd/000077500000000000000000000000001443173665200171325ustar00rootroot00000000000000paleomix-1.3.8/paleomix/atomiccmd/__init__.py000066400000000000000000000021631443173665200212450ustar00rootroot00000000000000#!/usr/bin/python3 # # Copyright (c) 2012 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in all # copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. # paleomix-1.3.8/paleomix/atomiccmd/builder.py000066400000000000000000000374531443173665200211460ustar00rootroot00000000000000#!/usr/bin/python3 # # Copyright (c) 2012 Mikkel Schubert # # Permission is hereby granted, free of charge, to any person obtaining a copy # of this software and associated documentation files (the "Software"), to deal # in the Software without restriction, including without limitation the rights # to use, copy, modify, merge, publish, distribute, sublicense, and/or sell # copies of the Software, and to permit persons to whom the Software is # furnished to do so, subject to the following conditions: # # The above copyright notice and this permission notice shall be included in # all copies or substantial portions of the Software. # # THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR # IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, # FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE # AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER # LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, # OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE # SOFTWARE. # from paleomix.atomiccmd.command import AtomicCmd from paleomix.common.utilities import safe_coerce_to_tuple class AtomicCmdBuilderError(RuntimeError): """Error raised by AtomicCmdBuilder.""" class AtomicCmdBuilder: """AtomicCmdBuilder is a class used to allow step-wise construction of an AtomicCmd object. This allows the user of a Node to modify the behavior of the called programs using some CLI parameters, without explicit support for these in the Node API. Some limitations are in place, to help catch cases where overwriting or adding a flag would break the Node. The system call is constructed in the following manner: $ The components are defined as follows: - The minimal call needed invoke the current program. Typically this is just the name of the executable, but may be a more complex set of values for nested calls (e.g. java/scripts).