pauvre-0.2.3/ 0000755 0026123 0000167 00000000000 14044135227 014173 5 ustar dschultz biolum 0000000 0000000 pauvre-0.2.3/MANIFEST.in 0000644 0026123 0000167 00000000030 13663657256 015743 0 ustar dschultz biolum 0000000 0000000 include scripts/test.sh
pauvre-0.2.3/PKG-INFO 0000644 0026123 0000167 00000002231 14044135227 015266 0 ustar dschultz biolum 0000000 0000000 Metadata-Version: 1.2
Name: pauvre
Version: 0.2.3
Summary: Tools for plotting Oxford Nanopore and other long-read data.
Home-page: https://github.com/conchoecia/pauvre
Author: Darrin Schultz
Author-email: dts@ucsc.edu
License: UNKNOWN
Description:
'pauvre' is a package for plotting Oxford Nanopore and other long read data.
The name means 'poor' in French, a play on words to the oft-used 'pore' prefix
for similar packages. This package was designed for python 3, but it might work in
python 2. You can visit the gitub page for more detailed information here:
https://github.com/conchoecia/pauvre
Platform: UNKNOWN
Classifier: Development Status :: 2 - Pre-Alpha
Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3)
Classifier: Programming Language :: Python :: 3
Classifier: Programming Language :: Python :: 3.6
Classifier: Programming Language :: Python :: 3.7
Classifier: Operating System :: POSIX :: Linux
Classifier: Topic :: Scientific/Engineering :: Bio-Informatics
Classifier: Intended Audience :: Science/Research
Requires-Python: >=3
pauvre-0.2.3/README.md 0000644 0026123 0000167 00000012765 13655436644 015503 0 ustar dschultz biolum 0000000 0000000 [](https://travis-ci.org/conchoecia/pauvre) [](https://zenodo.org/badge/latestdoi/112774670)
## Getting Started
```
pauvre custommargin -i custom.tsv --ycol length --xcol qual # Custom tsv input
```
## Table of Contents
- [Getting Started](#started)
- [Users' Guide](#uguide)
- [Installation](#installation)
- [Requirements](#reqs)
- [Install Instructions](#install)
- [Usage](#usage)
- [pauvre stats](#stats)
- [pauvre marginplot](#marginplot)
- [Basic Usage](#marginbasic)
- [Plot Adjustments](#marginadjustments)
- [Specialized Options](#marginspecialized)
- [Contributors](#contributors)
## Users' Guide
Pauvre is a plotting package originally designed to help QC the length and
quality distribution of Oxford Nanopore or PacBio reads. The main outputs
are marginplots. Now, `pauvre` also hosts other additional data plotting
scripts.
This package currently hosts five scripts for plotting and/or printing stats.
- `pauvre marginplot`
- takes a fastq file as input and outputs a marginal histogram with a heatmap.
- `pauvre custommargin`
- takes a tsv as input and outputs a marginal histogram with custom columns of your choice.
- `pauvre stats`
- Takes a fastq file as input and prints out a table of stats, including how many basepairs/reads there are for a length/mean quality cutoff.
- This is also automagically called when using `pauvre marginplot`
- `pauvre redwood`
- I am happy to introduce the redwood plot to the world as a method
of representing circular genomes. A redwood plot contains long
reads as "rings" on the inside, a gene annotation
"cambrium/phloem", and a RNAseq "bark". The input is `.bam` files
for the long reads and RNAseq data, and a `.gff` file for the
annotation. More details to follow as we document this program
better...
- `pauvre synteny`
- Makes a synteny plot of circular genomes. Finds the most
parsimonius rotation to display the synteny of all the input
genomes with the fewest crossings-over. Input is one `.gff` file
per circular genome and one directory of gene alignments.
## Installation
### Requirements
- You must have the following installed on your system to install this software:
- python 3.x
- matplotlib
- biopython
- pandas
- pillow
### Install Instructions
- Instructions to install on your mac or linux system. Not sure on
Windows! Make sure *python 3* is the active environment before
installing.
- `git clone https://github.com/conchoecia/pauvre.git`
- `cd ./pauvre`
- `pip3 install .`
- Or, install with pip
- `pip3 install pauvre`
## Usage
### `stats`
- generate basic statistics about the fastq file. For example, if I
want to know the number of bases and reads with AT LEAST a PHRED
score of 5 and AT LEAST a read length of 500, run the program as below
and look at the cells highlighted with ``.
- `pauvre stats --fastq miniDSMN15.fastq`
```
numReads: 1000
numBasepairs: 1029114
meanLen: 1029.114
medianLen: 875.5
minLen: 11
maxLen: 5337
N50: 1278
L50: 296
Basepairs >= bin by mean PHRED and length
minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
0 1029114 1010681 935366 429279 143948 25139 3668 2938 2000 0
500 984212 <968653> 904787 421307 142003 24417 3668 2938 2000 0
1000 659842 649319 616788 300948 103122 17251 2000 2000 2000 0
et cetera...
Number of reads >= bin by mean Phred+Len
minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30
0 1000 969 865 366 118 22 3 2 1 0
500 873 <859> 789 347 113 20 3 2 1 0
1000 424 418 396 187 62 11 1 1 1 0
et cetera...
```
### `marginplot`
#### Basic Usage
- automatically calls `pauvre stats` for each fastq file
- Make the default plot showing the 99th percentile of longest reads
- `pauvre marginplot --fastq miniDSMN15.fastq`
- 
- Make a marginal histogram for ONT 2D or 1D^2 cDNA data with a
lower maxlen and higher maxqual.
- `pauvre marginplot --maxlen 4000 --maxqual 25 --lengthbin 50 --fileform pdf png --qualbin 0.5 --fastq miniDSMN15.fastq`
- 
#### Plot Adjustments
- Filter out reads with a mean quality less than 5, and a length
less than 800. Zoom in to plot only mean quality of at least 4 and
read length at least 500bp.
- `pauvre marginplot -f miniDSMN15.fastq --filt_minqual 5 --filt_minlen 800 -y --plot_minlen 500 --plot_minqual 4`
- 
#### Specialized Options
- Plot ONT 1D data with a large tail
- `pauvre marginplot --maxlen 100000 --maxqual 15 --lengthbin 500 .fastq`
- Get more resolution on lengths
- `pauvre marginplot --maxlen 100000 --lengthbin 5 .fastq`
- Turn off transparency if you just want a white background
- `pauvre marginplot --transparent False .fastq`
- Note: transparency is the default behavior
- 
## Contributors
@conchoecia (Darrin Schultz)
@mebbert (Mark Ebbert)
@wdecoster (Wouter De Coster)
pauvre-0.2.3/pauvre/ 0000755 0026123 0000167 00000000000 14044135227 015475 5 ustar dschultz biolum 0000000 0000000 pauvre-0.2.3/pauvre/__init__.py 0000644 0026123 0000167 00000000047 13167537132 017615 0 ustar dschultz biolum 0000000 0000000 from pauvre.version import __version__
pauvre-0.2.3/pauvre/bamparse.py 0000644 0026123 0000167 00000021713 13322176054 017646 0 ustar dschultz biolum 0000000 0000000 #!/usr/bin/env python
# -*- coding: utf-8 -*-
# pauvre - just a pore plotting package
# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
# twitter @conchoecia
#
# This file is part of pauvre.
#
# pauvre is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# pauvre is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with pauvre. If not, see .
import pysam
import pandas as pd
import os
class BAMParse():
"""This class reads in a sam/bam file and constructs a pandas
dataframe of all the relevant information for the reads to pass on
and plot.
"""
def __init__(self, filename, chrid = None, start = None,
stop = None, doubled = None):
self.filename = filename
self.doubled = doubled
#determine if the file is bam or sam
self.filetype = os.path.splitext(self.filename)[1]
#throw an error if the file is not bam or sam
if self.filetype not in ['.bam']:
raise Exception("""You have provided a file with an extension other than
'.bam', please check your command-line arguments""")
#now make sure there is an index file for the bam file
if not os.path.exists("{}.bai".format(self.filename)):
raise Exception("""Your .bam file is there, but it isn't indexed and
there isn't a .bai file to go with it. Use
'samtools index .bam' to fix it.""")
#now open the file and just call it a sambam file
filetype_dict = {'.sam': '', '.bam': 'b'}
self.sambam = pysam.AlignmentFile(self.filename, "r{}".format(filetype_dict[self.filetype]))
if chrid == None:
self.chrid = self.sambam.references[0]
else:
self.chrid = chrid
self.refindex = self.sambam.references.index(self.chrid)
self.seqlength = self.sambam.lengths[self.refindex]
self.true_seqlength = self.seqlength if not self.doubled else int(self.seqlength/2)
if start == None or stop == None:
self.start = 1
self.stop = self.true_seqlength
self.features = self.parse()
self.features.sort_values(by=['POS','MAPLEN'], ascending=[True, False] ,inplace=True)
self.features.reset_index(inplace=True)
self.features.drop('index', 1, inplace=True)
self.raw_depthmap = self.get_depthmap()
self.features_depthmap = self.get_features_depthmap()
def get_depthmap(self):
depthmap = [0] * (self.stop - self.start + 1)
for p in self.sambam.pileup(self.chrid, self.start, self.stop):
index = p.reference_pos
if index >= self.true_seqlength:
index -= self.true_seqlength
depthmap[index] += p.nsegments
return depthmap
def get_features_depthmap(self):
"""this method builds a more accurate pileup that is
based on if there is actually a mapped base at any
given position or not. better for long reads and RNA"""
depthmap = [0] * (self.stop - self.start + 1)
print("depthmap is: {} long".format(len(depthmap)))
for index, row in self.features.iterrows():
thisindex = row["POS"] - self.start
for thistup in row["TUPS"]:
b_type = thistup[1]
b_len = thistup[0]
if b_type == "M":
for j in range(b_len):
#this is necessary to reset the index if we wrap
# around to the beginning
if self.doubled and thisindex == len(depthmap):
thisindex = 0
depthmap[thisindex] += 1
thisindex += 1
elif b_type in ["S", "H", "I"]:
pass
elif b_type in ["D", "N"]:
thisindex += b_len
#this is necessary to reset the index if we wrap
# around to the beginning
if self.doubled and thisindex >= len(depthmap):
thisindex = thisindex - len(depthmap)
return depthmap
def parse(self):
data = {'POS': [], 'MAPQ': [], 'TUPS': [] }
for read in self.sambam.fetch(self.chrid, self.start, self.stop):
data['POS'].append(read.reference_start + 1)
data['TUPS'].append(self.cigar_parse(read.cigartuples))
data['MAPQ'].append(read.mapq)
features = pd.DataFrame.from_dict(data, orient='columns')
features['ALNLEN'] = features['TUPS'].apply(self.aln_len)
features['TRULEN'] = features['TUPS'].apply(self.tru_len)
features['MAPLEN'] = features['TUPS'].apply(self.map_len)
features['POS'] = features['POS'].apply(self.fix_pos)
return features
def cigar_parse(self, tuples):
"""
arguments:
a CIGAR string tuple list in pysam format
purpose:
This function uses the pysam cigarstring tuples format and returns
a list of tuples in the internal format, [(20, 'M'), (5, "I")], et
cetera. The zeroth element of each tuple is the number of bases for the
CIGAR string feature. The first element of each tuple is the CIGAR
string feature type.
There are several feature types in SAM/BAM files. See below:
'M' - match
'I' - insertion relative to reference
'D' - deletion relative to reference
'N' - skipped region from the reference
'S' - soft clip, not aligned but still in sam file
'H' - hard clip, not aligned and not in sam file
'P' - padding (silent deletion from padded reference)
'=' - sequence match
'X' - sequence mismatch
'B' - BAM_CBACK (I don't actually know what this is)
"""
# I used the map values from http://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment
psam_to_char = {0: 'M', 1: 'I', 2: 'D', 3: 'N', 4: 'S',
5: 'H', 6: 'P', 7: '=', 8: 'X', 9: 'B'}
return [(value, psam_to_char[feature]) for feature, value in tuples]
def aln_len(self, TUPS):
"""
arguments:
a list of tuples output from the cigar_parse() function.
purpose:
This returns the alignment length of the read to the reference.
Specifically, it sums the length of all of the matches and deletions.
In effect, this number is length of the region of the reference sequence to
which the read maps. This number is probably the most useful for selecting
reads to visualize in the mapped read plot.
"""
return sum([pair[0] for pair in TUPS if pair[1] not in ['S', 'H', 'I']])
def map_len(self, TUPS):
"""
arguments:
a list of tuples output from the cigar_parse() function.
purpose:
This function returns the map length (all matches and deletions relative to
the reference), plus the unmapped 5' and 3' hard/soft clipped sequences.
This number is useful if you want to visualize how much 5' and 3' sequence
of a read did not map to the reference. For example, poor quality 5' and 3'
tails are common in Nanopore reads.
"""
return sum([pair[0] for pair in TUPS if pair[1] not in ['I']])
def tru_len(self, TUPS):
"""
arguments:
a list of tuples output from the cigar_parse() function.
purpose:
This function returns the total length of the read, including insertions,
deletions, matches, soft clips, and hard clips. This is useful for
comparing to the map length or alignment length to see what percentage of
the read aligned to the reference.
"""
return sum([pair[0] for pair in TUPS])
def fix_pos(self, start_index):
"""
arguments:
an int
purpose:
When using a doubled SAMfile, any reads that start after the first copy
of the reference risk running over the plotting window, causing the program
to crash. This function corrects for this issue by changing the start site
of the read.
Note: this will probably break the program if not using a double alignment
since no reads would map past half the length of the single reference
"""
if self.doubled:
if start_index > int(self.seqlength/2):
return start_index - int(self.seqlength/2) - 1
else:
return start_index
else:
return start_index
pauvre-0.2.3/pauvre/browser.py 0000644 0026123 0000167 00000036513 13622004260 017533 0 ustar dschultz biolum 0000000 0000000 #!/usr/bin/env python
# -*- coding: utf-8 -*-
# pauvre - a pore plotting package
# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
#
# This file is part of pauvre.
#
# pauvre is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# pauvre is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with pauvre. If not, see .
# following this tutorial to install helvetica
# https://github.com/olgabot/sciencemeetproductivity.tumblr.com/blob/master/posts/2012/11/how-to-set-helvetica-as-the-default-sans-serif-font-in.md
global hfont
hfont = {'fontname':'Helvetica'}
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
from matplotlib.colors import LinearSegmentedColormap, Normalize
import matplotlib.patches as patches
import gffutils
import pandas as pd
pd.set_option('display.max_columns', 500)
pd.set_option('display.width', 1000)
import numpy as np
import os
import pauvre.rcparams as rc
from pauvre.functions import GFFParse, print_images, timestamp
from pauvre import gfftools
from pauvre.lsi.lsi import intersection
from pauvre.bamparse import BAMParse
import progressbar
import platform
import sys
import time
# Biopython stuff
from Bio import SeqIO
import Bio.SubsMat.MatrixInfo as MI
class PlotCommand:
def __init__(self, plotcmd, REF):
self.ref = REF
self.style_choices = []
self.cmdtype = ""
self.path = ""
self.style = ""
self.options = ""
self._parse_cmd(plotcmd)
def _parse_cmd(self, plotcmd):
chunks = plotcmd.split(":")
if chunks[0] == "ref":
self.cmdtype = "ref"
if len(chunks) < 2:
self._len_error()
self.path = self.ref
self.style = chunks[1]
self.style_choices = ["normal", "colorful"]
self._check_style_choices()
if len(chunks) > 2:
self.options = chunks[2].split(",")
elif chunks[0] in ["bam", "peptides"]:
if len(chunks) < 3:
self._len_error()
self.cmdtype = chunks[0]
self.path = os.path.abspath(os.path.expanduser(chunks[1]))
self.style = chunks[2]
if self.cmdtype == "bam":
self.style_choices = ["depth", "reads"]
else:
self.style_choices = ["depth"]
self._check_style_choices()
if len(chunks) > 3:
self.options = chunks[3].split(",")
elif chunks[0] in ["gff3"]:
if len(chunks) < 2:
self._len_error()
self.cmdtype = chunks[0]
self.path = os.path.abspath(os.path.expanduser(chunks[1]))
if len(chunks) > 2:
self.options = chunks[2].split(",")
def _len_error(self):
raise IOError("""You selected {} to plot,
but need to specify the style at least.""".format(self.cmdtype))
def _check_style_choices(self):
if self.style not in self.style_choices:
raise IOError("""You selected {} style for
ref. You must select from {}. """.format(
self.style, self.style_choices))
global dna_color
dna_color = {"A": (81/255, 87/255, 251/255, 1),
"T": (230/255, 228/255, 49/255, 1),
"G": (28/255, 190/255, 32/255, 1),
"C": (220/255, 10/255, 23/255, 1)}
#these are the line width for the different cigar string flags.
# usually, only M, I, D, S, and H appear in bwa mem output
global widthDict
widthDict = {'M':0.45, # match
'I':0.9, # insertion relative to reference
'D':0.05, # deletion relative to reference
'N':0.1, # skipped region from the reference
'S':0.1, # soft clip, not aligned but still in sam file
'H':0.1, # hard clip, not aligned and not in sam file
'P':0.1, # padding (silent deletion from padded reference)
'=':0.1, # sequence match
'X':0.1} # sequence mismatch
global richgrey
richgrey = (60/255, 54/255, 69/255, 1)
def plot_ref(panel, chrid, start, stop, thiscmd):
panel.set_xlim([start, stop])
panel.set_ylim([-2.5, 2.5])
panel.set_xticks([int(val) for val in np.linspace(start, stop, 6)])
if thiscmd.style == "colorful":
thisseq = ""
for record in SeqIO.parse(thiscmd.ref, "fasta"):
if record.id == chrid:
thisseq = record.seq[start-1: stop]
for i in range(len(thisseq)):
left = start + i
bottom = -0.5
width = 1
height = 1
rect = patches.Rectangle((left, bottom),
width, height,
linewidth = 0,
facecolor = dna_color[thisseq[i]] )
panel.add_patch(rect)
return panel
def safe_log10(value):
try:
logval = np.log10(value)
except:
logval = 0
return logval
def plot_bam(panel, chrid, start, stop, thiscmd):
bam = BAMParse(thiscmd.path)
panel.set_xlim([start, stop])
if thiscmd.style == "depth":
maxdepth = max(bam.features_depthmap)
maxdepthlog = safe_log10(maxdepth)
if "log" in thiscmd.options:
panel.set_ylim([-maxdepthlog, maxdepthlog])
panel.set_yticks([int(val) for val in np.linspace(0, maxdepthlog, 2)])
else:
panel.set_yticks([int(val) for val in np.linspace(0, maxdepth, 2)])
if "c" in thiscmd.options:
panel.set_ylim([-maxdepth, maxdepth])
else:
panel.set_ylim([0, maxdepth])
for i in range(len(bam.features_depthmap)):
left = start + i
width = 1
if "c" in thiscmd.options and "log" in thiscmd.options:
bottom = -1 * safe_log10(bam.features_depthmap[i])
height = safe_log10(bam.features_depthmap[i]) * 2
elif "c" in thiscmd.options and "log" not in thiscmd.options:
bottom = -bam.features_depthmap[i]
height = bam.features_depthmap[i] * 2
else:
bottom = 0
height = bam.features_depthmap[i]
if height > 0:
rect = patches.Rectangle((left, bottom),
width, height,
linewidth = 0,
facecolor = richgrey )
panel.add_patch(rect)
if thiscmd.style == "reads":
#If we're plotting reads, we don't need y-axis
panel.tick_params(bottom="off", labelbottom="off",
left ="off", labelleft = "off")
reads = bam.features.copy()
panel.set_xlim([start, stop])
direction = "for"
if direction == 'for':
bav = {"by":['POS','MAPLEN'], "asc": [True, False]}
direction= 'rev'
elif direction == 'rev':
bav = {"by":['POS','MAPLEN'], "asc": [True, False]}
direction = 'for'
reads.sort_values(by=bav["by"], ascending=bav['asc'],inplace=True)
reads.reset_index(drop=True, inplace=True)
depth_count = -1
plotind = start
while len(reads) > 0:
#depth_count -= 1
#print("len of reads is {}".format(len(reads)))
potential = reads.query("POS >= {}".format(plotind))
if len(potential) == 0:
readsindex = 0
#print("resetting plot ind from {} to {}".format(
# plotind, reads.loc[readsindex, "POS"]))
depth_count -= 1
else:
readsindex = int(potential.index.values[0])
#print("pos of potential is {}".format(reads.loc[readsindex, "POS"]))
plotind = reads.loc[readsindex, "POS"]
for TUP in reads.loc[readsindex, "TUPS"]:
b_type = TUP[1]
b_len = TUP[0]
#plotting params
# left same for all.
left = plotind
bottom = depth_count
height = widthDict[b_type]
width = b_len
plot = True
color = richgrey
if b_type in ["H", "S"]:
"""We don't plot hard or sort clips - like IGV"""
plot = False
pass
elif b_type == "M":
"""just plot matches normally"""
plotind += b_len
elif b_type in ["D", "P", "=", "X"]:
"""deletions get an especially thin line"""
plotind += b_len
elif b_type == "I":
"""insertions get a special purple bar"""
left = plotind - (b_len/2)
color = (200/255, 41/255, 226/255, 0.5)
elif b_type == "N":
"""skips for splice junctions, line in middle"""
bottom += (widthDict["M"]/2) - (widthDict["N"]/2)
plotind += b_len
if plot:
rect = patches.Rectangle((left, bottom),
width, height,
linewidth = 0,
facecolor = color )
panel.add_patch(rect)
reads.drop([readsindex], inplace=True)
reads.reset_index(drop = True, inplace=True)
panel.set_ylim([depth_count, 0])
return panel
def plot_gff3(panel, chrid, start, stop, thiscmd):
db = gffutils.create_db(thiscmd.path, ":memory:")
bottom = 0
genes_to_plot = [thing.id
for thing in db.region(
region=(chrid, start, stop),
completely_within=False)
if thing.featuretype == "gene" ]
#print("genes to plot are: " genes_to_plot)
panel.set_xlim([start, stop])
# we don't need labels on one of the axes
#panel.tick_params(bottom="off", labelbottom="off",
# left ="off", labelleft = "off")
ticklabels = []
for geneid in genes_to_plot:
plotnow = False
if "id" in thiscmd.options and geneid in thiscmd.options:
plotnow = True
elif "id" not in thiscmd.options:
plotnow = True
if plotnow:
ticklabels.append(geneid)
if db[geneid].strand == "+":
panel = gfftools._plot_left_to_right_introns_top(panel, geneid, db,
bottom, text = None)
bottom += 1
else:
raise IOError("""Plotting things on the reverse strand is
not yet implemented""")
#print("tick labels are", ticklabels)
panel.set_ylim([0, len(ticklabels)])
yticks_vals = [val for val in np.linspace(0.5, len(ticklabels) - 0.5, len(ticklabels))]
panel.set_yticks(yticks_vals)
print("bottom is: ", bottom)
print("len tick labels is: ", len(ticklabels))
print("intervals are: ", yticks_vals)
panel.set_yticklabels(ticklabels)
return panel
def browser(args):
rc.update_rcParams()
print(args)
# if the user forgot to add a reference, they must add one
if args.REF is None:
raise IOError("You must specify the reference fasta file")
# if the user forgot to add the start and stop,
# Print the id and the start/stop
if args.CHR is None or args.START is None or args.STOP is None:
print("""\n You have forgotten to specify the chromosome,
the start coordinate, or the stop coordinate to plot.
Try something like '-c chr1 --start 20 --stop 2000'.
Here is a list of chromosome ids and their lengths
from the provided reference. The minimum start coordinate
is one and the maximum stop coordinate is the length of
the chromosome.\n\nID\tLength""")
for record in SeqIO.parse(args.REF, "fasta"):
print("{}\t{}".format(record.id, len(record.seq)))
sys.exit(0)
if args.CMD is None:
raise IOError("You must specify a plotting command.")
# now we parse each set of commands
commands = [PlotCommand(thiscmd, args.REF)
for thiscmd in reversed(args.CMD)]
# set the figure dimensions
if args.ratio:
figWidth = args.ratio[0] + 1
figHeight = args.ratio[1] + 1
#set the panel dimensions
panelWidth = args.ratio[0]
panelHeight = args.ratio[1]
else:
figWidth = 7
figHeight = len(commands) + 2
#set the panel dimensions
panelWidth = 5
# panel margin x 2 + panel height = total vertical height
panelHeight = 0.8
panelMargin = 0.1
figure = plt.figure(figsize=(figWidth,figHeight))
#find the margins to center the panel in figure
leftMargin = (figWidth - panelWidth)/2
bottomMargin = ((figHeight - panelHeight)/2) + panelMargin
plot_dict = {"ref": plot_ref,
"bam": plot_bam,
"gff3": plot_gff3
#"peptides": plot_peptides
}
panels = []
for i in range(len(commands)):
thiscmd = commands[i]
if thiscmd.cmdtype in ["gff3", "ref", "peptides"] \
or thiscmd.style == "depth" \
or "narrow" in thiscmd.options:
temp_panelHeight = 0.5
else:
temp_panelHeight = panelHeight
panels.append( plt.axes([leftMargin/figWidth, #left
bottomMargin/figHeight, #bottom
panelWidth/figWidth, #width
temp_panelHeight/figHeight]) #height
)
panels[i].tick_params(axis='both',which='both',\
bottom='off', labelbottom='off',\
left='on', labelleft='on', \
right='off', labelright='off',\
top='off', labeltop='off')
if thiscmd.cmdtype == "ref":
panels[i].tick_params(bottom='on', labelbottom='on')
#turn off some of the axes
panels[i].spines["top"].set_visible(False)
panels[i].spines["bottom"].set_visible(False)
panels[i].spines["right"].set_visible(False)
panels[i].spines["left"].set_visible(False)
panels[i] = plot_dict[thiscmd.cmdtype](panels[i], args.CHR,
args.START, args.STOP,
thiscmd)
bottomMargin = bottomMargin + temp_panelHeight + (2 * panelMargin)
# Print image(s)
if args.BASENAME is None:
file_base = 'browser_{}.png'.format(timestamp())
else:
file_base = args.BASENAME
path = None
if args.path:
path = args.path
transparent = args.transparent
print_images(
base_output_name=file_base,
image_formats=args.fileform,
dpi=args.dpi,
no_timestamp = kwargs["no_timestamp"],
path = path,
transparent=transparent)
def run(args):
browser(args)
pauvre-0.2.3/pauvre/custommargin.py 0000644 0026123 0000167 00000042013 13655434263 020570 0 ustar dschultz biolum 0000000 0000000 #!/usr/bin/env python
# -*- coding: utf-8 -*-
# pauvre - just a pore PhD student's plotting package
# Copyright (c) 2016-2017 Darrin T. Schultz. All rights reserved.
#
# This file is part of pauvre.
#
# pauvre is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# pauvre is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with pauvre. If not, see .
import ast
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.patches as mplpatches
from matplotlib.colors import LinearSegmentedColormap
import numpy as np
import pandas as pd
import os.path as opath
from sys import stderr
from pauvre.functions import print_images
from pauvre.marginplot import generate_panel
from pauvre.stats import stats
import pauvre.rcparams as rc
import sys
import logging
# logging
logger = logging.getLogger('pauvre')
def _generate_histogram_bin_patches(panel, bins, bin_values, horizontal=True):
"""This helper method generates the histogram that is added to the panel.
In this case, horizontal = True applies to the mean quality histogram.
So, horizontal = False only applies to the length histogram.
"""
l_width = 0.0
f_color = (0.5, 0.5, 0.5)
e_color = (0, 0, 0)
if horizontal:
for step in np.arange(0, len(bin_values), 1):
left = bins[step]
bottom = 0
width = bins[step + 1] - bins[step]
height = bin_values[step]
hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
linewidth=l_width,
facecolor=f_color,
edgecolor=e_color)
panel.add_patch(hist_rectangle)
else:
for step in np.arange(0, len(bin_values), 1):
left = 0
bottom = bins[step]
width = bin_values[step]
height = bins[step + 1] - bins[step]
hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
linewidth=l_width,
facecolor=f_color,
edgecolor=e_color)
panel.add_patch(hist_rectangle)
def generate_histogram(panel, data_list, min_plot_val, max_plot_val,
bin_interval, hist_horizontal=True,
left_spine=True, bottom_spine=True,
top_spine=False, right_spine=False, x_label=None,
y_label=None):
bins = np.arange(0, max_plot_val, bin_interval)
bin_values, bins2 = np.histogram(data_list, bins)
# hist_horizontal is used for quality
if hist_horizontal:
panel.set_xlim([min_plot_val, max_plot_val])
panel.set_ylim([0, max(bin_values * 1.1)])
# and hist_horizontal == Fale is for read length
else:
panel.set_xlim([0, max(bin_values * 1.1)])
panel.set_ylim([min_plot_val, max_plot_val])
# Generate histogram bin patches, depending on whether we're plotting
# vertically or horizontally
_generate_histogram_bin_patches(panel, bins, bin_values, hist_horizontal)
panel.spines['left'].set_visible(left_spine)
panel.spines['bottom'].set_visible(bottom_spine)
panel.spines['top'].set_visible(top_spine)
panel.spines['right'].set_visible(right_spine)
if y_label is not None:
panel.set_ylabel(y_label)
if x_label is not None:
panel.set_xlabel(x_label)
def generate_square_map(panel, data_frame, plot_min_y, plot_min_x,
plot_max_y, plot_max_x, color,
xcol, ycol, **kwargs):
"""This generates the heatmap panels using squares. Everything is
quantized by ints.
"""
panel.set_xlim([plot_min_x, plot_max_x])
panel.set_ylim([plot_min_y, plot_max_y])
tempdf = data_frame[[xcol, ycol]]
data_frame = tempdf.astype(int)
querystring = "{}<={} and {}<={}".format(plot_min_y, ycol, plot_min_x, xcol)
print(" - Filtering squares with {}".format(querystring))
square_this = data_frame.query(querystring)
querystring = "{}<{} and {}<{}".format(ycol, plot_max_y, xcol, plot_max_x)
print(" - Filtering squares with {}".format(querystring))
square_this = square_this.query(querystring)
counts = square_this.groupby([xcol, ycol]).size().reset_index(name='counts')
for index, row in counts.iterrows():
x_pos = row[xcol]
y_pos = row[ycol]
thiscolor = color(row["counts"]/(counts["counts"].max()))
rectangle1=mplpatches.Rectangle((x_pos,y_pos),1,1,
linewidth=0,\
facecolor=thiscolor)
panel.add_patch(rectangle1)
all_counts = counts["counts"]
return all_counts
def generate_heat_map(panel, data_frame, plot_min_y, plot_min_x,
plot_max_y, plot_max_x, color,
xcol, ycol, **kwargs):
panel.set_xlim([plot_min_x, plot_max_x])
panel.set_ylim([plot_min_y, plot_max_y])
querystring = "{}<={} and {}<={}".format(plot_min_y, ycol, plot_min_x, xcol)
print(" - Filtering hexmap with {}".format(querystring))
hex_this = data_frame.query(querystring)
querystring = "{}<{} and {}<{}".format(ycol, plot_max_y, xcol, plot_max_x)
print(" - Filtering hexmap with {}".format(querystring))
hex_this = hex_this.query(querystring)
# This single line controls plotting the hex bins in the panel
hex_vals = panel.hexbin(hex_this[xcol], hex_this[ycol], gridsize=49,
linewidths=0.0, cmap=color)
for each in panel.spines:
panel.spines[each].set_visible(False)
counts = hex_vals.get_array()
return counts
def generate_legend(panel, counts, color):
# completely custom for more control
panel.set_xlim([0, 1])
panel.set_ylim([0, 1000])
panel.set_yticks([int(x) for x in np.linspace(0, 1000, 6)])
panel.set_yticklabels([int(x) for x in np.linspace(0, max(counts), 6)])
for i in np.arange(0, 1001, 1):
rgba = color(i / 1001)
alpha = rgba[-1]
facec = rgba[0:3]
hist_rectangle = mplpatches.Rectangle((0, i), 1, 1,
linewidth=0.0,
facecolor=facec,
edgecolor=(0, 0, 0),
alpha=alpha)
panel.add_patch(hist_rectangle)
panel.spines['top'].set_visible(False)
panel.spines['left'].set_visible(False)
panel.spines['bottom'].set_visible(False)
panel.yaxis.set_label_position("right")
panel.set_ylabel('count')
def custommargin(df, **kwargs):
rc.update_rcParams()
# 250, 231, 34 light yellow
# 67, 1, 85
# R=np.linspace(65/255,1,101)
# G=np.linspace(0/255, 231/255, 101)
# B=np.linspace(85/255, 34/255, 101)
# R=65/255, G=0/255, B=85/255
Rf = 65 / 255
Bf = 85 / 255
pdict = {'red': ((0.0, Rf, Rf),
(1.0, Rf, Rf)),
'green': ((0.0, 0.0, 0.0),
(1.0, 0.0, 0.0)),
'blue': ((0.0, Bf, Bf),
(1.0, Bf, Bf)),
'alpha': ((0.0, 0.0, 0.0),
(1.0, 1.0, 1.0))
}
# Now we will use this example to illustrate 3 ways of
# handling custom colormaps.
# First, the most direct and explicit:
purple1 = LinearSegmentedColormap('Purple1', pdict)
# set the figure dimensions
fig_width = 1.61 * 3
fig_height = 1 * 3
fig = plt.figure(figsize=(fig_width, fig_height))
# set the panel dimensions
heat_map_panel_width = fig_width * 0.5
heat_map_panel_height = heat_map_panel_width * 0.62
# find the margins to center the panel in figure
fig_left_margin = fig_bottom_margin = (1 / 6)
# lengthPanel
y_panel_width = (1 / 8)
# the color Bar parameters
legend_panel_width = (1 / 24)
# define padding
h_padding = 0.02
v_padding = 0.05
# Set whether to include y-axes in histograms
print(" - Setting panel options.", file = sys.stderr)
if kwargs["Y_AXES"]:
y_bottom_spine = True
y_bottom_tick = True
y_bottom_label = True
x_left_spine = True
x_left_tick = True
x_left_label = True
x_y_label = 'Count'
else:
y_bottom_spine = False
y_bottom_tick = False
y_bottom_label = False
x_left_spine = False
x_left_tick = False
x_left_label = False
x_y_label = None
panels = []
# Quality histogram panel
print(" - Generating the x-axis panel.", file = sys.stderr)
x_panel_left = fig_left_margin + y_panel_width + h_padding
x_panel_width = heat_map_panel_width / fig_width
x_panel_height = y_panel_width * fig_width / fig_height
x_panel = generate_panel(x_panel_left,
fig_bottom_margin,
x_panel_width,
x_panel_height,
left_tick_param=x_left_tick,
label_left_tick_param=x_left_label)
panels.append(x_panel)
# y histogram panel
print(" - Generating the y-axis panel.", file = sys.stderr)
y_panel_bottom = fig_bottom_margin + x_panel_height + v_padding
y_panel_height = heat_map_panel_height / fig_height
y_panel = generate_panel(fig_left_margin,
y_panel_bottom,
y_panel_width,
y_panel_height,
bottom_tick_param=y_bottom_tick,
label_bottom_tick_param=y_bottom_label)
panels.append(y_panel)
# Heat map panel
heat_map_panel_left = fig_left_margin + y_panel_width + h_padding
heat_map_panel_bottom = fig_bottom_margin + x_panel_height + v_padding
print(" - Generating the heat map panel.", file = sys.stderr)
heat_map_panel = generate_panel(heat_map_panel_left,
heat_map_panel_bottom,
heat_map_panel_width / fig_width,
heat_map_panel_height / fig_height,
bottom_tick_param='off',
label_bottom_tick_param='off',
left_tick_param='off',
label_left_tick_param='off')
panels.append(heat_map_panel)
heat_map_panel.set_title(kwargs["title"])
# Legend panel
print(" - Generating the legend panel.", file = sys.stderr)
legend_panel_left = fig_left_margin + y_panel_width + \
heat_map_panel_width / fig_width + h_padding
legend_panel_bottom = fig_bottom_margin + x_panel_height + v_padding
legend_panel_height = heat_map_panel_height / fig_height
legend_panel = generate_panel(legend_panel_left, legend_panel_bottom,
legend_panel_width, legend_panel_height,
bottom_tick_param=False,
label_bottom_tick_param=False,
left_tick_param=False,
label_left_tick_param=False,
right_tick_param=True,
label_right_tick_param=True)
panels.append(legend_panel)
#
# Everything above this is just to set up the panels
#
##################################################################
# Set max and min viewing window for the xaxis
if kwargs["plot_max_x"]:
plot_max_x = kwargs["plot_max_x"]
else:
if kwargs["square"]:
plot_max_x = df[kwargs["xcol"]].max()
plot_max_x = max(np.ceil(df[kwargs["xcol"]]))
plot_min_x = kwargs["plot_min_x"]
# Set x bin sizes
if kwargs["xbin"]:
x_bin_interval = kwargs["xbin"]
else:
# again, this is just based on what looks good from experience
x_bin_interval = 1
# Generate x histogram
print(" - Generating the x-axis histogram.", file = sys.stderr)
generate_histogram(panel = x_panel,
data_list = df[kwargs['xcol']],
min_plot_val = plot_min_x,
max_plot_val = plot_max_x,
bin_interval = x_bin_interval,
hist_horizontal = True,
x_label=kwargs['xcol'],
y_label=x_y_label,
left_spine=x_left_spine)
# Set max and min viewing window for the y axis
if kwargs["plot_max_y"]:
plot_max_y = kwargs["plot_max_y"]
else:
if kwargs["square"]:
plot_max_y = df[kwargs["ycol"]].max()
else:
plot_max_y = max(np.ceil(df[kwargs["ycol"]]))
plot_min_y = kwargs["plot_min_y"]
# Set y bin sizes
if kwargs["ybin"]:
y_bin_interval = kwargs["ybin"]
else:
y_bin_interval = 1
# Generate y histogram
print(" - Generating the y-axis histogram.", file = sys.stderr)
generate_histogram(panel = y_panel,
data_list = df[kwargs['ycol']],
min_plot_val = plot_min_y,
max_plot_val = plot_max_y,
bin_interval = y_bin_interval,
hist_horizontal = False,
y_label = kwargs['ycol'],
bottom_spine = y_bottom_spine)
# Generate heat map
if kwargs["square"]:
print(" - Generating the square heatmap.", file = sys.stderr)
counts = generate_square_map(panel = heat_map_panel,
data_frame = df,
plot_min_y = plot_min_y,
plot_min_x = plot_min_x,
plot_max_y = plot_max_y,
plot_max_x = plot_max_x,
color = purple1,
xcol = kwargs["xcol"],
ycol = kwargs["ycol"])
else:
print(" - Generating the heatmap.", file = sys.stderr)
counts = generate_heat_map(panel = heat_map_panel,
data_frame = df,
plot_min_y = plot_min_y,
plot_min_x = plot_min_x,
plot_max_y = plot_max_y,
plot_max_x = plot_max_x,
color = purple1,
xcol = kwargs["xcol"],
ycol = kwargs["ycol"])
# Generate legend
print(" - Generating the legend.", file = sys.stderr)
generate_legend(legend_panel, counts, purple1)
# inform the user of the plotting window if not quiet mode
#if not kwargs["QUIET"]:
# print("""plotting in the following window:
# {0} <= Q-score (x-axis) <= {1}
# {2} <= length (y-axis) <= {3}""".format(
# plot_min_x, plot_max_x, min_plot_val, max_plot_val),
# file=stderr)
# Print image(s)
if kwargs["output_base_name"] is None:
file_base = "custommargin"
else:
file_base = kwargs["output_base_name"]
print(" - Saving your images", file = sys.stderr)
print_images(
base =file_base,
image_formats=kwargs["fileform"],
dpi=kwargs["dpi"],
no_timestamp = kwargs["no_timestamp"],
transparent= kwargs["no_transparent"])
def run(args):
print(args)
if not opath.exists(args.input_file):
raise IOError("The input file does not exist: {}".format(
args.input_file))
df = pd.read_csv(args.input_file, header='infer', sep='\t')
# make sure that the column names that were specified are actually
# in the dataframe
if args.xcol not in df.columns:
raise IOError("""The x-column name that you specified, {}, is not in the
dataframe column names: {}""".format(args.xcol, df.columns))
if args.ycol not in df.columns:
raise IOError("""The y-column name that you specified, {}, is not in the
dataframe column names: {}""".format(args.ycol, df.columns))
print(" - Successfully read csv file. Here are a few lines:",
file = sys.stderr)
print(df.head(), file = sys.stderr)
print(" - Plotting {} on the x-axis".format(args.xcol),file=sys.stderr)
print(df[args.xcol].head(), file = sys.stderr)
print(" - Plotting {} on the y-axis".format(args.ycol),file=sys.stderr)
print(df[args.ycol].head(), file = sys.stderr)
custommargin(df=df.dropna(), **vars(args))
pauvre-0.2.3/pauvre/functions.py 0000644 0026123 0000167 00000036521 13622037370 020067 0 ustar dschultz biolum 0000000 0000000 #!/usr/bin/env python
# -*- coding: utf-8 -*-
# pauvre
# Copyright (c) 2016-2020 Darrin T. Schultz.
#
# This file is part of pauvre.
#
# pauvre is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# pauvre is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with pauvre. If not, see .
from Bio import SeqIO
import copy
import gzip
import matplotlib.pyplot as plt
import numpy as np
import os
import pandas as pd
from sys import stderr
import time
# this makes opening files more robust for different platforms
# currently only used in GFFParse
import codecs
import warnings
def print_images(base, image_formats, dpi,
transparent=False, no_timestamp = False):
"""
Save the plot in multiple formats, with or without transparency
and with or without timestamps.
"""
for fmt in image_formats:
if no_timestamp:
out_name = "{0}.{1}".format(base, fmt)
else:
out_name = "{0}_{1}.{2}".format(base, timestamp(), fmt)
try:
if fmt == 'png':
plt.savefig(out_name, dpi=dpi, transparent=transparent)
else:
plt.savefig(out_name, format=fmt, transparent=transparent)
except PermissionError:
# thanks to https://github.com/wdecoster for the suggestion
print("""You don't have permission to save pauvre plots to this
directory. Try changing the directory and running the script again!""")
class GFFParse():
def __init__(self, filename, stop_codons=None, species=None):
self.filename = filename
self.samplename = os.path.splitext(os.path.basename(filename))[0]
self.species = species
self.featureDict = {"name": [],
"featType": [],
"start": [],
"stop": [],
"strand": []}
gffnames = ["sequence", "source", "featType", "start", "stop", "dunno1",
"strand", "dunno2", "tags"]
self.features = pd.read_csv(self.filename, comment='#',
sep='\t', names=gffnames)
self.features['name'] = self.features['tags'].apply(self._get_name)
self.features.drop('dunno1', 1, inplace=True)
self.features.drop('dunno2', 1, inplace=True)
self.features.reset_index(inplace=True, drop=True)
# warn the user if there are CDS or gene entries not divisible by three
self._check_triplets()
# sort the database by start
self.features.sort_values(by='start', ascending=True, inplace=True)
if stop_codons:
strip_codons = ['gene', 'CDS']
# if the direction is forward, subtract three from the stop to bring it closer to the start
self.features.loc[(self.features['featType'].isin(strip_codons)) & (self.features['strand'] == '+'), 'stop'] =\
self.features.loc[(self.features['featType'].isin(strip_codons))
& (self.features['strand'] == '+'), 'stop'] - 3
# if the direction is reverse, add three to the start (since the coords are flip-flopped)
self.features.loc[(self.features['featType'].isin(strip_codons)) & (self.features['strand'] == '-'), 'start'] =\
self.features.loc[(self.features['featType'].isin(strip_codons))
& (self.features['strand'] == '-'), 'start'] + 3
self.features['center'] = self.features['start'] + \
((self.features['stop'] - self.features['start']) / 2)
# we need to add one since it doesn't account for the last base otherwise
self.features['width'] = abs(self.features['stop'] - self.features['start']) + 1
self.features['lmost'] = self.features.apply(self._determine_lmost, axis=1)
self.features['rmost'] = self.features.apply(self._determine_rmost, axis=1)
self.features['track'] = 0
if len(self.features.loc[self.features['tags'] == "Is_circular=true", 'stop']) < 1:
raise IOError("""The GFF file needs to have a tag ending in "Is_circular=true"
with a region from 1 to the number of bases in the mitogenome
example:
Bf201311 Geneious region 1 13337 . + 0 Is_circular=true
""")
self.seqlen = int(self.features.loc[self.features['tags'] == "Is_circular=true", 'stop'])
self.features.reset_index(inplace=True, drop=True)
#print("float", self.features.loc[self.features['name'] == 'COX1', 'center'])
#print("float cat", len(self.features.loc[self.features['name'] == 'CAT', 'center']))
# print(self.features)
# print(self.seqlen)
def set_features(self, new_features):
"""all this does is reset the features pandas dataframe"""
self.features = new_features
def get_unique_genes(self):
"""This returns a series of gene names"""
plottable = self.features.query(
"featType != 'tRNA' and featType != 'region' and featType != 'source'")
return set(plottable['name'].unique())
def shuffle(self):
"""
this returns a list of all possible shuffles of features.
A shuffle is when the frontmost bit of coding + noncoding DNA up
until the next bit of coding DNA is removed and tagged on the
end of the sequence. In this case this process is represented by
shifting gff coordinates.
"""
shuffles = []
# get the index of the first element
# get the index of the next thing
# subtract the indices of everything, then reset the ones that are below
# zero
done = False
shuffle_features = self.features[self.features['featType'].isin(
['gene', 'rRNA', 'CDS', 'tRNA'])].copy(deep=True)
# we first add the shuffle features without reorganizing
# print("shuffle\n",shuffle_features)
add_first = copy.deepcopy(self)
add_first.set_features(shuffle_features)
shuffles.append(add_first)
# first gene is changed with every iteration
first_gene = list(shuffle_features['name'])[0]
# absolute first is the first gene in the original gff file, used to determine if we are done in this while loop
absolute_first = list(shuffle_features['name'])[0]
while not done:
# We need to prevent the case of shuffling in the middle of
# overlapped genes. Do this by ensuring that the the start of
# end of first gene is less than the start of the next gene.
first_stop = int(shuffle_features.loc[shuffle_features['name'] == first_gene, 'stop'])
next_gene = ""
for next_index in range(1, len(shuffle_features)):
# get the df of the next list, if len == 0, then it is a tRNA and we need to go to the next index
next_gene_df = list(
shuffle_features[shuffle_features['featType'].isin(['gene', 'rRNA', 'CDS'])]['name'])
if len(next_gene_df) != 0:
next_gene = next_gene_df[next_index]
next_start = int(shuffle_features.loc[shuffle_features['name'] == next_gene, 'start'])
#print("looking at {}, prev_stop is {}, start is {}".format(
# next_gene, first_stop, next_start))
#print(shuffle_features[shuffle_features['featType'].isin(['gene', 'rRNA', 'CDS'])])
# if the gene we're looking at and the next one don't overlap, move on
if first_stop < next_start:
break
#print("next_gene before checking for first is {}".format(next_gene))
if next_gene == absolute_first:
done = True
break
# now we can reset the first gene for the next iteration
first_gene = next_gene
shuffle_features = shuffle_features.copy(deep=True)
# figure out where the next start point is going to be
next_start = int(shuffle_features.loc[shuffle_features['name'] == next_gene, 'start'])
#print('next gene: {}'.format(next_gene))
shuffle_features['start'] = shuffle_features['start'] - next_start + 1
shuffle_features['stop'] = shuffle_features['stop'] - next_start + 1
shuffle_features['center'] = shuffle_features['center'] - next_start + 1
# now correct the values that are less than 0
shuffle_features.loc[shuffle_features['start'] < 1,
'start'] = shuffle_features.loc[shuffle_features['start'] < 1, 'start'] + self.seqlen
shuffle_features.loc[shuffle_features['stop'] < 1, 'stop'] = shuffle_features.loc[shuffle_features['stop']
< 1, 'start'] + shuffle_features.loc[shuffle_features['stop'] < 1, 'width']
shuffle_features['center'] = shuffle_features['start'] + \
((shuffle_features['stop'] - shuffle_features['start']) / 2)
shuffle_features['lmost'] = shuffle_features.apply(self._determine_lmost, axis=1)
shuffle_features['rmost'] = shuffle_features.apply(self._determine_rmost, axis=1)
shuffle_features.sort_values(by='start', ascending=True, inplace=True)
shuffle_features.reset_index(inplace=True, drop=True)
new_copy = copy.deepcopy(self)
new_copy.set_features(shuffle_features)
shuffles.append(new_copy)
#print("len shuffles: {}".format(len(shuffles)))
return shuffles
def couple(self, other_GFF, this_y=0, other_y=1):
"""
Compares this set of features to another set and generates tuples of
(x,y) coordinate pairs to input into lsi
"""
other_features = other_GFF.features
coordinates = []
for thisname in self.features['name']:
othermatch = other_features.loc[other_features['name'] == thisname, 'center']
if len(othermatch) == 1:
this_x = float(self.features.loc[self.features['name']
== thisname, 'center']) # /self.seqlen
other_x = float(othermatch) # /other_GFF.seqlen
# lsi can't handle vertical or horizontal lines, and we don't
# need them either for our comparison. Don't add if equivalent.
if this_x != other_x:
these_coords = ((this_x, this_y), (other_x, other_y))
coordinates.append(these_coords)
return coordinates
def _check_triplets(self):
"""This method verifies that all entries of featType gene and CDS are
divisible by three"""
genesCDSs = self.features.query("featType == 'CDS' or featType == 'gene'")
not_trips = genesCDSs.loc[((abs(genesCDSs['stop'] - genesCDSs['start']) + 1) % 3) > 0, ]
if len(not_trips) > 0:
print_string = ""
print_string += "There are CDS and gene entries that are not divisible by three\n"
print_string += str(not_trips)
warnings.warn(print_string, SyntaxWarning)
def _get_name(self, tag_value):
"""This extracts a name from a single row in 'tags' of the pandas
dataframe
"""
try:
if ";" in tag_value:
name = tag_value[5:].split(';')[0]
else:
name = tag_value[5:].split()[0]
except:
name = tag_value
print("Couldn't correctly parse {}".format(
tag_value))
return name
def _determine_lmost(self, row):
"""Booleans don't work well for pandas dataframes, so I need to use apply
"""
if row['start'] < row['stop']:
return row['start']
else:
return row['stop']
def _determine_rmost(self, row):
"""Booleans don't work well for pandas dataframes, so I need to use apply
"""
if row['start'] < row['stop']:
return row['stop']
else:
return row['start']
def parse_fastq_length_meanqual(fastq):
"""
arguments:
the fastq file path. Hopefully it has been verified to exist already
purpose:
This function parses a fastq and returns a pandas dataframe of read lengths
and read meanQuals.
"""
# First try to open the file with the gzip package. It will crash
# if the file is not gzipped, so this is an easy way to test if
# the fastq file is gzipped or not.
try:
handle = gzip.open(fastq, "rt")
length, meanQual = _fastq_parse_helper(handle)
except:
handle = open(fastq, "r")
length, meanQual = _fastq_parse_helper(handle)
handle.close()
df = pd.DataFrame(list(zip(length, meanQual)), columns=['length', 'meanQual'])
return df
def filter_fastq_length_meanqual(df, min_len, max_len,
min_mqual, max_mqual):
querystring = "length >= {0} and meanQual >= {1}".format(min_len, min_mqual)
if max_len != None:
querystring += " and length <= {}".format(max_len)
if max_mqual != None:
querystring += " and meanQual <= {}".format(max_mqual)
print("Keeping reads that satisfy: {}".format(querystring), file=stderr)
filtdf = df.query(querystring)
#filtdf["length"] = pd.to_numeric(filtdf["length"], errors='coerce')
#filtdf["meanQual"] = pd.to_numeric(filtdf["meanQual"], errors='coerce')
return filtdf
def _fastq_parse_helper(handle):
length = []
meanQual = []
for record in SeqIO.parse(handle, "fastq"):
if len(record) > 0:
meanQual.append(_arithmetic_mean(record.letter_annotations["phred_quality"]))
length.append(len(record))
return length, meanQual
def _geometric_mean(phred_values):
"""in case I want geometric mean in the future, can calculate it like this"""
# np.mean(record.letter_annotations["phred_quality"]))
pass
def _arithmetic_mean(phred_values):
"""
Convert Phred to 1-accuracy (error probabilities), calculate the arithmetic mean,
log transform back to Phred.
"""
if not isinstance(phred_values, np.ndarray):
phred_values = np.array(phred_values)
return _erate_to_phred(np.mean(_phred_to_erate(phred_values)))
def _phred_to_erate(phred_values):
"""
converts a list or numpy array of phred values to a numpy array
of error rates
"""
if not isinstance(phred_values, np.ndarray):
phred_values = np.array(phred_values)
return np.power(10, (-1 * (phred_values / 10)))
def _erate_to_phred(erate_values):
"""
converts a list or numpy array of error rates to a numpy array
of phred values
"""
if not isinstance(erate_values, np.ndarray):
phred_values = np.array(erate_values)
return -10 * np.log10(erate_values)
def timestamp():
"""
Returns the current time in :samp:`YYYYMMDD_HHMMSS` format.
"""
return time.strftime("%Y%m%d_%H%M%S")
pauvre-0.2.3/pauvre/gfftools.py 0000644 0026123 0000167 00000063720 13322176054 017703 0 ustar dschultz biolum 0000000 0000000 #!/usr/bin/env python
# -*- coding: utf-8 -*-
# pauvre - a pore plotting package
# Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved.
#
# This file is part of pauvre.
#
# pauvre is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# pauvre is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with pauvre. If not, see .
"""This file contains things related to parsing and plotting GFF files"""
import copy
from matplotlib.path import Path
import matplotlib.patches as patches
global chevron_width
global arrow_width
global min_text
global text_cutoff
arrow_width = 80
chevron_width = 40
min_text = 550
text_cutoff = 150
import sys
global colorMap
colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red',
'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white',
'ORF':'orange'}
def _plot_left_to_right_introns(panel, geneid, db, y_pos, text = None):
""" plots a left to right patch with introns when there are no intervening
sequences to consider. Uses a gene id and gffutils database as input.
b
a .-=^=-. c
1__________2---/ e `---1__________2
| #lff \f d| #lff \
| left to \3 | left to \3
| right / | right /
5___________/4 5___________/4
"""
#first we need to determine the number of exons
bar_thickness = 0.75
#now we can start plotting the exons
exonlist = list(db.children(geneid, featuretype='CDS', order_by="start"))
for i in range(len(exonlist)):
cds_start = exonlist[i].start
cds_stop = exonlist[i].stop
verts = [(cds_start, y_pos + bar_thickness), #1
(cds_stop - chevron_width, y_pos + bar_thickness), #2
(cds_stop, y_pos + (bar_thickness/2)), #3
(cds_stop - chevron_width, y_pos), #4
(cds_start, y_pos), #5
(cds_start, y_pos + bar_thickness), #1
]
codes = [Path.MOVETO,
Path.LINETO,
Path.LINETO,
Path.LINETO,
Path.LINETO,
Path.CLOSEPOLY,
]
path = Path(verts, codes)
patch = patches.PathPatch(path, lw = 0,
fc=colorMap['CDS'] )
panel.add_patch(patch)
# we must draw the splice junction
if i < len(exonlist) - 1:
next_start = exonlist[i+1].start
next_stop = exonlist[i+1].stop
middle = cds_stop + ((next_start - cds_stop)/2)
verts = [(cds_stop - chevron_width, y_pos + bar_thickness), #2/a
(middle, y_pos + 0.95), #b
(next_start, y_pos + bar_thickness), #c
(next_start, y_pos + bar_thickness - 0.05), #d
(middle, y_pos + 0.95 - 0.05), #e
(cds_stop - chevron_width, y_pos + bar_thickness -0.05), #f
(cds_stop - chevron_width, y_pos + bar_thickness), #2/a
]
codes = [Path.MOVETO,
Path.LINETO,
Path.LINETO,
Path.LINETO,
Path.LINETO,
Path.LINETO,
Path.CLOSEPOLY,
]
path = Path(verts, codes)
patch = patches.PathPatch(path, lw = 0,
fc=colorMap['CDS'] )
panel.add_patch(patch)
return panel
def _plot_left_to_right_introns_top(panel, geneid, db, y_pos, text = None):
""" slightly different from the above version such thatsplice junctions
are more visually explicit.
plots a left to right patch with introns when there are no intervening
sequences to consider. Uses a gene id and gffutils database as input.
b
a .-=^=-. c
1_____________2---/ e `---1_____________2
| #lff /f d| #lff /
| left to / | left to /
| right / | right /
4_________/3 4_________/3
"""
#first we need to determine the number of exons
bar_thickness = 0.75
#now we can start plotting the exons
exonlist = list(db.children(geneid, featuretype='CDS', order_by="start"))
for i in range(len(exonlist)):
cds_start = exonlist[i].start
cds_stop = exonlist[i].stop
verts = [(cds_start, y_pos + bar_thickness), #1
(cds_stop, y_pos + bar_thickness), #2
(cds_stop - chevron_width, y_pos), #4
(cds_start, y_pos), #5
(cds_start, y_pos + bar_thickness), #1
]
codes = [Path.MOVETO,
Path.LINETO,
Path.LINETO,
Path.LINETO,
Path.CLOSEPOLY,
]
path = Path(verts, codes)
patch = patches.PathPatch(path, lw = 0,
fc=colorMap['CDS'] )
panel.add_patch(patch)
# we must draw the splice junction
if i < len(exonlist) - 1:
next_start = exonlist[i+1].start
next_stop = exonlist[i+1].stop
middle = cds_stop + ((next_start - cds_stop)/2)
verts = [(cds_stop-5, y_pos + bar_thickness), #2/a
(middle, y_pos + 0.95), #b
(next_start, y_pos + bar_thickness), #c
(next_start, y_pos + bar_thickness - 0.05), #d
(middle, y_pos + 0.95 - 0.05), #e
(cds_stop-5, y_pos + bar_thickness -0.05), #f
(cds_stop-5, y_pos + bar_thickness), #2/a
]
codes = [Path.MOVETO,
Path.LINETO,
Path.LINETO,
Path.LINETO,
Path.LINETO,
Path.LINETO,
Path.CLOSEPOLY,
]
path = Path(verts, codes)
patch = patches.PathPatch(path, lw = 0,
fc=colorMap['CDS'] )
panel.add_patch(patch)
return panel
def _plot_lff(panel, left_df, right_df, colorMap, y_pos, bar_thickness, text):
""" plots a lff patch
1__________2 ____________
| #lff \ \ #rff \
| left for \3 \ right for \
| forward / / forward /
5___________/4 /___________/
"""
#if there is only one feature to plot, then just plot it
print("plotting lff")
verts = [(left_df['start'], y_pos + bar_thickness), #1
(right_df['start'] - chevron_width, y_pos + bar_thickness), #2
(left_df['stop'], y_pos + (bar_thickness/2)), #3
(right_df['start'] - chevron_width, y_pos), #4
(left_df['start'], y_pos), #5
(left_df['start'], y_pos + bar_thickness), #1
]
codes = [Path.MOVETO,
Path.LINETO,
Path.LINETO,
Path.LINETO,
Path.LINETO,
Path.CLOSEPOLY,
]
path = Path(verts, codes)
patch = patches.PathPatch(path, lw = 0,
fc=colorMap[left_df['featType']] )
text_width = left_df['width']
if text and text_width >= min_text:
panel = _plot_label(panel, left_df, y_pos, bar_thickness)
elif text and text_width < min_text and text_width >= text_cutoff:
panel = _plot_label(panel, left_df,
y_pos, bar_thickness,
rotate = True, arrow = True)
return panel, patch
def _plot_label(panel, df, y_pos, bar_thickness, rotate = False, arrow = False):
# handles the case where a dataframe was passed
fontsize = 8
rotation = 0
if rotate:
fontsize = 5
rotation = 90
if len(df) == 1:
x =((df.loc[0, 'stop'] - df.loc[0, 'start'])/2) + df.loc[0, 'start']
y = y_pos + (bar_thickness/2)
# if we need to center somewhere other than the arrow, need to adjust
# for the direction of the arrow
# it doesn't look good if it shifts by the whole arrow width, so only
# shift by half the arrow width
if arrow:
if df.loc[0, 'strand'] == "+":
shift_start = df.loc[0, 'start']
else:
shift_start = df.loc[0, 'start'] + (arrow_width/2)
x =((df.loc[0, 'stop'] - (arrow_width/2) - df.loc[0, 'start'])/2) + shift_start
panel.text(x, y,
df.loc[0, 'name'], fontsize = fontsize,
ha='center', va='center',
color = 'white', family = 'monospace',
zorder = 100, rotation = rotation)
# and the case where a series was passed
else:
x = ((df['stop'] - df['start'])/2) + df['start']
y = y_pos + (bar_thickness/2)
if arrow:
if df['strand'] == "+":
shift_start = df['start']
else:
shift_start = df['start'] + (arrow_width/2)
x =((df['stop'] - (arrow_width/2) - df['start'])/2) + shift_start
panel.text(x, y,
df['name'], fontsize = fontsize,
ha='center', va='center',
color = 'white', family = 'monospace',
zorder = 100, rotation = rotation)
return panel
def _plot_rff(panel, left_df, right_df, colorMap, y_pos, bar_thickness, text):
""" plots a rff patch
____________ 1__________2
| #lff \ \ #rff \
| left for \ 6\ right for \3
| forward / / forward /
|___________/ /5__________/4
"""
#if there is only one feature to plot, then just plot it
print("plotting rff")
verts = [(right_df['start'], y_pos + bar_thickness), #1
(right_df['stop'] - arrow_width, y_pos + bar_thickness), #2
(right_df['stop'], y_pos + (bar_thickness/2)), #3
(right_df['stop'] - arrow_width, y_pos), #4
(right_df['start'], y_pos), #5
(left_df['stop'] + chevron_width, y_pos + (bar_thickness/2)), #6
(right_df['start'], y_pos + bar_thickness), #1
]
codes = [Path.MOVETO,
Path.LINETO,
Path.LINETO,
Path.LINETO,
Path.LINETO,
Path.LINETO,
Path.CLOSEPOLY,
]
path = Path(verts, codes)
patch = patches.PathPatch(path, lw = 0,
fc=colorMap[right_df['featType']] )
text_width = right_df['width']
if text and text_width >= min_text:
panel = _plot_label(panel, right_df, y_pos, bar_thickness)
elif text and text_width < min_text and text_width >= text_cutoff:
panel = _plot_label(panel, right_df,
y_pos, bar_thickness, rotate = True)
return panel, patch
def x_offset_gff(GFFParseobj, x_offset):
"""Takes in a gff object (a gff file parsed as a pandas dataframe),
and an x_offset value and shifts the start, stop, center, lmost, and rmost.
Returns a GFFParse object with the shifted values in GFFParse.features.
"""
for columnname in ['start', 'stop', 'center', 'lmost', 'rmost']:
GFFParseobj.features[columnname] = GFFParseobj.features[columnname] + x_offset
return GFFParseobj
def gffplot_horizontal(figure, panel, args, gff_object,
track_width=0.2, start_y=0.1, **kwargs):
"""
this plots horizontal things from gff files. it was probably written for synplot,
as the browser does not use this at all.
"""
# Because this size should be relative to the circle that it is plotted next
# to, define the start_radius as the place to work from, and the width of
# each track.
colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red',
'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white'}
augment = 0
bar_thickness = 0.9 * track_width
# return these at the end
myPatches=[]
plot_order = []
idone = False
# we need to filter out the tRNAs since those are plotted last
plottable_features = gff_object.features.query("featType != 'tRNA' and featType != 'region' and featType != 'source'")
plottable_features.reset_index(inplace=True, drop=True)
print(plottable_features)
len_plottable = len(plottable_features)
print('len plottable', len_plottable)
# - this for loop relies on the gff features to already be sorted
# - The algorithm for this loop works by starting at the 0th index of the
# plottable features (i).
# - It then looks to see if the next object (the jth) overlaps with the
# ith element.
i = 0
j = 1
while i < len(plottable_features):
if i + j == len(plottable_features):
#we have run off of the df and need to include everything from i to the end
these_features = plottable_features.loc[i::,].copy(deep=True)
these_features = these_features.reset_index()
print(these_features)
plot_order.append(these_features)
i = len(plottable_features)
break
print(" - i,j are currently: {},{}".format(i, j))
stop = plottable_features.loc[i]["stop"]
start = plottable_features.loc[i+j]["start"]
print("stop: {}. start: {}.".format(stop, start))
if plottable_features.loc[i]["stop"] <= plottable_features.loc[i+j]["start"]:
print(" - putting elements {} through (including) {} together".format(i, i+j))
these_features = plottable_features.loc[i:i+j-1,].copy(deep=True)
these_features = these_features.reset_index()
print(these_features)
plot_order.append(these_features)
i += 1
j = 1
else:
j += 1
#while idone == False:
# print("im in the overlap-pairing while loop i={}".format(i))
# # look ahead at all of the elements that overlap with the ith element
# jdone = False
# j = 1
# this_set_minimum_index = i
# this_set_maximum_index = i
# while jdone == False:
# print("new i= {} j={} len={}".format(i, j, len_plottable))
# print("len plottable in jdone: {}".format(len_plottable))
# print("plottable features in jdone:\n {}".format(plottable_features))
# # first make sure that we haven't gone off the end of the dataframe
# # This is an edge case where i has a jth element that overlaps with it,
# # and j is the last element in the plottable features.
# if i+j == len_plottable:
# print("i+j == len_plottable")
# # this checks for the case that i is the last element of the
# # plottable features.
# # In both of the above cases, we are done with both the ith and
# # the jth features.
# if i == len_plottable-1:
# print("i == len_plottable-1")
# # this is the last analysis, so set idone to true
# # to finish after this
# idone = True
# # the last one can't be in its own group, so just add it solo
# these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,].copy(deep=True)
# plot_order.append(these_features.reset_index(drop=True))
# break
# jdone = True
# else:
# print("i+j != len_plottable")
# # if the lmost of the next gene overlaps with the rmost of
# # the current one, it overlaps and couple together
# if plottable_features.loc[i+j, 'lmost'] < plottable_features.loc[i, 'rmost']:
# print("lmost < rmost")
# # note that this feature overlaps with the current
# this_set_maximum_index = i+j
# # ... and we need to look at the next in line
# j += 1
# else:
# print("lmost !< rmost")
# i += 1 + (this_set_maximum_index - this_set_minimum_index)
# #add all of the things that grouped together once we don't find any more groups
# these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,].copy(deep=True)
# plot_order.append(these_features.reset_index(drop=True))
# jdone = True
# print("plot order is now: {}".format(plot_order))
# print("jdone: {}".format(str(jdone)))
for feature_set in plot_order:
# plot_feature_hori handles overlapping cases as well as normal cases
panel, patches = gffplot_feature_hori(figure, panel, feature_set, colorMap,
start_y, bar_thickness, text = True)
for each in patches:
print("there are {} patches after gffplot_feature_hori".format(len(patches)))
print(each)
myPatches.append(each)
print("length of myPatches is: {}".format(len(myPatches)))
# Now we add all of the tRNAs to this to plot, do it last to overlay
# everything else
tRNAs = gff_object.features.query("featType == 'tRNA'")
tRNAs.reset_index(inplace=True, drop = True)
tRNA_bar_thickness = bar_thickness * (0.8)
tRNA_start_y = start_y + ((bar_thickness - tRNA_bar_thickness)/2)
for i in range(0,len(tRNAs)):
this_feature = tRNAs[i:i+1].copy(deep=True)
this_feature.reset_index(inplace=True, drop = True)
panel, patches = gffplot_feature_hori(figure, panel, this_feature, colorMap,
tRNA_start_y, tRNA_bar_thickness, text = True)
for patch in patches:
myPatches.append(patch)
print("There are {} patches at the end of gffplot_horizontal()".format(len(myPatches)))
return panel, myPatches
def gffplot_feature_hori(figure, panel, feature_df,
colorMap, y_pos, bar_thickness, text=True):
"""This plots the track for a feature, and if there is something for
'this_feature_overlaps_feature', then there is special processing to
add the white bar and the extra slope for the chevron
"""
myPatches = []
#if there is only one feature to plot, then just plot it
if len(feature_df) == 1:
#print("plotting a single thing: {} {}".format(str(feature_df['sequence']).split()[1],
# str(feature_df['featType']).split()[1] ))
#print(this_feature['name'], "is not overlapping")
# This plots this shape: 1_________2 2_________1
# | forward \3 3/ reverse |
# |5__________/4 \4________5|
if feature_df.loc[0,'strand'] == '+':
verts = [(feature_df.loc[0, 'start'], y_pos + bar_thickness), #1
(feature_df.loc[0, 'stop'] - arrow_width, y_pos + bar_thickness), #2
(feature_df.loc[0, 'stop'], y_pos + (bar_thickness/2)), #3
(feature_df.loc[0, 'stop'] - arrow_width, y_pos), #4
(feature_df.loc[0, 'start'], y_pos), #5
(feature_df.loc[0, 'start'], y_pos + bar_thickness)] #1
elif feature_df.loc[0,'strand'] == '-':
verts = [(feature_df.loc[0, 'stop'], y_pos + bar_thickness), #1
(feature_df.loc[0, 'start'] + arrow_width, y_pos + bar_thickness), #2
(feature_df.loc[0, 'start'], y_pos + (bar_thickness/2)), #3
(feature_df.loc[0, 'start'] + arrow_width, y_pos), #4
(feature_df.loc[0, 'stop'], y_pos), #5
(feature_df.loc[0, 'stop'], y_pos + bar_thickness)] #1
feat_width = feature_df.loc[0,'width']
if text and feat_width >= min_text:
panel = _plot_label(panel, feature_df.loc[0,],
y_pos, bar_thickness)
elif text and feat_width < min_text and feat_width >= text_cutoff:
panel = _plot_label(panel, feature_df.loc[0,],
y_pos, bar_thickness,
rotate = True, arrow = True)
codes = [Path.MOVETO,
Path.LINETO,
Path.LINETO,
Path.LINETO,
Path.LINETO,
Path.CLOSEPOLY]
path = Path(verts, codes)
print("normal path is: {}".format(path))
# If the feature itself is smaller than the arrow, we need to take special measures to
if feature_df.loc[0,'width'] <= arrow_width:
path = Path([verts[i] for i in [0,2,4,5]],
[codes[i] for i in [0,2,4,5]])
patch = patches.PathPatch(path, lw = 0,
fc=colorMap[feature_df.loc[0, 'featType']] )
myPatches.append(patch)
# there are four possible scenarios if there are two overlapping sequences:
# ___________ ____________ ____________ ___________
# | #1 \ \ #1 \ / #2 / / #2 |
# | both seqs \ \ both seqs \ / both seqs / / both seqs |
# | forward / / forward / \ reverse \ \ reverse |
# |__________/ /___________/ \___________\ \___________|
# ___________ _____________ ____________ _ _________
# | #3 \ \ #3 | / #2 _| #2 \
# | one seq \ \ one seq | / one seq |_ one seq \
# | forward \ \ reverse | \ reverse _| forward /
# |_____________\ \_________| \__________|_ ___________/
#
# These different scenarios can be thought of as different left/right
# flanking segment types.
# In the annotation #rff:
# - 'r' refers to the annotation type as being on the right
# - the first 'f' refers to the what element is to the left of this one.
# Since it is forward the 5' end of this annotation must be a chevron
# - the second 'f' refers to the right side of this element. Since it is
# forward it must be a normal arrow.
# being on the right
#
# *LEFT TYPES* *RIGHT TYPES*
# ____________ ____________
# | #lff \ \ #rff \
# | left for \ \ right for \
# | forward / / forward /
# |___________/ /___________/
# ___________ _____________
# | #lfr \ \ #rfr |
# | left for \ \ right for |
# | reverse \ \ reverse |
# |_____________\ \_________|
# ____________ ___________
# / #lrr / / #rrr |
# / left rev / / right rev |
# \ reverse \ \ reverse |
# \___________\ \___________|
# ____________ __________
# / #lrf _| _| #rrf \
# / left rev |_ | _ right rev \
# \ forward _| _| forward /
# \__________| |____________/
#
# To properly plot these elements, we must go through each element of the
# feature_df to determine which patch type it is.
elif len(feature_df) == 2:
print("im in here feat len=2")
for i in range(len(feature_df)):
# this tests for which left type we're dealing with
if i == 0:
# type could be lff or lfr
if feature_df.loc[i, 'strand'] == '+':
if feature_df.loc[i + 1, 'strand'] == '+':
# plot a lff type
panel, patch = _plot_lff(panel, feature_df.iloc[i,], feature_df.iloc[i+1,],
colorMap, y_pos, bar_thickness, text)
myPatches.append(patch)
elif feature_df.loc[i + 1, 'strand'] == '-':
#plot a lfr type
raise IOError("can't plot {} patches yet".format("lfr"))
# or type could be lrr or lrf
elif feature_df.loc[i, 'strand'] == '-':
if feature_df.loc[i + 1, 'strand'] == '+':
# plot a lrf type
raise IOError("can't plot {} patches yet".format("lrf"))
elif feature_df.loc[i + 1, 'strand'] == '-':
#plot a lrr type
raise IOError("can't plot {} patches yet".format("lrr"))
# in this case we're only dealing with 'right type' patches
elif i == len(feature_df) - 1:
# type could be rff or rfr
if feature_df.loc[i-1, 'strand'] == '+':
if feature_df.loc[i, 'strand'] == '+':
# plot a rff type
panel, patch = _plot_rff(panel, feature_df.iloc[i-1,], feature_df.iloc[i,],
colorMap, y_pos, bar_thickness, text)
myPatches.append(patch)
elif feature_df.loc[i, 'strand'] == '-':
#plot a rfr type
raise IOError("can't plot {} patches yet".format("rfr"))
# or type could be rrr or rrf
elif feature_df.loc[i-1, 'strand'] == '-':
if feature_df.loc[i, 'strand'] == '+':
# plot a rrf type
raise IOError("can't plot {} patches yet".format("rrf"))
elif feature_df.loc[i, 'strand'] == '-':
#plot a rrr type
raise IOError("can't plot {} patches yet".format("rrr"))
return panel, myPatches
pauvre-0.2.3/pauvre/lsi/ 0000755 0026123 0000167 00000000000 14044135227 016264 5 ustar dschultz biolum 0000000 0000000 pauvre-0.2.3/pauvre/lsi/Q.py 0000755 0026123 0000167 00000003243 13274475032 017050 0 ustar dschultz biolum 0000000 0000000 # Binary search tree that holds status of sweep line. Only leaves hold values.
# Operations for finding left and right neighbors of a query point p and finding which segments contain p.
# Author: Sam Lichtenberg
# Email: splichte@princeton.edu
# Date: 09/02/2013
from pauvre.lsi.helper import *
ev = 0.00000001
class Q:
def __init__(self, key, value):
self.key = key
self.value = value
self.left = None
self.right = None
def find(self, key):
if self.key is None:
return False
c = compare_by_y(key, self.key)
if c==0:
return True
elif c==-1:
if self.left:
self.left.find(key)
else:
return False
else:
if self.right:
self.right.find(key)
else:
return False
def insert(self, key, value):
if self.key is None:
self.key = key
self.value = value
c = compare_by_y(key, self.key)
if c==0:
self.value += value
elif c==-1:
if self.left is None:
self.left = Q(key, value)
else:
self.left.insert(key, value)
else:
if self.right is None:
self.right = Q(key, value)
else:
self.right.insert(key, value)
# must return key AND value
def get_and_del_min(self, parent=None):
if self.left is not None:
return self.left.get_and_del_min(self)
else:
k = self.key
v = self.value
if parent:
parent.left = self.right
# i.e. is root node
else:
if self.right:
self.key = self.right.key
self.value = self.right.value
self.left = self.right.left
self.right = self.right.right
else:
self.key = None
return k,v
def print_tree(self):
if self.left:
self.left.print_tree()
print(self.key)
print(self.value)
if self.right:
self.right.print_tree()
pauvre-0.2.3/pauvre/lsi/T.py 0000755 0026123 0000167 00000020760 13274475032 017056 0 ustar dschultz biolum 0000000 0000000 # Binary search tree that holds status of sweep line. Only leaves hold values.
# Operations for finding left and right neighbors of a query point p and finding which segments contain p.
# Author: Sam Lichtenberg
# Email: splichte@princeton.edu
# Date: 09/02/2013
from pauvre.lsi.helper import *
ev = 0.00000001
class T:
def __init__(self):
self.root = Node(None, None, None, None)
def contain_p(self, p):
if self.root.value is None:
return [[], []]
lists = [[], []]
self.root.contain_p(p, lists)
return (lists[0], lists[1])
def get_left_neighbor(self, p):
if self.root.value is None:
return None
return self.root.get_left_neighbor(p)
def get_right_neighbor(self, p):
if self.root.value is None:
return None
return self.root.get_right_neighbor(p)
def insert(self, key, s):
if self.root.value is None:
self.root.left = Node(s, None, None, self.root)
self.root.value = s
self.root.m = get_slope(s)
else:
(node, path) = self.root.find_insert_pt(key, s)
if path == 'r':
node.right = Node(s, None, None, node)
node.right.adjust()
elif path == 'l':
node.left = Node(s, None, None, node)
else:
# this means matching Node was a leaf
# need to make a new internal Node
if node.compare_to_key(key) < 0 or (node.compare_to_key(key)==0 and node.compare_lower(key, s) < 1):
new_internal = Node(s, None, node, node.parent)
new_leaf = Node(s, None, None, new_internal)
new_internal.left = new_leaf
if node is node.parent.left:
node.parent.left = new_internal
node.adjust()
else:
node.parent.right = new_internal
else:
new_internal = Node(node.value, node, None, node.parent)
new_leaf = Node(s, None, None, new_internal)
new_internal.right = new_leaf
if node is node.parent.left:
node.parent.left = new_internal
new_leaf.adjust()
else:
node.parent.right = new_internal
node.parent = new_internal
def delete(self, p, s):
key = p
node = self.root.find_delete_pt(key, s)
val = node.value
if node is node.parent.left:
parent = node.parent.parent
if parent is None:
if self.root.right is not None:
if self.root.right.left or self.root.right.right:
self.root = self.root.right
self.root.parent = None
else:
self.root.left = self.root.right
self.root.value = self.root.right.value
self.root.m = self.root.right.m
self.root.right = None
else:
self.root.left = None
self.root.value = None
elif node.parent is parent.left:
parent.left = node.parent.right
node.parent.right.parent = parent
else:
parent.right = node.parent.right
node.parent.right.parent = parent
else:
parent = node.parent.parent
if parent is None:
if self.root.left:
# switch properties
if self.root.left.right or self.root.left.left:
self.root = self.root.left
self.root.parent = None
else:
self.root.right = None
else:
self.root.right = None
self.root.value = None
elif node.parent is parent.left:
parent.left = node.parent.left
node.parent.left.parent = parent
farright = node.parent.left
while farright.right is not None:
farright = farright.right
farright.adjust()
else:
parent.right = node.parent.left
node.parent.left.parent = parent
farright = node.parent.left
while farright.right is not None:
farright = farright.right
farright.adjust()
return val
def print_tree(self):
self.root.print_tree()
class Node:
def __init__(self, value, left, right, parent):
self.value = value # associated line segment
self.left = left
self.right = right
self.parent = parent
self.m = None
if value is not None:
self.m = get_slope(value)
# compares line segment at y-val of p to p
# TODO: remove this and replace with get_x_at
def compare_to_key(self, p):
x0 = self.value[0][0]
y0 = self.value[0][1]
y1 = p[1]
if self.m != 0 and self.m is not None:
x1 = x0 - float(y0-y1)/self.m
return compare_by_x(p, (x1, y1))
else:
x1 = p[0]
return 0
def get_left_neighbor(self, p):
neighbor = None
n = self
if n.left is None and n.right is None:
return neighbor
last_right = None
found = False
while not found:
c = n.compare_to_key(p)
if c < 1 and n.left:
n = n.left
elif c==1 and n.right:
n = n.right
last_right = n.parent
else:
found = True
c = n.compare_to_key(p)
if c==0:
if n is n.parent.right:
return n.parent
else:
goright = None
if last_right:
goright =last_right.left
return self.get_lr(None, goright)[0]
# n stores the highest-value in the left subtree
if c==-1:
goright = None
if last_right:
goright = last_right.left
return self.get_lr(None, goright)[0]
if c==1:
neighbor = n
return neighbor
def get_right_neighbor(self, p):
neighbor = None
n = self
if n.left is None and n.right is None:
return neighbor
last_left = None
found = False
while not found:
c = n.compare_to_key(p)
if c==0 and n.right:
n = n.right
elif c < 0 and n.left:
n = n.left
last_left = n.parent
elif c==1 and n.right:
n = n.right
else:
found = True
c = n.compare_to_key(p)
# can be c==0 and n.left if at root node
if c==0:
if n.parent is None:
return None
if n is n.parent.right:
goleft = None
if last_left:
goleft = last_left.right
return self.get_lr(goleft, None)[1]
else:
return self.get_lr(n.parent.right, None)[1]
if c==1:
goleft = None
if last_left:
goleft = last_left.right
return self.get_lr(goleft, None)[1]
if c==-1:
return n
return neighbor
# travels down a single direction to get neighbors
def get_lr(self, left, right):
lr = [None, None]
if left:
while left.left:
left = left.left
lr[1] = left
if right:
while right.right:
right = right.right
lr[0] = right
return lr
def contain_p(self, p, lists):
c = self.compare_to_key(p)
if c==0:
if self.left is None and self.right is None:
if compare_by_x(p, self.value[1])==0:
lists[1].append(self.value)
else:
lists[0].append(self.value)
if self.left:
self.left.contain_p(p, lists)
if self.right:
self.right.contain_p(p, lists)
elif c < 0:
if self.left:
self.left.contain_p(p, lists)
else:
if self.right:
self.right.contain_p(p, lists)
def find_insert_pt(self, key, seg):
if self.left and self.right:
if self.compare_to_key(key) == 0 and self.compare_lower(key, seg)==1:
return self.right.find_insert_pt(key, seg)
elif self.compare_to_key(key) < 1:
return self.left.find_insert_pt(key, seg)
else:
return self.right.find_insert_pt(key, seg)
# this case only happens at root
elif self.left:
if self.compare_to_key(key) == 0 and self.compare_lower(key, seg)==1:
return (self, 'r')
elif self.compare_to_key(key) < 1:
return self.left.find_insert_pt(key, seg)
else:
return (self, 'r')
else:
return (self, 'n')
# adjusts stored segments in inner nodes
def adjust(self):
value = self.value
m = self.m
parent = self.parent
node = self
# go up left as much as possible
while parent and node is parent.right:
node = parent
parent = node.parent
# parent to adjust will be on the immediate right
if parent and node is parent.left:
parent.value = value
parent.m = m
def compare_lower(self, p, s2):
y = p[1] - 10
key = get_x_at(s2, (p[0], y))
return self.compare_to_key(key)
# returns matching leaf node, or None if no match
# when deleting, you don't delete below--you delete above! so compare lower = -1.
def find_delete_pt(self, key, value):
if self.left and self.right:
# if equal at this pt, and this node's value is less than the seg's slightly above this pt
if self.compare_to_key(key) == 0 and self.compare_lower(key, value)==-1:
return self.right.find_delete_pt(key, value)
if self.compare_to_key(key) < 1:
return self.left.find_delete_pt(key, value)
else:
return self.right.find_delete_pt(key, value)
elif self.left:
if self.compare_to_key(key) < 1:
return self.left.find_delete_pt(key, value)
else:
return None
# is leaf
else:
if self.compare_to_key(key)==0 and segs_equal(self.value, value):
return self
else:
return None
# also prints depth of each node
def print_tree(self, l=0):
l += 1
if self.left:
self.left.print_tree(l)
if self.left or self.right:
print('INTERNAL: {0}'.format(l))
else:
print('LEAF: {0}'.format(l))
print(self)
print(self.value)
if self.right:
self.right.print_tree(l)
pauvre-0.2.3/pauvre/lsi/__init__.py 0000644 0026123 0000167 00000000000 13622004260 020354 0 ustar dschultz biolum 0000000 0000000 pauvre-0.2.3/pauvre/lsi/helper.py 0000755 0026123 0000167 00000004450 13274475032 020130 0 ustar dschultz biolum 0000000 0000000 # Helper functions for use in the lsi implementation.
ev = 0.0000001
# floating-point comparison
def approx_equal(a, b, tol):
return abs(a - b) < tol
# compares x-values of two pts
# used for ordering in T
def compare_by_x(k1, k2):
if approx_equal(k1[0], k2[0], ev):
return 0
elif k1[0] < k2[0]:
return -1
else:
return 1
# higher y value is "less"; if y value equal, lower x value is "less"
# used for ordering in Q
def compare_by_y(k1, k2):
if approx_equal(k1[1], k2[1], ev):
if approx_equal(k1[0], k2[0], ev):
return 0
elif k1[0] < k2[0]:
return -1
else:
return 1
elif k1[1] > k2[1]:
return -1
else:
return 1
# tests if s0 and s1 represent the same segment (i.e. pts can be in 2 different orders)
def segs_equal(s0, s1):
x00 = s0[0][0]
y00 = s0[0][1]
x01 = s0[1][0]
y01 = s0[1][1]
x10 = s1[0][0]
y10 = s1[0][1]
x11 = s1[1][0]
y11 = s1[1][1]
if (approx_equal(x00, x10, ev) and approx_equal(y00, y10, ev)):
if (approx_equal(x01, x11, ev) and approx_equal(y01, y11, ev)):
return True
if (approx_equal(x00, x11, ev) and approx_equal(y00, y11, ev)):
if (approx_equal(x01, x10, ev) and approx_equal(y01, y10, ev)):
return True
return False
# get m for a given seg in (p1, p2) form
def get_slope(s):
x0 = s[0][0]
y0 = s[0][1]
x1 = s[1][0]
y1 = s[1][1]
if (x1-x0)==0:
return None
else:
return float(y1-y0)/(x1-x0)
# given a point p, return the point on s that shares p's y-val
def get_x_at(s, p):
m = get_slope(s)
# TODO: this should check if p's x-val is octually on seg; we're assuming
# for now that it would have been deleted already if not
if m == 0: # horizontal segment
return p
# ditto; should check if y-val on seg
if m is None: # vertical segment
return (s[0][0], p[1])
x1 = s[0][0]-(s[0][1]-p[1])/m
return (x1, p[1])
# returns the point at which two line segments intersect, or None if no intersection.
def intersect(seg1, seg2):
p = seg1[0]
r = (seg1[1][0]-seg1[0][0], seg1[1][1]-seg1[0][1])
q = seg2[0]
s = (seg2[1][0]-seg2[0][0], seg2[1][1]-seg2[0][1])
denom = r[0]*s[1]-r[1]*s[0]
if denom == 0:
return None
numer = float(q[0]-p[0])*s[1]-(q[1]-p[1])*s[0]
t = numer/denom
numer = float(q[0]-p[0])*r[1]-(q[1]-p[1])*r[0]
u = numer/denom
if (t < 0 or t > 1) or (u < 0 or u > 1):
return None
x = p[0]+t*r[0]
y = p[1]+t*r[1]
return (x, y)
pauvre-0.2.3/pauvre/lsi/lsi.py 0000755 0026123 0000167 00000005553 13274475032 017445 0 ustar dschultz biolum 0000000 0000000 # Implementation of the Bentley-Ottmann algorithm, described in deBerg et al, ch. 2.
# See README for more information.
# Author: Sam Lichtenberg
# Email: splichte@princeton.edu
# Date: 09/02/2013
from pauvre.lsi.Q import Q
from pauvre.lsi.T import T
from pauvre.lsi.helper import *
# "close enough" for floating point
ev = 0.00000001
# how much lower to get the x of a segment, to determine which of a set of segments is the farthest right/left
lower_check = 100
# gets the point on a segment at a lower y value.
def getNextPoint(p, seg, y_lower):
p1 = seg[0]
p2 = seg[1]
if (p1[0]-p2[0])==0:
return (p[0]+10, p[1])
slope = float(p1[1]-p2[1])/(p1[0]-p2[0])
if slope==0:
return (p1[0], p[1]-y_lower)
y = p[1]-y_lower
x = p1[0]-(p1[1]-y)/slope
return (x, y)
"""
for each event point:
U_p = segments that have p as an upper endpoint
C_p = segments that contain p
L_p = segments that have p as a lower endpoint
"""
def handle_event_point(p, segs, q, t, intersections):
rightmost = (float("-inf"), 0)
rightmost_seg = None
leftmost = (float("inf"), 0)
leftmost_seg = None
U_p = segs
(C_p, L_p) = t.contain_p(p)
merge_all = U_p+C_p+L_p
if len(merge_all) > 1:
intersections[p] = []
for s in merge_all:
intersections[p].append(s)
merge_CL = C_p+L_p
merge_UC = U_p+C_p
for s in merge_CL:
# deletes at a point slightly above (to break ties) - where seg is located in tree
# above intersection point
t.delete(p, s)
# put segments into T based on where they are at y-val just below p[1]
for s in merge_UC:
n = getNextPoint(p, s, lower_check)
if n[0] > rightmost[0]:
rightmost = n
rightmost_seg = s
if n[0] < leftmost[0]:
leftmost = n
leftmost_seg = s
t.insert(p, s)
# means only L_p -> check newly-neighbored segments
if len(merge_UC) == 0:
neighbors = (t.get_left_neighbor(p), t.get_right_neighbor(p))
if neighbors[0] and neighbors[1]:
find_new_event(neighbors[0].value, neighbors[1].value, p, q)
# of newly inserted pts, find possible intersections to left and right
else:
left_neighbor = t.get_left_neighbor(p)
if left_neighbor:
find_new_event(left_neighbor.value, leftmost_seg, p, q)
right_neighbor = t.get_right_neighbor(p)
if right_neighbor:
find_new_event(right_neighbor.value, rightmost_seg, p, q)
def find_new_event(s1, s2, p, q):
i = intersect(s1, s2)
if i:
if compare_by_y(i, p) == 1:
if not q.find(i):
q.insert(i, [])
# segment is in ((x, y), (x, y)) form
# first pt in a segment should have higher y-val - this is handled in function
def intersection(S):
s0 = S[0]
if s0[1][1] > s0[0][1]:
s0 = (s0[1], s0[0])
q = Q(s0[0], [s0])
q.insert(s0[1], [])
intersections = {}
for s in S[1:]:
if s[1][1] > s[0][1]:
s = (s[1], s[0])
q.insert(s[0], [s])
q.insert(s[1], [])
t = T()
while q.key:
p, segs = q.get_and_del_min()
handle_event_point(p, segs, q, t, intersections)
return intersections
pauvre-0.2.3/pauvre/lsi/test.py 0000755 0026123 0000167 00000004633 13274475032 017633 0 ustar dschultz biolum 0000000 0000000 # Test file for lsi.
# Author: Sam Lichtenberg
# Email: splichte@princeton.edu
# Date: 09/02/2013
from lsi import intersection
import random
import time, sys
from helper import *
ev = 0.00000001
def scale(i):
return float(i)
use_file = None
try:
use_file = sys.argv[2]
except:
pass
if not use_file:
S = []
for i in range(int(sys.argv[1])):
p1 = (scale(random.randint(0, 1000)), scale(random.randint(0, 1000)))
p2 = (scale(random.randint(0, 1000)), scale(random.randint(0, 1000)))
s = (p1, p2)
S.append(s)
f = open('input', 'w')
f.write(str(S))
f.close()
else:
f = open(sys.argv[2], 'r')
S = eval(f.read())
intersections = []
seen = []
vs = False
hs = False
es = False
now = time.time()
for seg1 in S:
if approx_equal(seg1[0][0], seg1[1][0], ev):
print 'VERTICAL SEG'
print ''
print ''
vs = True
if approx_equal(seg1[0][1], seg1[1][1], ev):
print 'HORIZONTAL SEG'
print ''
print ''
hs = True
for seg2 in S:
if seg1 is not seg2 and segs_equal(seg1, seg2):
print 'EQUAL SEGS'
print ''
print ''
es = True
if seg1 is not seg2 and (seg2, seg1) not in seen:
i = intersect(seg1, seg2)
if i:
intersections.append((i, [seg1, seg2]))
# xpts = [seg1[0][0], seg1[1][0], seg2[0][0], seg2[1][0]]
# xpts = sorted(xpts)
# if (i[0] <= xpts[2] and i[0] >= xpts[1]:
# intersections.append((i, [seg1, seg2]))
seen.append((seg1, seg2))
later = time.time()
n2time = later-now
print "Line sweep results:"
now = time.time()
lsinters = intersection(S)
inters = []
for k, v in lsinters.iteritems():
#print '{0}: {1}'.format(k, v)
inters.append(k)
# inters.append(v)
later = time.time()
print 'TIME ELAPSED: {0}'.format(later-now)
print "N^2 comparison results:"
pts_seen = []
highestseen = 0
for i in intersections:
seen_already = False
seen = 0
for p in pts_seen:
if approx_equal(i[0][0], p[0], ev) and approx_equal(i[0][1], p[1], ev):
seen += 1
seen_already = True
if seen > highestseen:
highestseen = seen
if not seen_already:
pts_seen.append(i[0])
in_k = False
for k in inters:
if approx_equal(k[0], i[0][0], ev) and approx_equal(k[1], i[0][1], ev):
in_k = True
if in_k == False:
print 'Not in K: {0}: {1}'.format(i[0], i[1])
# print i
print highestseen
print 'TIME ELAPSED: {0}'.format(n2time)
#print 'Missing from line sweep but in N^2:'
#for i in seen:
# matched = False
print len(lsinters)
print len(pts_seen)
if len(lsinters) != len(pts_seen):
print 'uh oh!'
pauvre-0.2.3/pauvre/marginplot.py 0000644 0026123 0000167 00000037035 13622045777 020246 0 ustar dschultz biolum 0000000 0000000 #!/usr/bin/env python
# -*- coding: utf-8 -*-
# pauvre
# Copyright (c) 2016-2020 Darrin T. Schultz.
#
# This file is part of pauvre.
#
# pauvre is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# pauvre is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with pauvre. If not, see .
import ast
import matplotlib
matplotlib.use('Agg')
import matplotlib.pyplot as plt
import matplotlib.patches as mplpatches
from matplotlib.colors import LinearSegmentedColormap
import numpy as np
import pandas as pd
import os.path as opath
from sys import stderr
from pauvre.functions import parse_fastq_length_meanqual, print_images, filter_fastq_length_meanqual
from pauvre.stats import stats
import pauvre.rcparams as rc
import logging
# logging
logger = logging.getLogger('pauvre')
def generate_panel(panel_left, panel_bottom, panel_width, panel_height,
axis_tick_param='both', which_tick_param='both',
bottom_tick_param=True, label_bottom_tick_param=True,
left_tick_param=True, label_left_tick_param=True,
right_tick_param=False, label_right_tick_param=False,
top_tick_param=False, label_top_tick_param=False):
"""
Setting default panel tick parameters. Some of these are the defaults
for matplotlib anyway, but specifying them for readability. Here are
options and defaults for the parameters used below:
axis : {'x', 'y', 'both'}; which axis to modify; default = 'both'
which : {'major', 'minor', 'both'}; which ticks to modify;
default = 'major'
bottom, top, left, right : bool or {True, False}; ticks on or off;
labelbottom, labeltop, labelleft, labelright : bool or {True, False}
"""
# create the panel
panel_rectangle = [panel_left, panel_bottom, panel_width, panel_height]
panel = plt.axes(panel_rectangle)
# Set tick parameters
panel.tick_params(axis=axis_tick_param,
which=which_tick_param,
bottom=bottom_tick_param,
labelbottom=label_bottom_tick_param,
left=left_tick_param,
labelleft=label_left_tick_param,
right=right_tick_param,
labelright=label_right_tick_param,
top=top_tick_param,
labeltop=label_top_tick_param)
return panel
def _generate_histogram_bin_patches(panel, bins, bin_values, horizontal=True):
"""This helper method generates the histogram that is added to the panel.
In this case, horizontal = True applies to the mean quality histogram.
So, horizontal = False only applies to the length histogram.
"""
l_width = 0.0
f_color = (0.5, 0.5, 0.5)
e_color = (0, 0, 0)
if horizontal:
for step in np.arange(0, len(bin_values), 1):
left = bins[step]
bottom = 0
width = bins[step + 1] - bins[step]
height = bin_values[step]
hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
linewidth=l_width,
facecolor=f_color,
edgecolor=e_color)
panel.add_patch(hist_rectangle)
else:
for step in np.arange(0, len(bin_values), 1):
left = 0
bottom = bins[step]
width = bin_values[step]
height = bins[step + 1] - bins[step]
hist_rectangle = mplpatches.Rectangle((left, bottom), width, height,
linewidth=l_width,
facecolor=f_color,
edgecolor=e_color)
panel.add_patch(hist_rectangle)
def generate_histogram(panel, data_list, max_plot_length, min_plot_length,
bin_interval, hist_horizontal=True,
left_spine=True, bottom_spine=True,
top_spine=False, right_spine=False, x_label=None,
y_label=None):
bins = np.arange(0, max_plot_length, bin_interval)
bin_values, bins2 = np.histogram(data_list, bins)
# hist_horizontal is used for quality
if hist_horizontal:
panel.set_xlim([min_plot_length, max_plot_length])
panel.set_ylim([0, max(bin_values * 1.1)])
# and hist_horizontal == Fale is for read length
else:
panel.set_xlim([0, max(bin_values * 1.1)])
panel.set_ylim([min_plot_length, max_plot_length])
# Generate histogram bin patches, depending on whether we're plotting
# vertically or horizontally
_generate_histogram_bin_patches(panel, bins, bin_values, hist_horizontal)
panel.spines['left'].set_visible(left_spine)
panel.spines['bottom'].set_visible(bottom_spine)
panel.spines['top'].set_visible(top_spine)
panel.spines['right'].set_visible(right_spine)
if y_label is not None:
panel.set_ylabel(y_label)
if x_label is not None:
panel.set_xlabel(x_label)
def generate_heat_map(panel, data_frame, min_plot_length, min_plot_qual,
max_plot_length, max_plot_qual, color, **kwargs):
panel.set_xlim([min_plot_qual, max_plot_qual])
panel.set_ylim([min_plot_length, max_plot_length])
if kwargs["kmerdf"]:
hex_this = data_frame.query('length<{} and numks<{}'.format(
max_plot_length, max_plot_qual))
# This single line controls plotting the hex bins in the panel
hex_vals = panel.hexbin(hex_this['numks'], hex_this['length'], gridsize=int(np.ceil(max_plot_qual/2)),
linewidths=0.0, cmap=color)
else:
hex_this = data_frame.query('length<{} and meanQual<{}'.format(
max_plot_length, max_plot_qual))
# This single line controls plotting the hex bins in the panel
hex_vals = panel.hexbin(hex_this['meanQual'], hex_this['length'], gridsize=49,
linewidths=0.0, cmap=color)
for each in panel.spines:
panel.spines[each].set_visible(False)
counts = hex_vals.get_array()
return counts
def generate_legend(panel, counts, color):
# completely custom for more control
panel.set_xlim([0, 1])
panel.set_ylim([0, 1000])
panel.set_yticks([int(x) for x in np.linspace(0, 1000, 6)])
panel.set_yticklabels([int(x) for x in np.linspace(0, max(counts), 6)])
for i in np.arange(0, 1001, 1):
rgba = color(i / 1001)
alpha = rgba[-1]
facec = rgba[0:3]
hist_rectangle = mplpatches.Rectangle((0, i), 1, 1,
linewidth=0.0,
facecolor=facec,
edgecolor=(0, 0, 0),
alpha=alpha)
panel.add_patch(hist_rectangle)
panel.spines['top'].set_visible(False)
panel.spines['left'].set_visible(False)
panel.spines['bottom'].set_visible(False)
panel.yaxis.set_label_position("right")
panel.set_ylabel('Number of Reads')
def margin_plot(df, **kwargs):
rc.update_rcParams()
# 250, 231, 34 light yellow
# 67, 1, 85
# R=np.linspace(65/255,1,101)
# G=np.linspace(0/255, 231/255, 101)
# B=np.linspace(85/255, 34/255, 101)
# R=65/255, G=0/255, B=85/255
Rf = 65 / 255
Bf = 85 / 255
pdict = {'red': ((0.0, Rf, Rf),
(1.0, Rf, Rf)),
'green': ((0.0, 0.0, 0.0),
(1.0, 0.0, 0.0)),
'blue': ((0.0, Bf, Bf),
(1.0, Bf, Bf)),
'alpha': ((0.0, 0.0, 0.0),
(1.0, 1.0, 1.0))
}
# Now we will use this example to illustrate 3 ways of
# handling custom colormaps.
# First, the most direct and explicit:
purple1 = LinearSegmentedColormap('Purple1', pdict)
# set the figure dimensions
fig_width = 1.61 * 3
fig_height = 1 * 3
fig = plt.figure(figsize=(fig_width, fig_height))
# set the panel dimensions
heat_map_panel_width = fig_width * 0.5
heat_map_panel_height = heat_map_panel_width * 0.62
# find the margins to center the panel in figure
fig_left_margin = fig_bottom_margin = (1 / 6)
# lengthPanel
length_panel_width = (1 / 8)
# the color Bar parameters
legend_panel_width = (1 / 24)
# define padding
h_padding = 0.02
v_padding = 0.05
# Set whether to include y-axes in histograms
if kwargs["Y_AXES"]:
length_bottom_spine = True
length_bottom_tick = False
length_bottom_label = True
qual_left_spine = True
qual_left_tick = True
qual_left_label = True
qual_y_label = 'Count'
else:
length_bottom_spine = False
length_bottom_tick = False
length_bottom_label = False
qual_left_spine = False
qual_left_tick = False
qual_left_label = False
qual_y_label = None
panels = []
# Quality histogram panel
qual_panel_left = fig_left_margin + length_panel_width + h_padding
qual_panel_width = heat_map_panel_width / fig_width
qual_panel_height = length_panel_width * fig_width / fig_height
qual_panel = generate_panel(qual_panel_left,
fig_bottom_margin,
qual_panel_width,
qual_panel_height,
left_tick_param=qual_left_tick,
label_left_tick_param=qual_left_label)
panels.append(qual_panel)
# Length histogram panel
length_panel_bottom = fig_bottom_margin + qual_panel_height + v_padding
length_panel_height = heat_map_panel_height / fig_height
length_panel = generate_panel(fig_left_margin,
length_panel_bottom,
length_panel_width,
length_panel_height,
bottom_tick_param=length_bottom_tick,
label_bottom_tick_param=length_bottom_label)
panels.append(length_panel)
# Heat map panel
heat_map_panel_left = fig_left_margin + length_panel_width + h_padding
heat_map_panel_bottom = fig_bottom_margin + qual_panel_height + v_padding
heat_map_panel = generate_panel(heat_map_panel_left,
heat_map_panel_bottom,
heat_map_panel_width / fig_width,
heat_map_panel_height / fig_height,
bottom_tick_param=False,
label_bottom_tick_param=False,
left_tick_param=False,
label_left_tick_param=False)
panels.append(heat_map_panel)
heat_map_panel.set_title(kwargs["title"])
# Legend panel
legend_panel_left = fig_left_margin + length_panel_width + \
(heat_map_panel_width / fig_width) + (h_padding * 2)
legend_panel_bottom = fig_bottom_margin + qual_panel_height + v_padding
legend_panel_height = heat_map_panel_height / fig_height
legend_panel = generate_panel(legend_panel_left, legend_panel_bottom,
legend_panel_width, legend_panel_height,
bottom_tick_param = False,
label_bottom_tick_param = False,
left_tick_param = False,
label_left_tick_param = False,
right_tick_param = True,
label_right_tick_param = True)
panels.append(legend_panel)
# Set min and max viewing window for length
if kwargs["plot_maxlen"]:
max_plot_length = kwargs["plot_maxlen"]
else:
max_plot_length = int(np.percentile(df['length'], 99))
min_plot_length = kwargs["plot_minlen"]
# Set length bin sizes
if kwargs["lengthbin"]:
length_bin_interval = kwargs["lengthbin"]
else:
# Dividing by 80 is based on what looks good from experience
length_bin_interval = int(max_plot_length / 80)
# length_bins = np.arange(0, max_plot_length, length_bin_interval)
# Set max and min viewing window for quality
if kwargs["plot_maxqual"]:
max_plot_qual = kwargs["plot_maxqual"]
elif kwargs["kmerdf"]:
max_plot_qual = np.ceil(df["numks"].median() * 2)
else:
max_plot_qual = max(np.ceil(df['meanQual']))
min_plot_qual = kwargs["plot_minqual"]
# Set qual bin sizes
if kwargs["qualbin"]:
qual_bin_interval = kwargs["qualbin"]
elif kwargs["kmerdf"]:
qual_bin_interval = 1
else:
# again, this is just based on what looks good from experience
qual_bin_interval = max_plot_qual / 85
qual_bins = np.arange(0, max_plot_qual, qual_bin_interval)
# Generate length histogram
generate_histogram(length_panel, df['length'], max_plot_length, min_plot_length,
length_bin_interval, hist_horizontal=False,
y_label='Read Length', bottom_spine=length_bottom_spine)
# Generate quality histogram
if kwargs["kmerdf"]:
generate_histogram(qual_panel, df['numks'], max_plot_qual, min_plot_qual,
qual_bin_interval, x_label='number of kmers',
y_label=qual_y_label, left_spine=qual_left_spine)
else:
generate_histogram(qual_panel, df['meanQual'], max_plot_qual, min_plot_qual,
qual_bin_interval, x_label='Phred Quality',
y_label=qual_y_label, left_spine=qual_left_spine)
# Generate heat map
counts = generate_heat_map(heat_map_panel, df, min_plot_length, min_plot_qual,
max_plot_length, max_plot_qual, purple1, kmerdf = kwargs["kmerdf"])
# Generate legend
generate_legend(legend_panel, counts, purple1)
# inform the user of the plotting window if not quiet mode
if not kwargs["QUIET"]:
print("""plotting in the following window:
{0} <= Q-score (x-axis) <= {1}
{2} <= length (y-axis) <= {3}""".format(
min_plot_qual, max_plot_qual, min_plot_length, max_plot_length),
file=stderr)
# Print image(s)
if kwargs["BASENAME"] is None:
file_base = opath.splitext(opath.basename(kwargs["fastq"]))[0]
else:
file_base = kwargs["BASENAME"]
print_images(
file_base,
image_formats=kwargs["fileform"],
dpi=kwargs["dpi"],
no_timestamp = kwargs["no_timestamp"],
transparent=kwargs["TRANSPARENT"])
def run(args):
if args.kmerdf:
df = pd.read_csv(args.kmerdf, header='infer', sep='\t')
df["kmers"] = df["kmers"].apply(ast.literal_eval)
else:
df = parse_fastq_length_meanqual(args.fastq)
df = filter_fastq_length_meanqual(df, args.filt_minlen, args.filt_maxlen,
args.filt_minqual, args.filt_maxqual)
stats(df, args.fastq, False)
margin_plot(df=df.dropna(), **vars(args))
pauvre-0.2.3/pauvre/pauvre_main.py 0000644 0026123 0000167 00000100363 13622037671 020365 0 ustar dschultz biolum 0000000 0000000 #!/usr/bin/env python
# -*- coding: utf-8 -*-
# pauvre - just a pore plotting package
# Copyright (c) 2016-2017 Darrin T. Schultz. All rights reserved.
#
# This file is part of pauvre.
#
# pauvre is free software: you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation, either version 3 of the License, or
# (at your option) any later version.
#
# pauvre is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the
# GNU General Public License for more details.
#
# You should have received a copy of the GNU General Public License
# along with pauvre. If not, see .
# I modeled this code on https://github.com/arq5x/poretools/. Check it out. - DTS
import sys
import os.path
import argparse
# pauvre imports
import pauvre.version
# This class is used in argparse to expand the ~. This avoids errors caused on
# some systems.
class FullPaths(argparse.Action):
"""Expand user- and relative-paths"""
def __call__(self, parser, namespace, values, option_string=None):
setattr(namespace, self.dest,
os.path.abspath(os.path.expanduser(values)))
class FullPathsList(argparse.Action):
"""Expand user- and relative-paths when a list of paths is passed to the
program"""
def __call__(self, parser, namespace, values, option_string=None):
setattr(namespace, self.dest,
[os.path.abspath(os.path.expanduser(value)) for value in values])
def run_subtool(parser, args):
if args.command == 'browser':
import pauvre.browser as submodule
elif args.command == 'custommargin':
import pauvre.custommargin as submodule
elif args.command == 'marginplot':
import pauvre.marginplot as submodule
elif args.command == 'redwood':
import pauvre.redwood as submodule
elif args.command == 'stats':
import pauvre.stats as submodule
elif args.command == 'synplot':
import pauvre.synplot as submodule
# run the chosen submodule.
submodule.run(args)
class ArgumentParserWithDefaults(argparse.ArgumentParser):
def __init__(self, *args, **kwargs):
super(ArgumentParserWithDefaults, self).__init__(*args, **kwargs)
self.add_argument("-q", "--quiet", help="Do not output warnings to stderr",
action="store_true",
dest="QUIET")
def main():
#########################################
# create the top-level parser
#########################################
parser = argparse.ArgumentParser(
prog='pauvre', formatter_class=argparse.ArgumentDefaultsHelpFormatter)
parser.add_argument("-v", "--version", help="Installed pauvre version",
action="version",
version="%(prog)s " + str(pauvre.version.__version__))
subparsers = parser.add_subparsers(
title='[sub-commands]', dest='command', parser_class=ArgumentParserWithDefaults)
#########################################
# create the individual tool parsers
#########################################
#############
# browser
#############
parser_browser = subparsers.add_parser('browser',
help="""an adaptable genome
browser with various track types""")
parser_browser.add_argument('-c', '--chromosomeid',
metavar = "Chr",
dest = 'CHR',
type = str,
help = """The fasta sequence to observe.
Use the header name of the fasta file
without the '>' character""")
parser_browser.add_argument('--dpi',
metavar='dpi',
default=600,
type=int,
help="""Change the dpi from the default 600
if you need it higher""")
parser_browser.add_argument('--fileform',
dest='fileform',
metavar='STRING',
choices=['png', 'pdf', 'eps', 'jpeg', 'jpg',
'pdf', 'pgf', 'ps', 'raw', 'rgba',
'svg', 'svgz', 'tif', 'tiff'],
default=['png'],
nargs='+',
help='Which output format would you like? Def.=png')
parser_browser.add_argument("--no_timestamp",
action = 'store_true',
help="""Turn off time stamps in the filename
output.""")
parser_browser.add_argument('-o', '--output-base-name',
dest='BASENAME',
help="""Specify a base name for the output file(
s). The input file base name is the
default.""")
parser_browser.add_argument('--path',
type=str,
help="""Set an explicit filepath for the output.
Only do this if you have selected one output type.""")
parser_browser.add_argument('-p', '--plot_commands',
dest='CMD',
nargs = '+',
help="""Write strings here to select what
to plot. The format for each track is:
::