pauvre-0.2.3/0000755002612300001670000000000014044135227014173 5ustar dschultzbiolum00000000000000pauvre-0.2.3/MANIFEST.in0000644002612300001670000000003013663657256015743 0ustar dschultzbiolum00000000000000include scripts/test.sh pauvre-0.2.3/PKG-INFO0000644002612300001670000000223114044135227015266 0ustar dschultzbiolum00000000000000Metadata-Version: 1.2 Name: pauvre Version: 0.2.3 Summary: Tools for plotting Oxford Nanopore and other long-read data. Home-page: https://github.com/conchoecia/pauvre Author: Darrin Schultz Author-email: dts@ucsc.edu License: UNKNOWN Description: 'pauvre' is a package for plotting Oxford Nanopore and other long read data. The name means 'poor' in French, a play on words to the oft-used 'pore' prefix for similar packages. This package was designed for python 3, but it might work in python 2. You can visit the gitub page for more detailed information here: https://github.com/conchoecia/pauvre Platform: UNKNOWN Classifier: Development Status :: 2 - Pre-Alpha Classifier: License :: OSI Approved :: GNU General Public License v3 (GPLv3) Classifier: Programming Language :: Python :: 3 Classifier: Programming Language :: Python :: 3.6 Classifier: Programming Language :: Python :: 3.7 Classifier: Operating System :: POSIX :: Linux Classifier: Topic :: Scientific/Engineering :: Bio-Informatics Classifier: Intended Audience :: Science/Research Requires-Python: >=3 pauvre-0.2.3/README.md0000644002612300001670000001276513655436644015503 0ustar dschultzbiolum00000000000000[![travis-ci](https://travis-ci.org/conchoecia/pauvre.svg?branch=master)](https://travis-ci.org/conchoecia/pauvre) [![DOI](https://zenodo.org/badge/112774670.svg)](https://zenodo.org/badge/latestdoi/112774670) ## Getting Started ``` pauvre custommargin -i custom.tsv --ycol length --xcol qual # Custom tsv input ``` ## Table of Contents - [Getting Started](#started) - [Users' Guide](#uguide) - [Installation](#installation) - [Requirements](#reqs) - [Install Instructions](#install) - [Usage](#usage) - [pauvre stats](#stats) - [pauvre marginplot](#marginplot) - [Basic Usage](#marginbasic) - [Plot Adjustments](#marginadjustments) - [Specialized Options](#marginspecialized) - [Contributors](#contributors) ## Users' Guide Pauvre is a plotting package originally designed to help QC the length and quality distribution of Oxford Nanopore or PacBio reads. The main outputs are marginplots. Now, `pauvre` also hosts other additional data plotting scripts. This package currently hosts five scripts for plotting and/or printing stats. - `pauvre marginplot` - takes a fastq file as input and outputs a marginal histogram with a heatmap. - `pauvre custommargin` - takes a tsv as input and outputs a marginal histogram with custom columns of your choice. - `pauvre stats` - Takes a fastq file as input and prints out a table of stats, including how many basepairs/reads there are for a length/mean quality cutoff. - This is also automagically called when using `pauvre marginplot` - `pauvre redwood` - I am happy to introduce the redwood plot to the world as a method of representing circular genomes. A redwood plot contains long reads as "rings" on the inside, a gene annotation "cambrium/phloem", and a RNAseq "bark". The input is `.bam` files for the long reads and RNAseq data, and a `.gff` file for the annotation. More details to follow as we document this program better... - `pauvre synteny` - Makes a synteny plot of circular genomes. Finds the most parsimonius rotation to display the synteny of all the input genomes with the fewest crossings-over. Input is one `.gff` file per circular genome and one directory of gene alignments. ## Installation ### Requirements - You must have the following installed on your system to install this software: - python 3.x - matplotlib - biopython - pandas - pillow ### Install Instructions - Instructions to install on your mac or linux system. Not sure on Windows! Make sure *python 3* is the active environment before installing. - `git clone https://github.com/conchoecia/pauvre.git` - `cd ./pauvre` - `pip3 install .` - Or, install with pip - `pip3 install pauvre` ## Usage ### `stats` - generate basic statistics about the fastq file. For example, if I want to know the number of bases and reads with AT LEAST a PHRED score of 5 and AT LEAST a read length of 500, run the program as below and look at the cells highlighted with ``. - `pauvre stats --fastq miniDSMN15.fastq` ``` numReads: 1000 numBasepairs: 1029114 meanLen: 1029.114 medianLen: 875.5 minLen: 11 maxLen: 5337 N50: 1278 L50: 296 Basepairs >= bin by mean PHRED and length minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30 0 1029114 1010681 935366 429279 143948 25139 3668 2938 2000 0 500 984212 <968653> 904787 421307 142003 24417 3668 2938 2000 0 1000 659842 649319 616788 300948 103122 17251 2000 2000 2000 0 et cetera... Number of reads >= bin by mean Phred+Len minLen Q0 Q5 Q10 Q15 Q17.5 Q20 Q21.5 Q25 Q25.5 Q30 0 1000 969 865 366 118 22 3 2 1 0 500 873 <859> 789 347 113 20 3 2 1 0 1000 424 418 396 187 62 11 1 1 1 0 et cetera... ``` ### `marginplot` #### Basic Usage - automatically calls `pauvre stats` for each fastq file - Make the default plot showing the 99th percentile of longest reads - `pauvre marginplot --fastq miniDSMN15.fastq` - ![default](files/default_miniDSMN15.png) - Make a marginal histogram for ONT 2D or 1D^2 cDNA data with a lower maxlen and higher maxqual. - `pauvre marginplot --maxlen 4000 --maxqual 25 --lengthbin 50 --fileform pdf png --qualbin 0.5 --fastq miniDSMN15.fastq` - ![example1](files/miniDSMN15.png) #### Plot Adjustments - Filter out reads with a mean quality less than 5, and a length less than 800. Zoom in to plot only mean quality of at least 4 and read length at least 500bp. - `pauvre marginplot -f miniDSMN15.fastq --filt_minqual 5 --filt_minlen 800 -y --plot_minlen 500 --plot_minqual 4` - ![test4](files/test4.png) #### Specialized Options - Plot ONT 1D data with a large tail - `pauvre marginplot --maxlen 100000 --maxqual 15 --lengthbin 500 .fastq` - Get more resolution on lengths - `pauvre marginplot --maxlen 100000 --lengthbin 5 .fastq` - Turn off transparency if you just want a white background - `pauvre marginplot --transparent False .fastq` - Note: transparency is the default behavior - ![transparency](files/transparency.001.jpeg) ## Contributors @conchoecia (Darrin Schultz) @mebbert (Mark Ebbert) @wdecoster (Wouter De Coster) pauvre-0.2.3/pauvre/0000755002612300001670000000000014044135227015475 5ustar dschultzbiolum00000000000000pauvre-0.2.3/pauvre/__init__.py0000644002612300001670000000004713167537132017615 0ustar dschultzbiolum00000000000000from pauvre.version import __version__ pauvre-0.2.3/pauvre/bamparse.py0000644002612300001670000002171313322176054017646 0ustar dschultzbiolum00000000000000#!/usr/bin/env python # -*- coding: utf-8 -*- # pauvre - just a pore plotting package # Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved. # twitter @conchoecia # # This file is part of pauvre. # # pauvre is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # pauvre is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with pauvre. If not, see . import pysam import pandas as pd import os class BAMParse(): """This class reads in a sam/bam file and constructs a pandas dataframe of all the relevant information for the reads to pass on and plot. """ def __init__(self, filename, chrid = None, start = None, stop = None, doubled = None): self.filename = filename self.doubled = doubled #determine if the file is bam or sam self.filetype = os.path.splitext(self.filename)[1] #throw an error if the file is not bam or sam if self.filetype not in ['.bam']: raise Exception("""You have provided a file with an extension other than '.bam', please check your command-line arguments""") #now make sure there is an index file for the bam file if not os.path.exists("{}.bai".format(self.filename)): raise Exception("""Your .bam file is there, but it isn't indexed and there isn't a .bai file to go with it. Use 'samtools index .bam' to fix it.""") #now open the file and just call it a sambam file filetype_dict = {'.sam': '', '.bam': 'b'} self.sambam = pysam.AlignmentFile(self.filename, "r{}".format(filetype_dict[self.filetype])) if chrid == None: self.chrid = self.sambam.references[0] else: self.chrid = chrid self.refindex = self.sambam.references.index(self.chrid) self.seqlength = self.sambam.lengths[self.refindex] self.true_seqlength = self.seqlength if not self.doubled else int(self.seqlength/2) if start == None or stop == None: self.start = 1 self.stop = self.true_seqlength self.features = self.parse() self.features.sort_values(by=['POS','MAPLEN'], ascending=[True, False] ,inplace=True) self.features.reset_index(inplace=True) self.features.drop('index', 1, inplace=True) self.raw_depthmap = self.get_depthmap() self.features_depthmap = self.get_features_depthmap() def get_depthmap(self): depthmap = [0] * (self.stop - self.start + 1) for p in self.sambam.pileup(self.chrid, self.start, self.stop): index = p.reference_pos if index >= self.true_seqlength: index -= self.true_seqlength depthmap[index] += p.nsegments return depthmap def get_features_depthmap(self): """this method builds a more accurate pileup that is based on if there is actually a mapped base at any given position or not. better for long reads and RNA""" depthmap = [0] * (self.stop - self.start + 1) print("depthmap is: {} long".format(len(depthmap))) for index, row in self.features.iterrows(): thisindex = row["POS"] - self.start for thistup in row["TUPS"]: b_type = thistup[1] b_len = thistup[0] if b_type == "M": for j in range(b_len): #this is necessary to reset the index if we wrap # around to the beginning if self.doubled and thisindex == len(depthmap): thisindex = 0 depthmap[thisindex] += 1 thisindex += 1 elif b_type in ["S", "H", "I"]: pass elif b_type in ["D", "N"]: thisindex += b_len #this is necessary to reset the index if we wrap # around to the beginning if self.doubled and thisindex >= len(depthmap): thisindex = thisindex - len(depthmap) return depthmap def parse(self): data = {'POS': [], 'MAPQ': [], 'TUPS': [] } for read in self.sambam.fetch(self.chrid, self.start, self.stop): data['POS'].append(read.reference_start + 1) data['TUPS'].append(self.cigar_parse(read.cigartuples)) data['MAPQ'].append(read.mapq) features = pd.DataFrame.from_dict(data, orient='columns') features['ALNLEN'] = features['TUPS'].apply(self.aln_len) features['TRULEN'] = features['TUPS'].apply(self.tru_len) features['MAPLEN'] = features['TUPS'].apply(self.map_len) features['POS'] = features['POS'].apply(self.fix_pos) return features def cigar_parse(self, tuples): """ arguments: a CIGAR string tuple list in pysam format purpose: This function uses the pysam cigarstring tuples format and returns a list of tuples in the internal format, [(20, 'M'), (5, "I")], et cetera. The zeroth element of each tuple is the number of bases for the CIGAR string feature. The first element of each tuple is the CIGAR string feature type. There are several feature types in SAM/BAM files. See below: 'M' - match 'I' - insertion relative to reference 'D' - deletion relative to reference 'N' - skipped region from the reference 'S' - soft clip, not aligned but still in sam file 'H' - hard clip, not aligned and not in sam file 'P' - padding (silent deletion from padded reference) '=' - sequence match 'X' - sequence mismatch 'B' - BAM_CBACK (I don't actually know what this is) """ # I used the map values from http://pysam.readthedocs.io/en/latest/api.html#pysam.AlignedSegment psam_to_char = {0: 'M', 1: 'I', 2: 'D', 3: 'N', 4: 'S', 5: 'H', 6: 'P', 7: '=', 8: 'X', 9: 'B'} return [(value, psam_to_char[feature]) for feature, value in tuples] def aln_len(self, TUPS): """ arguments: a list of tuples output from the cigar_parse() function. purpose: This returns the alignment length of the read to the reference. Specifically, it sums the length of all of the matches and deletions. In effect, this number is length of the region of the reference sequence to which the read maps. This number is probably the most useful for selecting reads to visualize in the mapped read plot. """ return sum([pair[0] for pair in TUPS if pair[1] not in ['S', 'H', 'I']]) def map_len(self, TUPS): """ arguments: a list of tuples output from the cigar_parse() function. purpose: This function returns the map length (all matches and deletions relative to the reference), plus the unmapped 5' and 3' hard/soft clipped sequences. This number is useful if you want to visualize how much 5' and 3' sequence of a read did not map to the reference. For example, poor quality 5' and 3' tails are common in Nanopore reads. """ return sum([pair[0] for pair in TUPS if pair[1] not in ['I']]) def tru_len(self, TUPS): """ arguments: a list of tuples output from the cigar_parse() function. purpose: This function returns the total length of the read, including insertions, deletions, matches, soft clips, and hard clips. This is useful for comparing to the map length or alignment length to see what percentage of the read aligned to the reference. """ return sum([pair[0] for pair in TUPS]) def fix_pos(self, start_index): """ arguments: an int purpose: When using a doubled SAMfile, any reads that start after the first copy of the reference risk running over the plotting window, causing the program to crash. This function corrects for this issue by changing the start site of the read. Note: this will probably break the program if not using a double alignment since no reads would map past half the length of the single reference """ if self.doubled: if start_index > int(self.seqlength/2): return start_index - int(self.seqlength/2) - 1 else: return start_index else: return start_index pauvre-0.2.3/pauvre/browser.py0000644002612300001670000003651313622004260017533 0ustar dschultzbiolum00000000000000#!/usr/bin/env python # -*- coding: utf-8 -*- # pauvre - a pore plotting package # Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved. # # This file is part of pauvre. # # pauvre is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # pauvre is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with pauvre. If not, see . # following this tutorial to install helvetica # https://github.com/olgabot/sciencemeetproductivity.tumblr.com/blob/master/posts/2012/11/how-to-set-helvetica-as-the-default-sans-serif-font-in.md global hfont hfont = {'fontname':'Helvetica'} import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt from matplotlib.colors import LinearSegmentedColormap, Normalize import matplotlib.patches as patches import gffutils import pandas as pd pd.set_option('display.max_columns', 500) pd.set_option('display.width', 1000) import numpy as np import os import pauvre.rcparams as rc from pauvre.functions import GFFParse, print_images, timestamp from pauvre import gfftools from pauvre.lsi.lsi import intersection from pauvre.bamparse import BAMParse import progressbar import platform import sys import time # Biopython stuff from Bio import SeqIO import Bio.SubsMat.MatrixInfo as MI class PlotCommand: def __init__(self, plotcmd, REF): self.ref = REF self.style_choices = [] self.cmdtype = "" self.path = "" self.style = "" self.options = "" self._parse_cmd(plotcmd) def _parse_cmd(self, plotcmd): chunks = plotcmd.split(":") if chunks[0] == "ref": self.cmdtype = "ref" if len(chunks) < 2: self._len_error() self.path = self.ref self.style = chunks[1] self.style_choices = ["normal", "colorful"] self._check_style_choices() if len(chunks) > 2: self.options = chunks[2].split(",") elif chunks[0] in ["bam", "peptides"]: if len(chunks) < 3: self._len_error() self.cmdtype = chunks[0] self.path = os.path.abspath(os.path.expanduser(chunks[1])) self.style = chunks[2] if self.cmdtype == "bam": self.style_choices = ["depth", "reads"] else: self.style_choices = ["depth"] self._check_style_choices() if len(chunks) > 3: self.options = chunks[3].split(",") elif chunks[0] in ["gff3"]: if len(chunks) < 2: self._len_error() self.cmdtype = chunks[0] self.path = os.path.abspath(os.path.expanduser(chunks[1])) if len(chunks) > 2: self.options = chunks[2].split(",") def _len_error(self): raise IOError("""You selected {} to plot, but need to specify the style at least.""".format(self.cmdtype)) def _check_style_choices(self): if self.style not in self.style_choices: raise IOError("""You selected {} style for ref. You must select from {}. """.format( self.style, self.style_choices)) global dna_color dna_color = {"A": (81/255, 87/255, 251/255, 1), "T": (230/255, 228/255, 49/255, 1), "G": (28/255, 190/255, 32/255, 1), "C": (220/255, 10/255, 23/255, 1)} #these are the line width for the different cigar string flags. # usually, only M, I, D, S, and H appear in bwa mem output global widthDict widthDict = {'M':0.45, # match 'I':0.9, # insertion relative to reference 'D':0.05, # deletion relative to reference 'N':0.1, # skipped region from the reference 'S':0.1, # soft clip, not aligned but still in sam file 'H':0.1, # hard clip, not aligned and not in sam file 'P':0.1, # padding (silent deletion from padded reference) '=':0.1, # sequence match 'X':0.1} # sequence mismatch global richgrey richgrey = (60/255, 54/255, 69/255, 1) def plot_ref(panel, chrid, start, stop, thiscmd): panel.set_xlim([start, stop]) panel.set_ylim([-2.5, 2.5]) panel.set_xticks([int(val) for val in np.linspace(start, stop, 6)]) if thiscmd.style == "colorful": thisseq = "" for record in SeqIO.parse(thiscmd.ref, "fasta"): if record.id == chrid: thisseq = record.seq[start-1: stop] for i in range(len(thisseq)): left = start + i bottom = -0.5 width = 1 height = 1 rect = patches.Rectangle((left, bottom), width, height, linewidth = 0, facecolor = dna_color[thisseq[i]] ) panel.add_patch(rect) return panel def safe_log10(value): try: logval = np.log10(value) except: logval = 0 return logval def plot_bam(panel, chrid, start, stop, thiscmd): bam = BAMParse(thiscmd.path) panel.set_xlim([start, stop]) if thiscmd.style == "depth": maxdepth = max(bam.features_depthmap) maxdepthlog = safe_log10(maxdepth) if "log" in thiscmd.options: panel.set_ylim([-maxdepthlog, maxdepthlog]) panel.set_yticks([int(val) for val in np.linspace(0, maxdepthlog, 2)]) else: panel.set_yticks([int(val) for val in np.linspace(0, maxdepth, 2)]) if "c" in thiscmd.options: panel.set_ylim([-maxdepth, maxdepth]) else: panel.set_ylim([0, maxdepth]) for i in range(len(bam.features_depthmap)): left = start + i width = 1 if "c" in thiscmd.options and "log" in thiscmd.options: bottom = -1 * safe_log10(bam.features_depthmap[i]) height = safe_log10(bam.features_depthmap[i]) * 2 elif "c" in thiscmd.options and "log" not in thiscmd.options: bottom = -bam.features_depthmap[i] height = bam.features_depthmap[i] * 2 else: bottom = 0 height = bam.features_depthmap[i] if height > 0: rect = patches.Rectangle((left, bottom), width, height, linewidth = 0, facecolor = richgrey ) panel.add_patch(rect) if thiscmd.style == "reads": #If we're plotting reads, we don't need y-axis panel.tick_params(bottom="off", labelbottom="off", left ="off", labelleft = "off") reads = bam.features.copy() panel.set_xlim([start, stop]) direction = "for" if direction == 'for': bav = {"by":['POS','MAPLEN'], "asc": [True, False]} direction= 'rev' elif direction == 'rev': bav = {"by":['POS','MAPLEN'], "asc": [True, False]} direction = 'for' reads.sort_values(by=bav["by"], ascending=bav['asc'],inplace=True) reads.reset_index(drop=True, inplace=True) depth_count = -1 plotind = start while len(reads) > 0: #depth_count -= 1 #print("len of reads is {}".format(len(reads))) potential = reads.query("POS >= {}".format(plotind)) if len(potential) == 0: readsindex = 0 #print("resetting plot ind from {} to {}".format( # plotind, reads.loc[readsindex, "POS"])) depth_count -= 1 else: readsindex = int(potential.index.values[0]) #print("pos of potential is {}".format(reads.loc[readsindex, "POS"])) plotind = reads.loc[readsindex, "POS"] for TUP in reads.loc[readsindex, "TUPS"]: b_type = TUP[1] b_len = TUP[0] #plotting params # left same for all. left = plotind bottom = depth_count height = widthDict[b_type] width = b_len plot = True color = richgrey if b_type in ["H", "S"]: """We don't plot hard or sort clips - like IGV""" plot = False pass elif b_type == "M": """just plot matches normally""" plotind += b_len elif b_type in ["D", "P", "=", "X"]: """deletions get an especially thin line""" plotind += b_len elif b_type == "I": """insertions get a special purple bar""" left = plotind - (b_len/2) color = (200/255, 41/255, 226/255, 0.5) elif b_type == "N": """skips for splice junctions, line in middle""" bottom += (widthDict["M"]/2) - (widthDict["N"]/2) plotind += b_len if plot: rect = patches.Rectangle((left, bottom), width, height, linewidth = 0, facecolor = color ) panel.add_patch(rect) reads.drop([readsindex], inplace=True) reads.reset_index(drop = True, inplace=True) panel.set_ylim([depth_count, 0]) return panel def plot_gff3(panel, chrid, start, stop, thiscmd): db = gffutils.create_db(thiscmd.path, ":memory:") bottom = 0 genes_to_plot = [thing.id for thing in db.region( region=(chrid, start, stop), completely_within=False) if thing.featuretype == "gene" ] #print("genes to plot are: " genes_to_plot) panel.set_xlim([start, stop]) # we don't need labels on one of the axes #panel.tick_params(bottom="off", labelbottom="off", # left ="off", labelleft = "off") ticklabels = [] for geneid in genes_to_plot: plotnow = False if "id" in thiscmd.options and geneid in thiscmd.options: plotnow = True elif "id" not in thiscmd.options: plotnow = True if plotnow: ticklabels.append(geneid) if db[geneid].strand == "+": panel = gfftools._plot_left_to_right_introns_top(panel, geneid, db, bottom, text = None) bottom += 1 else: raise IOError("""Plotting things on the reverse strand is not yet implemented""") #print("tick labels are", ticklabels) panel.set_ylim([0, len(ticklabels)]) yticks_vals = [val for val in np.linspace(0.5, len(ticklabels) - 0.5, len(ticklabels))] panel.set_yticks(yticks_vals) print("bottom is: ", bottom) print("len tick labels is: ", len(ticklabels)) print("intervals are: ", yticks_vals) panel.set_yticklabels(ticklabels) return panel def browser(args): rc.update_rcParams() print(args) # if the user forgot to add a reference, they must add one if args.REF is None: raise IOError("You must specify the reference fasta file") # if the user forgot to add the start and stop, # Print the id and the start/stop if args.CHR is None or args.START is None or args.STOP is None: print("""\n You have forgotten to specify the chromosome, the start coordinate, or the stop coordinate to plot. Try something like '-c chr1 --start 20 --stop 2000'. Here is a list of chromosome ids and their lengths from the provided reference. The minimum start coordinate is one and the maximum stop coordinate is the length of the chromosome.\n\nID\tLength""") for record in SeqIO.parse(args.REF, "fasta"): print("{}\t{}".format(record.id, len(record.seq))) sys.exit(0) if args.CMD is None: raise IOError("You must specify a plotting command.") # now we parse each set of commands commands = [PlotCommand(thiscmd, args.REF) for thiscmd in reversed(args.CMD)] # set the figure dimensions if args.ratio: figWidth = args.ratio[0] + 1 figHeight = args.ratio[1] + 1 #set the panel dimensions panelWidth = args.ratio[0] panelHeight = args.ratio[1] else: figWidth = 7 figHeight = len(commands) + 2 #set the panel dimensions panelWidth = 5 # panel margin x 2 + panel height = total vertical height panelHeight = 0.8 panelMargin = 0.1 figure = plt.figure(figsize=(figWidth,figHeight)) #find the margins to center the panel in figure leftMargin = (figWidth - panelWidth)/2 bottomMargin = ((figHeight - panelHeight)/2) + panelMargin plot_dict = {"ref": plot_ref, "bam": plot_bam, "gff3": plot_gff3 #"peptides": plot_peptides } panels = [] for i in range(len(commands)): thiscmd = commands[i] if thiscmd.cmdtype in ["gff3", "ref", "peptides"] \ or thiscmd.style == "depth" \ or "narrow" in thiscmd.options: temp_panelHeight = 0.5 else: temp_panelHeight = panelHeight panels.append( plt.axes([leftMargin/figWidth, #left bottomMargin/figHeight, #bottom panelWidth/figWidth, #width temp_panelHeight/figHeight]) #height ) panels[i].tick_params(axis='both',which='both',\ bottom='off', labelbottom='off',\ left='on', labelleft='on', \ right='off', labelright='off',\ top='off', labeltop='off') if thiscmd.cmdtype == "ref": panels[i].tick_params(bottom='on', labelbottom='on') #turn off some of the axes panels[i].spines["top"].set_visible(False) panels[i].spines["bottom"].set_visible(False) panels[i].spines["right"].set_visible(False) panels[i].spines["left"].set_visible(False) panels[i] = plot_dict[thiscmd.cmdtype](panels[i], args.CHR, args.START, args.STOP, thiscmd) bottomMargin = bottomMargin + temp_panelHeight + (2 * panelMargin) # Print image(s) if args.BASENAME is None: file_base = 'browser_{}.png'.format(timestamp()) else: file_base = args.BASENAME path = None if args.path: path = args.path transparent = args.transparent print_images( base_output_name=file_base, image_formats=args.fileform, dpi=args.dpi, no_timestamp = kwargs["no_timestamp"], path = path, transparent=transparent) def run(args): browser(args) pauvre-0.2.3/pauvre/custommargin.py0000644002612300001670000004201313655434263020570 0ustar dschultzbiolum00000000000000#!/usr/bin/env python # -*- coding: utf-8 -*- # pauvre - just a pore PhD student's plotting package # Copyright (c) 2016-2017 Darrin T. Schultz. All rights reserved. # # This file is part of pauvre. # # pauvre is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # pauvre is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with pauvre. If not, see . import ast import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt import matplotlib.patches as mplpatches from matplotlib.colors import LinearSegmentedColormap import numpy as np import pandas as pd import os.path as opath from sys import stderr from pauvre.functions import print_images from pauvre.marginplot import generate_panel from pauvre.stats import stats import pauvre.rcparams as rc import sys import logging # logging logger = logging.getLogger('pauvre') def _generate_histogram_bin_patches(panel, bins, bin_values, horizontal=True): """This helper method generates the histogram that is added to the panel. In this case, horizontal = True applies to the mean quality histogram. So, horizontal = False only applies to the length histogram. """ l_width = 0.0 f_color = (0.5, 0.5, 0.5) e_color = (0, 0, 0) if horizontal: for step in np.arange(0, len(bin_values), 1): left = bins[step] bottom = 0 width = bins[step + 1] - bins[step] height = bin_values[step] hist_rectangle = mplpatches.Rectangle((left, bottom), width, height, linewidth=l_width, facecolor=f_color, edgecolor=e_color) panel.add_patch(hist_rectangle) else: for step in np.arange(0, len(bin_values), 1): left = 0 bottom = bins[step] width = bin_values[step] height = bins[step + 1] - bins[step] hist_rectangle = mplpatches.Rectangle((left, bottom), width, height, linewidth=l_width, facecolor=f_color, edgecolor=e_color) panel.add_patch(hist_rectangle) def generate_histogram(panel, data_list, min_plot_val, max_plot_val, bin_interval, hist_horizontal=True, left_spine=True, bottom_spine=True, top_spine=False, right_spine=False, x_label=None, y_label=None): bins = np.arange(0, max_plot_val, bin_interval) bin_values, bins2 = np.histogram(data_list, bins) # hist_horizontal is used for quality if hist_horizontal: panel.set_xlim([min_plot_val, max_plot_val]) panel.set_ylim([0, max(bin_values * 1.1)]) # and hist_horizontal == Fale is for read length else: panel.set_xlim([0, max(bin_values * 1.1)]) panel.set_ylim([min_plot_val, max_plot_val]) # Generate histogram bin patches, depending on whether we're plotting # vertically or horizontally _generate_histogram_bin_patches(panel, bins, bin_values, hist_horizontal) panel.spines['left'].set_visible(left_spine) panel.spines['bottom'].set_visible(bottom_spine) panel.spines['top'].set_visible(top_spine) panel.spines['right'].set_visible(right_spine) if y_label is not None: panel.set_ylabel(y_label) if x_label is not None: panel.set_xlabel(x_label) def generate_square_map(panel, data_frame, plot_min_y, plot_min_x, plot_max_y, plot_max_x, color, xcol, ycol, **kwargs): """This generates the heatmap panels using squares. Everything is quantized by ints. """ panel.set_xlim([plot_min_x, plot_max_x]) panel.set_ylim([plot_min_y, plot_max_y]) tempdf = data_frame[[xcol, ycol]] data_frame = tempdf.astype(int) querystring = "{}<={} and {}<={}".format(plot_min_y, ycol, plot_min_x, xcol) print(" - Filtering squares with {}".format(querystring)) square_this = data_frame.query(querystring) querystring = "{}<{} and {}<{}".format(ycol, plot_max_y, xcol, plot_max_x) print(" - Filtering squares with {}".format(querystring)) square_this = square_this.query(querystring) counts = square_this.groupby([xcol, ycol]).size().reset_index(name='counts') for index, row in counts.iterrows(): x_pos = row[xcol] y_pos = row[ycol] thiscolor = color(row["counts"]/(counts["counts"].max())) rectangle1=mplpatches.Rectangle((x_pos,y_pos),1,1, linewidth=0,\ facecolor=thiscolor) panel.add_patch(rectangle1) all_counts = counts["counts"] return all_counts def generate_heat_map(panel, data_frame, plot_min_y, plot_min_x, plot_max_y, plot_max_x, color, xcol, ycol, **kwargs): panel.set_xlim([plot_min_x, plot_max_x]) panel.set_ylim([plot_min_y, plot_max_y]) querystring = "{}<={} and {}<={}".format(plot_min_y, ycol, plot_min_x, xcol) print(" - Filtering hexmap with {}".format(querystring)) hex_this = data_frame.query(querystring) querystring = "{}<{} and {}<{}".format(ycol, plot_max_y, xcol, plot_max_x) print(" - Filtering hexmap with {}".format(querystring)) hex_this = hex_this.query(querystring) # This single line controls plotting the hex bins in the panel hex_vals = panel.hexbin(hex_this[xcol], hex_this[ycol], gridsize=49, linewidths=0.0, cmap=color) for each in panel.spines: panel.spines[each].set_visible(False) counts = hex_vals.get_array() return counts def generate_legend(panel, counts, color): # completely custom for more control panel.set_xlim([0, 1]) panel.set_ylim([0, 1000]) panel.set_yticks([int(x) for x in np.linspace(0, 1000, 6)]) panel.set_yticklabels([int(x) for x in np.linspace(0, max(counts), 6)]) for i in np.arange(0, 1001, 1): rgba = color(i / 1001) alpha = rgba[-1] facec = rgba[0:3] hist_rectangle = mplpatches.Rectangle((0, i), 1, 1, linewidth=0.0, facecolor=facec, edgecolor=(0, 0, 0), alpha=alpha) panel.add_patch(hist_rectangle) panel.spines['top'].set_visible(False) panel.spines['left'].set_visible(False) panel.spines['bottom'].set_visible(False) panel.yaxis.set_label_position("right") panel.set_ylabel('count') def custommargin(df, **kwargs): rc.update_rcParams() # 250, 231, 34 light yellow # 67, 1, 85 # R=np.linspace(65/255,1,101) # G=np.linspace(0/255, 231/255, 101) # B=np.linspace(85/255, 34/255, 101) # R=65/255, G=0/255, B=85/255 Rf = 65 / 255 Bf = 85 / 255 pdict = {'red': ((0.0, Rf, Rf), (1.0, Rf, Rf)), 'green': ((0.0, 0.0, 0.0), (1.0, 0.0, 0.0)), 'blue': ((0.0, Bf, Bf), (1.0, Bf, Bf)), 'alpha': ((0.0, 0.0, 0.0), (1.0, 1.0, 1.0)) } # Now we will use this example to illustrate 3 ways of # handling custom colormaps. # First, the most direct and explicit: purple1 = LinearSegmentedColormap('Purple1', pdict) # set the figure dimensions fig_width = 1.61 * 3 fig_height = 1 * 3 fig = plt.figure(figsize=(fig_width, fig_height)) # set the panel dimensions heat_map_panel_width = fig_width * 0.5 heat_map_panel_height = heat_map_panel_width * 0.62 # find the margins to center the panel in figure fig_left_margin = fig_bottom_margin = (1 / 6) # lengthPanel y_panel_width = (1 / 8) # the color Bar parameters legend_panel_width = (1 / 24) # define padding h_padding = 0.02 v_padding = 0.05 # Set whether to include y-axes in histograms print(" - Setting panel options.", file = sys.stderr) if kwargs["Y_AXES"]: y_bottom_spine = True y_bottom_tick = True y_bottom_label = True x_left_spine = True x_left_tick = True x_left_label = True x_y_label = 'Count' else: y_bottom_spine = False y_bottom_tick = False y_bottom_label = False x_left_spine = False x_left_tick = False x_left_label = False x_y_label = None panels = [] # Quality histogram panel print(" - Generating the x-axis panel.", file = sys.stderr) x_panel_left = fig_left_margin + y_panel_width + h_padding x_panel_width = heat_map_panel_width / fig_width x_panel_height = y_panel_width * fig_width / fig_height x_panel = generate_panel(x_panel_left, fig_bottom_margin, x_panel_width, x_panel_height, left_tick_param=x_left_tick, label_left_tick_param=x_left_label) panels.append(x_panel) # y histogram panel print(" - Generating the y-axis panel.", file = sys.stderr) y_panel_bottom = fig_bottom_margin + x_panel_height + v_padding y_panel_height = heat_map_panel_height / fig_height y_panel = generate_panel(fig_left_margin, y_panel_bottom, y_panel_width, y_panel_height, bottom_tick_param=y_bottom_tick, label_bottom_tick_param=y_bottom_label) panels.append(y_panel) # Heat map panel heat_map_panel_left = fig_left_margin + y_panel_width + h_padding heat_map_panel_bottom = fig_bottom_margin + x_panel_height + v_padding print(" - Generating the heat map panel.", file = sys.stderr) heat_map_panel = generate_panel(heat_map_panel_left, heat_map_panel_bottom, heat_map_panel_width / fig_width, heat_map_panel_height / fig_height, bottom_tick_param='off', label_bottom_tick_param='off', left_tick_param='off', label_left_tick_param='off') panels.append(heat_map_panel) heat_map_panel.set_title(kwargs["title"]) # Legend panel print(" - Generating the legend panel.", file = sys.stderr) legend_panel_left = fig_left_margin + y_panel_width + \ heat_map_panel_width / fig_width + h_padding legend_panel_bottom = fig_bottom_margin + x_panel_height + v_padding legend_panel_height = heat_map_panel_height / fig_height legend_panel = generate_panel(legend_panel_left, legend_panel_bottom, legend_panel_width, legend_panel_height, bottom_tick_param=False, label_bottom_tick_param=False, left_tick_param=False, label_left_tick_param=False, right_tick_param=True, label_right_tick_param=True) panels.append(legend_panel) # # Everything above this is just to set up the panels # ################################################################## # Set max and min viewing window for the xaxis if kwargs["plot_max_x"]: plot_max_x = kwargs["plot_max_x"] else: if kwargs["square"]: plot_max_x = df[kwargs["xcol"]].max() plot_max_x = max(np.ceil(df[kwargs["xcol"]])) plot_min_x = kwargs["plot_min_x"] # Set x bin sizes if kwargs["xbin"]: x_bin_interval = kwargs["xbin"] else: # again, this is just based on what looks good from experience x_bin_interval = 1 # Generate x histogram print(" - Generating the x-axis histogram.", file = sys.stderr) generate_histogram(panel = x_panel, data_list = df[kwargs['xcol']], min_plot_val = plot_min_x, max_plot_val = plot_max_x, bin_interval = x_bin_interval, hist_horizontal = True, x_label=kwargs['xcol'], y_label=x_y_label, left_spine=x_left_spine) # Set max and min viewing window for the y axis if kwargs["plot_max_y"]: plot_max_y = kwargs["plot_max_y"] else: if kwargs["square"]: plot_max_y = df[kwargs["ycol"]].max() else: plot_max_y = max(np.ceil(df[kwargs["ycol"]])) plot_min_y = kwargs["plot_min_y"] # Set y bin sizes if kwargs["ybin"]: y_bin_interval = kwargs["ybin"] else: y_bin_interval = 1 # Generate y histogram print(" - Generating the y-axis histogram.", file = sys.stderr) generate_histogram(panel = y_panel, data_list = df[kwargs['ycol']], min_plot_val = plot_min_y, max_plot_val = plot_max_y, bin_interval = y_bin_interval, hist_horizontal = False, y_label = kwargs['ycol'], bottom_spine = y_bottom_spine) # Generate heat map if kwargs["square"]: print(" - Generating the square heatmap.", file = sys.stderr) counts = generate_square_map(panel = heat_map_panel, data_frame = df, plot_min_y = plot_min_y, plot_min_x = plot_min_x, plot_max_y = plot_max_y, plot_max_x = plot_max_x, color = purple1, xcol = kwargs["xcol"], ycol = kwargs["ycol"]) else: print(" - Generating the heatmap.", file = sys.stderr) counts = generate_heat_map(panel = heat_map_panel, data_frame = df, plot_min_y = plot_min_y, plot_min_x = plot_min_x, plot_max_y = plot_max_y, plot_max_x = plot_max_x, color = purple1, xcol = kwargs["xcol"], ycol = kwargs["ycol"]) # Generate legend print(" - Generating the legend.", file = sys.stderr) generate_legend(legend_panel, counts, purple1) # inform the user of the plotting window if not quiet mode #if not kwargs["QUIET"]: # print("""plotting in the following window: # {0} <= Q-score (x-axis) <= {1} # {2} <= length (y-axis) <= {3}""".format( # plot_min_x, plot_max_x, min_plot_val, max_plot_val), # file=stderr) # Print image(s) if kwargs["output_base_name"] is None: file_base = "custommargin" else: file_base = kwargs["output_base_name"] print(" - Saving your images", file = sys.stderr) print_images( base =file_base, image_formats=kwargs["fileform"], dpi=kwargs["dpi"], no_timestamp = kwargs["no_timestamp"], transparent= kwargs["no_transparent"]) def run(args): print(args) if not opath.exists(args.input_file): raise IOError("The input file does not exist: {}".format( args.input_file)) df = pd.read_csv(args.input_file, header='infer', sep='\t') # make sure that the column names that were specified are actually # in the dataframe if args.xcol not in df.columns: raise IOError("""The x-column name that you specified, {}, is not in the dataframe column names: {}""".format(args.xcol, df.columns)) if args.ycol not in df.columns: raise IOError("""The y-column name that you specified, {}, is not in the dataframe column names: {}""".format(args.ycol, df.columns)) print(" - Successfully read csv file. Here are a few lines:", file = sys.stderr) print(df.head(), file = sys.stderr) print(" - Plotting {} on the x-axis".format(args.xcol),file=sys.stderr) print(df[args.xcol].head(), file = sys.stderr) print(" - Plotting {} on the y-axis".format(args.ycol),file=sys.stderr) print(df[args.ycol].head(), file = sys.stderr) custommargin(df=df.dropna(), **vars(args)) pauvre-0.2.3/pauvre/functions.py0000644002612300001670000003652113622037370020067 0ustar dschultzbiolum00000000000000#!/usr/bin/env python # -*- coding: utf-8 -*- # pauvre # Copyright (c) 2016-2020 Darrin T. Schultz. # # This file is part of pauvre. # # pauvre is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # pauvre is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with pauvre. If not, see . from Bio import SeqIO import copy import gzip import matplotlib.pyplot as plt import numpy as np import os import pandas as pd from sys import stderr import time # this makes opening files more robust for different platforms # currently only used in GFFParse import codecs import warnings def print_images(base, image_formats, dpi, transparent=False, no_timestamp = False): """ Save the plot in multiple formats, with or without transparency and with or without timestamps. """ for fmt in image_formats: if no_timestamp: out_name = "{0}.{1}".format(base, fmt) else: out_name = "{0}_{1}.{2}".format(base, timestamp(), fmt) try: if fmt == 'png': plt.savefig(out_name, dpi=dpi, transparent=transparent) else: plt.savefig(out_name, format=fmt, transparent=transparent) except PermissionError: # thanks to https://github.com/wdecoster for the suggestion print("""You don't have permission to save pauvre plots to this directory. Try changing the directory and running the script again!""") class GFFParse(): def __init__(self, filename, stop_codons=None, species=None): self.filename = filename self.samplename = os.path.splitext(os.path.basename(filename))[0] self.species = species self.featureDict = {"name": [], "featType": [], "start": [], "stop": [], "strand": []} gffnames = ["sequence", "source", "featType", "start", "stop", "dunno1", "strand", "dunno2", "tags"] self.features = pd.read_csv(self.filename, comment='#', sep='\t', names=gffnames) self.features['name'] = self.features['tags'].apply(self._get_name) self.features.drop('dunno1', 1, inplace=True) self.features.drop('dunno2', 1, inplace=True) self.features.reset_index(inplace=True, drop=True) # warn the user if there are CDS or gene entries not divisible by three self._check_triplets() # sort the database by start self.features.sort_values(by='start', ascending=True, inplace=True) if stop_codons: strip_codons = ['gene', 'CDS'] # if the direction is forward, subtract three from the stop to bring it closer to the start self.features.loc[(self.features['featType'].isin(strip_codons)) & (self.features['strand'] == '+'), 'stop'] =\ self.features.loc[(self.features['featType'].isin(strip_codons)) & (self.features['strand'] == '+'), 'stop'] - 3 # if the direction is reverse, add three to the start (since the coords are flip-flopped) self.features.loc[(self.features['featType'].isin(strip_codons)) & (self.features['strand'] == '-'), 'start'] =\ self.features.loc[(self.features['featType'].isin(strip_codons)) & (self.features['strand'] == '-'), 'start'] + 3 self.features['center'] = self.features['start'] + \ ((self.features['stop'] - self.features['start']) / 2) # we need to add one since it doesn't account for the last base otherwise self.features['width'] = abs(self.features['stop'] - self.features['start']) + 1 self.features['lmost'] = self.features.apply(self._determine_lmost, axis=1) self.features['rmost'] = self.features.apply(self._determine_rmost, axis=1) self.features['track'] = 0 if len(self.features.loc[self.features['tags'] == "Is_circular=true", 'stop']) < 1: raise IOError("""The GFF file needs to have a tag ending in "Is_circular=true" with a region from 1 to the number of bases in the mitogenome example: Bf201311 Geneious region 1 13337 . + 0 Is_circular=true """) self.seqlen = int(self.features.loc[self.features['tags'] == "Is_circular=true", 'stop']) self.features.reset_index(inplace=True, drop=True) #print("float", self.features.loc[self.features['name'] == 'COX1', 'center']) #print("float cat", len(self.features.loc[self.features['name'] == 'CAT', 'center'])) # print(self.features) # print(self.seqlen) def set_features(self, new_features): """all this does is reset the features pandas dataframe""" self.features = new_features def get_unique_genes(self): """This returns a series of gene names""" plottable = self.features.query( "featType != 'tRNA' and featType != 'region' and featType != 'source'") return set(plottable['name'].unique()) def shuffle(self): """ this returns a list of all possible shuffles of features. A shuffle is when the frontmost bit of coding + noncoding DNA up until the next bit of coding DNA is removed and tagged on the end of the sequence. In this case this process is represented by shifting gff coordinates. """ shuffles = [] # get the index of the first element # get the index of the next thing # subtract the indices of everything, then reset the ones that are below # zero done = False shuffle_features = self.features[self.features['featType'].isin( ['gene', 'rRNA', 'CDS', 'tRNA'])].copy(deep=True) # we first add the shuffle features without reorganizing # print("shuffle\n",shuffle_features) add_first = copy.deepcopy(self) add_first.set_features(shuffle_features) shuffles.append(add_first) # first gene is changed with every iteration first_gene = list(shuffle_features['name'])[0] # absolute first is the first gene in the original gff file, used to determine if we are done in this while loop absolute_first = list(shuffle_features['name'])[0] while not done: # We need to prevent the case of shuffling in the middle of # overlapped genes. Do this by ensuring that the the start of # end of first gene is less than the start of the next gene. first_stop = int(shuffle_features.loc[shuffle_features['name'] == first_gene, 'stop']) next_gene = "" for next_index in range(1, len(shuffle_features)): # get the df of the next list, if len == 0, then it is a tRNA and we need to go to the next index next_gene_df = list( shuffle_features[shuffle_features['featType'].isin(['gene', 'rRNA', 'CDS'])]['name']) if len(next_gene_df) != 0: next_gene = next_gene_df[next_index] next_start = int(shuffle_features.loc[shuffle_features['name'] == next_gene, 'start']) #print("looking at {}, prev_stop is {}, start is {}".format( # next_gene, first_stop, next_start)) #print(shuffle_features[shuffle_features['featType'].isin(['gene', 'rRNA', 'CDS'])]) # if the gene we're looking at and the next one don't overlap, move on if first_stop < next_start: break #print("next_gene before checking for first is {}".format(next_gene)) if next_gene == absolute_first: done = True break # now we can reset the first gene for the next iteration first_gene = next_gene shuffle_features = shuffle_features.copy(deep=True) # figure out where the next start point is going to be next_start = int(shuffle_features.loc[shuffle_features['name'] == next_gene, 'start']) #print('next gene: {}'.format(next_gene)) shuffle_features['start'] = shuffle_features['start'] - next_start + 1 shuffle_features['stop'] = shuffle_features['stop'] - next_start + 1 shuffle_features['center'] = shuffle_features['center'] - next_start + 1 # now correct the values that are less than 0 shuffle_features.loc[shuffle_features['start'] < 1, 'start'] = shuffle_features.loc[shuffle_features['start'] < 1, 'start'] + self.seqlen shuffle_features.loc[shuffle_features['stop'] < 1, 'stop'] = shuffle_features.loc[shuffle_features['stop'] < 1, 'start'] + shuffle_features.loc[shuffle_features['stop'] < 1, 'width'] shuffle_features['center'] = shuffle_features['start'] + \ ((shuffle_features['stop'] - shuffle_features['start']) / 2) shuffle_features['lmost'] = shuffle_features.apply(self._determine_lmost, axis=1) shuffle_features['rmost'] = shuffle_features.apply(self._determine_rmost, axis=1) shuffle_features.sort_values(by='start', ascending=True, inplace=True) shuffle_features.reset_index(inplace=True, drop=True) new_copy = copy.deepcopy(self) new_copy.set_features(shuffle_features) shuffles.append(new_copy) #print("len shuffles: {}".format(len(shuffles))) return shuffles def couple(self, other_GFF, this_y=0, other_y=1): """ Compares this set of features to another set and generates tuples of (x,y) coordinate pairs to input into lsi """ other_features = other_GFF.features coordinates = [] for thisname in self.features['name']: othermatch = other_features.loc[other_features['name'] == thisname, 'center'] if len(othermatch) == 1: this_x = float(self.features.loc[self.features['name'] == thisname, 'center']) # /self.seqlen other_x = float(othermatch) # /other_GFF.seqlen # lsi can't handle vertical or horizontal lines, and we don't # need them either for our comparison. Don't add if equivalent. if this_x != other_x: these_coords = ((this_x, this_y), (other_x, other_y)) coordinates.append(these_coords) return coordinates def _check_triplets(self): """This method verifies that all entries of featType gene and CDS are divisible by three""" genesCDSs = self.features.query("featType == 'CDS' or featType == 'gene'") not_trips = genesCDSs.loc[((abs(genesCDSs['stop'] - genesCDSs['start']) + 1) % 3) > 0, ] if len(not_trips) > 0: print_string = "" print_string += "There are CDS and gene entries that are not divisible by three\n" print_string += str(not_trips) warnings.warn(print_string, SyntaxWarning) def _get_name(self, tag_value): """This extracts a name from a single row in 'tags' of the pandas dataframe """ try: if ";" in tag_value: name = tag_value[5:].split(';')[0] else: name = tag_value[5:].split()[0] except: name = tag_value print("Couldn't correctly parse {}".format( tag_value)) return name def _determine_lmost(self, row): """Booleans don't work well for pandas dataframes, so I need to use apply """ if row['start'] < row['stop']: return row['start'] else: return row['stop'] def _determine_rmost(self, row): """Booleans don't work well for pandas dataframes, so I need to use apply """ if row['start'] < row['stop']: return row['stop'] else: return row['start'] def parse_fastq_length_meanqual(fastq): """ arguments: the fastq file path. Hopefully it has been verified to exist already purpose: This function parses a fastq and returns a pandas dataframe of read lengths and read meanQuals. """ # First try to open the file with the gzip package. It will crash # if the file is not gzipped, so this is an easy way to test if # the fastq file is gzipped or not. try: handle = gzip.open(fastq, "rt") length, meanQual = _fastq_parse_helper(handle) except: handle = open(fastq, "r") length, meanQual = _fastq_parse_helper(handle) handle.close() df = pd.DataFrame(list(zip(length, meanQual)), columns=['length', 'meanQual']) return df def filter_fastq_length_meanqual(df, min_len, max_len, min_mqual, max_mqual): querystring = "length >= {0} and meanQual >= {1}".format(min_len, min_mqual) if max_len != None: querystring += " and length <= {}".format(max_len) if max_mqual != None: querystring += " and meanQual <= {}".format(max_mqual) print("Keeping reads that satisfy: {}".format(querystring), file=stderr) filtdf = df.query(querystring) #filtdf["length"] = pd.to_numeric(filtdf["length"], errors='coerce') #filtdf["meanQual"] = pd.to_numeric(filtdf["meanQual"], errors='coerce') return filtdf def _fastq_parse_helper(handle): length = [] meanQual = [] for record in SeqIO.parse(handle, "fastq"): if len(record) > 0: meanQual.append(_arithmetic_mean(record.letter_annotations["phred_quality"])) length.append(len(record)) return length, meanQual def _geometric_mean(phred_values): """in case I want geometric mean in the future, can calculate it like this""" # np.mean(record.letter_annotations["phred_quality"])) pass def _arithmetic_mean(phred_values): """ Convert Phred to 1-accuracy (error probabilities), calculate the arithmetic mean, log transform back to Phred. """ if not isinstance(phred_values, np.ndarray): phred_values = np.array(phred_values) return _erate_to_phred(np.mean(_phred_to_erate(phred_values))) def _phred_to_erate(phred_values): """ converts a list or numpy array of phred values to a numpy array of error rates """ if not isinstance(phred_values, np.ndarray): phred_values = np.array(phred_values) return np.power(10, (-1 * (phred_values / 10))) def _erate_to_phred(erate_values): """ converts a list or numpy array of error rates to a numpy array of phred values """ if not isinstance(erate_values, np.ndarray): phred_values = np.array(erate_values) return -10 * np.log10(erate_values) def timestamp(): """ Returns the current time in :samp:`YYYYMMDD_HHMMSS` format. """ return time.strftime("%Y%m%d_%H%M%S") pauvre-0.2.3/pauvre/gfftools.py0000644002612300001670000006372013322176054017703 0ustar dschultzbiolum00000000000000#!/usr/bin/env python # -*- coding: utf-8 -*- # pauvre - a pore plotting package # Copyright (c) 2016-2018 Darrin T. Schultz. All rights reserved. # # This file is part of pauvre. # # pauvre is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # pauvre is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with pauvre. If not, see . """This file contains things related to parsing and plotting GFF files""" import copy from matplotlib.path import Path import matplotlib.patches as patches global chevron_width global arrow_width global min_text global text_cutoff arrow_width = 80 chevron_width = 40 min_text = 550 text_cutoff = 150 import sys global colorMap colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red', 'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white', 'ORF':'orange'} def _plot_left_to_right_introns(panel, geneid, db, y_pos, text = None): """ plots a left to right patch with introns when there are no intervening sequences to consider. Uses a gene id and gffutils database as input. b a .-=^=-. c 1__________2---/ e `---1__________2 | #lff \f d| #lff \ | left to \3 | left to \3 | right / | right / 5___________/4 5___________/4 """ #first we need to determine the number of exons bar_thickness = 0.75 #now we can start plotting the exons exonlist = list(db.children(geneid, featuretype='CDS', order_by="start")) for i in range(len(exonlist)): cds_start = exonlist[i].start cds_stop = exonlist[i].stop verts = [(cds_start, y_pos + bar_thickness), #1 (cds_stop - chevron_width, y_pos + bar_thickness), #2 (cds_stop, y_pos + (bar_thickness/2)), #3 (cds_stop - chevron_width, y_pos), #4 (cds_start, y_pos), #5 (cds_start, y_pos + bar_thickness), #1 ] codes = [Path.MOVETO, Path.LINETO, Path.LINETO, Path.LINETO, Path.LINETO, Path.CLOSEPOLY, ] path = Path(verts, codes) patch = patches.PathPatch(path, lw = 0, fc=colorMap['CDS'] ) panel.add_patch(patch) # we must draw the splice junction if i < len(exonlist) - 1: next_start = exonlist[i+1].start next_stop = exonlist[i+1].stop middle = cds_stop + ((next_start - cds_stop)/2) verts = [(cds_stop - chevron_width, y_pos + bar_thickness), #2/a (middle, y_pos + 0.95), #b (next_start, y_pos + bar_thickness), #c (next_start, y_pos + bar_thickness - 0.05), #d (middle, y_pos + 0.95 - 0.05), #e (cds_stop - chevron_width, y_pos + bar_thickness -0.05), #f (cds_stop - chevron_width, y_pos + bar_thickness), #2/a ] codes = [Path.MOVETO, Path.LINETO, Path.LINETO, Path.LINETO, Path.LINETO, Path.LINETO, Path.CLOSEPOLY, ] path = Path(verts, codes) patch = patches.PathPatch(path, lw = 0, fc=colorMap['CDS'] ) panel.add_patch(patch) return panel def _plot_left_to_right_introns_top(panel, geneid, db, y_pos, text = None): """ slightly different from the above version such thatsplice junctions are more visually explicit. plots a left to right patch with introns when there are no intervening sequences to consider. Uses a gene id and gffutils database as input. b a .-=^=-. c 1_____________2---/ e `---1_____________2 | #lff /f d| #lff / | left to / | left to / | right / | right / 4_________/3 4_________/3 """ #first we need to determine the number of exons bar_thickness = 0.75 #now we can start plotting the exons exonlist = list(db.children(geneid, featuretype='CDS', order_by="start")) for i in range(len(exonlist)): cds_start = exonlist[i].start cds_stop = exonlist[i].stop verts = [(cds_start, y_pos + bar_thickness), #1 (cds_stop, y_pos + bar_thickness), #2 (cds_stop - chevron_width, y_pos), #4 (cds_start, y_pos), #5 (cds_start, y_pos + bar_thickness), #1 ] codes = [Path.MOVETO, Path.LINETO, Path.LINETO, Path.LINETO, Path.CLOSEPOLY, ] path = Path(verts, codes) patch = patches.PathPatch(path, lw = 0, fc=colorMap['CDS'] ) panel.add_patch(patch) # we must draw the splice junction if i < len(exonlist) - 1: next_start = exonlist[i+1].start next_stop = exonlist[i+1].stop middle = cds_stop + ((next_start - cds_stop)/2) verts = [(cds_stop-5, y_pos + bar_thickness), #2/a (middle, y_pos + 0.95), #b (next_start, y_pos + bar_thickness), #c (next_start, y_pos + bar_thickness - 0.05), #d (middle, y_pos + 0.95 - 0.05), #e (cds_stop-5, y_pos + bar_thickness -0.05), #f (cds_stop-5, y_pos + bar_thickness), #2/a ] codes = [Path.MOVETO, Path.LINETO, Path.LINETO, Path.LINETO, Path.LINETO, Path.LINETO, Path.CLOSEPOLY, ] path = Path(verts, codes) patch = patches.PathPatch(path, lw = 0, fc=colorMap['CDS'] ) panel.add_patch(patch) return panel def _plot_lff(panel, left_df, right_df, colorMap, y_pos, bar_thickness, text): """ plots a lff patch 1__________2 ____________ | #lff \ \ #rff \ | left for \3 \ right for \ | forward / / forward / 5___________/4 /___________/ """ #if there is only one feature to plot, then just plot it print("plotting lff") verts = [(left_df['start'], y_pos + bar_thickness), #1 (right_df['start'] - chevron_width, y_pos + bar_thickness), #2 (left_df['stop'], y_pos + (bar_thickness/2)), #3 (right_df['start'] - chevron_width, y_pos), #4 (left_df['start'], y_pos), #5 (left_df['start'], y_pos + bar_thickness), #1 ] codes = [Path.MOVETO, Path.LINETO, Path.LINETO, Path.LINETO, Path.LINETO, Path.CLOSEPOLY, ] path = Path(verts, codes) patch = patches.PathPatch(path, lw = 0, fc=colorMap[left_df['featType']] ) text_width = left_df['width'] if text and text_width >= min_text: panel = _plot_label(panel, left_df, y_pos, bar_thickness) elif text and text_width < min_text and text_width >= text_cutoff: panel = _plot_label(panel, left_df, y_pos, bar_thickness, rotate = True, arrow = True) return panel, patch def _plot_label(panel, df, y_pos, bar_thickness, rotate = False, arrow = False): # handles the case where a dataframe was passed fontsize = 8 rotation = 0 if rotate: fontsize = 5 rotation = 90 if len(df) == 1: x =((df.loc[0, 'stop'] - df.loc[0, 'start'])/2) + df.loc[0, 'start'] y = y_pos + (bar_thickness/2) # if we need to center somewhere other than the arrow, need to adjust # for the direction of the arrow # it doesn't look good if it shifts by the whole arrow width, so only # shift by half the arrow width if arrow: if df.loc[0, 'strand'] == "+": shift_start = df.loc[0, 'start'] else: shift_start = df.loc[0, 'start'] + (arrow_width/2) x =((df.loc[0, 'stop'] - (arrow_width/2) - df.loc[0, 'start'])/2) + shift_start panel.text(x, y, df.loc[0, 'name'], fontsize = fontsize, ha='center', va='center', color = 'white', family = 'monospace', zorder = 100, rotation = rotation) # and the case where a series was passed else: x = ((df['stop'] - df['start'])/2) + df['start'] y = y_pos + (bar_thickness/2) if arrow: if df['strand'] == "+": shift_start = df['start'] else: shift_start = df['start'] + (arrow_width/2) x =((df['stop'] - (arrow_width/2) - df['start'])/2) + shift_start panel.text(x, y, df['name'], fontsize = fontsize, ha='center', va='center', color = 'white', family = 'monospace', zorder = 100, rotation = rotation) return panel def _plot_rff(panel, left_df, right_df, colorMap, y_pos, bar_thickness, text): """ plots a rff patch ____________ 1__________2 | #lff \ \ #rff \ | left for \ 6\ right for \3 | forward / / forward / |___________/ /5__________/4 """ #if there is only one feature to plot, then just plot it print("plotting rff") verts = [(right_df['start'], y_pos + bar_thickness), #1 (right_df['stop'] - arrow_width, y_pos + bar_thickness), #2 (right_df['stop'], y_pos + (bar_thickness/2)), #3 (right_df['stop'] - arrow_width, y_pos), #4 (right_df['start'], y_pos), #5 (left_df['stop'] + chevron_width, y_pos + (bar_thickness/2)), #6 (right_df['start'], y_pos + bar_thickness), #1 ] codes = [Path.MOVETO, Path.LINETO, Path.LINETO, Path.LINETO, Path.LINETO, Path.LINETO, Path.CLOSEPOLY, ] path = Path(verts, codes) patch = patches.PathPatch(path, lw = 0, fc=colorMap[right_df['featType']] ) text_width = right_df['width'] if text and text_width >= min_text: panel = _plot_label(panel, right_df, y_pos, bar_thickness) elif text and text_width < min_text and text_width >= text_cutoff: panel = _plot_label(panel, right_df, y_pos, bar_thickness, rotate = True) return panel, patch def x_offset_gff(GFFParseobj, x_offset): """Takes in a gff object (a gff file parsed as a pandas dataframe), and an x_offset value and shifts the start, stop, center, lmost, and rmost. Returns a GFFParse object with the shifted values in GFFParse.features. """ for columnname in ['start', 'stop', 'center', 'lmost', 'rmost']: GFFParseobj.features[columnname] = GFFParseobj.features[columnname] + x_offset return GFFParseobj def gffplot_horizontal(figure, panel, args, gff_object, track_width=0.2, start_y=0.1, **kwargs): """ this plots horizontal things from gff files. it was probably written for synplot, as the browser does not use this at all. """ # Because this size should be relative to the circle that it is plotted next # to, define the start_radius as the place to work from, and the width of # each track. colorMap = {'gene': 'green', 'CDS': 'green', 'tRNA':'pink', 'rRNA':'red', 'misc_feature':'purple', 'rep_origin':'orange', 'spacebar':'white'} augment = 0 bar_thickness = 0.9 * track_width # return these at the end myPatches=[] plot_order = [] idone = False # we need to filter out the tRNAs since those are plotted last plottable_features = gff_object.features.query("featType != 'tRNA' and featType != 'region' and featType != 'source'") plottable_features.reset_index(inplace=True, drop=True) print(plottable_features) len_plottable = len(plottable_features) print('len plottable', len_plottable) # - this for loop relies on the gff features to already be sorted # - The algorithm for this loop works by starting at the 0th index of the # plottable features (i). # - It then looks to see if the next object (the jth) overlaps with the # ith element. i = 0 j = 1 while i < len(plottable_features): if i + j == len(plottable_features): #we have run off of the df and need to include everything from i to the end these_features = plottable_features.loc[i::,].copy(deep=True) these_features = these_features.reset_index() print(these_features) plot_order.append(these_features) i = len(plottable_features) break print(" - i,j are currently: {},{}".format(i, j)) stop = plottable_features.loc[i]["stop"] start = plottable_features.loc[i+j]["start"] print("stop: {}. start: {}.".format(stop, start)) if plottable_features.loc[i]["stop"] <= plottable_features.loc[i+j]["start"]: print(" - putting elements {} through (including) {} together".format(i, i+j)) these_features = plottable_features.loc[i:i+j-1,].copy(deep=True) these_features = these_features.reset_index() print(these_features) plot_order.append(these_features) i += 1 j = 1 else: j += 1 #while idone == False: # print("im in the overlap-pairing while loop i={}".format(i)) # # look ahead at all of the elements that overlap with the ith element # jdone = False # j = 1 # this_set_minimum_index = i # this_set_maximum_index = i # while jdone == False: # print("new i= {} j={} len={}".format(i, j, len_plottable)) # print("len plottable in jdone: {}".format(len_plottable)) # print("plottable features in jdone:\n {}".format(plottable_features)) # # first make sure that we haven't gone off the end of the dataframe # # This is an edge case where i has a jth element that overlaps with it, # # and j is the last element in the plottable features. # if i+j == len_plottable: # print("i+j == len_plottable") # # this checks for the case that i is the last element of the # # plottable features. # # In both of the above cases, we are done with both the ith and # # the jth features. # if i == len_plottable-1: # print("i == len_plottable-1") # # this is the last analysis, so set idone to true # # to finish after this # idone = True # # the last one can't be in its own group, so just add it solo # these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,].copy(deep=True) # plot_order.append(these_features.reset_index(drop=True)) # break # jdone = True # else: # print("i+j != len_plottable") # # if the lmost of the next gene overlaps with the rmost of # # the current one, it overlaps and couple together # if plottable_features.loc[i+j, 'lmost'] < plottable_features.loc[i, 'rmost']: # print("lmost < rmost") # # note that this feature overlaps with the current # this_set_maximum_index = i+j # # ... and we need to look at the next in line # j += 1 # else: # print("lmost !< rmost") # i += 1 + (this_set_maximum_index - this_set_minimum_index) # #add all of the things that grouped together once we don't find any more groups # these_features = plottable_features.loc[this_set_minimum_index:this_set_maximum_index,].copy(deep=True) # plot_order.append(these_features.reset_index(drop=True)) # jdone = True # print("plot order is now: {}".format(plot_order)) # print("jdone: {}".format(str(jdone))) for feature_set in plot_order: # plot_feature_hori handles overlapping cases as well as normal cases panel, patches = gffplot_feature_hori(figure, panel, feature_set, colorMap, start_y, bar_thickness, text = True) for each in patches: print("there are {} patches after gffplot_feature_hori".format(len(patches))) print(each) myPatches.append(each) print("length of myPatches is: {}".format(len(myPatches))) # Now we add all of the tRNAs to this to plot, do it last to overlay # everything else tRNAs = gff_object.features.query("featType == 'tRNA'") tRNAs.reset_index(inplace=True, drop = True) tRNA_bar_thickness = bar_thickness * (0.8) tRNA_start_y = start_y + ((bar_thickness - tRNA_bar_thickness)/2) for i in range(0,len(tRNAs)): this_feature = tRNAs[i:i+1].copy(deep=True) this_feature.reset_index(inplace=True, drop = True) panel, patches = gffplot_feature_hori(figure, panel, this_feature, colorMap, tRNA_start_y, tRNA_bar_thickness, text = True) for patch in patches: myPatches.append(patch) print("There are {} patches at the end of gffplot_horizontal()".format(len(myPatches))) return panel, myPatches def gffplot_feature_hori(figure, panel, feature_df, colorMap, y_pos, bar_thickness, text=True): """This plots the track for a feature, and if there is something for 'this_feature_overlaps_feature', then there is special processing to add the white bar and the extra slope for the chevron """ myPatches = [] #if there is only one feature to plot, then just plot it if len(feature_df) == 1: #print("plotting a single thing: {} {}".format(str(feature_df['sequence']).split()[1], # str(feature_df['featType']).split()[1] )) #print(this_feature['name'], "is not overlapping") # This plots this shape: 1_________2 2_________1 # | forward \3 3/ reverse | # |5__________/4 \4________5| if feature_df.loc[0,'strand'] == '+': verts = [(feature_df.loc[0, 'start'], y_pos + bar_thickness), #1 (feature_df.loc[0, 'stop'] - arrow_width, y_pos + bar_thickness), #2 (feature_df.loc[0, 'stop'], y_pos + (bar_thickness/2)), #3 (feature_df.loc[0, 'stop'] - arrow_width, y_pos), #4 (feature_df.loc[0, 'start'], y_pos), #5 (feature_df.loc[0, 'start'], y_pos + bar_thickness)] #1 elif feature_df.loc[0,'strand'] == '-': verts = [(feature_df.loc[0, 'stop'], y_pos + bar_thickness), #1 (feature_df.loc[0, 'start'] + arrow_width, y_pos + bar_thickness), #2 (feature_df.loc[0, 'start'], y_pos + (bar_thickness/2)), #3 (feature_df.loc[0, 'start'] + arrow_width, y_pos), #4 (feature_df.loc[0, 'stop'], y_pos), #5 (feature_df.loc[0, 'stop'], y_pos + bar_thickness)] #1 feat_width = feature_df.loc[0,'width'] if text and feat_width >= min_text: panel = _plot_label(panel, feature_df.loc[0,], y_pos, bar_thickness) elif text and feat_width < min_text and feat_width >= text_cutoff: panel = _plot_label(panel, feature_df.loc[0,], y_pos, bar_thickness, rotate = True, arrow = True) codes = [Path.MOVETO, Path.LINETO, Path.LINETO, Path.LINETO, Path.LINETO, Path.CLOSEPOLY] path = Path(verts, codes) print("normal path is: {}".format(path)) # If the feature itself is smaller than the arrow, we need to take special measures to if feature_df.loc[0,'width'] <= arrow_width: path = Path([verts[i] for i in [0,2,4,5]], [codes[i] for i in [0,2,4,5]]) patch = patches.PathPatch(path, lw = 0, fc=colorMap[feature_df.loc[0, 'featType']] ) myPatches.append(patch) # there are four possible scenarios if there are two overlapping sequences: # ___________ ____________ ____________ ___________ # | #1 \ \ #1 \ / #2 / / #2 | # | both seqs \ \ both seqs \ / both seqs / / both seqs | # | forward / / forward / \ reverse \ \ reverse | # |__________/ /___________/ \___________\ \___________| # ___________ _____________ ____________ _ _________ # | #3 \ \ #3 | / #2 _| #2 \ # | one seq \ \ one seq | / one seq |_ one seq \ # | forward \ \ reverse | \ reverse _| forward / # |_____________\ \_________| \__________|_ ___________/ # # These different scenarios can be thought of as different left/right # flanking segment types. # In the annotation #rff: # - 'r' refers to the annotation type as being on the right # - the first 'f' refers to the what element is to the left of this one. # Since it is forward the 5' end of this annotation must be a chevron # - the second 'f' refers to the right side of this element. Since it is # forward it must be a normal arrow. # being on the right # # *LEFT TYPES* *RIGHT TYPES* # ____________ ____________ # | #lff \ \ #rff \ # | left for \ \ right for \ # | forward / / forward / # |___________/ /___________/ # ___________ _____________ # | #lfr \ \ #rfr | # | left for \ \ right for | # | reverse \ \ reverse | # |_____________\ \_________| # ____________ ___________ # / #lrr / / #rrr | # / left rev / / right rev | # \ reverse \ \ reverse | # \___________\ \___________| # ____________ __________ # / #lrf _| _| #rrf \ # / left rev |_ | _ right rev \ # \ forward _| _| forward / # \__________| |____________/ # # To properly plot these elements, we must go through each element of the # feature_df to determine which patch type it is. elif len(feature_df) == 2: print("im in here feat len=2") for i in range(len(feature_df)): # this tests for which left type we're dealing with if i == 0: # type could be lff or lfr if feature_df.loc[i, 'strand'] == '+': if feature_df.loc[i + 1, 'strand'] == '+': # plot a lff type panel, patch = _plot_lff(panel, feature_df.iloc[i,], feature_df.iloc[i+1,], colorMap, y_pos, bar_thickness, text) myPatches.append(patch) elif feature_df.loc[i + 1, 'strand'] == '-': #plot a lfr type raise IOError("can't plot {} patches yet".format("lfr")) # or type could be lrr or lrf elif feature_df.loc[i, 'strand'] == '-': if feature_df.loc[i + 1, 'strand'] == '+': # plot a lrf type raise IOError("can't plot {} patches yet".format("lrf")) elif feature_df.loc[i + 1, 'strand'] == '-': #plot a lrr type raise IOError("can't plot {} patches yet".format("lrr")) # in this case we're only dealing with 'right type' patches elif i == len(feature_df) - 1: # type could be rff or rfr if feature_df.loc[i-1, 'strand'] == '+': if feature_df.loc[i, 'strand'] == '+': # plot a rff type panel, patch = _plot_rff(panel, feature_df.iloc[i-1,], feature_df.iloc[i,], colorMap, y_pos, bar_thickness, text) myPatches.append(patch) elif feature_df.loc[i, 'strand'] == '-': #plot a rfr type raise IOError("can't plot {} patches yet".format("rfr")) # or type could be rrr or rrf elif feature_df.loc[i-1, 'strand'] == '-': if feature_df.loc[i, 'strand'] == '+': # plot a rrf type raise IOError("can't plot {} patches yet".format("rrf")) elif feature_df.loc[i, 'strand'] == '-': #plot a rrr type raise IOError("can't plot {} patches yet".format("rrr")) return panel, myPatches pauvre-0.2.3/pauvre/lsi/0000755002612300001670000000000014044135227016264 5ustar dschultzbiolum00000000000000pauvre-0.2.3/pauvre/lsi/Q.py0000755002612300001670000000324313274475032017050 0ustar dschultzbiolum00000000000000# Binary search tree that holds status of sweep line. Only leaves hold values. # Operations for finding left and right neighbors of a query point p and finding which segments contain p. # Author: Sam Lichtenberg # Email: splichte@princeton.edu # Date: 09/02/2013 from pauvre.lsi.helper import * ev = 0.00000001 class Q: def __init__(self, key, value): self.key = key self.value = value self.left = None self.right = None def find(self, key): if self.key is None: return False c = compare_by_y(key, self.key) if c==0: return True elif c==-1: if self.left: self.left.find(key) else: return False else: if self.right: self.right.find(key) else: return False def insert(self, key, value): if self.key is None: self.key = key self.value = value c = compare_by_y(key, self.key) if c==0: self.value += value elif c==-1: if self.left is None: self.left = Q(key, value) else: self.left.insert(key, value) else: if self.right is None: self.right = Q(key, value) else: self.right.insert(key, value) # must return key AND value def get_and_del_min(self, parent=None): if self.left is not None: return self.left.get_and_del_min(self) else: k = self.key v = self.value if parent: parent.left = self.right # i.e. is root node else: if self.right: self.key = self.right.key self.value = self.right.value self.left = self.right.left self.right = self.right.right else: self.key = None return k,v def print_tree(self): if self.left: self.left.print_tree() print(self.key) print(self.value) if self.right: self.right.print_tree() pauvre-0.2.3/pauvre/lsi/T.py0000755002612300001670000002076013274475032017056 0ustar dschultzbiolum00000000000000# Binary search tree that holds status of sweep line. Only leaves hold values. # Operations for finding left and right neighbors of a query point p and finding which segments contain p. # Author: Sam Lichtenberg # Email: splichte@princeton.edu # Date: 09/02/2013 from pauvre.lsi.helper import * ev = 0.00000001 class T: def __init__(self): self.root = Node(None, None, None, None) def contain_p(self, p): if self.root.value is None: return [[], []] lists = [[], []] self.root.contain_p(p, lists) return (lists[0], lists[1]) def get_left_neighbor(self, p): if self.root.value is None: return None return self.root.get_left_neighbor(p) def get_right_neighbor(self, p): if self.root.value is None: return None return self.root.get_right_neighbor(p) def insert(self, key, s): if self.root.value is None: self.root.left = Node(s, None, None, self.root) self.root.value = s self.root.m = get_slope(s) else: (node, path) = self.root.find_insert_pt(key, s) if path == 'r': node.right = Node(s, None, None, node) node.right.adjust() elif path == 'l': node.left = Node(s, None, None, node) else: # this means matching Node was a leaf # need to make a new internal Node if node.compare_to_key(key) < 0 or (node.compare_to_key(key)==0 and node.compare_lower(key, s) < 1): new_internal = Node(s, None, node, node.parent) new_leaf = Node(s, None, None, new_internal) new_internal.left = new_leaf if node is node.parent.left: node.parent.left = new_internal node.adjust() else: node.parent.right = new_internal else: new_internal = Node(node.value, node, None, node.parent) new_leaf = Node(s, None, None, new_internal) new_internal.right = new_leaf if node is node.parent.left: node.parent.left = new_internal new_leaf.adjust() else: node.parent.right = new_internal node.parent = new_internal def delete(self, p, s): key = p node = self.root.find_delete_pt(key, s) val = node.value if node is node.parent.left: parent = node.parent.parent if parent is None: if self.root.right is not None: if self.root.right.left or self.root.right.right: self.root = self.root.right self.root.parent = None else: self.root.left = self.root.right self.root.value = self.root.right.value self.root.m = self.root.right.m self.root.right = None else: self.root.left = None self.root.value = None elif node.parent is parent.left: parent.left = node.parent.right node.parent.right.parent = parent else: parent.right = node.parent.right node.parent.right.parent = parent else: parent = node.parent.parent if parent is None: if self.root.left: # switch properties if self.root.left.right or self.root.left.left: self.root = self.root.left self.root.parent = None else: self.root.right = None else: self.root.right = None self.root.value = None elif node.parent is parent.left: parent.left = node.parent.left node.parent.left.parent = parent farright = node.parent.left while farright.right is not None: farright = farright.right farright.adjust() else: parent.right = node.parent.left node.parent.left.parent = parent farright = node.parent.left while farright.right is not None: farright = farright.right farright.adjust() return val def print_tree(self): self.root.print_tree() class Node: def __init__(self, value, left, right, parent): self.value = value # associated line segment self.left = left self.right = right self.parent = parent self.m = None if value is not None: self.m = get_slope(value) # compares line segment at y-val of p to p # TODO: remove this and replace with get_x_at def compare_to_key(self, p): x0 = self.value[0][0] y0 = self.value[0][1] y1 = p[1] if self.m != 0 and self.m is not None: x1 = x0 - float(y0-y1)/self.m return compare_by_x(p, (x1, y1)) else: x1 = p[0] return 0 def get_left_neighbor(self, p): neighbor = None n = self if n.left is None and n.right is None: return neighbor last_right = None found = False while not found: c = n.compare_to_key(p) if c < 1 and n.left: n = n.left elif c==1 and n.right: n = n.right last_right = n.parent else: found = True c = n.compare_to_key(p) if c==0: if n is n.parent.right: return n.parent else: goright = None if last_right: goright =last_right.left return self.get_lr(None, goright)[0] # n stores the highest-value in the left subtree if c==-1: goright = None if last_right: goright = last_right.left return self.get_lr(None, goright)[0] if c==1: neighbor = n return neighbor def get_right_neighbor(self, p): neighbor = None n = self if n.left is None and n.right is None: return neighbor last_left = None found = False while not found: c = n.compare_to_key(p) if c==0 and n.right: n = n.right elif c < 0 and n.left: n = n.left last_left = n.parent elif c==1 and n.right: n = n.right else: found = True c = n.compare_to_key(p) # can be c==0 and n.left if at root node if c==0: if n.parent is None: return None if n is n.parent.right: goleft = None if last_left: goleft = last_left.right return self.get_lr(goleft, None)[1] else: return self.get_lr(n.parent.right, None)[1] if c==1: goleft = None if last_left: goleft = last_left.right return self.get_lr(goleft, None)[1] if c==-1: return n return neighbor # travels down a single direction to get neighbors def get_lr(self, left, right): lr = [None, None] if left: while left.left: left = left.left lr[1] = left if right: while right.right: right = right.right lr[0] = right return lr def contain_p(self, p, lists): c = self.compare_to_key(p) if c==0: if self.left is None and self.right is None: if compare_by_x(p, self.value[1])==0: lists[1].append(self.value) else: lists[0].append(self.value) if self.left: self.left.contain_p(p, lists) if self.right: self.right.contain_p(p, lists) elif c < 0: if self.left: self.left.contain_p(p, lists) else: if self.right: self.right.contain_p(p, lists) def find_insert_pt(self, key, seg): if self.left and self.right: if self.compare_to_key(key) == 0 and self.compare_lower(key, seg)==1: return self.right.find_insert_pt(key, seg) elif self.compare_to_key(key) < 1: return self.left.find_insert_pt(key, seg) else: return self.right.find_insert_pt(key, seg) # this case only happens at root elif self.left: if self.compare_to_key(key) == 0 and self.compare_lower(key, seg)==1: return (self, 'r') elif self.compare_to_key(key) < 1: return self.left.find_insert_pt(key, seg) else: return (self, 'r') else: return (self, 'n') # adjusts stored segments in inner nodes def adjust(self): value = self.value m = self.m parent = self.parent node = self # go up left as much as possible while parent and node is parent.right: node = parent parent = node.parent # parent to adjust will be on the immediate right if parent and node is parent.left: parent.value = value parent.m = m def compare_lower(self, p, s2): y = p[1] - 10 key = get_x_at(s2, (p[0], y)) return self.compare_to_key(key) # returns matching leaf node, or None if no match # when deleting, you don't delete below--you delete above! so compare lower = -1. def find_delete_pt(self, key, value): if self.left and self.right: # if equal at this pt, and this node's value is less than the seg's slightly above this pt if self.compare_to_key(key) == 0 and self.compare_lower(key, value)==-1: return self.right.find_delete_pt(key, value) if self.compare_to_key(key) < 1: return self.left.find_delete_pt(key, value) else: return self.right.find_delete_pt(key, value) elif self.left: if self.compare_to_key(key) < 1: return self.left.find_delete_pt(key, value) else: return None # is leaf else: if self.compare_to_key(key)==0 and segs_equal(self.value, value): return self else: return None # also prints depth of each node def print_tree(self, l=0): l += 1 if self.left: self.left.print_tree(l) if self.left or self.right: print('INTERNAL: {0}'.format(l)) else: print('LEAF: {0}'.format(l)) print(self) print(self.value) if self.right: self.right.print_tree(l) pauvre-0.2.3/pauvre/lsi/__init__.py0000644002612300001670000000000013622004260020354 0ustar dschultzbiolum00000000000000pauvre-0.2.3/pauvre/lsi/helper.py0000755002612300001670000000445013274475032020130 0ustar dschultzbiolum00000000000000# Helper functions for use in the lsi implementation. ev = 0.0000001 # floating-point comparison def approx_equal(a, b, tol): return abs(a - b) < tol # compares x-values of two pts # used for ordering in T def compare_by_x(k1, k2): if approx_equal(k1[0], k2[0], ev): return 0 elif k1[0] < k2[0]: return -1 else: return 1 # higher y value is "less"; if y value equal, lower x value is "less" # used for ordering in Q def compare_by_y(k1, k2): if approx_equal(k1[1], k2[1], ev): if approx_equal(k1[0], k2[0], ev): return 0 elif k1[0] < k2[0]: return -1 else: return 1 elif k1[1] > k2[1]: return -1 else: return 1 # tests if s0 and s1 represent the same segment (i.e. pts can be in 2 different orders) def segs_equal(s0, s1): x00 = s0[0][0] y00 = s0[0][1] x01 = s0[1][0] y01 = s0[1][1] x10 = s1[0][0] y10 = s1[0][1] x11 = s1[1][0] y11 = s1[1][1] if (approx_equal(x00, x10, ev) and approx_equal(y00, y10, ev)): if (approx_equal(x01, x11, ev) and approx_equal(y01, y11, ev)): return True if (approx_equal(x00, x11, ev) and approx_equal(y00, y11, ev)): if (approx_equal(x01, x10, ev) and approx_equal(y01, y10, ev)): return True return False # get m for a given seg in (p1, p2) form def get_slope(s): x0 = s[0][0] y0 = s[0][1] x1 = s[1][0] y1 = s[1][1] if (x1-x0)==0: return None else: return float(y1-y0)/(x1-x0) # given a point p, return the point on s that shares p's y-val def get_x_at(s, p): m = get_slope(s) # TODO: this should check if p's x-val is octually on seg; we're assuming # for now that it would have been deleted already if not if m == 0: # horizontal segment return p # ditto; should check if y-val on seg if m is None: # vertical segment return (s[0][0], p[1]) x1 = s[0][0]-(s[0][1]-p[1])/m return (x1, p[1]) # returns the point at which two line segments intersect, or None if no intersection. def intersect(seg1, seg2): p = seg1[0] r = (seg1[1][0]-seg1[0][0], seg1[1][1]-seg1[0][1]) q = seg2[0] s = (seg2[1][0]-seg2[0][0], seg2[1][1]-seg2[0][1]) denom = r[0]*s[1]-r[1]*s[0] if denom == 0: return None numer = float(q[0]-p[0])*s[1]-(q[1]-p[1])*s[0] t = numer/denom numer = float(q[0]-p[0])*r[1]-(q[1]-p[1])*r[0] u = numer/denom if (t < 0 or t > 1) or (u < 0 or u > 1): return None x = p[0]+t*r[0] y = p[1]+t*r[1] return (x, y) pauvre-0.2.3/pauvre/lsi/lsi.py0000755002612300001670000000555313274475032017445 0ustar dschultzbiolum00000000000000# Implementation of the Bentley-Ottmann algorithm, described in deBerg et al, ch. 2. # See README for more information. # Author: Sam Lichtenberg # Email: splichte@princeton.edu # Date: 09/02/2013 from pauvre.lsi.Q import Q from pauvre.lsi.T import T from pauvre.lsi.helper import * # "close enough" for floating point ev = 0.00000001 # how much lower to get the x of a segment, to determine which of a set of segments is the farthest right/left lower_check = 100 # gets the point on a segment at a lower y value. def getNextPoint(p, seg, y_lower): p1 = seg[0] p2 = seg[1] if (p1[0]-p2[0])==0: return (p[0]+10, p[1]) slope = float(p1[1]-p2[1])/(p1[0]-p2[0]) if slope==0: return (p1[0], p[1]-y_lower) y = p[1]-y_lower x = p1[0]-(p1[1]-y)/slope return (x, y) """ for each event point: U_p = segments that have p as an upper endpoint C_p = segments that contain p L_p = segments that have p as a lower endpoint """ def handle_event_point(p, segs, q, t, intersections): rightmost = (float("-inf"), 0) rightmost_seg = None leftmost = (float("inf"), 0) leftmost_seg = None U_p = segs (C_p, L_p) = t.contain_p(p) merge_all = U_p+C_p+L_p if len(merge_all) > 1: intersections[p] = [] for s in merge_all: intersections[p].append(s) merge_CL = C_p+L_p merge_UC = U_p+C_p for s in merge_CL: # deletes at a point slightly above (to break ties) - where seg is located in tree # above intersection point t.delete(p, s) # put segments into T based on where they are at y-val just below p[1] for s in merge_UC: n = getNextPoint(p, s, lower_check) if n[0] > rightmost[0]: rightmost = n rightmost_seg = s if n[0] < leftmost[0]: leftmost = n leftmost_seg = s t.insert(p, s) # means only L_p -> check newly-neighbored segments if len(merge_UC) == 0: neighbors = (t.get_left_neighbor(p), t.get_right_neighbor(p)) if neighbors[0] and neighbors[1]: find_new_event(neighbors[0].value, neighbors[1].value, p, q) # of newly inserted pts, find possible intersections to left and right else: left_neighbor = t.get_left_neighbor(p) if left_neighbor: find_new_event(left_neighbor.value, leftmost_seg, p, q) right_neighbor = t.get_right_neighbor(p) if right_neighbor: find_new_event(right_neighbor.value, rightmost_seg, p, q) def find_new_event(s1, s2, p, q): i = intersect(s1, s2) if i: if compare_by_y(i, p) == 1: if not q.find(i): q.insert(i, []) # segment is in ((x, y), (x, y)) form # first pt in a segment should have higher y-val - this is handled in function def intersection(S): s0 = S[0] if s0[1][1] > s0[0][1]: s0 = (s0[1], s0[0]) q = Q(s0[0], [s0]) q.insert(s0[1], []) intersections = {} for s in S[1:]: if s[1][1] > s[0][1]: s = (s[1], s[0]) q.insert(s[0], [s]) q.insert(s[1], []) t = T() while q.key: p, segs = q.get_and_del_min() handle_event_point(p, segs, q, t, intersections) return intersections pauvre-0.2.3/pauvre/lsi/test.py0000755002612300001670000000463313274475032017633 0ustar dschultzbiolum00000000000000# Test file for lsi. # Author: Sam Lichtenberg # Email: splichte@princeton.edu # Date: 09/02/2013 from lsi import intersection import random import time, sys from helper import * ev = 0.00000001 def scale(i): return float(i) use_file = None try: use_file = sys.argv[2] except: pass if not use_file: S = [] for i in range(int(sys.argv[1])): p1 = (scale(random.randint(0, 1000)), scale(random.randint(0, 1000))) p2 = (scale(random.randint(0, 1000)), scale(random.randint(0, 1000))) s = (p1, p2) S.append(s) f = open('input', 'w') f.write(str(S)) f.close() else: f = open(sys.argv[2], 'r') S = eval(f.read()) intersections = [] seen = [] vs = False hs = False es = False now = time.time() for seg1 in S: if approx_equal(seg1[0][0], seg1[1][0], ev): print 'VERTICAL SEG' print '' print '' vs = True if approx_equal(seg1[0][1], seg1[1][1], ev): print 'HORIZONTAL SEG' print '' print '' hs = True for seg2 in S: if seg1 is not seg2 and segs_equal(seg1, seg2): print 'EQUAL SEGS' print '' print '' es = True if seg1 is not seg2 and (seg2, seg1) not in seen: i = intersect(seg1, seg2) if i: intersections.append((i, [seg1, seg2])) # xpts = [seg1[0][0], seg1[1][0], seg2[0][0], seg2[1][0]] # xpts = sorted(xpts) # if (i[0] <= xpts[2] and i[0] >= xpts[1]: # intersections.append((i, [seg1, seg2])) seen.append((seg1, seg2)) later = time.time() n2time = later-now print "Line sweep results:" now = time.time() lsinters = intersection(S) inters = [] for k, v in lsinters.iteritems(): #print '{0}: {1}'.format(k, v) inters.append(k) # inters.append(v) later = time.time() print 'TIME ELAPSED: {0}'.format(later-now) print "N^2 comparison results:" pts_seen = [] highestseen = 0 for i in intersections: seen_already = False seen = 0 for p in pts_seen: if approx_equal(i[0][0], p[0], ev) and approx_equal(i[0][1], p[1], ev): seen += 1 seen_already = True if seen > highestseen: highestseen = seen if not seen_already: pts_seen.append(i[0]) in_k = False for k in inters: if approx_equal(k[0], i[0][0], ev) and approx_equal(k[1], i[0][1], ev): in_k = True if in_k == False: print 'Not in K: {0}: {1}'.format(i[0], i[1]) # print i print highestseen print 'TIME ELAPSED: {0}'.format(n2time) #print 'Missing from line sweep but in N^2:' #for i in seen: # matched = False print len(lsinters) print len(pts_seen) if len(lsinters) != len(pts_seen): print 'uh oh!' pauvre-0.2.3/pauvre/marginplot.py0000644002612300001670000003703513622045777020246 0ustar dschultzbiolum00000000000000#!/usr/bin/env python # -*- coding: utf-8 -*- # pauvre # Copyright (c) 2016-2020 Darrin T. Schultz. # # This file is part of pauvre. # # pauvre is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # pauvre is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with pauvre. If not, see . import ast import matplotlib matplotlib.use('Agg') import matplotlib.pyplot as plt import matplotlib.patches as mplpatches from matplotlib.colors import LinearSegmentedColormap import numpy as np import pandas as pd import os.path as opath from sys import stderr from pauvre.functions import parse_fastq_length_meanqual, print_images, filter_fastq_length_meanqual from pauvre.stats import stats import pauvre.rcparams as rc import logging # logging logger = logging.getLogger('pauvre') def generate_panel(panel_left, panel_bottom, panel_width, panel_height, axis_tick_param='both', which_tick_param='both', bottom_tick_param=True, label_bottom_tick_param=True, left_tick_param=True, label_left_tick_param=True, right_tick_param=False, label_right_tick_param=False, top_tick_param=False, label_top_tick_param=False): """ Setting default panel tick parameters. Some of these are the defaults for matplotlib anyway, but specifying them for readability. Here are options and defaults for the parameters used below: axis : {'x', 'y', 'both'}; which axis to modify; default = 'both' which : {'major', 'minor', 'both'}; which ticks to modify; default = 'major' bottom, top, left, right : bool or {True, False}; ticks on or off; labelbottom, labeltop, labelleft, labelright : bool or {True, False} """ # create the panel panel_rectangle = [panel_left, panel_bottom, panel_width, panel_height] panel = plt.axes(panel_rectangle) # Set tick parameters panel.tick_params(axis=axis_tick_param, which=which_tick_param, bottom=bottom_tick_param, labelbottom=label_bottom_tick_param, left=left_tick_param, labelleft=label_left_tick_param, right=right_tick_param, labelright=label_right_tick_param, top=top_tick_param, labeltop=label_top_tick_param) return panel def _generate_histogram_bin_patches(panel, bins, bin_values, horizontal=True): """This helper method generates the histogram that is added to the panel. In this case, horizontal = True applies to the mean quality histogram. So, horizontal = False only applies to the length histogram. """ l_width = 0.0 f_color = (0.5, 0.5, 0.5) e_color = (0, 0, 0) if horizontal: for step in np.arange(0, len(bin_values), 1): left = bins[step] bottom = 0 width = bins[step + 1] - bins[step] height = bin_values[step] hist_rectangle = mplpatches.Rectangle((left, bottom), width, height, linewidth=l_width, facecolor=f_color, edgecolor=e_color) panel.add_patch(hist_rectangle) else: for step in np.arange(0, len(bin_values), 1): left = 0 bottom = bins[step] width = bin_values[step] height = bins[step + 1] - bins[step] hist_rectangle = mplpatches.Rectangle((left, bottom), width, height, linewidth=l_width, facecolor=f_color, edgecolor=e_color) panel.add_patch(hist_rectangle) def generate_histogram(panel, data_list, max_plot_length, min_plot_length, bin_interval, hist_horizontal=True, left_spine=True, bottom_spine=True, top_spine=False, right_spine=False, x_label=None, y_label=None): bins = np.arange(0, max_plot_length, bin_interval) bin_values, bins2 = np.histogram(data_list, bins) # hist_horizontal is used for quality if hist_horizontal: panel.set_xlim([min_plot_length, max_plot_length]) panel.set_ylim([0, max(bin_values * 1.1)]) # and hist_horizontal == Fale is for read length else: panel.set_xlim([0, max(bin_values * 1.1)]) panel.set_ylim([min_plot_length, max_plot_length]) # Generate histogram bin patches, depending on whether we're plotting # vertically or horizontally _generate_histogram_bin_patches(panel, bins, bin_values, hist_horizontal) panel.spines['left'].set_visible(left_spine) panel.spines['bottom'].set_visible(bottom_spine) panel.spines['top'].set_visible(top_spine) panel.spines['right'].set_visible(right_spine) if y_label is not None: panel.set_ylabel(y_label) if x_label is not None: panel.set_xlabel(x_label) def generate_heat_map(panel, data_frame, min_plot_length, min_plot_qual, max_plot_length, max_plot_qual, color, **kwargs): panel.set_xlim([min_plot_qual, max_plot_qual]) panel.set_ylim([min_plot_length, max_plot_length]) if kwargs["kmerdf"]: hex_this = data_frame.query('length<{} and numks<{}'.format( max_plot_length, max_plot_qual)) # This single line controls plotting the hex bins in the panel hex_vals = panel.hexbin(hex_this['numks'], hex_this['length'], gridsize=int(np.ceil(max_plot_qual/2)), linewidths=0.0, cmap=color) else: hex_this = data_frame.query('length<{} and meanQual<{}'.format( max_plot_length, max_plot_qual)) # This single line controls plotting the hex bins in the panel hex_vals = panel.hexbin(hex_this['meanQual'], hex_this['length'], gridsize=49, linewidths=0.0, cmap=color) for each in panel.spines: panel.spines[each].set_visible(False) counts = hex_vals.get_array() return counts def generate_legend(panel, counts, color): # completely custom for more control panel.set_xlim([0, 1]) panel.set_ylim([0, 1000]) panel.set_yticks([int(x) for x in np.linspace(0, 1000, 6)]) panel.set_yticklabels([int(x) for x in np.linspace(0, max(counts), 6)]) for i in np.arange(0, 1001, 1): rgba = color(i / 1001) alpha = rgba[-1] facec = rgba[0:3] hist_rectangle = mplpatches.Rectangle((0, i), 1, 1, linewidth=0.0, facecolor=facec, edgecolor=(0, 0, 0), alpha=alpha) panel.add_patch(hist_rectangle) panel.spines['top'].set_visible(False) panel.spines['left'].set_visible(False) panel.spines['bottom'].set_visible(False) panel.yaxis.set_label_position("right") panel.set_ylabel('Number of Reads') def margin_plot(df, **kwargs): rc.update_rcParams() # 250, 231, 34 light yellow # 67, 1, 85 # R=np.linspace(65/255,1,101) # G=np.linspace(0/255, 231/255, 101) # B=np.linspace(85/255, 34/255, 101) # R=65/255, G=0/255, B=85/255 Rf = 65 / 255 Bf = 85 / 255 pdict = {'red': ((0.0, Rf, Rf), (1.0, Rf, Rf)), 'green': ((0.0, 0.0, 0.0), (1.0, 0.0, 0.0)), 'blue': ((0.0, Bf, Bf), (1.0, Bf, Bf)), 'alpha': ((0.0, 0.0, 0.0), (1.0, 1.0, 1.0)) } # Now we will use this example to illustrate 3 ways of # handling custom colormaps. # First, the most direct and explicit: purple1 = LinearSegmentedColormap('Purple1', pdict) # set the figure dimensions fig_width = 1.61 * 3 fig_height = 1 * 3 fig = plt.figure(figsize=(fig_width, fig_height)) # set the panel dimensions heat_map_panel_width = fig_width * 0.5 heat_map_panel_height = heat_map_panel_width * 0.62 # find the margins to center the panel in figure fig_left_margin = fig_bottom_margin = (1 / 6) # lengthPanel length_panel_width = (1 / 8) # the color Bar parameters legend_panel_width = (1 / 24) # define padding h_padding = 0.02 v_padding = 0.05 # Set whether to include y-axes in histograms if kwargs["Y_AXES"]: length_bottom_spine = True length_bottom_tick = False length_bottom_label = True qual_left_spine = True qual_left_tick = True qual_left_label = True qual_y_label = 'Count' else: length_bottom_spine = False length_bottom_tick = False length_bottom_label = False qual_left_spine = False qual_left_tick = False qual_left_label = False qual_y_label = None panels = [] # Quality histogram panel qual_panel_left = fig_left_margin + length_panel_width + h_padding qual_panel_width = heat_map_panel_width / fig_width qual_panel_height = length_panel_width * fig_width / fig_height qual_panel = generate_panel(qual_panel_left, fig_bottom_margin, qual_panel_width, qual_panel_height, left_tick_param=qual_left_tick, label_left_tick_param=qual_left_label) panels.append(qual_panel) # Length histogram panel length_panel_bottom = fig_bottom_margin + qual_panel_height + v_padding length_panel_height = heat_map_panel_height / fig_height length_panel = generate_panel(fig_left_margin, length_panel_bottom, length_panel_width, length_panel_height, bottom_tick_param=length_bottom_tick, label_bottom_tick_param=length_bottom_label) panels.append(length_panel) # Heat map panel heat_map_panel_left = fig_left_margin + length_panel_width + h_padding heat_map_panel_bottom = fig_bottom_margin + qual_panel_height + v_padding heat_map_panel = generate_panel(heat_map_panel_left, heat_map_panel_bottom, heat_map_panel_width / fig_width, heat_map_panel_height / fig_height, bottom_tick_param=False, label_bottom_tick_param=False, left_tick_param=False, label_left_tick_param=False) panels.append(heat_map_panel) heat_map_panel.set_title(kwargs["title"]) # Legend panel legend_panel_left = fig_left_margin + length_panel_width + \ (heat_map_panel_width / fig_width) + (h_padding * 2) legend_panel_bottom = fig_bottom_margin + qual_panel_height + v_padding legend_panel_height = heat_map_panel_height / fig_height legend_panel = generate_panel(legend_panel_left, legend_panel_bottom, legend_panel_width, legend_panel_height, bottom_tick_param = False, label_bottom_tick_param = False, left_tick_param = False, label_left_tick_param = False, right_tick_param = True, label_right_tick_param = True) panels.append(legend_panel) # Set min and max viewing window for length if kwargs["plot_maxlen"]: max_plot_length = kwargs["plot_maxlen"] else: max_plot_length = int(np.percentile(df['length'], 99)) min_plot_length = kwargs["plot_minlen"] # Set length bin sizes if kwargs["lengthbin"]: length_bin_interval = kwargs["lengthbin"] else: # Dividing by 80 is based on what looks good from experience length_bin_interval = int(max_plot_length / 80) # length_bins = np.arange(0, max_plot_length, length_bin_interval) # Set max and min viewing window for quality if kwargs["plot_maxqual"]: max_plot_qual = kwargs["plot_maxqual"] elif kwargs["kmerdf"]: max_plot_qual = np.ceil(df["numks"].median() * 2) else: max_plot_qual = max(np.ceil(df['meanQual'])) min_plot_qual = kwargs["plot_minqual"] # Set qual bin sizes if kwargs["qualbin"]: qual_bin_interval = kwargs["qualbin"] elif kwargs["kmerdf"]: qual_bin_interval = 1 else: # again, this is just based on what looks good from experience qual_bin_interval = max_plot_qual / 85 qual_bins = np.arange(0, max_plot_qual, qual_bin_interval) # Generate length histogram generate_histogram(length_panel, df['length'], max_plot_length, min_plot_length, length_bin_interval, hist_horizontal=False, y_label='Read Length', bottom_spine=length_bottom_spine) # Generate quality histogram if kwargs["kmerdf"]: generate_histogram(qual_panel, df['numks'], max_plot_qual, min_plot_qual, qual_bin_interval, x_label='number of kmers', y_label=qual_y_label, left_spine=qual_left_spine) else: generate_histogram(qual_panel, df['meanQual'], max_plot_qual, min_plot_qual, qual_bin_interval, x_label='Phred Quality', y_label=qual_y_label, left_spine=qual_left_spine) # Generate heat map counts = generate_heat_map(heat_map_panel, df, min_plot_length, min_plot_qual, max_plot_length, max_plot_qual, purple1, kmerdf = kwargs["kmerdf"]) # Generate legend generate_legend(legend_panel, counts, purple1) # inform the user of the plotting window if not quiet mode if not kwargs["QUIET"]: print("""plotting in the following window: {0} <= Q-score (x-axis) <= {1} {2} <= length (y-axis) <= {3}""".format( min_plot_qual, max_plot_qual, min_plot_length, max_plot_length), file=stderr) # Print image(s) if kwargs["BASENAME"] is None: file_base = opath.splitext(opath.basename(kwargs["fastq"]))[0] else: file_base = kwargs["BASENAME"] print_images( file_base, image_formats=kwargs["fileform"], dpi=kwargs["dpi"], no_timestamp = kwargs["no_timestamp"], transparent=kwargs["TRANSPARENT"]) def run(args): if args.kmerdf: df = pd.read_csv(args.kmerdf, header='infer', sep='\t') df["kmers"] = df["kmers"].apply(ast.literal_eval) else: df = parse_fastq_length_meanqual(args.fastq) df = filter_fastq_length_meanqual(df, args.filt_minlen, args.filt_maxlen, args.filt_minqual, args.filt_maxqual) stats(df, args.fastq, False) margin_plot(df=df.dropna(), **vars(args)) pauvre-0.2.3/pauvre/pauvre_main.py0000644002612300001670000010036313622037671020365 0ustar dschultzbiolum00000000000000#!/usr/bin/env python # -*- coding: utf-8 -*- # pauvre - just a pore plotting package # Copyright (c) 2016-2017 Darrin T. Schultz. All rights reserved. # # This file is part of pauvre. # # pauvre is free software: you can redistribute it and/or modify # it under the terms of the GNU General Public License as published by # the Free Software Foundation, either version 3 of the License, or # (at your option) any later version. # # pauvre is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with pauvre. If not, see . # I modeled this code on https://github.com/arq5x/poretools/. Check it out. - DTS import sys import os.path import argparse # pauvre imports import pauvre.version # This class is used in argparse to expand the ~. This avoids errors caused on # some systems. class FullPaths(argparse.Action): """Expand user- and relative-paths""" def __call__(self, parser, namespace, values, option_string=None): setattr(namespace, self.dest, os.path.abspath(os.path.expanduser(values))) class FullPathsList(argparse.Action): """Expand user- and relative-paths when a list of paths is passed to the program""" def __call__(self, parser, namespace, values, option_string=None): setattr(namespace, self.dest, [os.path.abspath(os.path.expanduser(value)) for value in values]) def run_subtool(parser, args): if args.command == 'browser': import pauvre.browser as submodule elif args.command == 'custommargin': import pauvre.custommargin as submodule elif args.command == 'marginplot': import pauvre.marginplot as submodule elif args.command == 'redwood': import pauvre.redwood as submodule elif args.command == 'stats': import pauvre.stats as submodule elif args.command == 'synplot': import pauvre.synplot as submodule # run the chosen submodule. submodule.run(args) class ArgumentParserWithDefaults(argparse.ArgumentParser): def __init__(self, *args, **kwargs): super(ArgumentParserWithDefaults, self).__init__(*args, **kwargs) self.add_argument("-q", "--quiet", help="Do not output warnings to stderr", action="store_true", dest="QUIET") def main(): ######################################### # create the top-level parser ######################################### parser = argparse.ArgumentParser( prog='pauvre', formatter_class=argparse.ArgumentDefaultsHelpFormatter) parser.add_argument("-v", "--version", help="Installed pauvre version", action="version", version="%(prog)s " + str(pauvre.version.__version__)) subparsers = parser.add_subparsers( title='[sub-commands]', dest='command', parser_class=ArgumentParserWithDefaults) ######################################### # create the individual tool parsers ######################################### ############# # browser ############# parser_browser = subparsers.add_parser('browser', help="""an adaptable genome browser with various track types""") parser_browser.add_argument('-c', '--chromosomeid', metavar = "Chr", dest = 'CHR', type = str, help = """The fasta sequence to observe. Use the header name of the fasta file without the '>' character""") parser_browser.add_argument('--dpi', metavar='dpi', default=600, type=int, help="""Change the dpi from the default 600 if you need it higher""") parser_browser.add_argument('--fileform', dest='fileform', metavar='STRING', choices=['png', 'pdf', 'eps', 'jpeg', 'jpg', 'pdf', 'pgf', 'ps', 'raw', 'rgba', 'svg', 'svgz', 'tif', 'tiff'], default=['png'], nargs='+', help='Which output format would you like? Def.=png') parser_browser.add_argument("--no_timestamp", action = 'store_true', help="""Turn off time stamps in the filename output.""") parser_browser.add_argument('-o', '--output-base-name', dest='BASENAME', help="""Specify a base name for the output file( s). The input file base name is the default.""") parser_browser.add_argument('--path', type=str, help="""Set an explicit filepath for the output. Only do this if you have selected one output type.""") parser_browser.add_argument('-p', '--plot_commands', dest='CMD', nargs = '+', help="""Write strings here to select what to plot. The format for each track is: ::