pax_global_header00006660000000000000000000000064135226116710014516gustar00rootroot0000000000000052 comment=20702772939bebdb477d20c4feceb680474a3330 deeptools_intervals-0.1.9/000077500000000000000000000000001352261167100156125ustar00rootroot00000000000000deeptools_intervals-0.1.9/.gitignore000066400000000000000000000000701352261167100175770ustar00rootroot00000000000000build/ dist/ deeptoolsintervals.egg-info/ *.o *.a *.pyc deeptools_intervals-0.1.9/.travis.yml000066400000000000000000000021731352261167100177260ustar00rootroot00000000000000language: c env: - TRAVIS_PYTHON_VERSION=2.7 - TRAVIS_PYTHON_VERSION=3.6 - TRAVIS_PYTHON_VERSION=3.7 os: - linux - osx before_install: - if [[ "$TRAVIS_OS_NAME" == "linux" ]] ; then curl https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh -o miniconda.sh ; fi - if [[ "$TRAVIS_OS_NAME" == "osx" ]] ; then curl https://repo.continuum.io/miniconda/Miniconda3-latest-MacOSX-x86_64.sh -o miniconda.sh ; fi - bash miniconda.sh -b -p $HOME/miniconda - export PATH="$HOME/miniconda/bin:$PATH" - hash -r - conda config --set always_yes yes --set changeps1 no - conda config --add channels conda-forge - conda update -q conda - conda info -a install: - conda create -n foo -c conda-forge PYTHON=$TRAVIS_PYTHON_VERSION flake8 nose - source activate foo - if [[ "$TRAVIS_OS_NAME" == "osx" ]] ; then conda install -c conda-forge --yes libcxx libcxxabi cctools clang clang_osx-64 compiler-rt ld64 llvm llvm-lto-tapi ; fi - python ./setup.py install script: - flake8 . --exclude=build,deeptoolsintervals/__init__.py --ignore=E501,F403,E40,E722 - cd ~/ && nosetests --with-doctest deeptoolsintervals deeptools_intervals-0.1.9/LICENSE000066400000000000000000000021061352261167100166160ustar00rootroot00000000000000Copyright 2019 Max Planck Institute for Immunobiology and Epigenetics Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. deeptools_intervals-0.1.9/MANIFEST.in000066400000000000000000000003271352261167100173520ustar00rootroot00000000000000include README.md setup.py include LICENSE include deeptoolsintervals/tree.* include deeptoolsintervals/tree/*.c include deeptoolsintervals/tree/*.h include deeptoolsintervals/*.py include deeptoolsintervals/test/* deeptools_intervals-0.1.9/README.md000066400000000000000000000246771352261167100171110ustar00rootroot00000000000000For those curious, deepTools needs a new interval tree backend that support metadata associated with each interval. I previously made such a thing, called libGTF. Consequently, I'm just working on a (A) a python front-end for that and (B) some modifications specific to deepTools (namely, every interval needs an associated `deepTools_group` tag and exon bounds will be a new attribute associated with transcripts). Note that murmur3.c and murmur3.h are C implementations of MurmurHash. The C implementation is from [Peter Scott](https://github.com/PeterScott/murmur3) and MurmurHash itself is by [Austin Appleby](https://code.google.com/p/smhasher/wiki/MurmurHash3). Both of these are in the public domain. ktring.h and kseq.h are from [Heng Li](http://lh3lh3.users.sourceforge.net/) and are available under an MIT license. Usage ===== The only class contained here is `GTF` and it only has only one function that should ever be used, `findOverlaps`. Note that as is the case in deepTools, this package attempts to convert between chromosome naming systems. Because the conversion may not always be obvious, this can fail. The GTF class ------------- To read one or more files into an interval tree, one initializes a new `GTF` class: >>> from deeptoolsintervals import GTF >>> gtf = GTF("some_file.gtf") Multiple files can also be used: >>> from deeptoolsintervals import GTF >>> gtf = GTF(["some_file.gtf", "some_other_file.bed.gz"]) Files may be optionally compressed and the compression magic number is used to determine this. For GTF and BED12 files, exons are not stored by default, this can be changed with the `keepExons` option: >>> from deeptoolsintervals import GTF >>> gtf = GTF(["some_file.gtf", "some_other_file.bed.gz"], keepExons=True) The utility of this will be seen later. GTF and BED files may contain comments or browser lines at the beginning, these are ignored. ### Labels It's often useful to have multiple groups of intervals. This can be accomplished by assigning a label to each interval. If multiple files are used, then this package will default to assigning the file name as a label to intervals in each input file. Alternatively, labels can be included inside of files. For BED files, this is accomplished as follows: chr1 1 100 chr1 150 200 #My group chr1 300 400 #My other group These labels **MUST** be unique in BED files. If they are not, then each subsequent instance will have a suffix appended to ensure that it is unique. For GTF files, labels are included in the attribute column, by the addition of `deepTools_group` key:value pair: chr1 havana transcript 11869 14409 . + . gene_id "ENSG00000223972"; transcript_id "ENST00000456328"; deepTools_group "group 1"; These labels do **NOT** need to be unique across files. Labels can be over-riden with the `labels` option: >>> from deeptoolsintervals import GTF >>> gtf = GTF(["some_file.gtf", "some_other_file.bed.gz"], keepExons=True, labels=["foo", "bar", "quux", "sniggly"]) The number of provided labels **MUST** match the number encountered. These labels are applied in the order that groups are encountered in the input files. So if in the above example both files contain two groups, then the following would produce the same results but with the labels swapped across files: >>> from deeptoolsintervals import GTF >>> gtf = GTF(["some_other_file.bed.gz", "some_file.gtf"], keepExons=True, labels=["foo", "bar", "quux", "sniggly"]) Labels can also be replaced after the fact: >>> from deeptoolsintervals import GTF >>> gtf = GTF(["some_file.gtf", "some_other_file.bed.gz", "some_file.gtf"], keepExons=True) >>> gtf.labels = ["foo", "bar", "quux", "sniggly"] ### GTF-specific options GTF files come with three options specific to them: `exonID`, `transcriptID`, and `transcript_id_designator`. The "feature" column (column 3) in a GTF file denotes the type of feature an entry describes. By default, this package only looks at entries with `transcript` or `exon` (with `keepExons=True`) in the feature column. For some use cases, one might instead want to store CDS as exonic intervals or replace transcripts with genes. Transcripts are the primary entry used by this package and, consequently, each needs to have an associated transcript ID. Duplicate IDs are always ignored, since such a thing would be biologically non-sensical. In GTF files, the transcript id is stored in as `transcript_id "some ID";`. If, however, one changes thr `transcriptID` value to something else, such as `gene`, this key:value pair may not longer be present or may not be unique. In such cases, it's beneficial to change the key portion, for example to `gene_id`. ### Finding overlaps Finding overlaps requires a chromosome, start, and end positions. As with BED files, these coordinates are 0-based half-open. By default, strand and overlap type are completely ignored. This can be overridden: >>> o = gtf.findOverlaps("chr1", 0, 100, strand="+", matchType=1, strandType=3) This would search for intervals on the `+` strand (ignoring those on `.`, which would have additionally been returned had `strandType=1` been used) that are exactly [0, 100) on chromosome 1. Anyone interested in these more advanced overlap searching methods should look at the gtf.h file and the "libGTF" repository for examples. It's often the case the a function looking for intervals is doing so by first dividing the genome into chunks and then sending each chunk to a processor for subsequent analysis. In such cases, it's convenient to NOT have processor duplicate processing intervals that may overlap multiple genomic bins. In these circumstances, the `trimOverlap` option can be set to `True`. The output of `findOverlaps()` is a list of tuples: >>> from deeptoolsintervals import GTF >>> gtf = GTF("foo.gtf", keepExons=True) gtf.findOverlaps("chr1", 1, 20000) [(11868, 14409, 'ENST00000456328', 'group 1', [(11868, 12227), (12612, 12721), (13220, 14409)], '.'), (12009, 13670, 'ENST00000450305', 'group 1', [(12009, 12057), (12178, 12227), (12612, 12697), (12974, 13052), (13220, 13374), (13452, 13670)], '.'), (14403, 29570, 'ENST00000488147', 'group 1', [(14403, 14501), (15004, 15038), (15795, 15947), (16606, 16765), (16857, 17055), (17232, 17368), (17605, 17742), (17914, 18061), (18267, 18366), (24737, 24891), (29533, 29570)], '.'), (17368, 17436, 'ENST00000619216', 'group 2', [(17368, 17436)], '.')] Each tuple contains the following members (in order): 0-based starting position, 1-based end position, ID (the transcript ID for GTF files, column 4 for BED6/12 files and a string composed of the intervals for BED3 files), a group label, a sorted list of exonic bounds, and the score (column 4 in GTF files and 5 in BED files). If either the input file type does not provide exonic bounds or `keepExons=True` was not used, these bounds will be identical to that in the tuple: >>> from deeptoolsintervals import GTF >>> gtf = GTF("foo.gtf") >>> gtf.findOverlaps("chr1", 1, 20000) [(11868, 14409, 'ENST00000456328', 'group 1', [(11868, 14409)], '.'), (12009, 13670, 'ENST00000450305', 'group 1', [(12009, 13670)], '.'), (14403, 29570, 'ENST00000488147', 'group 1', [(14403, 29570)], '.'), (17368, 17436, 'ENST00000619216', 'group 2', [(17368, 17436)], '.')] In some cases, it's desirable to have the group labels be numeric, since the regions may be used for further processing and the results sorted or grouped accordingly. The `numericGroups` argument can be used to facilitate this: >>> from deeptoolsintervals import GTF >>> gtf = GTF("foo.gtf") >>> gtf.findOverlaps("chr1", 1, 20000, numericGroups=True) [(11868, 14409, 'ENST00000456328', 0, [(11868, 14409)], '.'), (12009, 13670, 'ENST00000450305', 0, [(12009, 13670)], '.'), (14403, 29570, 'ENST00000488147', 0, [(14403, 29570)], '.'), (17368, 17436, 'ENST00000619216', 1, [(17368, 17436)], '.')] The Enrichment class -------------------- The `Enrichment` class is a modification of the base `GTF` class, aimed at querying feature types in a region. Creation of the class from one or more BED/GTF files is also similar to the `GTF` class: >>> from deeptoolsintervals import Enrichment >>> gtf = Enrichment(["foo.gtf", "bar.bed"]) For GTF files, the feature type is the 3rd column. For BED files, the feature type is the file name, though this can be changed with the `labels` option: >>> from deeptoolsintervals import Enrichment >>> gtf = Enrichment(["foo.gtf", "bar.bed"], labels=["this will be ignored", "peaks"]) For GTF files, the label is ignored, but for the sake of simplicity if labels are specified there must be at least as many as there are files. Note that, unlike with `GTF` objects, you can not change labels after creation of an `Enrichment` object. All entries in BED and GTF files are stored. For BED12 files, only columns 2/3 are used as region bounds by default. This can be modified with the `keepExons` option: >>> from deeptoolsintervals import Enrichment >>> gtf = Enrichment("bar.bed12", keepExons=True) All other file types will ignore the `keepExons` option, so this can still be specified with a mix of BED12 and other file types. ### Finding overlaps Finding overlaps of an `Enrichment` object is similar to that with a `GTF` object. Once again, the `findOverlaps()` method is used, though the `trimOverlap`, `numericGroup`, and `includeStrand` options are not present. Further, instead of a single `start` and `end` value, a list of tuples is used. This last difference facilitates finding overlaps of spliced genes, since the pysam `get_blocks()` method returns this type of data: >>> from deeptoolsintervals import Enrichment >>> gtf = Enrichment("GRCh38.84.gtf.gz") >>> gtf.findOverlaps("1", [(65500, 65600), (69900, 70000)]) frozenset(['start_codon', 'transcript', 'gene', 'exon', 'CDS']) The output is a set containing all overlapped feature types. This is convenient for quick summarization. ## Enrichment of custom attributes As of deeptoolsintervals 0.1.8, the `Enrichment` class is able to use a custom attribute key instead of the feature type. This allows you to find overlaps of things like the gene biotype: >>> from deeptoolsintervals import Enrichment >>> gtf = Enrichment("GRCh38.84.gtf.gz", keepExons=True, attributeKey="gene_biotype") >>> gtf.findOverlaps("1", [(0, 2000000)]) frozenset(['miRNA', 'group 1', 'group 2', 'transcribed_unprocessed_pseudogene', 'processed_pseudogene', 'lincRNA', 'unprocessed_pseudogene', 'protein_coding'])) deeptools_intervals-0.1.9/deeptoolsintervals/000077500000000000000000000000001352261167100215405ustar00rootroot00000000000000deeptools_intervals-0.1.9/deeptoolsintervals/__init__.py000066400000000000000000000001361352261167100236510ustar00rootroot00000000000000from deeptoolsintervals.parse import GTF from deeptoolsintervals.enrichment import Enrichment deeptools_intervals-0.1.9/deeptoolsintervals/enrichment.py000066400000000000000000000264651352261167100242630ustar00rootroot00000000000000#!/usr/bin/env python from deeptoolsintervals import tree from deeptoolsintervals.parse import GTF, openPossiblyCompressed import sys from os.path import basename import csv class Enrichment(GTF): """ This is like the GTF object, but has no groups or exons (but a "features" list). BED files are given a 'peaks' feature and GTF files use column 3. """ def parseBEDcore(self, line, ncols, feature): strand = 3 cols = line.split("\t") if int(cols[1]) < 0: cols[1] = 0 if int(cols[1]) >= int(cols[2]): sys.stderr.write("Warning: {0}:{1}-{2} is an invalid BED interval! Ignoring it.\n".format(cols[0], cols[1], cols[2])) return # BED6/BED12: set name and strand score = '.' if ncols > 3: if cols[5] == '+': strand = 0 elif cols[5] == '-': strand = 1 score = cols[4] if ncols != 12 or self.keepExons is False: self.tree.addEnrichmentEntry(self.mungeChromosome(cols[0]), int(cols[1]), int(cols[2]), strand, score, feature) else: starts = cols[11].strip(",").split(",") widths = cols[10].strip(",").split(",") starts = [int(x) + int(cols[1]) for x in starts] ends = [x + int(y) for x, y in zip(starts, widths)] for x, y in zip(starts, ends): self.tree.addEnrichmentEntry(self.mungeChromosome(cols[0]), x, y, strand, score, feature) def parseBED(self, fp, line, ncols=3, feature='peaks', labelColumn=None): """ parse a BED file. The default feature label is 'peaks' fp: A python file pointer line: The first line ncols: The number of columns to care about feature: The feature label labelColumn: If this isn't None, it overrides the 'feature' option >>> from deeptoolsintervals import enrichment >>> from os.path import dirname >>> gtf = enrichment.Enrichment("{0}/test/GRCh38.84.bed".format(dirname(enrichment.__file__)), keepExons=True) >>> o = gtf.findOverlaps("1", [(1, 3000000)]) >>> assert(o == frozenset(['GRCh38.84.bed'])) >>> o = gtf.findOverlaps("chr1", [(1, 3000000)]) >>> assert(o == frozenset(['GRCh38.84.bed'])) >>> gtf = enrichment.Enrichment("{0}/test/GRCh38.84.bed".format(dirname(enrichment.__file__)), keepExons=True, attributeKey="gene_biotype") >>> o = gtf.findOverlaps("1", [(1, 3000000)]) >>> assert(o == frozenset(['None'])) """ # Handle the first line if labelColumn is not None: cols = line.split("\t") feature = cols.pop(labelColumn) line = "\t".join(cols) self.parseBEDcore(line, ncols, feature) if feature not in self.features: self.features.append(feature) # iterate over the remaining lines for line in fp: if not isinstance(line, str): line = line.decode('ascii') line = line.strip() if len(line) == 0: # Apparently this happens, some people seem to like trying to break things continue if line.startswith("#"): continue else: if labelColumn is not None: cols = line.split("\t") feature = cols.pop(labelColumn) line = "\t".join(cols) self.parseBEDcore(line, ncols, feature) if feature not in self.features: self.features.append(feature) def parseGTF(self, fp, line): """ >>> from deeptoolsintervals import enrichment >>> from os.path import dirname >>> gtf = enrichment.Enrichment("{0}/test/GRCh38.84.gtf.gz".format(dirname(enrichment.__file__)), keepExons=True) >>> o = gtf.findOverlaps("1", [(0, 2000000)]) >>> assert(o == frozenset(['start_codon', 'exon', 'stop_codon', 'CDS', 'gene', 'transcript', 'group 1', 'group 2'])) >>> gtf = enrichment.Enrichment("{0}/test/GRCh38.84.gtf.gz".format(dirname(enrichment.__file__)), keepExons=True, attributeKey="gene_biotype") >>> o = gtf.findOverlaps("1", [(0, 2000000)]) >>> assert(o == frozenset(['miRNA', 'group 1', 'group 2', 'transcribed_unprocessed_pseudogene', 'processed_pseudogene', 'lincRNA', 'unprocessed_pseudogene', 'protein_coding'])) """ # Handle the first line cols = line.split("\t") strand = 3 if cols[6] == '+': strand = 0 elif cols[6] == '-': strand = 1 feature = cols[2] if self.attributeKey: feature = "None" if self.attributeKey in cols[8]: s = next(csv.reader([cols[8]], delimiter=' ')) if s[-1] != self.attributeKey: feature = s[s.index(self.attributeKey) + 1].rstrip(";") if "deepTools_group" in cols[8]: s = next(csv.reader([cols[8]], delimiter=' ')) if s[-1] != "deepTools_group": feature = s[s.index("deepTools_group") + 1].rstrip(";") self.tree.addEnrichmentEntry(self.mungeChromosome(cols[0]), int(cols[3]) - 1, int(cols[4]), strand, cols[5], feature) if feature not in self.features: self.features.append(feature) # Handle the remaining lines for line in fp: if not isinstance(line, str): line = line.decode('ascii') if not line.startswith('#'): cols = line.split("\t") if len(cols) == 0: continue strand = 3 if cols[6] == '+': strand = 0 elif cols[6] == '-': strand = 1 feature = cols[2] if self.attributeKey: feature = "None" if self.attributeKey in cols[8]: s = next(csv.reader([cols[8]], delimiter=' ')) if s[-1] != self.attributeKey: feature = s[s.index(self.attributeKey) + 1].rstrip(";") if "deepTools_group" in cols[8]: s = next(csv.reader([cols[8]], delimiter=" ")) if s[-1] != "deepTools_group": feature = s[s.index("deepTools_group") + 1].rstrip(";") self.tree.addEnrichmentEntry(self.mungeChromosome(cols[0]), int(cols[3]) - 1, int(cols[4]), strand, cols[5], feature) if feature not in self.features: self.features.append(feature) def __init__(self, fnames, keepExons=False, attributeKey=None, labels=None, verbose=False): """ Driver function to actually parse files. The steps are as follows: 1) skip to the first non-comment line 2) Infer the type from that 3) Call a type-specific processing function accordingly * These call the underlying C code for storage * These handle chromsome name conversions (python-level) Required inputs are as follows: fnames: A list of (possibly compressed with gzip or bzip2) GTF or BED files. Optional input is: keepExons: For BED12 files, exons are ignored by default. attributeKey: If specified, ignore the "feature" column and instead parse the value of the given attribute key. This can be used to allow computing overlaps of gene_biotype and other generic tags. If the tag is missing, "None" is used. Note that the presence of a deepTools_group tag will always override this! labels: Override the feature labels supplied in the file(s). Note that this might instead be replaced later in the .features attribute. verbose: Whether to print warnings (default: False) """ self.fname = [] self.filename = "" self.chroms = [] self.features = [] self.tree = tree.initTree() self.keepExons = keepExons self.verbose = verbose self.attributeKey = attributeKey if not isinstance(fnames, list): fnames = [fnames] # Load the files for labelIdx, fname in enumerate(fnames): self.filename = fname fp = openPossiblyCompressed(fname) line, labelColumn = self.firstNonComment(fp) if line is None: # This will only ever happen if a file is empty or just has a header/comment continue line = line.strip() ftype = self.inferType(fp, line, labelColumn) if ftype != 'GTF' and labels is not None: assert(len(labels) > labelIdx) bname = labels[labelIdx] else: bname = basename(fname) feature = "None" if attributeKey is not None else bname if ftype == 'GTF': self.parseGTF(fp, line) elif ftype == 'BED3': self.parseBED(fp, line, 3, feature=feature, labelColumn=labelColumn) elif ftype == 'BED6': self.parseBED(fp, line, 6, feature=feature, labelColumn=labelColumn) else: self.parseBED(fp, line, 12, feature=feature, labelColumn=labelColumn) fp.close() # Sanity check if self.tree.countEntries() == 0: raise RuntimeError("None of the input BED/GTF files had valid regions") if len(self.features) == 0: raise RuntimeError("There were no valid feature labels!") # vine -> tree self.tree.finish() # findOverlaps() def findOverlaps(self, chrom, blocks, strand=".", matchType=0, strandType=0): """ Given a chromosome and start/end coordinates with an optional strand, return a frozenset of the overlap features. If there are no overlaps, return None. This function allows stranded searching, though the default is to ignore strand! The non-obvious options are defined in gtf.h: matchType: 0, GTF_MATCH_ANY 1, GTF_MATCH_EXACT 2, GTF_MATCH_CONTAIN 3, GTF_MATCH_WITHIN 4, GTF_MATCH_START 5, GTF_MATCH_END strandType: 0, GTF_IGNORE_STRAND 1, GTF_SAME_STRAND 2, GTF_OPPOSITE_STRAND 3, GTF_EXACT_SAME_STRAND """ chrom = self.mungeChromosome(chrom, append=False) if not chrom: return None # Ensure that this is a tree and has entries if self.tree.countEntries() == 0: return None if not self.tree.isTree(): raise RuntimeError('The GTFtree is actually a vine! There must have been an error during creation (this shouldn\'t happen)...') # Convert the strand to a number if strand == '+': strand = 1 elif strand == '-': strand = 2 else: strand = 0 oset = frozenset() for block in blocks: overlaps = self.tree.findOverlappingFeatures(chrom, int(block[0]), int(block[1]), strand, matchType, strandType) if overlaps is not None: oset = oset.union(frozenset(overlaps)) return oset deeptools_intervals-0.1.9/deeptoolsintervals/parse.py000066400000000000000000001020031352261167100232200ustar00rootroot00000000000000#!/usr/bin/env python from deeptoolsintervals import tree import sys import gzip try: import bz2 supportsBZ2 = True except: supportsBZ2 = False import os.path import csv def getNext(fp): """ Sometimes we need to decode, sometimes not """ line = fp.readline() if isinstance(line, str): return line return line.decode('ascii') def seemsLikeGTF(cols): """ Does a line look like it could be from a GTF file? Column contents must be: 3: int 4: int 5: '.' or float 6: '+', '-', or '.' 7: 0, 1 or 2 8: matches the attribute regular expression """ try: int(cols[3]) int(cols[4]) if cols[5] != '.': float(cols[5]) cols[6] in ['+', '-', '.'] if cols[7] != '.': int(cols[7]) in [0, 1, 2] s = next(csv.reader([cols[8]], delimiter=' ')) assert("gene_id" in s) assert(s[-1] != "gene_id") return True except: return False def findRandomLabel(labels, name): """ Because some people are too clever by half, ensure that group labels are unique... """ if name not in labels: return name # This is what the heatmapper.py did to ensure unique names i = 0 while True: i += 1 nameTry = name + "_r" + str(i) if nameTry not in labels: return nameTry def parseExonBounds(start, end, n, sizes, offsets): """ Parse the last 2 columns of a BED12 file and return a list of tuples with (exon start, exon end) entries. If the line is malformed, issue a warning and return (start, end) """ offsets = offsets.strip(",").split(",") sizes = sizes.strip(",").split(",") offsets = offsets[0:n] sizes = sizes[0:n] try: starts = [start + int(x) for x in offsets] ends = [start + int(x) + int(y) for x, y in zip(offsets, sizes)] except: sys.stderr.write("Warning: Received an invalid exon offset ({0}) or size ({1}), using the entry bounds instead ({2}-{3})\n".format(offsets, sizes, start, end)) return [(start, end)] if len(offsets) < n or len(sizes) < n: sys.stderr.write("Warning: There were too few exon start/end offsets ({0}) or sizes ({1}), using the entry bounds instead ({2}-{3})\n".format(offsets, sizes, start, end)) return [(start, end)] return [(x, y) for x, y in zip(starts, ends)] def openPossiblyCompressed(fname): """ A wrapper to open gzip/bzip/uncompressed files """ mode = "rU" modeb = "rbU" if sys.version_info[0] >= 3: mode = "r" modeb = "rb" with open(fname, "rb") as f: first3 = bytes(f.read(3)) if first3 == b"\x1f\x8b\x08": return gzip.open(fname, modeb) elif first3 == b"\x42\x5a\x68" and supportsBZ2: return bz2.BZ2File(fname, modeb) else: return open(fname, mode) def getLabel(line): """ Split by tabs and return the index of "deepTools_group" (or None) """ cols = line.strip().split("\t") if "deepTools_group" in cols: return cols.index("deepTools_group") return None class GTF(object): """ A class to hold an interval tree and its associated functions >>> from deeptoolsintervals import parse >>> from os.path import dirname >>> gtf = parse.GTF("{0}/test/GRCh38.84.gtf.gz".format(dirname(parse.__file__)), keepExons=True) >>> gtf.findOverlaps("1", 1, 20000) [(11868, 14409, 'ENST00000456328', 'group 1', [(11868, 12227), (12612, 12721), (13220, 14409)], '.'), (12009, 13670, 'ENST00000450305', 'group 1', [(12009, 12057), (12178, 12227), (12612, 12697), (12974, 13052), (13220, 13374), (13452, 13670)], '.'), (14403, 29570, 'ENST00000488147', 'group 1', [(14403, 14501), (15004, 15038), (15795, 15947), (16606, 16765), (16857, 17055), (17232, 17368), (17605, 17742), (17914, 18061), (18267, 18366), (24737, 24891), (29533, 29570)], '.'), (17368, 17436, 'ENST00000619216', 'group 2', [(17368, 17436)], '.')] >>> gtf = parse.GTF("{0}/test/GRCh38.84.gtf.gz".format(dirname(parse.__file__))) >>> gtf.findOverlaps("1", 1, 20000) [(11868, 14409, 'ENST00000456328', 'group 1', [(11868, 14409)], '.'), (12009, 13670, 'ENST00000450305', 'group 1', [(12009, 13670)], '.'), (14403, 29570, 'ENST00000488147', 'group 1', [(14403, 29570)], '.'), (17368, 17436, 'ENST00000619216', 'group 2', [(17368, 17436)], '.')] >>> gtf.findOverlaps("1", 12000, 20000, trimOverlap=True) [(12009, 13670, 'ENST00000450305', 'group 1', [(12009, 13670)], '.'), (14403, 29570, 'ENST00000488147', 'group 1', [(14403, 29570)], '.'), (17368, 17436, 'ENST00000619216', 'group 2', [(17368, 17436)], '.')] >>> gtf.findOverlaps("1", 1, 20000, numericGroups=True, includeStrand=True) [(11868, 14409, 'ENST00000456328', 0, [(11868, 14409)], '+', '.'), (12009, 13670, 'ENST00000450305', 0, [(12009, 13670)], '+', '.'), (14403, 29570, 'ENST00000488147', 0, [(14403, 29570)], '-', '.'), (17368, 17436, 'ENST00000619216', 1, [(17368, 17436)], '-', '.')] """ def firstNonComment(self, fp): """ Skip lines at the beginning of a file starting with #, browser, or track. Returns a tuple of the first non-comment line and the column holding the group label (if it exists) """ line = getNext(fp) labelColumn = None try: while line.startswith("#") or line.startswith('track') or line.startswith('browser'): if labelColumn is None: labelColumn = getLabel(line) line = getNext(fp) except: sys.stderr.write("Warning, {0} was empty\n".format(self.filename)) return None return line, labelColumn def inferType(self, fp, line, labelColumn=None): """ Attempt to infer a file type from a single line. This is largely based on the number of columns plus looking for "gene_id". """ subtract = 0 if labelColumn is not None: subtract = 1 cols = line.split("\t") if len(cols) - subtract < 3: raise RuntimeError('{0} does not seem to be a recognized file type!'.format(self.filename)) elif len(cols) - subtract == 3: return 'BED3' elif len(cols) - subtract < 6: if self.verbose: sys.stderr.write("Warning, {0} has an abnormal number of fields. Assuming BED3 format.\n".format(self.filename)) return 'BED3' elif len(cols) - subtract == 6: return 'BED6' elif len(cols) and seemsLikeGTF(cols): return 'GTF' elif len(cols) - subtract == 12: return 'BED12' elif len(cols) - subtract < 12: if self.verbose: sys.stderr.write("Warning, {0} has an abnormal format. Assuming BED6 format.\n".format(self.filename)) return 'BED6' else: if self.verbose: sys.stderr.write("Warning, {0} has an abnormal format. Assuming BED12 format.\n".format(self.filename)) return 'BED12' def mungeChromosome(self, chrom, append=True): """ Return the chromosome name, possibly munged to match one already found in the chromosome dictionary """ if chrom in self.chroms: return chrom # chrM <-> MT and chr1 <-> 1 conversions if chrom == "MT" and "chrM" in self.chroms: chrom = "chrM" elif chrom == "chrM" and "MT" in self.chroms: chrom = "MT" elif chrom.startswith("chr") and len(chrom) > 3 and chrom[3:] in self.chroms: chrom = chrom[3:] elif "chr" + chrom in self.chroms: chrom = "chr" + chrom if append: self.chroms.append(chrom) return chrom def parseBEDcore(self, line, ncols): """ Returns True if the entry was added, otherwise False >>> from deeptoolsintervals import parse >>> from os.path import dirname >>> gtf = parse.GTF("{0}/test/GRCh38.84.bed12.bz2".format(dirname(parse.__file__)), keepExons=True, labels=["foo"]) >>> gtf.findOverlaps("1", 1, 20000) [(11868, 14409, 'ENST00000456328.2', 'foo', [(11868, 12227), (12612, 12721), (13220, 14409)], 0.0), (12009, 13670, 'ENST00000450305.2', 'foo', [(12009, 12057), (12178, 12227), (12612, 12697), (12974, 13052), (13220, 13374), (13452, 13670)], 0.0), (14403, 29570, 'ENST00000488147.1', 'foo', [(14403, 14501), (15004, 15038), (15795, 15947), (16606, 16765), (16857, 17055), (17232, 17368), (17605, 17742), (17914, 18061), (18267, 18366), (24737, 24891), (29533, 29570)], 0.0), (17368, 17436, 'ENST00000619216.1', 'foo', [(17368, 17436)], 0.0)] """ strand = 3 cols = line.split("\t") name = "{0}:{1}-{2}".format(cols[0], cols[1], cols[2]) if int(cols[1]) < 0: cols[1] = 0 if int(cols[1]) >= int(cols[2]): sys.stderr.write("Warning: {0}:{1}-{2} is an invalid BED interval! Ignoring it.\n".format(cols[0], cols[1], cols[2])) return # BED6/BED12: set name and strand score = '.' if ncols > 3: name = cols[3] if cols[5] == '+': strand = 0 elif cols[5] == '-': strand = 1 score = cols[4] # Ensure that the name is unique name = findRandomLabel(self.exons[self.labelIdx], name) self.tree.addEntry(self.mungeChromosome(cols[0]), int(cols[1]), int(cols[2]), name, strand, self.labelIdx, score) if ncols != 12 or self.keepExons is False: self.exons[self.labelIdx][name] = [(int(cols[1]), int(cols[2]))] else: assert(len(cols) == 12) self.exons[self.labelIdx][name] = parseExonBounds(int(cols[1]), int(cols[2]), int(cols[9]), cols[10], cols[11]) def parseBED(self, fp, line, ncols=3, labelColumn=None): """ parse a BED file. The default group label is the file name. fp: A python file pointer line: The first line ncols: The number of columns to care about >>> from deeptoolsintervals import parse >>> from os.path import dirname, basename >>> gtf = parse.GTF("{0}/test/GRCh38.84.bed6".format(dirname(parse.__file__)), keepExons=True) >>> gtf.findOverlaps("1", 1, 20000) [(11868, 14409, 'ENST00000456328.2', 'group 1', [(11868, 14409)], 0.0), (12009, 13670, 'ENST00000450305.2', 'group 1', [(12009, 13670)], 0.0), (14403, 29570, 'ENST00000488147.1', 'group 1', [(14403, 29570)], 0.0), (17368, 17436, 'ENST00000619216.1', 'group 1', [(17368, 17436)], 0.0)] >>> gtf = parse.GTF("{0}/test/GRCh38.84.bed".format(dirname(parse.__file__)), keepExons=True, labels=["foo", "bar", "quux", "sniggly"]) >>> gtf.findOverlaps("1", 1, 20000) [(11868, 14409, '1:11868-14409', 'foo', [(11868, 14409)], '.'), (12009, 13670, '1:12009-13670', 'foo', [(12009, 13670)], '.'), (14403, 29570, '1:14403-29570', 'foo', [(14403, 29570)], '.'), (17368, 17436, '1:17368-17436', 'foo', [(17368, 17436)], '.')] Test having a header in one file, but not another: >>> gtf = parse.GTF(["{0}/test/GRCh38.84.labels.bed".format(dirname(parse.__file__)), "{0}/test/GRCh38.84.bed2".format(dirname(parse.__file__))]) >>> overlaps = gtf.findOverlaps("1", 1, 30000000) >>> labels = dict() >>> for o in overlaps: ... if basename(o[3]) not in labels: ... labels[basename(o[3])] = 0 ... labels[basename(o[3])] += 1 >>> assert(labels['group 1'] == 4) >>> assert(labels['group 2'] == 9) >>> assert(labels['group 3'] == 7) >>> assert(labels['group 4'] == 1) >>> assert(labels['group 1_r1'] == 4) >>> assert(labels['group2'] == 9) >>> assert(labels['group 3'] == 7) >>> assert(labels['GRCh38.84.bed2'] == 1) >>> gtf = parse.GTF(["{0}/test/GRCh38.84.bed2".format(dirname(parse.__file__)), "{0}/test/GRCh38.84.labels.bed".format(dirname(parse.__file__))]) >>> overlaps = gtf.findOverlaps("1", 1, 30000000) >>> labels = dict() >>> for o in overlaps: ... if basename(o[3]) not in labels: ... labels[basename(o[3])] = 0 ... labels[basename(o[3])] += 1 >>> assert(labels['group 1'] == 8) >>> assert(labels['group 2'] == 9) >>> assert(labels['group 3'] == 14) >>> assert(labels['group 4'] == 1) >>> assert(labels['group2'] == 9) >>> assert(labels['GRCh38.84.bed2'] == 1) """ groupLabelsFound = 0 groupEntries = 0 # Handle the first line if labelColumn is not None: cols = line.split("\t") label = cols.pop(labelColumn) line = "\t".join(cols) if label in self.labels: self.labelIdx = self.labels.index(label) else: self.labels.append(label) self.exons.append(dict()) self.labelIdx = len(self.labels) - 1 else: self.exons.append(dict()) self.parseBEDcore(line, ncols) groupEntries = 1 # iterate over the remaining lines for line in fp: if not isinstance(line, str): line = line.decode('ascii') line = line.strip() if len(line) == 0: # Apparently this happens, some people seem to like trying to break things continue if line.startswith("#") and labelColumn is None: # If there was a previous group AND it had no entries then remove it if groupLabelsFound > 0: if groupEntries == 0: sys.stderr.write("Warning, the '{0}' group had no valid entries! Removing it.\n".format(self.labels[self.labelIdx])) del self.labels[-1] groupLabelsFound -= 1 self.labelIdx -= 1 label = line[1:].strip() if len(label): # Guard against duplicate group labels self.labels.append(findRandomLabel(self.labels, label)) else: # I'm sure someone will try an empty label... self.labels.append(findRandomLabel(self.labels, os.path.basename(self.filename))) self.labelIdx += 1 self.exons.append(dict()) groupLabelsFound += 1 groupEntries = 0 elif line.startswith("#") and labelColumn is not None: continue else: if labelColumn is not None: cols = line.split("\t") label = cols.pop(labelColumn) line = "\t".join(cols) if label in self.labels: self.labelIdx = self.labels.index(label) else: self.labels.append(label) self.exons.append(dict()) self.labelIdx = len(self.labels) - 1 self.parseBEDcore(line, ncols) if labelColumn is None: groupEntries += 1 if groupEntries > 0 and labelColumn is None: if self.defaultGroup is not None: self.labels.append(findRandomLabel(self.labels, self.defaultGroup)) else: self.labels.append(findRandomLabel(self.labels, os.path.basename(self.filename))) # Reset self.labelIdx self.labelIdx = len(self.labels) def parseGTFtranscript(self, cols, label): """ Parse and add a transcript entry """ if int(cols[3]) - 1 < 0: sys.stderr.write("Warning: Invalid start in '{0}', skipping\n".format("\t".join(cols))) return if len(cols) < 9: sys.stderr.write("Warning: non-GTF line encountered! {0}\n".format("\t".join(cols))) return s = next(csv.reader([cols[8]], delimiter=' ')) if "deepTools_group" in s and s[-1] != "deepTools_group": label = s[s.index("deepTools_group") + 1].rstrip(";") elif self.defaultGroup is not None: label = self.defaultGroup if self.transcript_id_designator not in s or s[-1] == self.transcript_id_designator: sys.stderr.write("Warning: {0} is malformed!\n".format("\t".join(cols))) return if int(cols[3]) > int(cols[4]) or int(cols[3]) < 1: sys.stderr.write("Warning: {0}:{1}-{2} is an invalid GTF interval! Ignoring it.\n".format(cols[0], cols[3], cols[4])) return strand = 3 if cols[6] == '+': strand = 0 elif cols[6] == '-': strand = 1 score = cols[5] # Get the label index if label not in self.labels: self.labels.append(label) self.exons.append(dict()) self.labelIdx = self.labels.index(label) # Ensure unique names within GTF files name = s[s.index(self.transcript_id_designator) + 1].rstrip(";") if name in self.exons[self.labelIdx]: sys.stderr.write("Warning: {0} occurs more than once! Only using the first instance.\n".format(name)) self.transcriptIDduplicated.append(name) return chrom = self.mungeChromosome(cols[0]) self.tree.addEntry(chrom, int(cols[3]) - 1, int(cols[4]), name, strand, self.labelIdx, score) # Exon bounds placeholder self.exons[self.labelIdx][name] = [] def parseGTFexon(self, cols): """ Parse an exon entry and add it to the transcript hash """ if int(cols[3]) - 1 < 0: sys.stderr.write("Warning: Invalid start in '{0}', skipping\n".format("\t".join(cols))) return s = next(csv.reader([cols[8]], delimiter=' ')) if self.transcript_id_designator not in s or s[-1] == self.transcript_id_designator: sys.stderr.write("Warning: {0} is malformed!\n".format("\t".join(cols))) return name = s[s.index(self.transcript_id_designator) + 1].rstrip(";") if name in self.transcriptIDduplicated: return if name not in self.exons[self.labelIdx]: self.exons[self.labelIdx][name] = [] self.exons[self.labelIdx][name].append((int(cols[3]) - 1, int(cols[4]))) def parseGTF(self, fp, line): """ parse a GTF file. Note that a single label will be used for every entry in a file that isn't explicitly labeled with a deepTools_group key:values pair in the last column fp: A python file pointer line: The first non-comment line >>> from deeptoolsintervals import parse >>> from os.path import dirname, basename >>> gtf = parse.GTF(["{0}/test/GRCh38.84.gtf.gz".format(dirname(parse.__file__)), "{0}/test/GRCh38.84.2.gtf.gz".format(dirname(parse.__file__))], keepExons=True) >>> overlaps = gtf.findOverlaps("1", 1, 20000000) >>> labels = dict() >>> for o in overlaps: ... if basename(o[3]) not in labels: ... labels[basename(o[3])] = 0 ... labels[basename(o[3])] += 1 >>> assert(labels['GRCh38.84.gtf.gz'] == 17) >>> assert(labels['GRCh38.84.2.gtf.gz'] == 6) >>> assert(labels['group 1'] == 5) >>> assert(labels['group 2'] == 3) Test GTF and a BED file >>> gtf = parse.GTF(["{0}/test/GRCh38.84.gtf.gz".format(dirname(parse.__file__)), "{0}/test/GRCh38.84.bed".format(dirname(parse.__file__))]) >>> overlaps = gtf.findOverlaps("1", 1, 20000000) >>> labels = dict() >>> for o in overlaps: ... if basename(o[3]) not in labels: ... labels[basename(o[3])] = 0 ... labels[basename(o[3])] += 1 >>> assert(labels['GRCh38.84.gtf.gz'] == 17) >>> assert(labels['group 1'] == 3) >>> assert(labels['group 2'] == 1) >>> assert(labels['group 1_r1'] == 4) >>> assert(labels['group2'] == 9) >>> assert(labels['group 3'] == 7) >>> assert(labels['GRCh38.84.bed'] == 1) >>> gtf = parse.GTF(["{0}/test/GRCh38.84.bed".format(dirname(parse.__file__)), "{0}/test/GRCh38.84.gtf.gz".format(dirname(parse.__file__))]) >>> overlaps = gtf.findOverlaps("1", 1, 20000000) >>> labels = dict() >>> for o in overlaps: ... if basename(o[3]) not in labels: ... labels[basename(o[3])] = 0 ... labels[basename(o[3])] += 1 >>> assert(labels['GRCh38.84.gtf.gz'] == 17) >>> assert(labels['group 1'] == 7) >>> assert(labels['group 2'] == 1) >>> assert(labels['group2'] == 9) >>> assert(labels['group 3'] == 7) >>> assert(labels['GRCh38.84.bed'] == 1) """ file_label = findRandomLabel(self.labels, os.path.basename(self.filename)) # Handle the first line cols = line.split("\t") if cols[2].lower() == self.transcriptID.lower(): self.parseGTFtranscript(cols, file_label) elif cols[2].lower() == self.exonID.lower(): self.parseGTFexon(cols) # Handle the remaining lines for line in fp: if not isinstance(line, str): line = line.decode('ascii') if not line.startswith('#'): cols = line.split("\t") if len(cols) == 0: continue if cols[2].lower() == self.transcriptID.lower(): self.parseGTFtranscript(cols, file_label) elif cols[2].lower() == self.exonID.lower() and self.keepExons is True: self.parseGTFexon(cols) # Reset self.labelIdx self.labelIdx = len(self.labels) def __init__(self, fnames, exonID="exon", transcriptID="transcript", keepExons=False, labels=[], transcript_id_designator="transcript_id", defaultGroup=None, verbose=False): """ Driver function to actually parse files. The steps are as follows: 1) skip to the first non-comment line 2) Infer the type from that 3) Call a type-specific processing function accordingly * These call the underlying C code for storage * These handle chromsome name conversions (python-level) * These handle labels (python-level, with a C-level numeric attribute) 4) Sanity checking (do the number of labels make sense?) Required inputs are as follows: fnames: A list of (possibly compressed with gzip or bzip2) GTF or BED files. Optional input is: exonID: For GTF files, the feature column (column 3) label for exons, or whatever else should be stored as exons. The default is 'exon', though one could use 'CDS' instead. transcriptID: As above, but for transcripts. The default is 'transcript_id'. keepExons: For BED12 and GTF files, exons are ignored by default. labels: A list of group labels. transcript_id_designator: For gtf files, this is the key used in a searching for the transcript ID. If one sets transcriptID to 'gene', then transcript_id_designator would need to be changed to 'gene_id' or 'gene_name' to extract the gene ID/name from the attributes. defaultGroup: The default group name. If None, the file name is used. verbose: Whether to produce warning messages (default: False) """ self.fname = [] self.filename = "" self.chroms = [] self.exons = [] self.labels = [] self.transcriptIDduplicated = [] self.tree = tree.initTree() self.labelIdx = 0 self.transcript_id_designator = transcript_id_designator self.exonID = exonID self.transcriptID = transcriptID self.keepExons = keepExons self.defaultGroup = defaultGroup self.verbose = verbose if labels != []: self.already_input_labels = True if not isinstance(fnames, list): fnames = [fnames] # Load the files for fname in fnames: self.filename = fname fp = openPossiblyCompressed(fname) line, labelColumn = self.firstNonComment(fp) if line is None: # This will only ever happen if a file is empty or just has a header/comment continue line = line.strip() ftype = self.inferType(fp, line, labelColumn) if ftype == 'GTF': self.parseGTF(fp, line) elif ftype == 'BED3': self.parseBED(fp, line, 3, labelColumn) elif ftype == 'BED6': self.parseBED(fp, line, 6, labelColumn) else: self.parseBED(fp, line, 12, labelColumn) fp.close() # Sanity check if self.tree.countEntries() == 0: raise RuntimeError("None of the input BED/GTF files had valid regions") # Replace labels if len(labels) > 0: if len(labels) != len(self.labels): raise RuntimeError("The number of labels found ({0}) does not match the number input ({1})!".format(self.labels, labels)) else: self.labels = labels # vine -> tree self.tree.finish() # findOverlaps() def findOverlaps(self, chrom, start, end, strand=".", matchType=0, strandType=0, trimOverlap=False, numericGroups=False, includeStrand=False): """ Given a chromosome and start/end coordinates with an optional strand, return a list of tuples comprised of: * start * end * name * label * [(exon start, exon end), ...] * strand (optional) If there are no overlaps, return None. This function allows stranded searching, though the default is to ignore strand! The non-obvious options are defined in gtf.h: matchType: 0, GTF_MATCH_ANY 1, GTF_MATCH_EXACT 2, GTF_MATCH_CONTAIN 3, GTF_MATCH_WITHIN 4, GTF_MATCH_START 5, GTF_MATCH_END strandType: 0, GTF_IGNORE_STRAND 1, GTF_SAME_STRAND 2, GTF_OPPOSITE_STRAND 3, GTF_EXACT_SAME_STRAND trimOverlap: If true, this removes overlaps from the 5' end that extend beyond the range requested. This is useful in cases where a function calling this does is first dividing the genome into large bins. In that case, 'trimOverlap=True' can be used to ensure that a given interval is never seen more than once. numericGroups: Whether to return group labels or simply the numeric index. The latter is more useful when these are passed to a function whose output will be sorted according to group. includeStrand: Whether to include the strand in the output. The default is False >>> from deeptoolsintervals import parse >>> from os.path import dirname, basename >>> gtf = parse.GTF(["{0}/test/GRCh38.84.bed6".format(dirname(parse.__file__)), "{0}/test/GRCh38.84.bed2".format(dirname(parse.__file__))], keepExons=True) >>> overlaps = gtf.findOverlaps("1", 0, 3000000) >>> labels = dict() >>> for o in overlaps: ... if basename(o[3]) not in labels: ... labels[basename(o[3])] = 0 ... labels[basename(o[3])] += 1 >>> assert(labels['GRCh38.84.bed2'] == 1) >>> assert(labels['GRCh38.84.bed6'] == 15) >>> assert(labels['group2'] == 9) >>> assert(labels['group 3'] == 7) >>> assert(labels['group 1'] == 6) >>> assert(labels['group 1_r1'] == 4) >>> gtf = parse.GTF("{0}/test/strands.bed".format(dirname(parse.__file__))) >>> o = gtf.findOverlaps("1", 0, 3000, strand="+", includeStrand=True, strandType=0) # ignore strand >>> assert(o == [(0, 1000, 'first', 'strands.bed', [(0, 1000)], '+', 0.0), (1000, 2000, 'second', 'strands.bed', [(1000, 2000)], '-', 0.0), (2000, 3000, 'third', 'strands.bed', [(2000, 3000)], '.', 0.0)]) >>> o = gtf.findOverlaps("1", 0, 3000, strand="+", includeStrand=True, strandType=1) # same strand >>> assert(o == [(0, 1000, 'first', 'strands.bed', [(0, 1000)], '+', 0.0), (2000, 3000, 'third', 'strands.bed', [(2000, 3000)], '.', 0.0)]) >>> o = gtf.findOverlaps("1", 0, 3000, strand="+", includeStrand=True, strandType=2) # opposite strand >>> assert(o == [(1000, 2000, 'second', 'strands.bed', [(1000, 2000)], '-', 0.0), (2000, 3000, 'third', 'strands.bed', [(2000, 3000)], '.', 0.0)]) >>> o = gtf.findOverlaps("1", 0, 3000, strand="+", includeStrand=True, strandType=3) # exact same strand >>> assert(o == [(0, 1000, 'first', 'strands.bed', [(0, 1000)], '+', 0.0)]) >>> o = gtf.findOverlaps("1", 0, 3000, strand="-", includeStrand=True, strandType=0) # ignore strand >>> assert(o == [(0, 1000, 'first', 'strands.bed', [(0, 1000)], '+', 0.0), (1000, 2000, 'second', 'strands.bed', [(1000, 2000)], '-', 0.0), (2000, 3000, 'third', 'strands.bed', [(2000, 3000)], '.', 0.0)]) >>> o = gtf.findOverlaps("1", 0, 3000, strand="-", includeStrand=True, strandType=1) # same strand >>> assert(o == [(1000, 2000, 'second', 'strands.bed', [(1000, 2000)], '-', 0.0), (2000, 3000, 'third', 'strands.bed', [(2000, 3000)], '.', 0.0)]) >>> o = gtf.findOverlaps("1", 0, 3000, strand="-", includeStrand=True, strandType=2) # opposite strand >>> assert(o == [(0, 1000, 'first', 'strands.bed', [(0, 1000)], '+', 0.0), (2000, 3000, 'third', 'strands.bed', [(2000, 3000)], '.', 0.0)]) >>> o = gtf.findOverlaps("1", 0, 3000, strand="-", includeStrand=True, strandType=3) # exact same strand >>> assert(o == [(1000, 2000, 'second', 'strands.bed', [(1000, 2000)], '-', 0.0)]) >>> o = gtf.findOverlaps("1", 0, 3000, strand=".", includeStrand=True, strandType=3) # same strand >>> assert(o == [(2000, 3000, 'third', 'strands.bed', [(2000, 3000)], '.', 0.0)]) """ chrom = self.mungeChromosome(chrom, append=False) if not chrom: return None # Ensure that this is a tree and has entries, otherwise if self.tree.countEntries() == 0: return None if not self.tree.isTree(): raise RuntimeError('The GTFtree is actually a vine! There must have been an error during creation (this shouldn\'t happen)...') # Convert the strand to a number if strand == '+': strand = 0 elif strand == '-': strand = 1 else: strand = 3 overlaps = self.tree.findOverlaps(chrom, start, end, strand, matchType, strandType, "transcript_id", includeStrand) if overlaps is None: return None for i, o in enumerate(overlaps): if o[2] not in self.exons[o[3]] or len(self.exons[o[3]][o[2]]) == 0: exons = [(o[0], o[1])] else: exons = sorted(self.exons[o[3]][o[2]]) if numericGroups: overlaps[i] = (o[0], o[1], o[2], o[3], exons) else: overlaps[i] = (o[0], o[1], o[2], self.labels[o[3]], exons) if includeStrand: overlaps[i] = overlaps[i] + (str(o[-2].decode("ascii")),) # Add the score overlaps[i] = overlaps[i] + (o[-1],) # Ensure that the intervals are sorted by their 5'-most bound. This enables trimming overlaps = sorted(overlaps) if trimOverlap: while True: if len(overlaps) > 0: if overlaps[0][0] < start: del overlaps[0] else: break else: overlaps = [] break return overlaps def hasOverlaps(self, returnDistance=False): """ By default, returns True if ANY intervals in the tree overlap each other, regardless of strand, and False otherwise. If returnDistance is True, then a tuple is returned instead. The first value is as described above, the second is the minimum distance between intervals (0 on an overlap). >>> from deeptoolsintervals import parse >>> from os.path import dirname >>> gtf = parse.GTF(["{0}/test/GRCh38.84.bed".format(dirname(parse.__file__))]) >>> assert(gtf.hasOverlaps()) >>> gtf = parse.GTF(["{0}/test/GRCh38.84.bed".format(dirname(parse.__file__))]) >>> assert(gtf.hasOverlaps(returnDistance=True) == (True, 0)) >>> gtf = parse.GTF(["{0}/test/noOverlaps.bed".format(dirname(parse.__file__))]) >>> assert(not gtf.hasOverlaps()) >>> gtf = parse.GTF(["{0}/test/noOverlaps.bed".format(dirname(parse.__file__))]) >>> assert(gtf.hasOverlaps(returnDistance=True) == (False, 9)) """ rv = self.tree.hasOverlaps() if returnDistance: return rv return rv[0] deeptools_intervals-0.1.9/deeptoolsintervals/test/000077500000000000000000000000001352261167100225175ustar00rootroot00000000000000deeptools_intervals-0.1.9/deeptoolsintervals/test/GRCh38.84.2.gtf.gz000066400000000000000000000037311352261167100251740ustar00rootroot00000000000000‹ÐæVGRCh38.84.2.gtfíß7ÇŸÉ_èc•ÈãñOå)º¦íC[U½ËCŸN„[]îà\Ôü÷õ.¿YÖ°¶9C”Ó&ì±,Ÿ¯gÆžñÀOwߊiÑíOÌÏ`öÚzúѺƒo“ñóx:~.ºð:ÅhZ<}ºÿÖÿÞõ;Ũè RÀ:L(I::ïÍOùðýð¡Ûûü×ío¤üŠI„ÞÇnuê{1™ÍÅ{|ùȨo^ ÷ë§?ÈÍòÁéøu20o¿èòì×áxöãÅœ~™ŒgÅpt?? G½õÛœMú£é`2|™­n?æf×WZ>å®z R)„Þþ…ÕYˆw¹ùJ[×}oÞÀöiç+lüΞ×é?v{77¿ÜšÓê—ÿgT°’ÉCQ¼ÜÇOÓûÇÉøõ¥Û›À¦DñßxÔaœ1’@ƒòÅïG¯Ï_‹I·y‹R½×žÏPª)ç¸<µƒ6™Ì…j*‘«J­«´|Æ‚Ðß B ™Ü8»‰§®ÕtÖŸÌÊ×,-˨…WÍBjæði‹¸‚Êt>fMÞâÓ8J²æÓ¨Ó§ØP‰^U:SŸ6~Yº4DU‘t.-oÉêü¿Üý³öh$®GËšs}A€R3Í¢šÓ613aþ2æèí˜áÓiñp¿>{hµ±µÊ Šè`ÓÎ…Z$]· )•ºn-ì烺íÙûèN;({͹¾ŸýŽ[ã“2n€Pq.n…¶|wü e¢^|zi}G{8m¾ƒk¡Œß•Ç}tß”½FªI=õ³Å¾Jå‚–\è5¤H™AøÙJß"êôwÆë§"µ”úƒÜÅæÀÕ͆Ó©Ã¥|sΡæ6U3§vÊ»ªë¿¾š{.ºwX™Uf´ ZkQiísbç‹Ïb6RRÂE—EÍ/ –bdîsZoŠ5×µ‹µ­êD;b•k–æ®h— ±ažîÎt%A Ô!‹JÞ5?n9QÇ¢µx0ôe¨aIH'ô8qÑ×í‚SHN Ê"ÏJ¦êàª\e:I&{•G1*EÏR届µY¹¾Š–­m)À€ÍÜY ¯@ÜžZy`áHÌÂq}šÀ’3úz`A¦‘²„¥&ÓÕBÎ2°ÌŠi‰Pc sV«nTœq!•‡QÕdJcTYËÔ¢QIŽª<’r ”óêÓbTBh®¼ŒÕ¨veJcTYËÔžQa¹Ëª?\] ~ä„w`·£‰å³•,ðÃ寽àGNŒ†ù=èØ^Àgªl=5ÓÚòJÕ ~ädLXøÀÌ"½ üÅò5@’Ùo‘E2¢‚*v ëz'…ÀMÜ€ÛNÕ.÷h8ÓpÝhntâí¸^ð߬'9Ó=·^ð#÷…k½àGî;ò›ïžEN°:ˆ”ð#÷…/˜PºÑ×Ò…Û{æ?rpX·sð^ð#÷‡ùÉ;à½àGîìóÓwÀû-²"÷‡ú[à‚…y?úY­q‘1<rmY²íý«œ!Ù£ADâxñ­Ë¿û/bÂgi•deeptools_intervals-0.1.9/deeptoolsintervals/test/GRCh38.84.bed000066400000000000000000000005171352261167100243660ustar00rootroot000000000000001 11868 14409 1 12009 13670 1 14403 29570 1 17368 17436 #group 1 1 29553 31097 1 30266 31109 1 30365 30503 1 34553 36081 1 35244 36073 1 52472 53312 1 62947 63887 1 69090 70008 1 89294 120932 #group2 1 92229 129217 1 110952 129173 1 120724 133723 1 129080 133566 1 89550 91105 1 131024 134836 1 135140 135895 # group 3 1 137681 137965 deeptools_intervals-0.1.9/deeptoolsintervals/test/GRCh38.84.bed12.bz2000066400000000000000000000012051352261167100252200ustar00rootroot00000000000000BZh91AY&SY•ÂL&UÞ 0à P>Û[:.T«· 2#ª0S4)M2&‚TP4ÐÈɈ$"™DѦ„#½:‰äuW£¿¿½c?Í9¹½z„4ªùˆ¨)îv-=Ÿ/]Js2ãXÏk8ŽÊè~b¬Ôú·¢c§VØý* žD_ ŽÆÝP||§ó“¦,ψGÒqŸ‡ée¢2w‚¦ƒëŠ"!WpŠyzž˜çÌ·ª£ȬˆºµŠ‹JÆXq2º`¡½tu‹í:»;9™ŒIB ÐcU@Ø5E’×2ê»HËKw¨„e¡¨FDŽ×vÌ¢MQf(Á” ŒXìöd¼RšüBóðìDmŒmÝvt{䯬ù Mj(élß¼V¯-‘o6‹&G”‡Ÿ*ÈOnÏœ|+¼2&LýÔd·6ùçÙ2jÄÌS&þÐdÜbŒZ0>—óHŒµÑ0(Pã1Ãc¸šóÚ,އ7G£ŸXbïlR•ðÑ;)¶%è{h\$(D:ÂÈö$O_» V V+lÛš¤h‰­ÙN¸ý0‚i.i<*÷¢€‹QMn+ŠS¡TBƒ­éëʬƒw5 Ê1£˜Ì·(Gõÿ!¨Gü]ÉáBBW 0˜deeptools_intervals-0.1.9/deeptoolsintervals/test/GRCh38.84.bed2000066400000000000000000000005711352261167100244500ustar00rootroot000000000000001 211868 214409 1 212009 213670 1 214403 229570 1 217368 217436 #group 1 1 229553 231097 1 230266 231109 1 230365 230503 1 234553 236081 1 235244 236073 1 252472 253312 1 262947 263887 1 269090 270008 1 289294 2120932 #group2 1 292229 2129217 1 2110952 2129173 1 2120724 2133723 1 2129080 2133566 1 289550 291105 1 2131024 2134836 1 2135140 2135895 # group 3 1 2137681 2137965 deeptools_intervals-0.1.9/deeptoolsintervals/test/GRCh38.84.bed6000066400000000000000000000014131352261167100244500ustar00rootroot000000000000001 11868 14409 ENST00000456328.2 0 + 1 12009 13670 ENST00000450305.2 0 + 1 14403 29570 ENST00000488147.1 0 - 1 17368 17436 ENST00000619216.1 0 - 1 29553 31097 ENST00000473358.1 0 + 1 30266 31109 ENST00000469289.1 0 + #group 1 1 30365 30503 ENST00000607096.1 0 + 1 34553 36081 ENST00000417324.1 0 - 1 35244 36073 ENST00000461467.1 0 - 1 52472 53312 ENST00000606857.1 0 + 1 62947 63887 ENST00000492842.1 0 + 1 69090 70008 ENST00000335137.3 0 + 1 89294 120932 ENST00000466430.5 0 - 1 92229 129217 ENST00000477740.5 0 - 1 110952 129173 ENST00000471248.1 0 - 1 120724 133723 ENST00000610542.1 0 - 1 129080 133566 ENST00000453576.2 0 - 1 89550 91105 ENST00000495576.1 0 - 1 131024 134836 ENST00000442987.3 0 + 1 135140 135895 ENST00000494149.2 0 - 1 137681 137965 ENST00000595919.1 0 - deeptools_intervals-0.1.9/deeptoolsintervals/test/GRCh38.84.gtf.gz000066400000000000000000000054661352261167100250430ustar00rootroot00000000000000‹£ËæVGRCh38.84.gtfí[o»ÇŸu>…ëóX(àpxmŸ\ç$'@“c8)Ú7C¶…Ž$XvÐ~ûrw%ë‹v—”¸kl€È‰V^‘üÿ83^ö׿|ŸÎæ?¦ãÛ—‡Çû³×—¢z·à¿üº¾ðsú´|˜ÏV—6ïßOž§g”ŽnÞ-o3žÜÝM—åo}¹üǧ¿}¼¼¸!Å`„¿£¤úø´úìãdù<~Y·»/îÇÇ@~ÑŸ“Ÿ“ÙdT|n „cDÞþjþoß<ÜŸÿöåëÇòÖ”¢–ôüïgå¥u¡ÏùúÙäÇôìüýûÿüÖo.ç/OwæíêËÖïÞ>ÌŸÿ·0o??MfË»§‡ÛéýÍËlñ4/jeþ½XN_îçÅgͯT¿{Süïìüoß~ÿ×çªDæ°û‰MÁLQ7µ\Ñâ9¶®›;­å[ù+Œ ¤j÷Ûeél#mx§xcÓ»—­‚n]{-›«›¯ß¼·.fÕ|(¨äàúܦ°ÅåçÉ÷³óÛÉòán¯h/‹ÅüéùæqúsúX}ô~:]|›Ï—7ߟæ/‹³óêls1ýï|¶&‚R*OJDñå7³—·Ó§ªÄ"))›u¥Äo+í˜fl}©%OûÐPh^%…|Ð våhÐ WTjL RƒËé}Ï48@s4KƒÆ, OÍvØb"7b’¤E‡ átúH ´ ¥Iq}ÜTégÜÐú(|¹hÅT€PÂG1yQL׉±, h¦8{5Bm,M—ÅÈ£DÂM¬Î©*‹¡†&Ê ‘¡¾Æ8ýcHq²&Ç2¤%7Áá4Cl`èT i­19CÕX Q²| u9;úÆb™HÎã˜aеÃ:CJ ©oº˜ºš[aŒ°Õ¼dhìaHR<8·ò¿Ë«šÂ·›+ŸØÞ¼D\õ¼]D)`ÒÓE [íbw‚ªD-û@ ÷¨ÃÜ7D¹~öÁ€90e.qá°‡šPÐéìaE “¨Ì«Ò:’’äDR•÷vE_Š e^Qˆ|¤à@JRR˜”œÔ¤H ÜBDF›ÂRÒÚÁ…–ÉID˜WÉh>R:×÷ž&¹Àô¤P4c‰Æe#E ¤$ŽS(g45)Bq§HÂy>Rä@JZR8*mÏÅ’"ˆ4¯Rd$E ¤$%EP 2õ(¸Ô&NášÉ|¤è””¤P‚„AÔ:T')„ðâ3Æ)@T’¦Þ´q@29*e®'‡É0digi™b%Øt¶œþ¸}\M[™6¯ É7©¨+{þtmd=¶¦,Wßg‰üãáz¯DÛóMqåò1,@S†TÈ&íõNcêƒmsOm¯wŽŸÓ¡;-¾çiëФNß %†„e¤šs6B€Ðæ†LqK½$öõÀÙà;ÜoFá}|˜ÝUµ [Z}î±´¾íxëZ‘ÐrA­¼>I"rß–èDsØDn¤¥û± \%Aû›š¼r׌p!¬é ­pm-lÈx J7 B™$2j®ÆÆ Bœ¶¿‡¦,šaA) 8]Xh)2¸Ð\ï€E³q•".Æ=1b¶|¼pM•îwŒÐfSd3ìƒl°ÀÖÕëKa“;ƒCÂö1Fè„Ò®ì®×ï8ƒS÷÷>Æ]ÅB›q~Tþ’Û9)$ÅR0Aôs!™ùn‹ ;‹ÆøŒé¾”­rR‘åòæILð­ëç¤ZUÈ™©îtœÔ~â)WƒHÂ`¢sµˆÓCeè9”{€3Á€ aT š†mßÏ)ã…³˜ LÔܯØ-Ž?€Ä¾ß«#úÝý‰ì»ï„¤ J*{p²_ÚÃú—=8¹‡o¤§=|GõW’1êœÖÓ¿Žòbéˆ#Bàp¡µ“{®økö‘oé(.¥7¹çsP‘Õóç>„)N=•»]l¶«xI“@ËÃÓ \ 2sireƒøÄwåàM¹­4z‚j¦F• Í”™/‹‡¾Ç\p DsŸÀ¾e`‘õó:u½Ô´zYÆM¾)ÒÉ—"#¥i1ݳ{¹Ô>d÷ùÈïðÐŒÜSì™[MÜìX@M4Œ¤ùåe” ÚÞ¼À,¥?ð}w¿ÒÒÛ4åóôavs7¿˜}?¨3 èl×oÛÆÕÒ×39 /"°£âœÍã츿XEsu«€%——ᅳßÝÝ/Ëö*þ„—›Yêô’àæFrŒÊXfB‱IƈëyÍ!Åy:{@™²oñÄKžÈÀÓãiýÅ+}®Vú0,Ÿg±¾º;”Œ¡jù<11Õ|µVÅ+t½Iº"Q™/V¤Hlùµ””ý _iªyyˆ3 Mö(B´ÉÞÙ9[ë&eC…kåë¡ZZÖ(뱪5Æ2p•¿zþ¹ ÁxúïT»¸ÊãÏP’Ó¥xí¥ÌF[)³H\sû yƒ L"$·£ò¨-Ô¸Àâ‡"¡E*GÄ¿Ú}¤> šKª¢†m6 𵦔NV>6þq ð`ŒÓÄ{eªBƒ :lÀ#Š ã’á±gëųá4…Àqb-(‘R²7gâ‘¡J†¢Yh©‹'3¨Úçв2;\hêx|Ms&H±8òÃFLô,šì( 8Õß/m¾F#&zBv”‰Ð£Ñ ¢Ò*ѳ¨±£D(-´Œò[b±å¾xâÕ\]ß‚  ¬ùy\áÈË« Î|ŠHñÔªö9Rì„Ì®“a™–ñàˆQA&z)v•‰4QÁÚú O<¢lÂDÏ"ÅŽ2¡¤ ÚÅ®cw‹eÌL¬ „×]¡ÝmZæ±)Y«ƒqªŽ‹åcŽ3¨Ó_-—ëØ_ÅxÔóc]ÚnÒ|9µí•£>†¶ÈMx Õ:]§Š~›<]×LÛ^9Ü£ô[¢ãžsêÖ¶ô˜Fb‘tj®™¶½J²E[ÍÆkëÈ•ì pº<ô€ÒÑ27á·ï˜²^ÙaB˜[Ï„Áq'^{µ¥öè5”:µªý ¥:"³ã”Âñ°Ùð‰&V==yÖŒ‰¡ë§`BQI¤Ç!Ôc¢hŠ‘ÒœÃHÀ£üPkfŸQëïÛj_ÇHýd@?Ï‚ä¨Zy97·õrÞæâ¬N²hH„qŽZfª U2ƒ°5Ú ´Óòï&y‰”©—®ú;!<4|d,ü~mÀ¢TJš‹Ò¹)³½L•þòœÐ`"e~ð¥ËO—WÔ PÝk†ÿM@pÎõñœ¬a¥P´ø¡Eˆ­ÕÁCç¶UOk‰-lbd-}}ƒk^M*5ËyäkžpW'î ˆúÀrØX›˜K÷º6q¡®u’‚áë¥ÖÔüA ;À±deeptools_intervals-0.1.9/deeptoolsintervals/test/GRCh38.84.labels.bed000066400000000000000000000010251352261167100256220ustar00rootroot00000000000000# # chrom start end deepTools_group #another random line 1 11868 14409 group 1 1 12009 13670 group 1 1 14403 29570 group 1 1 17368 17436 group 1 1 29553 31097 group 2 1 30266 31109 group 2 1 30365 30503 group 2 1 34553 36081 group 2 1 35244 36073 group 2 1 52472 53312 group 2 1 62947 63887 group 2 1 69090 70008 group 2 1 89294 120932 group 2 1 92229 129217 group 3 1 110952 129173 group 3 1 120724 133723 group 3 1 129080 133566 group 3 1 89550 91105 group 3 1 131024 134836 group 3 1 135140 135895 group 3 1 137681 137965 group 4 deeptools_intervals-0.1.9/deeptoolsintervals/test/noOverlaps.bed000066400000000000000000000003671352261167100253310ustar00rootroot000000000000001 11868 12000 1 12009 13670 1 14403 16570 1 17368 17436 1 29553 30097 1 30266 31109 1 34553 35081 1 35244 36073 1 52472 53312 1 62947 63887 1 69090 70008 1 89294 92132 1 92229 129217 1 129280 133566 1 134024 134836 1 135140 135895 1 137681 137965 deeptools_intervals-0.1.9/deeptoolsintervals/test/strands.bed000066400000000000000000000001001352261167100246400ustar00rootroot000000000000001 0 1000 first 0 + 1 1000 2000 second 0 - 1 2000 3000 third 0 . deeptools_intervals-0.1.9/deeptoolsintervals/tree/000077500000000000000000000000001352261167100224775ustar00rootroot00000000000000deeptools_intervals-0.1.9/deeptoolsintervals/tree/findOverlaps.c000066400000000000000000000433331352261167100253050ustar00rootroot00000000000000#include #include #include #include "gtf.h" /******************************************************************************* * * Comparison functions * * These are according to Allen's Interval Algebra * *******************************************************************************/ static inline int rangeAny(uint32_t start, uint32_t end, GTFentry *e) { if(end <= e->start) return -1; if(start >= e->end) return 1; return 0; } static inline int rangeContains(uint32_t start, uint32_t end, GTFentry *e) { if(e->start >= start && e->end <= end) return 0; if(e->end < end) return -1; return 1; } static inline int rangeWithin(uint32_t start, uint32_t end, GTFentry *e) { if(start >= e->start && end <= e->end) return 0; if(start < e->start) return -1; return 1; } static inline int rangeExact(uint32_t start, uint32_t end, GTFentry *e) { if(start == e->start && end == e->end) return 0; if(start < e->start) return -1; if(end < e->end) return -1; return 1; } static inline int rangeStart(uint32_t start, uint32_t end, GTFentry *e) { if(start == e->start) return 0; if(start < e->start) return -1; return 1; } static inline int rangeEnd(uint32_t start, uint32_t end, GTFentry *e) { if(end == e->end) return 0; if(end < e->end) return -1; return 1; } static inline int exactSameStrand(int strand, GTFentry *e) { return strand == e->strand; } static inline int sameStrand(int strand, GTFentry *e) { if(strand == 3 || e->strand == 3) return 1; if(strand == e->strand) return 1; return 0; } static inline int oppositeStrand(int strand, GTFentry *e) { if(strand == 3 || e->strand == 3) return 1; if(strand != e->strand) return 1; return 0; } /******************************************************************************* * * OverlapSet functions * *******************************************************************************/ overlapSet *os_init(GTFtree *t) { overlapSet *os = calloc(1, sizeof(overlapSet)); assert(os); os->tree = t; return os; } void os_reset(overlapSet *os) { int i; for(i=0; il; i++) os->overlaps[i] = NULL; os->l = 0; } void os_destroy(overlapSet *os) { if(os->overlaps) free(os->overlaps); free(os); } overlapSet *os_grow(overlapSet *os) { int i; os->m++; kroundup32(os->m); os->overlaps = realloc(os->overlaps, os->m * sizeof(GTFentry*)); assert(os->overlaps); for(i=os->l; im; i++) os->overlaps[i] = NULL; return os; } static void os_push(overlapSet *os, GTFentry *e) { if(os->l+1 >= os->m) os = os_grow(os); os->overlaps[os->l++] = e; } overlapSet *os_dup(overlapSet *os) { int i; overlapSet *os2 = os_init(os->tree); for(i=0; il; i++) os_push(os2, os->overlaps[i]); return os2; } void os_exclude(overlapSet *os, int i) { int j; for(j=i; jl-1; j++) os->overlaps[j] = os->overlaps[j+1]; os->overlaps[--os->l] = NULL; } static int os_sortFunc(const void *a, const void *b) { GTFentry *pa = *(GTFentry**) a; GTFentry *pb = *(GTFentry**) b; if(pa->start < pb->start) return -1; if(pb->start < pa->start) return 1; if(pa->end < pb->end) return -1; if(pb->end < pa->end) return 1; return 0; } static void os_sort(overlapSet *os) { qsort((void *) os->overlaps, os->l, sizeof(GTFentry**), os_sortFunc); } //Non-existant keys/values will be ignored void os_requireAttributes(overlapSet *os, char **key, char **val, int len) { int i, j, k, filter; int32_t keyHash, valHash; for(i=0; il) break; keyHash = str2valHT(os->tree->htAttributes, key[i]); valHash = str2valHT(os->tree->htAttributes, val[i]); assert(keyHash>=0); assert(valHash>=0); for(j=0; jl; j++) { filter = 1; for(k=0; koverlaps[j]->nAttributes; k++) { if(os->overlaps[j]->attrib[k]->key == keyHash) { if(os->overlaps[j]->attrib[k]->val == valHash) { filter = 0; break; } } } if(filter) { os_exclude(os, j); j--; //os_exclude shifts everything } } } } //This is an inefficient implementation. It would be faster to sort according //to COMPARE_FUNC and then do an O(n) merge. overlapSet *os_intersect(overlapSet *os1, overlapSet *os2, COMPARE_FUNC f) { overlapSet *os = os_init(os1->tree); int i, j; for(i=0; il; i++) { for(j=0; jl; j++) { if(f(os1->overlaps[i],os2->overlaps[j]) == 0) { os_push(os, os1->overlaps[i]); os_exclude(os2, j); break; } } } return os; } /******************************************************************************* * * OverlapSetList functions * *******************************************************************************/ overlapSetList *osl_init() { overlapSetList *osl = calloc(1, sizeof(overlapSetList)); assert(osl); return osl; } void osl_reset(overlapSetList *osl) { int i; for(i=0; il; i++) os_destroy(osl->os[i]); osl->l = 0; } void osl_destroy(overlapSetList *osl) { osl_reset(osl); if(osl->os) free(osl->os); free(osl); } void osl_grow(overlapSetList *osl) { int i; osl->m++; kroundup32(osl->m); osl->os = realloc(osl->os, osl->m * sizeof(overlapSet*)); assert(osl->os); for(i=osl->l; im; i++) osl->os[i] = NULL; } void osl_push(overlapSetList *osl, overlapSet *os) { if(osl->l+1 >= osl->m) osl_grow(osl); osl->os[osl->l++] = os; } //The output needs to be destroyed overlapSet *osl_intersect(overlapSetList *osl, COMPARE_FUNC f) { int i; if(!osl->l) return NULL; overlapSet *osTmp, *os = os_dup(osl->os[0]); for(i=1; il; i++) { osTmp = os_intersect(os, osl->os[i], f); os_destroy(os); os = osTmp; if(os->l == 0) break; } return os; } //Returns 1 if the node is in the overlapSet, otherwise 0. int os_contains(overlapSet *os, GTFentry *e) { int i; for(i=0; il; i++) { if(os->overlaps[i] == e) return 1; } return 0; } //This could be made much more efficient overlapSet *osl_union(overlapSetList *osl) { int i, j; if(!osl->l) NULL; if(!osl->os) return NULL; if(!osl->os[0]) return NULL; overlapSet *os = os_dup(osl->os[0]); for(i=1; il; i++) { for(j=0; jos[i]->l; j++) { if(!os_contains(os, osl->os[i]->overlaps[j])) { os_push(os, osl->os[i]->overlaps[j]); } } } return os; } /******************************************************************************* * * uniqueSet functions * *******************************************************************************/ static uniqueSet *us_init(hashTable *ht) { uniqueSet *us = calloc(1, sizeof(uniqueSet)); assert(us); us->ht = ht; return us; } void us_destroy(uniqueSet *us) { if(!us) return; if(us->IDs) { free(us->IDs); free(us->cnts); } free(us); } static uniqueSet *us_grow(uniqueSet *us) { int i; us->m++; kroundup32(us->m); us->IDs = realloc(us->IDs, us->m * sizeof(int32_t)); assert(us->IDs); us->cnts = realloc(us->cnts, us->m * sizeof(uint32_t)); assert(us->cnts); for(i=us->l; im; i++) { us->IDs[i] = -1; us->cnts[i] = 0; } return us; } static void us_push(uniqueSet *us, int32_t ID) { if(us->l+1 >= us->m) us = us_grow(us); us->IDs[us->l] = ID; us->cnts[us->l++] = 1; } static void us_inc(uniqueSet *us) { assert(us->l<=us->m); us->cnts[us->l-1]++; } uint32_t us_cnt(uniqueSet *us, int32_t i) { assert(il); return us->cnts[i]; } char *us_val(uniqueSet *us, int32_t i) { if(i>=us->l) return NULL; return val2strHT(us->ht, us->IDs[i]); } /******************************************************************************* * * Overlap set count/unique functions * *******************************************************************************/ static int int32_t_cmp(const void *a, const void *b) { int32_t ia = *((int32_t*) a); int32_t ib = *((int32_t*) b); return ia-ib; } int32_t cntAttributes(overlapSet *os, char *attributeName) { int32_t IDs[os->l], i, j, key, last, n = 0; if(!strExistsHT(os->tree->htAttributes, attributeName)) return n; key = str2valHT(os->tree->htAttributes, attributeName); for(i=0; il; i++) { IDs[i] = -1; for(j=0; joverlaps[i]->nAttributes; j++) { if(os->overlaps[i]->attrib[j]->key == key) { IDs[i] = os->overlaps[i]->attrib[j]->val; break; } } } qsort((void*) IDs, os->l, sizeof(int32_t), int32_t_cmp); last = IDs[0]; n = (last >= 0) ? 1 : 0; for(i = 1; il; i++) { if(IDs[i] != last) { n++; last = IDs[i]; } } return n; } uniqueSet *uniqueAttributes(overlapSet *os, char *attributeName) { if(!os) return NULL; if(os->l == 0) return NULL; int32_t IDs[os->l], i, j, key, last; if(!strExistsHT(os->tree->htAttributes, attributeName)) return NULL; uniqueSet *us = us_init(os->tree->htAttributes); key = str2valHT(os->tree->htAttributes, attributeName); for(i=0; il; i++) { IDs[i] = -1; for(j=0; joverlaps[i]->nAttributes; j++) { if(os->overlaps[i]->attrib[j]->key == key) { IDs[i] = os->overlaps[i]->attrib[j]->val; break; } } } qsort((void*) IDs, os->l, sizeof(int32_t), int32_t_cmp); last = -1; for(i=0; il; i++) { if(IDs[i] != last || last < 0) { us_push(us, IDs[i]); last = IDs[i]; } else { us_inc(us); } } if(us->l) return us; us_destroy(us); return NULL; } /******************************************************************************* * * Node iterator functions * *******************************************************************************/ //bit 1: go left, bit 2: go right (a value of 3 is then "do both") static int centerDirection(uint32_t start, uint32_t end, GTFnode *n) { if(n->center >= start && n->center < end) return 3; if(n->center < start) return 2; return 1; } static int matchingStrand(GTFentry *e, int strand, int strandType) { if(strandType == GTF_IGNORE_STRAND) return 1; if(strandType == GTF_SAME_STRAND) { return sameStrand(strand, e); } else if(strandType == GTF_OPPOSITE_STRAND) { return oppositeStrand(strand, e); } else if(strandType == GTF_EXACT_SAME_STRAND) { return exactSameStrand(strand, e); } fprintf(stderr, "[matchingStrand] Unknown strand type %i. Assuming a match.\n", strandType); return 1; } static void filterStrand(overlapSet *os, int strand, int strandType) { int i; if(strandType == GTF_IGNORE_STRAND) return; for(i=os->l-1; i>=0; i--) { if(strandType == GTF_SAME_STRAND) { if(!sameStrand(strand, os->overlaps[i])) os_exclude(os, i); } else if(strandType == GTF_OPPOSITE_STRAND) { if(!oppositeStrand(strand, os->overlaps[i])) os_exclude(os, i); } else if(strandType == GTF_EXACT_SAME_STRAND) { if(!exactSameStrand(strand, os->overlaps[i])) os_exclude(os, i); } } } static void pushOverlaps(overlapSet *os, GTFtree *t, GTFentry *e, uint32_t start, uint32_t end, int comparisonType, int direction, FILTER_ENTRY_FUNC ffunc) { int dir; int keep = 1; if(!e) return; if(ffunc) keep = ffunc(t, e); switch(comparisonType) { case GTF_MATCH_EXACT : if((dir = rangeExact(start, end, e)) == 0) { if(keep) os_push(os, e); } break; case GTF_MATCH_WITHIN : if((dir = rangeAny(start, end, e)) == 0) { if(keep) if(rangeWithin(start, end ,e) == 0) os_push(os, e); } break; case GTF_MATCH_CONTAIN : if((dir = rangeAny(start, end, e)) == 0) { if(keep) if(rangeContains(start, end, e) == 0) os_push(os, e); } break; case GTF_MATCH_START : if((dir = rangeStart(start, end, e)) == 0) { if(keep) os_push(os, e); } break; case GTF_MATCH_END : if((dir = rangeEnd(start, end, e)) == 0) { if(keep) os_push(os, e); } break; default : if((dir = rangeAny(start, end, e)) == 0) { if(keep) os_push(os, e); } break; } if(direction) { if(dir > 0) return; pushOverlaps(os, t, e->right, start, end, comparisonType, direction, ffunc); } else { if(dir < 0) return; pushOverlaps(os, t, e->left, start, end, comparisonType, direction, ffunc); } } static int32_t countOverlapsEntry(GTFtree *t, GTFentry *e, uint32_t start, uint32_t end, int strand, int matchType, int strandType, int direction, int32_t max, FILTER_ENTRY_FUNC ffunc) { int dir; int32_t cnt = 0; if(!e) return cnt; switch(matchType) { case GTF_MATCH_EXACT : if((dir = rangeExact(start, end, e)) == 0) { cnt = 1; } break; case GTF_MATCH_WITHIN : if((dir = rangeAny(start, end, e)) == 0) { if(rangeWithin(start, end, e) == 0) cnt = 1; } break; case GTF_MATCH_CONTAIN : if((dir = rangeAny(start, end, e)) == 0) { if(rangeContains(start, end, e) == 0) cnt = 1; } break; case GTF_MATCH_START : if((dir = rangeStart(start, end, e)) == 0) { cnt = 1; } break; case GTF_MATCH_END : if((dir = rangeEnd(start, end, e)) == 0) { cnt = 1; } break; default : if((dir = rangeAny(start, end, e)) == 0) { cnt = 1; } break; } if(cnt) { if(!matchingStrand(e, strand, strandType)) cnt = 0; } if(cnt && ffunc) { if(!ffunc(t, e)) cnt = 0; } if(max && cnt >= max) return max; if(direction) { if(dir > 0) return cnt; return cnt + countOverlapsEntry(t, e->right, start, end, strand, matchType, strandType, direction, max, ffunc); } else { if(dir < 0) return cnt; return cnt + countOverlapsEntry(t, e->left, start, end, strand, matchType, strandType, direction, max, ffunc); } } static void pushOverlapsNode(overlapSet *os, GTFtree *t, GTFnode *n, uint32_t start, uint32_t end, int matchType, FILTER_ENTRY_FUNC ffunc) { int dir; if(!n) return; dir = centerDirection(start, end, n); if(dir&1) { pushOverlaps(os, t, n->starts, start, end, matchType, 1, ffunc); pushOverlapsNode(os, t, n->left, start, end, matchType, ffunc); } if(dir&2) { if(dir!=3) pushOverlaps(os, t, n->ends, start, end, matchType, 0, ffunc); pushOverlapsNode(os, t, n->right, start, end, matchType, ffunc); } } static int32_t countOverlapsNode(GTFtree *t, GTFnode *n, uint32_t start, uint32_t end, int strand, int matchType, int strandType, int32_t max, FILTER_ENTRY_FUNC ffunc) { int32_t cnt = 0; int dir; if(!n) return cnt; dir = centerDirection(start, end, n); if(dir&1) { cnt += countOverlapsEntry(t, n->starts, start, end, strand, matchType, strandType, 1, max, ffunc); if(max && cnt >= max) return max; cnt += countOverlapsNode(t, n->left, start, end, strand, matchType, strandType, max, ffunc); if(max && cnt >= max) return max; } if(dir&2) { if(dir!=3) cnt += countOverlapsEntry(t, n->starts, start, end, strand, matchType, strandType, 0, max, ffunc); if(max && cnt >= max) return max; cnt += countOverlapsNode(t, n->right, start, end, strand, matchType, strandType, max, ffunc); if(max && cnt >= max) return max; } return cnt; } /******************************************************************************* * * Driver functions for end use. * *******************************************************************************/ overlapSet * findOverlaps(overlapSet *os, GTFtree *t, char *chrom, uint32_t start, uint32_t end, int strand, int matchType, int strandType, int keepOS, FILTER_ENTRY_FUNC ffunc) { int32_t tid = str2valHT(t->htChroms, chrom); overlapSet *out = os; if(out && !keepOS) os_reset(out); else if(!out) out = os_init(t); if(tid<0) return out; if(!t->balanced) { fprintf(stderr, "[findOverlaps] The tree has not been balanced! No overlaps will be returned.\n"); return out; } pushOverlapsNode(out, t, (GTFnode*) t->chroms[tid]->tree, start, end, matchType, ffunc); if(out->l) filterStrand(out, strand, strandType); if(out->l) os_sort(out); return out; } int32_t countOverlaps(GTFtree *t, char *chrom, uint32_t start, uint32_t end, int strand, int matchType, int strandType, FILTER_ENTRY_FUNC ffunc) { int32_t tid = str2valHT(t->htChroms, chrom); if(tid<0) return 0; if(!t->balanced) { fprintf(stderr, "[countOverlaps] The tree has not been balanced! No overlaps will be returned.\n"); return 0; } return countOverlapsNode(t, (GTFnode*) t->chroms[tid]->tree, start, end, strand, matchType, strandType, 0, ffunc); } int overlapsAny(GTFtree *t, char *chrom, uint32_t start, uint32_t end, int strand, int matchType, int strandType, FILTER_ENTRY_FUNC ffunc) { int32_t tid = str2valHT(t->htChroms, chrom); if(tid<0) return 0; if(!t->balanced) { fprintf(stderr, "[overlapsAny] The tree has not been balanced! No overlaps will be returned.\n"); return 0; } return countOverlapsNode(t, (GTFnode*) t->chroms[tid]->tree, start, end, strand, matchType, strandType, 1, ffunc); } deeptools_intervals-0.1.9/deeptoolsintervals/tree/gtf.c000066400000000000000000000421651352261167100234330ustar00rootroot00000000000000#include #include #include #include #include "gtf.h" //Nodes for the interval tree typedef struct { int32_t center; int32_t l; GTFentry *start; GTFentry *end; struct treeNode *left; struct treeNode *right; } treeNode; //The sizes shouldn't be preset... GTFtree * initGTFtree() { GTFtree *t = calloc(1, sizeof(GTFtree)); assert(t); //Initialize the hash tables t->htChroms = initHT(128); t->htSources = initHT(128); t->htFeatures = initHT(128); t->htAttributes = initHT(128); return t; } void destroyGTFentry(GTFentry *e) { int32_t i; if(!e) return; if(e->right) destroyGTFentry(e->right); for(i=0; inAttributes; i++) { if(e->attrib[i]) free(e->attrib[i]); } if(e->attrib) free(e->attrib); free(e); } void destroyGTFnode(GTFnode *n) { if(n->left) destroyGTFnode(n->left); if(n->starts) destroyGTFentry(n->starts); if(n->right) destroyGTFnode(n->right); free(n); } void destroyGTFchrom(GTFchrom *c, int balanced) { if(balanced) destroyGTFnode((GTFnode*) c->tree); else destroyGTFentry((GTFentry*) c->tree); free(c); } //This need to handle htTargets and htGenes still void destroyGTFtree(GTFtree *t) { uint32_t i; for(i=0; in_targets; i++) { destroyGTFchrom(t->chroms[i], t->balanced); } destroyHT(t->htChroms); destroyHT(t->htSources); destroyHT(t->htFeatures); destroyHT(t->htAttributes); free(t->chroms); free(t); } void addChrom(GTFtree *t) { int i; t->n_targets++; //Grow if needed if(t->n_targets >= t->m) { t->m++; kroundup32(t->m); t->chroms = realloc(t->chroms, t->m * sizeof(GTFchrom *)); assert(t->chroms); for(i=t->n_targets-1; im; i++) t->chroms[i] = NULL; } assert(!t->chroms[t->n_targets-1]); //We shouldn't be adding over anything... t->chroms[t->n_targets-1] = calloc(1,sizeof(GTFchrom)); assert(t->chroms[t->n_targets-1]); } //Returns NULL on error static Attribute *makeAttribute(GTFtree *t, char *value) { int32_t idx; Attribute *a = malloc(sizeof(Attribute)); if(!a) return NULL; if(!strExistsHT(t->htAttributes, "transcript_id")) { idx = addHTelement(t->htAttributes, "transcript_id"); } else { idx = str2valHT(t->htAttributes, "transcript_id"); } a->key = idx; if(!strExistsHT(t->htAttributes, value)) { idx = addHTelement(t->htAttributes, value); } else { idx = str2valHT(t->htAttributes, value); } a->val = idx; return a; } /* This currently hard-codes the following: feature source frame all attributes (most are skipped) returns 1 on error */ int addGTFentry(GTFtree *t, char *chrom, uint32_t start, uint32_t end, uint8_t strand, char *transcriptID, uint32_t labelIDX, double score) { int32_t IDchrom, IDfeature, IDsource; char feature[] = "transcript", source[] = "deepTools"; uint8_t frame = 3; GTFentry *e = NULL; Attribute *a = NULL; Attribute **attribs = calloc(1, sizeof(Attribute *)); if(!attribs) return 1; //Get the chromosome ID if(!strExistsHT(t->htChroms, chrom)) { addChrom(t); IDchrom = addHTelement(t->htChroms, chrom); } else { IDchrom = str2valHT(t->htChroms, chrom); } //Handle the hard-coded stuff, which in case they're ever requested if(!strExistsHT(t->htSources, source)) { IDsource = addHTelement(t->htSources, source); } else { IDsource = str2valHT(t->htSources, source); } if(!strExistsHT(t->htFeatures, feature)) { IDfeature = addHTelement(t->htFeatures, feature); } else { IDfeature = str2valHT(t->htFeatures, feature); } //Create the attribute a = makeAttribute(t, transcriptID); if(!a) goto error; attribs[0] = a; //Initialize the entry e = malloc(sizeof(GTFentry)); if(!e) goto error; e->right = NULL; e->chrom = IDchrom; e->feature = IDfeature; e->source = IDsource; e->start = start; e->end = end; e->strand = strand; e->frame = frame; e->score = score; e->nAttributes = 1; e->attrib = attribs; e->labelIdx = labelIDX; if(t->chroms[IDchrom]->tree) { e->left = ((GTFentry*) t->chroms[IDchrom]->tree)->left; e->left->right = e; ((GTFentry*) t->chroms[IDchrom]->tree)->left = e; } else { t->chroms[IDchrom]->tree = (void *) e; e->left = e; } t->chroms[IDchrom]->n_entries++; return 0; error: if(attribs) free(attribs); if(a) free(a); if(e) free(e); return 1; } /* This currently hard-codes the following: name source frame all attributes returns 1 on error */ int addEnrichmententry(GTFtree *t, char *chrom, uint32_t start, uint32_t end, uint8_t strand, double score, char *feature) { int32_t IDchrom, IDfeature, IDsource; char source[] = "deepTools"; uint8_t frame = 3; GTFentry *e = NULL; //Attribute **attribs = calloc(1, sizeof(Attribute *)); //if(!attribs) return 1; //Get the chromosome ID if(!strExistsHT(t->htChroms, chrom)) { addChrom(t); IDchrom = addHTelement(t->htChroms, chrom); } else { IDchrom = str2valHT(t->htChroms, chrom); } //Handle the hard-coded stuff, in case they're ever requested if(!strExistsHT(t->htSources, source)) { IDsource = addHTelement(t->htSources, source); } else { IDsource = str2valHT(t->htSources, source); } if(!strExistsHT(t->htFeatures, feature)) { IDfeature = addHTelement(t->htFeatures, feature); } else { IDfeature = str2valHT(t->htFeatures, feature); } //Initialize the entry e = malloc(sizeof(GTFentry)); if(!e) goto error; e->right = NULL; e->chrom = IDchrom; e->feature = IDfeature; e->source = IDsource; e->start = start; e->end = end; e->strand = strand; e->frame = frame; e->score = score; e->nAttributes = 0; e->attrib = NULL; if(t->chroms[IDchrom]->tree) { e->left = ((GTFentry*) t->chroms[IDchrom]->tree)->left; e->left->right = e; ((GTFentry*) t->chroms[IDchrom]->tree)->left = e; } else { t->chroms[IDchrom]->tree = (void *) e; e->left = e; } t->chroms[IDchrom]->n_entries++; return 0; error: //if(attribs) free(attribs); //if(a) free(a); if(e) free(e); return 1; } /******************************************************************************* * * Sorting functions * *******************************************************************************/ GTFentry *getMiddleR(GTFentry *e, uint32_t pos) { uint32_t i; GTFentry *tmp, *o = e; if(!o->right) return o; for(i=1; iright); o = o->right; } tmp = o; assert(o->right); o = o->right; tmp->right = NULL; return o; } GTFentry *getMiddleL(GTFentry *e, uint32_t pos) { uint32_t i; GTFentry *tmp, *o = e; if(!o->left) { return o; } for(i=1; ileft); o = o->left; } tmp = o; assert(o->left); o = o->left; tmp->left = NULL; return o; } int cmpRangesStart(GTFentry *a, GTFentry *b) { if(!b && !a) return 0; if(!b) return -1; if(!a) return 1; if(a->start < b->start) return -1; if(b->start < a->start) return 1; if(b->end < a->end) return 1; return -1; } int cmpRangesEnd(GTFentry *a, GTFentry *b) { if(!b && !a) return 0; if(!a) return 1; if(!b) return -1; if(a->end > b->end) return -1; if(b->end > a->end) return 1; if(a->start > b->start) return -1; return 1; } GTFentry *mergeSortStart(GTFentry *a, GTFentry *b) { GTFentry *o = a, *last; int i = cmpRangesStart(a,b); if(i<0) { o = a; a = a->right; } else if(i>0) { o = b; b = b->right; } else{ return NULL; } last = o; last->right = NULL; while((i=cmpRangesStart(a,b))) { if(i>0) { last->right= b; last = b; b = b->right; } else { last->right= a; last = a; a = a->right; } } last->right = NULL; return o; } GTFentry *mergeSortEnd(GTFentry *a, GTFentry *b) { GTFentry *o = a, *last; int i = cmpRangesEnd(a,b); if(i<0) { o = a; a = a->left; } else if(i>0) { o = b; b = b->left; } else { return NULL; } last = o; last->left = NULL; while((i=cmpRangesEnd(a,b))) { if(i<0) { assert(a != last); last->left = a; last = a; a = a->left; } else { assert(b != last); last->left = b; last = b; b = b->left; } } last->left = NULL; return o; } GTFentry *sortTreeStart(GTFentry *e, uint32_t l) { if(l==1) return e; uint32_t half = l/2; GTFentry *middle = getMiddleR(e, half); return mergeSortStart(sortTreeStart(e,half), sortTreeStart(middle,half+(l&1))); } GTFentry *sortTreeEnd(GTFentry *e, uint32_t l) { if(l==1) { e->left = NULL; //The list is circular, so... return e; } uint32_t half = l/2; assert(e->left); assert(e != e->left); GTFentry *middle = getMiddleL(e, half); assert(e != middle); assert(e != e->left); return mergeSortEnd(sortTreeEnd(e,half), sortTreeEnd(middle,half+(l&1))); } /******************************************************************************* * * Functions for interval tree construction * *******************************************************************************/ //Note the returned object is the rightmost interval sorted by end position GTFentry *sortChrom(GTFchrom *c) { GTFentry *e = ((GTFentry *)c->tree)->left; ((GTFentry*) c->tree)->left = NULL; c->tree = (void *) sortTreeStart((GTFentry *) c->tree, c->n_entries); e = sortTreeEnd(e, c->n_entries); return e; } uint32_t getCenter(GTFentry *ends) { GTFentry *slow = ends; GTFentry *fast = ends; while(fast->left && fast->left->left) { slow = slow->left; fast = fast->left->left; } return slow->end-1; } GTFentry *getMembers(GTFentry **members, GTFentry **rStarts, GTFentry *starts, uint32_t pos) { GTFentry *tmp, *newStarts = NULL; GTFentry *last = NULL, *lastMember = NULL; *members = NULL, *rStarts = NULL; while(starts && starts->start <= pos) { if(starts->end > pos) { tmp = starts->right; if(!*members) { lastMember = starts; *members = starts; } else { lastMember->right = starts; lastMember = starts; } starts->right = NULL; starts = tmp; } else { if(!newStarts) { newStarts = starts; last = starts; } else { last->right = starts; last = starts; } starts = starts->right; } } *rStarts = starts; if(lastMember) lastMember->right = NULL; if(last) last->right = NULL; assert(*members); return newStarts; } GTFentry *getRMembers(GTFentry **members, GTFentry **lEnds, GTFentry *ends, uint32_t pos) { GTFentry *tmp, *newEnds = NULL; GTFentry *last = NULL, *lastMember = NULL; *members = NULL, *lEnds = NULL; while(ends && ends->end > pos) { tmp = ends->left; if(ends->start <= pos) { if(!*members) { *members = ends; lastMember = ends; } else { lastMember->left = ends; lastMember = ends; } } else { if(!newEnds) { newEnds = ends; last = ends; } else { last->left = ends; last = ends; } } ends->left = NULL; ends = tmp; } *lEnds = ends; assert(*members); lastMember->left = NULL; if(newEnds) last->left = NULL; return newEnds; } GTFnode *makeIntervalTree(GTFentry *starts, GTFentry *ends) { uint32_t center = getCenter(ends);//, nMembers; GTFentry *rStarts = NULL; //getRStarts(starts, center); GTFentry *lEnds = NULL; //getLEnds(ends, center); GTFentry *memberStarts = NULL, *memberEnds = NULL; GTFnode *out = calloc(1, sizeof(GTFnode)); assert(out); starts = getMembers(&memberStarts, &rStarts, starts, center); ends = getRMembers(&memberEnds, &lEnds, ends, center); out->center = center; out->starts = memberStarts; out->ends = memberEnds; if(lEnds && starts) { out->left = makeIntervalTree(starts, lEnds); } else { out->left = NULL; } if(rStarts && ends) { out->right = makeIntervalTree(rStarts, ends); } else { out->right = NULL; } return out; } void sortGTF(GTFtree *t) { int32_t i; GTFentry *ends; for(i=0; in_targets; i++) { ends = sortChrom(t->chroms[i]); t->chroms[i]->tree = (void*) makeIntervalTree((GTFentry*) t->chroms[i]->tree, ends); } t->balanced = 1; } int nodeHasOverlaps(GTFnode *node, int firstNode, uint32_t *lpos, uint32_t *minDistance) { int rv = 0; GTFentry *e = node->starts; // Go down the left if(node->left) { rv = nodeHasOverlaps(node->left, firstNode, lpos, minDistance); if(rv) return rv; } else if(firstNode) { //This only has to be specially set on the left-most node *lpos = e->end; *minDistance = e->start; e = e->right; } // Test this node while(e) { if(e->start < *lpos) { *minDistance = 0; return 1; } if(e->start - *lpos < *minDistance) *minDistance = e->start - *lpos; *lpos = e->end; e = e->right; } if(node->right) return nodeHasOverlaps(node->right, 0, lpos, minDistance); return rv; } int hasOverlapsChrom(GTFchrom *chrom, uint32_t *minDistance) { uint32_t lpos; if(chrom->n_entries < 2) return 0; return nodeHasOverlaps((GTFnode*) chrom->tree, 1, &lpos, minDistance); } // Given a GTF tree, returning 1 if ANY of the entries overlap with each other, 0 otherwise // minDistance is updated to return the minimum distance between intervals. This will be 0 if there are overlaps. int hasOverlaps(GTFtree *t, uint32_t *minDistance) { int32_t i; int rv = 0; *minDistance = (uint32_t) -1; for(i=0; in_targets; i++) { rv = hasOverlapsChrom(t->chroms[i], minDistance); if(rv) return rv; } return rv; } /******************************************************************************* * * Misc. functions * *******************************************************************************/ void printBalancedGTF(GTFnode *n, const char *chrom) { kstring_t ks, ks2; ks.s = NULL; ks.l = ks.m = 0; ks2.s = NULL; ks2.l = ks2.m = 0; kputs(chrom, &ks); kputc(':', &ks); kputuw(n->center, &ks); if(n->left) { kputs(chrom, &ks2); kputc(':', &ks2); kputuw(n->left->center, &ks2); printf("\t\"%s\" -> \"%s\";\n", ks.s, ks2.s); printBalancedGTF(n->left, chrom); } printf("\t\"%s:%"PRIu32"\" [shape=box];\n", chrom, n->center); GTFentry *e = n->starts; if(e) printGTFvineStart(e, chrom, ks.s); if(n->ends) printGTFvineStartR(n->ends, chrom, ks.s); if(n->right) { ks2.l = 0; kputs(chrom, &ks2); kputc(':', &ks2); kputuw(n->right->center, &ks2); printf("\t\"%s\" -> \"%s\";\n", ks.s, ks2.s); printBalancedGTF(n->right, chrom); } free(ks.s); if(ks2.s) free(ks2.s); } void printGTFvineR(GTFentry *e, const char* chrom) { if(e->left == e) return; if(!e->left) return; printf("\t\"%s:%"PRIu32"-%"PRIu32"\" -> \"%s:%"PRIu32"-%"PRIu32"\" [color=red];\n", chrom, e->start, e->end, chrom, e->left->start, e->left->end); printGTFvineR(e->left, chrom); } void printGTFvineStartR(GTFentry *e, const char *chrom, const char *str) { printf("\t\"%s\" -> \"%s:%"PRIu32"-%"PRIu32"\" [color=red];\n", str, chrom, e->start, e->end); if(e->left) printGTFvineR(e, chrom); } void printGTFvine(GTFentry *e, const char* chrom) { if(!e->right) return; printf("\t\"%s:%"PRIu32"-%"PRIu32"\" -> \"%s:%"PRIu32"-%"PRIu32"\";\n", chrom, e->start, e->end, chrom, e->right->start, e->right->end); printGTFvine(e->right, chrom); } void printGTFvineStart(GTFentry *e, const char *chrom, const char *str) { printf("\t\"%s\" -> \"%s:%"PRIu32"-%"PRIu32"\";\n", str, chrom, e->start, e->end); if(e->right) printGTFvine(e, chrom); } void printGTFtree(GTFtree *t) { int32_t i; const char *chromName; if(t->balanced) printf("digraph balancedTree {\n"); else printf("digraph unbalancedTree {\n"); for(i=0; in_targets; i++) { chromName = val2strHT(t->htChroms, i); if(t->balanced) { printBalancedGTF((GTFnode*) t->chroms[i]->tree, chromName); } else { printGTFvineStart((GTFentry*) t->chroms[i]->tree, chromName, chromName); } } printf("}\n"); } deeptools_intervals-0.1.9/deeptoolsintervals/tree/gtf.h000066400000000000000000000125101352261167100234270ustar00rootroot00000000000000#include #include "kstring.h" /***************** * Strand macros * *****************/ #define GTF_IGNORE_STRAND 0 #define GTF_SAME_STRAND 1 #define GTF_OPPOSITE_STRAND 2 #define GTF_EXACT_SAME_STRAND 3 /*********************** * Overlap type macros * ***********************/ #define GTF_MATCH_ANY 0 #define GTF_MATCH_EXACT 1 #define GTF_MATCH_CONTAIN 2 #define GTF_MATCH_WITHIN 3 #define GTF_MATCH_START 4 #define GTF_MATCH_END 5 typedef struct { int32_t key; int32_t val; } Attribute; /*! @typedef @abstract Structure for a single GTF line @field chrom Index into the chrom hash table @field source Index into the source hash table @field feature Index into the feature hash table @field start 0-based starting position @field end 1-based end position @field score The score field. A value of DBL_MAX indicates a "." @field strand 0: '+'; 1: '-'; 3: '.' @field frame 0: '0'; 1: '1'; 2: '2'; 3: '.' @field gene_id Index into the gene_id hash table @field transcript_id Index into the transcript_id hash table @discussion Positions are 0-based half open ([start, end)), like BED files. */ typedef struct GTFentry { int32_t chrom; int32_t source; int32_t feature; uint32_t start; uint32_t end; double score; uint8_t strand:4, frame:4; int32_t gene_id; int32_t transcript_id; uint32_t labelIdx; int nAttributes; Attribute **attrib; struct GTFentry *left, *right; } GTFentry; typedef struct { kstring_t chrom; kstring_t feature; kstring_t source; uint32_t start; uint32_t end; double score; uint8_t strand:4, frame: 4; kstring_t gene; kstring_t transcript; int nAttributes; Attribute **attrib; } GTFline; typedef struct GTFnode { uint32_t center; GTFentry *starts, *ends; struct GTFnode *left, *right; } GTFnode; typedef struct { int32_t chrom; uint32_t n_entries; void **tree; } GTFchrom; typedef struct hashTableElement { int32_t val; struct hashTableElement *next; } hashTableElement; typedef struct { uint64_t l, m; hashTableElement **elements; char **str; } hashTable; typedef struct { int32_t n_targets, m; int balanced; hashTable *htChroms; hashTable *htSources; hashTable *htFeatures; hashTable *htAttributes; GTFchrom **chroms; } GTFtree; typedef struct { int32_t l, m; GTFentry **overlaps; GTFtree *tree; } overlapSet; typedef struct { int32_t l, m; overlapSet **os; } overlapSetList; typedef struct { int32_t l, m; int32_t *IDs; uint32_t *cnts; hashTable *ht; } uniqueSet; //A function that can be applied to all entries in a GTF/BED/etc. file as it's //being processed. The pointer as input is currently a GTFline *. The return //value is 0 (ignore entry) or 1 (keep entry). typedef int (*FILTER_FUNC)(void*); typedef int (*FILTER_ENTRY_FUNC)(GTFtree *, GTFentry *); //A function used to compare to GTFentry items to see if the intersect in some //way (e.g., due to sharing a gene_id). This is used to intersect overlapsets. typedef int (*COMPARE_FUNC)(GTFentry *, GTFentry *); //gtf.c GTFtree * initGTFtree(void); void destroyGTFtree(GTFtree *t); void sortGTF(GTFtree *o); void printGTFtree(GTFtree *t); void printGTFvineStart(GTFentry *e, const char *chrom, const char *str); void printGTFvineStartR(GTFentry *e, const char *chrom, const char *str); int addGTFentry(GTFtree *t, char *chrom, uint32_t start, uint32_t end, uint8_t strand, char *transcriptID, uint32_t labelIDX, double score); int addEnrichmententry(GTFtree *t, char *chrom, uint32_t start, uint32_t end, uint8_t strand, double score, char *feature); int hasOverlaps(GTFtree *t, uint32_t *minOverlap); //hashTable.c hashTable *initHT(uint64_t size); void destroyHTelement(hashTableElement *e); void destroyHT(hashTable *ht); int32_t addHTelement(hashTable *ht, char *s); uint64_t hashString(char *s); int strExistsHT(hashTable *ht, char *s); int32_t str2valHT(hashTable *ht, char *s); char *val2strHT(hashTable *ht, int32_t val); int hasAttribute(GTFtree *t, GTFentry *e, char *str); char *getAttribute(GTFtree *t, GTFentry *e, char *str); //NULL if the attribute isn't there //findOverlaps.c //overlapSet functions overlapSet *os_init(GTFtree *t); void os_reset(overlapSet *os); void os_destroy(overlapSet *os); overlapSet *os_grow(overlapSet *os); void os_exclude(overlapSet *os, int i); void os_requireAttributes(overlapSet *os, char **keys, char **vals, int len); void os_requireSource(overlapSet *os, char *val); void os_requireFeature(overlapSet *os, char *val); overlapSet *os_intersect(overlapSet *os1, overlapSet *os2, COMPARE_FUNC f); //overlapSetList functions overlapSetList *osl_init(void); void osl_reset(overlapSetList *osl); void osl_destroy(overlapSetList *osl); void osl_push(overlapSetList *osl, overlapSet *os); void osl_grow(overlapSetList *osl); overlapSet *osl_intersect(overlapSetList *osl, COMPARE_FUNC f); overlapSet *osl_union(overlapSetList *osl); //uniqueSet functions void us_destroy(uniqueSet *us); uint32_t us_cnt(uniqueSet *us, int32_t i); char *us_val(uniqueSet *us, int32_t i); //Driver functions overlapSet * findOverlaps(overlapSet *os, GTFtree *t, char *chrom, uint32_t start, uint32_t end, int strand, int matchType, int strandType, int keepOS, FILTER_ENTRY_FUNC ffunc); deeptools_intervals-0.1.9/deeptoolsintervals/tree/hashTable.c000066400000000000000000000075461352261167100245520ustar00rootroot00000000000000#include #include #include #include #include "murmur3.h" #include "gtf.h" uint64_t hashString(char *s) { int len = strlen(s); uint64_t hash_val[2]; uint32_t seed = 0xAAAAAAAA; #if UINTPTR_MAX == 0xffffffff MurmurHash3_x86_128((void *) s, len, seed, (void *) &hash_val); #else MurmurHash3_x64_128((void *) s, len, seed, (void *) &hash_val); #endif return hash_val[0]; } hashTable *initHT(uint64_t size) { hashTable *ht = calloc(1, sizeof(hashTable)); assert(ht); ht->elements = calloc(size, sizeof(hashTableElement*)); assert(ht->elements); ht->str = calloc(size, sizeof(char*)); assert(ht->str); ht->m = size; return ht; } void insertHTelement(hashTable *ht, hashTableElement *e, uint64_t hash) { uint64_t i = hash%ht->m; hashTableElement *curr = ht->elements[i]; if(!curr) ht->elements[i] = e; else { while(curr->next) curr = curr->next; curr->next = e; } } static void rehashElement(hashTable *ht, hashTableElement *e) { hashTableElement *next = e->next; if(!e) return; uint64_t hash = hashString(ht->str[e->val]); e->next = NULL; insertHTelement(ht, e, hash); if(next) rehashElement(ht, next); } static void rehashHT(hashTable *ht) { int32_t i; hashTableElement *e; for(i=0; il; i++) { if(ht->elements[i]) { e = ht->elements[i]; ht->elements[i] = NULL; rehashElement(ht, e); } } } static void growHT(hashTable *ht) { int i; ht->m = ht->l+1; kroundup32(ht->m); ht->str = realloc(ht->str, ht->m*sizeof(char*)); assert(ht->str); ht->elements = realloc(ht->elements, ht->m*sizeof(hashTableElement*)); for(i=ht->l; im; i++) { ht->str[i] = NULL; ht->elements[i] = NULL; } rehashHT(ht); } //don't do this if the element's already in the table! int32_t addHTelement(hashTable *ht, char *s) { if(!s) return -1; uint64_t hash = hashString(s); int32_t val = ht->l++; if(ht->l >= ht->m) growHT(ht); ht->str[val] = strdup(s); hashTableElement *e = calloc(1, sizeof(hashTableElement)); assert(e); e->val = val; insertHTelement(ht, e, hash); return val; } void destroyHTelement(hashTableElement *e) { hashTableElement *next = e->next; free(e); if(next) destroyHTelement(next); } void destroyHT(hashTable *ht) { int i; for(i=0; il; i++) free(ht->str[i]); for(i=0; im; i++) { if(ht->elements[i]) destroyHTelement(ht->elements[i]); } free(ht->elements); free(ht->str); free(ht); } int strExistsHT(hashTable *ht, char *s) { if(!s) return 0; uint64_t h = hashString(s); hashTableElement *curr = ht->elements[h%ht->m]; while(curr) { if(strcmp(ht->str[curr->val], s) == 0) return 1; curr = curr->next; } return 0; } //Returns -1 if not present int32_t str2valHT(hashTable *ht, char *s) { if(!s) return -1; uint64_t h = hashString(s); hashTableElement *curr = ht->elements[h%ht->m]; while(curr) { if(strcmp(ht->str[curr->val], s) == 0) return curr->val; curr = curr->next; } return -1; } //Returns NULL on error char *val2strHT(hashTable *ht, int32_t val) { if(val<0) return NULL; if(val>=ht->l) return NULL; return ht->str[val]; } int hasAttribute(GTFtree *t, GTFentry *e, char *str) { int32_t i, key = str2valHT(t->htAttributes, str); for(i=0; inAttributes; i++) { if(e->attrib[i]->key == key) return 1; } return 0; } //Returns NULL if the entry lacks the attribute char *getAttribute(GTFtree *t, GTFentry *e, char *str) { int32_t i, key = str2valHT(t->htAttributes, str); for(i=0; inAttributes; i++) { if(e->attrib[i]->key == key) return val2strHT(t->htAttributes, e->attrib[i]->val); } return NULL; } deeptools_intervals-0.1.9/deeptoolsintervals/tree/kseq.h000066400000000000000000000213041352261167100236130ustar00rootroot00000000000000/* The MIT License Copyright (c) 2008, 2009, 2011 Attractive Chaos Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ /* Last Modified: 05MAR2012 */ #ifndef AC_KSEQ_H #define AC_KSEQ_H #include #include #include #define KS_SEP_SPACE 0 // isspace(): \t, \n, \v, \f, \r #define KS_SEP_TAB 1 // isspace() && !' ' #define KS_SEP_LINE 2 // line separator: "\n" (Unix) or "\r\n" (Windows) #define KS_SEP_MAX 2 #define __KS_TYPE(type_t) \ typedef struct __kstream_t { \ int begin, end; \ int is_eof:2, bufsize:30; \ uint64_t seek_pos; \ type_t f; \ unsigned char *buf; \ } kstream_t; #define ks_eof(ks) ((ks)->is_eof && (ks)->begin >= (ks)->end) #define ks_rewind(ks) ((ks)->is_eof = (ks)->begin = (ks)->end = 0) #define __KS_BASIC(SCOPE, type_t, __bufsize) \ SCOPE kstream_t *ks_init(type_t f) \ { \ kstream_t *ks = (kstream_t*)calloc(1, sizeof(kstream_t)); \ ks->f = f; ks->bufsize = __bufsize; \ ks->buf = (unsigned char*)malloc(__bufsize); \ return ks; \ } \ SCOPE void ks_destroy(kstream_t *ks) \ { \ if (!ks) return; \ free(ks->buf); \ free(ks); \ } #define __KS_INLINED(__read) \ static inline int ks_getc(kstream_t *ks) \ { \ if (ks->is_eof && ks->begin >= ks->end) return -1; \ if (ks->begin >= ks->end) { \ ks->begin = 0; \ ks->end = __read(ks->f, ks->buf, ks->bufsize); \ if (ks->end == 0) { ks->is_eof = 1; return -1; } \ } \ ks->seek_pos++; \ return (int)ks->buf[ks->begin++]; \ } \ static inline int ks_getuntil(kstream_t *ks, int delimiter, kstring_t *str, int *dret) \ { return ks_getuntil2(ks, delimiter, str, dret, 0); } #ifndef KSTRING_T #define KSTRING_T kstring_t typedef struct __kstring_t { size_t l, m; char *s; } kstring_t; #endif #ifndef kroundup32 #define kroundup32(x) (--(x), (x)|=(x)>>1, (x)|=(x)>>2, (x)|=(x)>>4, (x)|=(x)>>8, (x)|=(x)>>16, ++(x)) #endif #define __KS_GETUNTIL(SCOPE, __read) \ SCOPE int ks_getuntil2(kstream_t *ks, int delimiter, kstring_t *str, int *dret, int append) \ { \ int gotany = 0; \ if (dret) *dret = 0; \ str->l = append? str->l : 0; \ uint64_t seek_pos = str->l; \ for (;;) { \ int i; \ if (ks->begin >= ks->end) { \ if (!ks->is_eof) { \ ks->begin = 0; \ ks->end = __read(ks->f, ks->buf, ks->bufsize); \ if (ks->end == 0) { ks->is_eof = 1; break; } \ } else break; \ } \ if (delimiter == KS_SEP_LINE) { \ for (i = ks->begin; i < ks->end; ++i) \ if (ks->buf[i] == '\n') break; \ } else if (delimiter > KS_SEP_MAX) { \ for (i = ks->begin; i < ks->end; ++i) \ if (ks->buf[i] == delimiter) break; \ } else if (delimiter == KS_SEP_SPACE) { \ for (i = ks->begin; i < ks->end; ++i) \ if (isspace(ks->buf[i])) break; \ } else if (delimiter == KS_SEP_TAB) { \ for (i = ks->begin; i < ks->end; ++i) \ if (isspace(ks->buf[i]) && ks->buf[i] != ' ') break; \ } else i = 0; /* never come to here! */ \ if (str->m - str->l < (size_t)(i - ks->begin + 1)) { \ str->m = str->l + (i - ks->begin) + 1; \ kroundup32(str->m); \ str->s = (char*)realloc(str->s, str->m); \ } \ seek_pos += i - ks->begin; if ( i < ks->end ) seek_pos++; \ gotany = 1; \ memcpy(str->s + str->l, ks->buf + ks->begin, i - ks->begin); \ str->l = str->l + (i - ks->begin); \ ks->begin = i + 1; \ if (i < ks->end) { \ if (dret) *dret = ks->buf[i]; \ break; \ } \ } \ if (!gotany && ks_eof(ks)) return -1; \ ks->seek_pos += seek_pos; \ if (str->s == 0) { \ str->m = 1; \ str->s = (char*)calloc(1, 1); \ } else if (delimiter == KS_SEP_LINE && str->l > 1 && str->s[str->l-1] == '\r') --str->l; \ str->s[str->l] = '\0'; \ return str->l; \ } #define KSTREAM_INIT2(SCOPE, type_t, __read, __bufsize) \ __KS_TYPE(type_t) \ __KS_BASIC(SCOPE, type_t, __bufsize) \ __KS_GETUNTIL(SCOPE, __read) \ __KS_INLINED(__read) #define KSTREAM_INIT(type_t, __read, __bufsize) KSTREAM_INIT2(static, type_t, __read, __bufsize) #define KSTREAM_DECLARE(type_t, __read) \ __KS_TYPE(type_t) \ extern int ks_getuntil2(kstream_t *ks, int delimiter, kstring_t *str, int *dret, int append); \ extern kstream_t *ks_init(type_t f); \ extern void ks_destroy(kstream_t *ks); \ __KS_INLINED(__read) /****************** * FASTA/Q parser * ******************/ #define kseq_rewind(ks) ((ks)->last_char = (ks)->f->is_eof = (ks)->f->begin = (ks)->f->end = 0) #define __KSEQ_BASIC(SCOPE, type_t) \ SCOPE kseq_t *kseq_init(type_t fd) \ { \ kseq_t *s = (kseq_t*)calloc(1, sizeof(kseq_t)); \ s->f = ks_init(fd); \ return s; \ } \ SCOPE void kseq_destroy(kseq_t *ks) \ { \ if (!ks) return; \ free(ks->name.s); free(ks->comment.s); free(ks->seq.s); free(ks->qual.s); \ ks_destroy(ks->f); \ free(ks); \ } /* Return value: >=0 length of the sequence (normal) -1 end-of-file -2 truncated quality string */ #define __KSEQ_READ(SCOPE) \ SCOPE int kseq_read(kseq_t *seq) \ { \ int c; \ kstream_t *ks = seq->f; \ if (seq->last_char == 0) { /* then jump to the next header line */ \ while ((c = ks_getc(ks)) != -1 && c != '>' && c != '@'); \ if (c == -1) return -1; /* end of file */ \ seq->last_char = c; \ } /* else: the first header char has been read in the previous call */ \ seq->comment.l = seq->seq.l = seq->qual.l = 0; /* reset all members */ \ if (ks_getuntil(ks, 0, &seq->name, &c) < 0) return -1; /* normal exit: EOF */ \ if (c != '\n') ks_getuntil(ks, KS_SEP_LINE, &seq->comment, 0); /* read FASTA/Q comment */ \ if (seq->seq.s == 0) { /* we can do this in the loop below, but that is slower */ \ seq->seq.m = 256; \ seq->seq.s = (char*)malloc(seq->seq.m); \ } \ while ((c = ks_getc(ks)) != -1 && c != '>' && c != '+' && c != '@') { \ if (c == '\n') continue; /* skip empty lines */ \ seq->seq.s[seq->seq.l++] = c; /* this is safe: we always have enough space for 1 char */ \ ks_getuntil2(ks, KS_SEP_LINE, &seq->seq, 0, 1); /* read the rest of the line */ \ } \ if (c == '>' || c == '@') seq->last_char = c; /* the first header char has been read */ \ if (seq->seq.l + 1 >= seq->seq.m) { /* seq->seq.s[seq->seq.l] below may be out of boundary */ \ seq->seq.m = seq->seq.l + 2; \ kroundup32(seq->seq.m); /* rounded to the next closest 2^k */ \ seq->seq.s = (char*)realloc(seq->seq.s, seq->seq.m); \ } \ seq->seq.s[seq->seq.l] = 0; /* null terminated string */ \ if (c != '+') return seq->seq.l; /* FASTA */ \ if (seq->qual.m < seq->seq.m) { /* allocate memory for qual in case insufficient */ \ seq->qual.m = seq->seq.m; \ seq->qual.s = (char*)realloc(seq->qual.s, seq->qual.m); \ } \ while ((c = ks_getc(ks)) != -1 && c != '\n'); /* skip the rest of '+' line */ \ if (c == -1) return -2; /* error: no quality string */ \ while (ks_getuntil2(ks, KS_SEP_LINE, &seq->qual, 0, 1) >= 0 && seq->qual.l < seq->seq.l); \ seq->last_char = 0; /* we have not come to the next header line */ \ if (seq->seq.l != seq->qual.l) return -2; /* error: qual string is of a different length */ \ return seq->seq.l; \ } #define __KSEQ_TYPE(type_t) \ typedef struct { \ kstring_t name, comment, seq, qual; \ int last_char; \ kstream_t *f; \ } kseq_t; #define KSEQ_INIT2(SCOPE, type_t, __read) \ KSTREAM_INIT(type_t, __read, 16384) \ __KSEQ_TYPE(type_t) \ __KSEQ_BASIC(SCOPE, type_t) \ __KSEQ_READ(SCOPE) #define KSEQ_INIT(type_t, __read) KSEQ_INIT2(static, type_t, __read) #define KSEQ_DECLARE(type_t) \ __KS_TYPE(type_t) \ __KSEQ_TYPE(type_t) \ extern kseq_t *kseq_init(type_t fd); \ void kseq_destroy(kseq_t *ks); \ int kseq_read(kseq_t *seq); #endif deeptools_intervals-0.1.9/deeptoolsintervals/tree/kstring.h000066400000000000000000000156371352261167100243450ustar00rootroot00000000000000/* The MIT License Copyright (C) 2011 by Attractive Chaos Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. */ #ifndef KSTRING_H #define KSTRING_H #include #include #include #include #include #ifndef kroundup32 #define kroundup32(x) (--(x), (x)|=(x)>>1, (x)|=(x)>>2, (x)|=(x)>>4, (x)|=(x)>>8, (x)|=(x)>>16, ++(x)) #endif #if defined __GNUC__ && (__GNUC__ > 2 || (__GNUC__ == 2 && __GNUC_MINOR__ > 4)) #define KS_ATTR_PRINTF(fmt, arg) __attribute__((__format__ (__printf__, fmt, arg))) #else #define KS_ATTR_PRINTF(fmt, arg) #endif /* kstring_t is a simple non-opaque type whose fields are likely to be * used directly by user code (but see also ks_str() and ks_len() below). * A kstring_t object is initialised by either of * kstring_t str = { 0, 0, NULL }; * kstring_t str; ...; str.l = str.m = 0; str.s = NULL; * and either ownership of the underlying buffer should be given away before * the object disappears (see ks_release() below) or the kstring_t should be * destroyed with free(str.s); */ #ifndef KSTRING_T #define KSTRING_T kstring_t typedef struct __kstring_t { size_t l, m; char *s; } kstring_t; #endif typedef struct { uint64_t tab[4]; int sep, finished; const char *p; // end of the current token } ks_tokaux_t; #ifdef __cplusplus extern "C" { #endif int kvsprintf(kstring_t *s, const char *fmt, va_list ap) KS_ATTR_PRINTF(2,0); int ksprintf(kstring_t *s, const char *fmt, ...) KS_ATTR_PRINTF(2,3); int ksplit_core(char *s, int delimiter, int *_max, int **_offsets); char *kstrstr(const char *str, const char *pat, int **_prep); char *kstrnstr(const char *str, const char *pat, int n, int **_prep); void *kmemmem(const void *_str, int n, const void *_pat, int m, int **_prep); /* kstrtok() is similar to strtok_r() except that str is not * modified and both str and sep can be NULL. For efficiency, it is * actually recommended to set both to NULL in the subsequent calls * if sep is not changed. */ char *kstrtok(const char *str, const char *sep, ks_tokaux_t *aux); /* kgetline() uses the supplied fgets()-like function to read a "\n"- * or "\r\n"-terminated line from fp. The line read is appended to the * kstring without its terminator and 0 is returned; EOF is returned at * EOF or on error (determined by querying fp, as per fgets()). */ typedef char *kgets_func(char *, int, void *); int kgetline(kstring_t *s, kgets_func *fgets, void *fp); #ifdef __cplusplus } #endif static inline int ks_resize(kstring_t *s, size_t size) { if (s->m < size) { char *tmp; s->m = size; kroundup32(s->m); if ((tmp = (char*)realloc(s->s, s->m))) s->s = tmp; else return -1; } return 0; } static inline char *ks_str(kstring_t *s) { return s->s; } static inline size_t ks_len(kstring_t *s) { return s->l; } // Give ownership of the underlying buffer away to something else (making // that something else responsible for freeing it), leaving the kstring_t // empty and ready to be used again, or ready to go out of scope without // needing free(str.s) to prevent a memory leak. static inline char *ks_release(kstring_t *s) { char *ss = s->s; s->l = s->m = 0; s->s = NULL; return ss; } static inline int kputsn(const char *p, int l, kstring_t *s) { if (s->l + l + 1 >= s->m) { char *tmp; s->m = s->l + l + 2; kroundup32(s->m); if ((tmp = (char*)realloc(s->s, s->m))) s->s = tmp; else return EOF; } memcpy(s->s + s->l, p, l); s->l += l; s->s[s->l] = 0; return l; } static inline int kputs(const char *p, kstring_t *s) { return kputsn(p, strlen(p), s); } static inline int kputc(int c, kstring_t *s) { if (s->l + 1 >= s->m) { char *tmp; s->m = s->l + 2; kroundup32(s->m); if ((tmp = (char*)realloc(s->s, s->m))) s->s = tmp; else return EOF; } s->s[s->l++] = c; s->s[s->l] = 0; return c; } static inline int kputc_(int c, kstring_t *s) { if (s->l + 1 > s->m) { char *tmp; s->m = s->l + 1; kroundup32(s->m); if ((tmp = (char*)realloc(s->s, s->m))) s->s = tmp; else return EOF; } s->s[s->l++] = c; return 1; } static inline int kputsn_(const void *p, int l, kstring_t *s) { if (s->l + l > s->m) { char *tmp; s->m = s->l + l; kroundup32(s->m); if ((tmp = (char*)realloc(s->s, s->m))) s->s = tmp; else return EOF; } memcpy(s->s + s->l, p, l); s->l += l; return l; } static inline int kputw(int c, kstring_t *s) { char buf[16]; int i, l = 0; unsigned int x = c; if (c < 0) x = -x; do { buf[l++] = x%10 + '0'; x /= 10; } while (x > 0); if (c < 0) buf[l++] = '-'; if (s->l + l + 1 >= s->m) { char *tmp; s->m = s->l + l + 2; kroundup32(s->m); if ((tmp = (char*)realloc(s->s, s->m))) s->s = tmp; else return EOF; } for (i = l - 1; i >= 0; --i) s->s[s->l++] = buf[i]; s->s[s->l] = 0; return 0; } static inline int kputuw(unsigned c, kstring_t *s) { char buf[16]; int l, i; unsigned x; if (c == 0) return kputc('0', s); for (l = 0, x = c; x > 0; x /= 10) buf[l++] = x%10 + '0'; if (s->l + l + 1 >= s->m) { char *tmp; s->m = s->l + l + 2; kroundup32(s->m); if ((tmp = (char*)realloc(s->s, s->m))) s->s = tmp; else return EOF; } for (i = l - 1; i >= 0; --i) s->s[s->l++] = buf[i]; s->s[s->l] = 0; return 0; } static inline int kputl(long c, kstring_t *s) { char buf[32]; int i, l = 0; unsigned long x = c; if (c < 0) x = -x; do { buf[l++] = x%10 + '0'; x /= 10; } while (x > 0); if (c < 0) buf[l++] = '-'; if (s->l + l + 1 >= s->m) { char *tmp; s->m = s->l + l + 2; kroundup32(s->m); if ((tmp = (char*)realloc(s->s, s->m))) s->s = tmp; else return EOF; } for (i = l - 1; i >= 0; --i) s->s[s->l++] = buf[i]; s->s[s->l] = 0; return 0; } /* * Returns 's' split by delimiter, with *n being the number of components; * NULL on failue. */ static inline int *ksplit(kstring_t *s, int delimiter, int *n) { int max = 0, *offsets = 0; *n = ksplit_core(s->s, delimiter, &max, &offsets); return offsets; } #endif deeptools_intervals-0.1.9/deeptoolsintervals/tree/murmur3.c000066400000000000000000000165541352261167100242700ustar00rootroot00000000000000//----------------------------------------------------------------------------- // MurmurHash3 was written by Austin Appleby, and is placed in the public // domain. The author hereby disclaims copyright to this source code. // Note - The x86 and x64 versions do _not_ produce the same results, as the // algorithms are optimized for their respective platforms. You can still // compile and run any of them on any platform, but your performance with the // non-native version will be less than optimal. #include "murmur3.h" //----------------------------------------------------------------------------- // Platform-specific functions and macros #ifdef __GNUC__ #define FORCE_INLINE __attribute__((always_inline)) inline #else #define FORCE_INLINE inline #endif static FORCE_INLINE uint32_t rotl32 ( uint32_t x, int8_t r ) { return (x << r) | (x >> (32 - r)); } static FORCE_INLINE uint64_t rotl64 ( uint64_t x, int8_t r ) { return (x << r) | (x >> (64 - r)); } #define ROTL32(x,y) rotl32(x,y) #define ROTL64(x,y) rotl64(x,y) #define BIG_CONSTANT(x) (x##LLU) //----------------------------------------------------------------------------- // Block read - if your platform needs to do endian-swapping or can only // handle aligned reads, do the conversion here #define getblock(p, i) (p[i]) //----------------------------------------------------------------------------- // Finalization mix - force all bits of a hash block to avalanche static FORCE_INLINE uint32_t fmix32 ( uint32_t h ) { h ^= h >> 16; h *= 0x85ebca6b; h ^= h >> 13; h *= 0xc2b2ae35; h ^= h >> 16; return h; } //---------- static FORCE_INLINE uint64_t fmix64 ( uint64_t k ) { k ^= k >> 33; k *= BIG_CONSTANT(0xff51afd7ed558ccd); k ^= k >> 33; k *= BIG_CONSTANT(0xc4ceb9fe1a85ec53); k ^= k >> 33; return k; } //----------------------------------------------------------------------------- void MurmurHash3_x86_32 ( const void * key, int len, uint32_t seed, void * out ) { const uint8_t * data = (const uint8_t*)key; const int nblocks = len / 4; int i; uint32_t h1 = seed; uint32_t c1 = 0xcc9e2d51; uint32_t c2 = 0x1b873593; //---------- // body const uint32_t * blocks = (const uint32_t *)(data + nblocks*4); for(i = -nblocks; i; i++) { uint32_t k1 = getblock(blocks,i); k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1; h1 = ROTL32(h1,13); h1 = h1*5+0xe6546b64; } //---------- // tail const uint8_t * tail = (const uint8_t*)(data + nblocks*4); uint32_t k1 = 0; switch(len & 3) { case 3: k1 ^= tail[2] << 16; case 2: k1 ^= tail[1] << 8; case 1: k1 ^= tail[0]; k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1; }; //---------- // finalization h1 ^= len; h1 = fmix32(h1); *(uint32_t*)out = h1; } //----------------------------------------------------------------------------- void MurmurHash3_x86_128 ( const void * key, const int len, uint32_t seed, void * out ) { const uint8_t * data = (const uint8_t*)key; const int nblocks = len / 16; int i; uint32_t h1 = seed; uint32_t h2 = seed; uint32_t h3 = seed; uint32_t h4 = seed; uint32_t c1 = 0x239b961b; uint32_t c2 = 0xab0e9789; uint32_t c3 = 0x38b34ae5; uint32_t c4 = 0xa1e38b93; //---------- // body const uint32_t * blocks = (const uint32_t *)(data + nblocks*16); for(i = -nblocks; i; i++) { uint32_t k1 = getblock(blocks,i*4+0); uint32_t k2 = getblock(blocks,i*4+1); uint32_t k3 = getblock(blocks,i*4+2); uint32_t k4 = getblock(blocks,i*4+3); k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1; h1 = ROTL32(h1,19); h1 += h2; h1 = h1*5+0x561ccd1b; k2 *= c2; k2 = ROTL32(k2,16); k2 *= c3; h2 ^= k2; h2 = ROTL32(h2,17); h2 += h3; h2 = h2*5+0x0bcaa747; k3 *= c3; k3 = ROTL32(k3,17); k3 *= c4; h3 ^= k3; h3 = ROTL32(h3,15); h3 += h4; h3 = h3*5+0x96cd1c35; k4 *= c4; k4 = ROTL32(k4,18); k4 *= c1; h4 ^= k4; h4 = ROTL32(h4,13); h4 += h1; h4 = h4*5+0x32ac3b17; } //---------- // tail const uint8_t * tail = (const uint8_t*)(data + nblocks*16); uint32_t k1 = 0; uint32_t k2 = 0; uint32_t k3 = 0; uint32_t k4 = 0; switch(len & 15) { case 15: k4 ^= tail[14] << 16; case 14: k4 ^= tail[13] << 8; case 13: k4 ^= tail[12] << 0; k4 *= c4; k4 = ROTL32(k4,18); k4 *= c1; h4 ^= k4; case 12: k3 ^= tail[11] << 24; case 11: k3 ^= tail[10] << 16; case 10: k3 ^= tail[ 9] << 8; case 9: k3 ^= tail[ 8] << 0; k3 *= c3; k3 = ROTL32(k3,17); k3 *= c4; h3 ^= k3; case 8: k2 ^= tail[ 7] << 24; case 7: k2 ^= tail[ 6] << 16; case 6: k2 ^= tail[ 5] << 8; case 5: k2 ^= tail[ 4] << 0; k2 *= c2; k2 = ROTL32(k2,16); k2 *= c3; h2 ^= k2; case 4: k1 ^= tail[ 3] << 24; case 3: k1 ^= tail[ 2] << 16; case 2: k1 ^= tail[ 1] << 8; case 1: k1 ^= tail[ 0] << 0; k1 *= c1; k1 = ROTL32(k1,15); k1 *= c2; h1 ^= k1; }; //---------- // finalization h1 ^= len; h2 ^= len; h3 ^= len; h4 ^= len; h1 += h2; h1 += h3; h1 += h4; h2 += h1; h3 += h1; h4 += h1; h1 = fmix32(h1); h2 = fmix32(h2); h3 = fmix32(h3); h4 = fmix32(h4); h1 += h2; h1 += h3; h1 += h4; h2 += h1; h3 += h1; h4 += h1; ((uint32_t*)out)[0] = h1; ((uint32_t*)out)[1] = h2; ((uint32_t*)out)[2] = h3; ((uint32_t*)out)[3] = h4; } //----------------------------------------------------------------------------- void MurmurHash3_x64_128 ( const void * key, const int len, const uint32_t seed, void * out ) { const uint8_t * data = (const uint8_t*)key; const int nblocks = len / 16; int i; uint64_t h1 = seed; uint64_t h2 = seed; uint64_t c1 = BIG_CONSTANT(0x87c37b91114253d5); uint64_t c2 = BIG_CONSTANT(0x4cf5ad432745937f); //---------- // body const uint64_t * blocks = (const uint64_t *)(data); for(i = 0; i < nblocks; i++) { uint64_t k1 = getblock(blocks,i*2+0); uint64_t k2 = getblock(blocks,i*2+1); k1 *= c1; k1 = ROTL64(k1,31); k1 *= c2; h1 ^= k1; h1 = ROTL64(h1,27); h1 += h2; h1 = h1*5+0x52dce729; k2 *= c2; k2 = ROTL64(k2,33); k2 *= c1; h2 ^= k2; h2 = ROTL64(h2,31); h2 += h1; h2 = h2*5+0x38495ab5; } //---------- // tail const uint8_t * tail = (const uint8_t*)(data + nblocks*16); uint64_t k1 = 0; uint64_t k2 = 0; switch(len & 15) { case 15: k2 ^= (uint64_t)(tail[14]) << 48; case 14: k2 ^= (uint64_t)(tail[13]) << 40; case 13: k2 ^= (uint64_t)(tail[12]) << 32; case 12: k2 ^= (uint64_t)(tail[11]) << 24; case 11: k2 ^= (uint64_t)(tail[10]) << 16; case 10: k2 ^= (uint64_t)(tail[ 9]) << 8; case 9: k2 ^= (uint64_t)(tail[ 8]) << 0; k2 *= c2; k2 = ROTL64(k2,33); k2 *= c1; h2 ^= k2; case 8: k1 ^= (uint64_t)(tail[ 7]) << 56; case 7: k1 ^= (uint64_t)(tail[ 6]) << 48; case 6: k1 ^= (uint64_t)(tail[ 5]) << 40; case 5: k1 ^= (uint64_t)(tail[ 4]) << 32; case 4: k1 ^= (uint64_t)(tail[ 3]) << 24; case 3: k1 ^= (uint64_t)(tail[ 2]) << 16; case 2: k1 ^= (uint64_t)(tail[ 1]) << 8; case 1: k1 ^= (uint64_t)(tail[ 0]) << 0; k1 *= c1; k1 = ROTL64(k1,31); k1 *= c2; h1 ^= k1; }; //---------- // finalization h1 ^= len; h2 ^= len; h1 += h2; h2 += h1; h1 = fmix64(h1); h2 = fmix64(h2); h1 += h2; h2 += h1; ((uint64_t*)out)[0] = h1; ((uint64_t*)out)[1] = h2; } //----------------------------------------------------------------------------- deeptools_intervals-0.1.9/deeptoolsintervals/tree/murmur3.h000066400000000000000000000014301352261167100242600ustar00rootroot00000000000000//----------------------------------------------------------------------------- // MurmurHash3 was written by Austin Appleby, and is placed in the // public domain. The author hereby disclaims copyright to this source // code. #ifndef _MURMURHASH3_H_ #define _MURMURHASH3_H_ #include #ifdef __cplusplus extern "C" { #endif //----------------------------------------------------------------------------- void MurmurHash3_x86_32 (const void *key, int len, uint32_t seed, void *out); void MurmurHash3_x86_128(const void *key, int len, uint32_t seed, void *out); void MurmurHash3_x64_128(const void *key, int len, uint32_t seed, void *out); //----------------------------------------------------------------------------- #ifdef __cplusplus } #endif #endif // _MURMURHASH3_H_ deeptools_intervals-0.1.9/deeptoolsintervals/tree/tree.c000066400000000000000000000267001352261167100236070ustar00rootroot00000000000000#include #include #include #include "tree.h" #include #include static void pyGTFDealloc(pyGTFtree_t *self) { if(self->t) destroyGTFtree(self->t); PyObject_DEL(self); } #if PY_MAJOR_VERSION >= 3 //Return 1 iff obj is a ready unicode type int PyString_Check(PyObject *obj) { if(PyUnicode_Check(obj)) { return PyUnicode_READY(obj)+1; } return 0; } char *PyString_AsString(PyObject *obj) { return PyBytes_AsString(PyUnicode_AsASCIIString(obj)); } PyObject *PyString_FromString(char *s) { return PyUnicode_FromString(s); } #endif //Will return 1 for long or int types currently int isNumeric(PyObject *obj) { #if PY_MAJOR_VERSION < 3 if(PyInt_Check(obj)) return 1; #endif return PyLong_Check(obj); } //On error, throws a runtime error, so use PyErr_Occurred() after this uint32_t Numeric2Uint(PyObject *obj) { long l; #if PY_MAJOR_VERSION < 3 if(PyInt_Check(obj)) { return (uint32_t) PyInt_AsLong(obj); } #endif l = PyLong_AsLong(obj); //Check bounds if(l > 0xFFFFFFFF) { PyErr_SetString(PyExc_RuntimeError, "Length out of bounds for a bigWig file!"); return (uint32_t) -1; } return (uint32_t) l; } static PyObject *pyGTFinit(PyObject *self, PyObject *args) { GTFtree *t = NULL; pyGTFtree_t *pt; t = initGTFtree(); if(!t) return NULL; pt = PyObject_New(pyGTFtree_t, &pyGTFtree); if(!pt) goto error; pt->t = t; return (PyObject*) pt; error: if(t) destroyGTFtree(t); PyErr_SetString(PyExc_RuntimeError, "Received an error during tree initialization!"); return NULL; } static PyObject *pyAddEntry(pyGTFtree_t *self, PyObject *args) { GTFtree *t = self->t; char *chrom = NULL, *name = NULL, *sscore = NULL; uint32_t start, end, labelIdx; double score; uint8_t strand; unsigned long lstrand, lstart, lend, llabelIdx; if(!(PyArg_ParseTuple(args, "skkskks", &chrom, &lstart, &lend, &name, &lstrand, &llabelIdx, &sscore))) { PyErr_SetString(PyExc_RuntimeError, "pyAddEntry received an invalid or missing argument!"); return NULL; } //Convert all of the longs if(lstart >= (uint32_t) -1 || lend >= (uint32_t) -1 || lend <= lstart) { PyErr_SetString(PyExc_RuntimeError, "pyAddEntry received invalid bounds!"); return NULL; } start = (uint32_t) lstart; end = (uint32_t) lend; if(lstrand != 0 && lstrand != 1 && lstrand != 3) { PyErr_SetString(PyExc_RuntimeError, "pyAddEntry received an invalid strand!"); return NULL; } strand = (uint8_t) lstrand; if(llabelIdx >= (uint32_t) -1) { PyErr_SetString(PyExc_RuntimeError, "pyAddEntry received an invalid label idx (too large)!"); return NULL; } labelIdx = (uint32_t) llabelIdx; //Handle the score if(strcmp(sscore, ".") == 0) { score = DBL_MAX; } else { score = strtod(sscore, NULL); } //Actually add the entry if(addGTFentry(t, chrom, start, end, strand, name, labelIdx, score)) { PyErr_SetString(PyExc_RuntimeError, "pyAddEntry received an error while inserting an entry!"); return NULL; } Py_INCREF(Py_None); return Py_None; } static PyObject *pyAddEnrichmentEntry(pyGTFtree_t *self, PyObject *args) { GTFtree *t = self->t; char *chrom = NULL, *sscore = NULL, *feature = NULL; uint32_t start, end; double score; uint8_t strand; unsigned long lstrand, lstart, lend; if(!(PyArg_ParseTuple(args, "skkkss", &chrom, &lstart, &lend, &lstrand, &sscore, &feature))) { PyErr_SetString(PyExc_RuntimeError, "pyAddEnrichmentEntry received an invalid or missing argument!"); return NULL; } //Convert all of the longs if(lstart >= (uint32_t) -1 || lend >= (uint32_t) -1 || lend <= lstart) { PyErr_SetString(PyExc_RuntimeError, "pyAddEnrichmentEntry received invalid bounds!"); return NULL; } start = (uint32_t) lstart; end = (uint32_t) lend; if(lstrand != 0 && lstrand != 1 && lstrand != 3) { PyErr_SetString(PyExc_RuntimeError, "pyAddEnrichmentEntry received an invalid strand!"); return NULL; } strand = (uint8_t) lstrand; //Handle the score if(strcmp(sscore, ".") == 0) { score = DBL_MAX; } else { score = strtod(sscore, NULL); } //Actually add the entry if(addEnrichmententry(t, chrom, start, end, strand, score, feature)) { PyErr_SetString(PyExc_RuntimeError, "pyAddEnrichmentEntry received an error while inserting an entry!"); return NULL; } Py_INCREF(Py_None); return Py_None; } static PyObject *pyVine2Tree(pyGTFtree_t *self, PyObject *args) { GTFtree *t = self->t; sortGTF(t); Py_INCREF(Py_None); return Py_None; } static PyObject *pyPrintGTFtree(pyGTFtree_t *self, PyObject *args) { GTFtree *t = self->t; printGTFtree(t); Py_INCREF(Py_None); return Py_None; } static PyObject *pyCountEntries(pyGTFtree_t *self, PyObject *args) { GTFtree *t = self->t; uint32_t nEntries = 0; unsigned long lnEntries = 0; int32_t i; PyObject *out = NULL; for(i=0; in_targets; i++) { nEntries += t->chroms[i]->n_entries; } lnEntries = (unsigned long) nEntries; out = PyLong_FromUnsignedLong(lnEntries); return out; } static PyObject *pyIsTree(pyGTFtree_t *self, PyObject *args) { GTFtree *t = self->t; if(t->balanced) Py_RETURN_TRUE; Py_RETURN_FALSE; } static PyObject *pyHasOverlaps(pyGTFtree_t *self, PyObject *args) { GTFtree *t = self->t; int rv; uint32_t minDistance = (uint32_t) -1; unsigned long long ominDistance; PyObject *otuple = NULL, *oval = NULL; rv = hasOverlaps(t, &minDistance); ominDistance = minDistance; // ominDistance should have at least as much space as minDistance otuple = PyTuple_New(2); if(!otuple) { PyErr_SetString(PyExc_RuntimeError, "Could not allocate space for a tuple!\n"); return NULL; } oval = PyLong_FromUnsignedLongLong(ominDistance); if(!oval) { PyErr_SetString(PyExc_RuntimeError, "Could not allocate space for a single integer!\n"); return NULL; } if(rv) { Py_INCREF(Py_True); PyTuple_SET_ITEM(otuple, 0, Py_True); } else { Py_INCREF(Py_False); PyTuple_SET_ITEM(otuple, 0, Py_False); } PyTuple_SetItem(otuple, 1, oval); return otuple; } static PyObject *pyFindOverlaps(pyGTFtree_t *self, PyObject *args) { GTFtree *t = self->t; char *chrom = NULL, *name = NULL, *transcript_id = NULL, strandChar; int32_t i; uint32_t start, end; int strand = 3, strandType = 0, matchType = 0; unsigned long lstrand, lstart, lend, lmatchType, lstrandType, llabelIdx; overlapSet *os = NULL; PyObject *olist = NULL, *otuple = NULL, *includeStrand = Py_False, *oscore = NULL; if(!(PyArg_ParseTuple(args, "skkkkksO", &chrom, &lstart, &lend, &lstrand, &lmatchType, &lstrandType, &transcript_id, &includeStrand))) { PyErr_SetString(PyExc_RuntimeError, "pyFindOverlaps received an invalid or missing argument!"); return NULL; } //I'm assuming that this is never called outside of the module strandType = (int) lstrandType; strand = (int) lstrand; matchType = (int) matchType; start = (uint32_t) lstart; end = (uint32_t) lend; os = findOverlaps(NULL, t, chrom, start, end, strand, matchType, strandType, 0, NULL); // Did we receive an error? if(!os) { PyErr_SetString(PyExc_RuntimeError, "findOverlaps returned NULL!"); return NULL; } // Convert the overlapSet to a list of tuples olist = PyList_New(os->l); if(!olist) goto error; for(i=0; il; i++) { // Make the tuple if(includeStrand == Py_True) { otuple = PyTuple_New(6); } else { otuple = PyTuple_New(5); } if(!otuple) goto error; lstart = (unsigned long) os->overlaps[i]->start; lend = (unsigned long) os->overlaps[i]->end; name = getAttribute(t, os->overlaps[i], transcript_id); llabelIdx = (unsigned long) os->overlaps[i]->labelIdx; strandChar = '.'; if(os->overlaps[i]->strand == 0) { strandChar = '+'; } else if(os->overlaps[i]->strand == 1) { strandChar = '-'; } if (os->overlaps[i]->score == DBL_MAX) { oscore = Py_BuildValue("s", "."); } else { oscore = Py_BuildValue("d", os->overlaps[i]->score); } if(!oscore) goto error; if(includeStrand == Py_True) { otuple = Py_BuildValue("(kkskcO)", lstart, lend, name, llabelIdx, strandChar, oscore); } else { otuple = Py_BuildValue("(kkskO)", lstart, lend, name, llabelIdx, oscore); } if(!otuple) goto error; // Add the tuple if(PyList_SetItem(olist, i, otuple)) goto error; otuple = NULL; } os_destroy(os); return olist; error: if(otuple) Py_DECREF(otuple); if(olist) Py_DECREF(olist); PyErr_SetString(PyExc_RuntimeError, "findOverlaps received an error!"); return NULL; } static PyObject *pyFindOverlappingFeatures(pyGTFtree_t *self, PyObject *args) { GTFtree *t = self->t; char *chrom = NULL; int32_t i; uint32_t start, end; int strand = 3, strandType = 0, matchType = 0; unsigned long lstrand, lstart, lend, lmatchType, lstrandType; overlapSet *os = NULL; PyObject *olist = NULL, *ostring = NULL; if(!(PyArg_ParseTuple(args, "skkkkk", &chrom, &lstart, &lend, &lstrand, &lmatchType, &lstrandType))) { PyErr_SetString(PyExc_RuntimeError, "pyFindOverlaps received an invalid or missing argument!"); return NULL; } //I'm assuming that this is never called outside of the module strandType = (int) lstrandType; strand = (int) lstrand; matchType = (int) matchType; start = (uint32_t) lstart; end = (uint32_t) lend; os = findOverlaps(NULL, t, chrom, start, end, strand, matchType, strandType, 0, NULL); // Did we receive an error? if(!os) { PyErr_SetString(PyExc_RuntimeError, "findOverlaps returned NULL!"); return NULL; } if(!os->l) { os_destroy(os); Py_INCREF(Py_None); return Py_None; } // Convert the overlapSet to a list of tuples olist = PyList_New(os->l); if(!olist) goto error; for(i=0; il; i++) { //Make the python string ostring = PyString_FromString(val2strHT(t->htFeatures, os->overlaps[i]->feature)); if(!ostring) goto error; // Add the item if(PyList_SetItem(olist, i, ostring)) goto error; ostring = NULL; } os_destroy(os); return olist; error: if(ostring) Py_DECREF(ostring); if(olist) Py_DECREF(olist); PyErr_SetString(PyExc_RuntimeError, "findOverlappingFeatures received an error!"); return NULL; } #if PY_MAJOR_VERSION >= 3 PyMODINIT_FUNC PyInit_tree(void) { PyObject *res; errno = 0; if(PyType_Ready(&pyGTFtree) < 0) return NULL; res = PyModule_Create(&treemodule); if(!res) return NULL; Py_INCREF(&pyGTFtree); PyModule_AddObject(res, "pyGTFtree", (PyObject *) &pyGTFtree); return res; } #else //Python2 initialization PyMODINIT_FUNC inittree(void) { errno = 0; //Sometimes libpython2.7.so is missing some links... if(PyType_Ready(&pyGTFtree) < 0) return; Py_InitModule3("tree", treeMethods, "A module for handling GTF files for deepTools"); } #endif deeptools_intervals-0.1.9/deeptoolsintervals/tree/tree.h000066400000000000000000000112201352261167100236030ustar00rootroot00000000000000#include #include #include "gtf.h" typedef struct { PyObject_HEAD GTFtree *t; } pyGTFtree_t; /* Remove all asserts and ensure that the new return values are honoured. Profile the code with a test to eliminate unneeded cruft. */ static PyObject *pyGTFinit(PyObject *self, PyObject *args); static PyObject *pyAddEntry(pyGTFtree_t *self, PyObject *args); static PyObject *pyAddEnrichmentEntry(pyGTFtree_t *self, PyObject *args); static PyObject *pyVine2Tree(pyGTFtree_t *self, PyObject *args); static PyObject *pyPrintGTFtree(pyGTFtree_t *self, PyObject *args); static PyObject *pyCountEntries(pyGTFtree_t *self, PyObject *args); static PyObject *pyFindOverlaps(pyGTFtree_t *self, PyObject *args); static PyObject *pyFindOverlappingFeatures(pyGTFtree_t *self, PyObject *args); static PyObject *pyIsTree(pyGTFtree_t *self, PyObject *args); static PyObject *pyHasOverlaps(pyGTFtree_t *self, PyObject *args); static void pyGTFDealloc(pyGTFtree_t *self); static PyMethodDef treeMethods[] = { {"initTree", (PyCFunction) pyGTFinit, METH_VARARGS, "Initialize the tree\n"}, {"addEntry", (PyCFunction) pyAddEntry, METH_VARARGS, "Some documentation for pyAddEntry\n"}, {"addEnrichmentEntry", (PyCFunction) pyAddEnrichmentEntry, METH_VARARGS, "Some documentation for pyAddEnrichmentEntry\n"}, {"finish", (PyCFunction) pyVine2Tree, METH_VARARGS, "This must be called after ALL entries from ALL files have been added.\n"}, {"printGTFtree", (PyCFunction) pyPrintGTFtree, METH_VARARGS, "Prints a text representation in dot format.\n"}, {"countEntries", (PyCFunction) pyCountEntries, METH_VARARGS, "Count the number of entries in a GTFtree\n"}, {"isTree", (PyCFunction) pyIsTree, METH_VARARGS, "Return True if the object is a tree\n"}, {"hasOverlaps", (PyCFunction) pyHasOverlaps, METH_VARARGS, "Returns a tuple with the first value True if ANY of the entries in the tree overlap (ignoring strand) and False otherwise. The second value in the tuple is the minimum distance between intervals (0 on overlap).\n"}, {"findOverlaps", (PyCFunction) pyFindOverlaps, METH_VARARGS, "Find overlapping intervals\n"}, {"findOverlappingFeatures", (PyCFunction) pyFindOverlappingFeatures, METH_VARARGS, "Find overlapping intervals, returning a list of features\n"}, {NULL, NULL, 0, NULL} }; #if PY_MAJOR_VERSION >= 3 struct treemodule_state { PyObject *error; }; #define GETSTATE(m) ((struct treemodule_state*)PyModule_GetState(m)) static PyModuleDef treemodule = { PyModuleDef_HEAD_INIT, "tree", "A python module creating/accessing GTF-based interval trees with associated meta-data", -1, treeMethods, NULL, NULL, NULL, NULL }; #endif //Should set tp_dealloc, tp_print, tp_repr, tp_str, tp_members static PyTypeObject pyGTFtree = { #if PY_MAJOR_VERSION >= 3 PyVarObject_HEAD_INIT(NULL, 0) #else PyObject_HEAD_INIT(NULL) 0, /*ob_size*/ #endif "pyGTFtree", /*tp_name*/ sizeof(pyGTFtree_t), /*tp_basicsize*/ 0, /*tp_itemsize*/ (destructor)pyGTFDealloc, /*tp_dealloc*/ 0, /*tp_print*/ 0, /*tp_getattr*/ 0, /*tp_setattr*/ 0, /*tp_compare*/ 0, /*tp_repr*/ 0, /*tp_as_number*/ 0, /*tp_as_sequence*/ 0, /*tp_as_mapping*/ 0, /*tp_hash*/ 0, /*tp_call*/ 0, /*tp_str*/ PyObject_GenericGetAttr, /*tp_getattro*/ PyObject_GenericSetAttr, /*tp_setattro*/ 0, /*tp_as_buffer*/ #if PY_MAJOR_VERSION >= 3 Py_TPFLAGS_DEFAULT, /*tp_flags*/ #else Py_TPFLAGS_HAVE_CLASS, /*tp_flags*/ #endif "GTF tree", /*tp_doc*/ 0, /*tp_traverse*/ 0, /*tp_clear*/ 0, /*tp_richcompare*/ 0, /*tp_weaklistoffset*/ 0, /*tp_iter*/ 0, /*tp_iternext*/ treeMethods, /*tp_methods*/ 0, /*tp_members*/ 0, /*tp_getset*/ 0, /*tp_base*/ 0, /*tp_dict*/ 0, /*tp_descr_get*/ 0, /*tp_descr_set*/ 0, /*tp_dictoffset*/ 0, /*tp_init*/ 0, /*tp_alloc*/ 0, /*tp_new*/ 0,0,0,0,0,0 }; deeptools_intervals-0.1.9/setup.py000077500000000000000000000046001352261167100173270ustar00rootroot00000000000000#!/usr/bin/env python from setuptools import setup, Extension, find_packages from distutils import sysconfig import glob import sys srcs = [x for x in glob.glob("deeptoolsintervals/tree/*.c")] libs = ["z"] if sysconfig.get_config_vars('BLDLIBRARY') is not None: # Note the "-l" prefix! for e in sysconfig.get_config_vars('BLDLIBRARY')[0].split(): if e[0:2] == "-l": libs.append(e[2:]) elif sys.version_info[0] >= 3 and sys.version_info[1] >= 3: libs.append("python%i.%im" % (sys.version_info[0], sys.version_info[1])) else: libs.append("python%i.%i" % (sys.version_info[0], sys.version_info[1])) additional_libs = [sysconfig.get_config_var("LIBDIR"), sysconfig.get_config_var("LIBPL")] module1 = Extension('deeptoolsintervals.tree', sources=srcs, libraries=libs, library_dirs=additional_libs, include_dirs=[sysconfig.get_config_var("INCLUDEPY")]) setup(name='deeptoolsintervals', version='0.1.9', description='A python module creating/accessing GTF-based interval trees with associated meta-data', author="Devon P. Ryan", author_email="ryan@ie-freiburg.mpg.de", url="https://github.com/deeptools/deeptools_intervals", keywords=["bioinformatics", "GTF"], classifier=["Development Status :: 5 - Production/Stable", "Environment :: Console", "License :: OSI Approved :: MIT License", "Intended Audience :: Developers", "Programming Language :: C", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: Implementation :: CPython", "Operating System :: POSIX", "Operating System :: Unix", "Operating System :: MacOS", "Topic :: Scientific/Engineering"], packages=find_packages(), zip_safe=False, include_package_data=True, ext_modules=[module1])