pax_global_header00006660000000000000000000000064132405532130014510gustar00rootroot0000000000000052 comment=8237ddb871cf389907fc723d070a592061ed0316 py2bit-0.3.0/000077500000000000000000000000001324055321300127215ustar00rootroot00000000000000py2bit-0.3.0/.gitignore000066400000000000000000000013551324055321300147150ustar00rootroot00000000000000# Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] # C extensions *.so # Distribution / packaging .Python env/ build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib64/ parts/ sdist/ var/ *.egg-info/ .installed.cfg *.egg # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *,cover # Translations *.mo *.pot # Django stuff: *.log # Sphinx documentation docs/_build/ # PyBuilder target/ *.o #./setup.py sdist creates this MANIFEST *.swp py2bit-0.3.0/.travis.yml000066400000000000000000000002161324055321300150310ustar00rootroot00000000000000language: python python: - "2.6" - "2.7" - "3.3" - "3.4" - "3.5" - "3.6" install: python ./setup.py install script: nosetests -sv py2bit-0.3.0/LICENSE.txt000066400000000000000000000020661324055321300145500ustar00rootroot00000000000000The MIT License (MIT) Copyright (c) 2015 Devon Ryan Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. py2bit-0.3.0/MANIFEST.in000066400000000000000000000001721324055321300144570ustar00rootroot00000000000000include LICENSE.txt README.md *.c *.h setup.py setup.cfg include lib2bit/* include py2bitTest/* exclude py2bit.egg-info/* py2bit-0.3.0/README.md000066400000000000000000000112741324055321300142050ustar00rootroot00000000000000[![Build Status](https://travis-ci.org/dpryan79/py2bit.svg?branch=master)](https://travis-ci.org/dpryan79/py2bit) # py2bit A python extension, written in C, for quick access to [2bit](https://genome.ucsc.edu/FAQ/FAQformat.html#format7) files. The extension uses [lib2bit](https://github.com/dpryan79/lib2bit) for file access. Table of Contents ================= * [Installation](#installation) * [Usage](#usage) * [Load the extension](#load-the-extension) * [Open a 2bit file](#open-a-2bit-file) * [Access the list of chromosomes and their lengths](#access-the-list-of-chromosomes-and-their-lengths) * [Print file information](#print-file-information) * [Fetch a sequence](#fetch-a-sequence) * [Fetch per-base statistics](#fetch-per-base-statistics) * [A note on coordinates](#a-note-on-coordinates) # Installation You can install the extension directly from github with: pip install git+https://github.com/dpryan79/py2bit # Usage Basic usage is as follows: ## Load the extension >>> import py2bit ## Open a 2bit file This will work if your working directory is the py2bit source code directory. >>> tb = py2bit.open("test/foo.2bit") Note that if you would like to include information about soft-masked bases, you need to manually specify that: >>> tb = py2bit.open("test/foo.2bit", True) ## Access the list of chromosomes and the lengths `TwoBit` objects contain a dictionary holding the chromosome/contig lengths, which can be accessed with the `chroms()` method. >>> tb.chroms() {'chr1': 150L, 'chr2': 100L} You can directly access a particular chromosome by specifying its name. >>> tb.chroms('chr1') 150L The lengths are stored as a "long" integer type, which is why there's an `L` suffix. If you specify a nonexistent chromosome then nothing is output. >>> tb.chroms("foo") >>> ## Print file information The following information about and contained within a 2bit file can be accessed with the `info()` method: * file size, in bytes (`file size`) * number of chromosomes/contigs (`nChroms`) * total sequence length, in bases (`sequence length`) * total number of hard-masked (N) bases (`hard-masked length`) * total number of soft-masked (lower case) bases(`soft-masked length`). Note that `soft-masked length` will only be present if `open("file.2bit", True)` is used, since handling soft-masking increases memory requirements and decreases perfomance. >>> tb.info() {'file size': 161, 'nChroms': 2, 'sequence length': 250, 'hard-masked length': 150, 'soft-masked length': 8} ## Fetch a sequence The sequence of a full or partial chromosome/contig can be fetched with the `sequence()` method. >>> tb.sequence("chr1") 'NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACGTACGTACGTagctagctGATCGATCGTAGCTAGCTAGCTAGCTGATCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN' By default, the whole chromosome/contig is returned. A specific range can also be requested. >>> tb.sequence("chr1", 24, 74) NNNNNNNNNNNNNNNNNNNNNNNNNNACGTACGTACGTagctagctGATC The first number is the (0-based) position on the chromosome/contig where the sequence should begin. The second number is the (1-based) position on the chromosome where the sequence should end. If it was requested during file opening that soft-masking information be stored, then lower case bases may be present. If a nonexistent chromosome/contig is specified then a runtime error occurs. ## Fetch per-base statistics It's often required to compute the percentage of 1 or more bases in a chromosome. This can be done with the `bases()` method. >>> tb.bases("chr1") {'A': 0.08, 'C': 0.08, 'T': 0.08666666666666667, 'G': 0.08666666666666667} This returns a dictionary with bases as keys and the fraction of the sequence composed of them as values. Note that this will not sum to 1 if there are any hard-masked bases (the chromosome is 2/3 `N` in this case). One can also request this information over a particular region. >>> tb.bases("chr1", 24, 74) {'A': 0.12, 'C': 0.12, 'T': 0.12, 'G': 0.12} The start and end position are as with the `sequence()` method described above. If integer counts are preferred, then they can instead be returned. >>> tb.bases("chr1", 24, 74, False) {'A': 6, 'C': 6, 'T': 6, 'G': 6} ## Close a file A `TwoBit` object can be closed with the `close()` method. >>> tb.close() # A note on coordinates 0-based half-open coordinates are used by this python module. So to access the value for the first base on `chr1`, one would specify the starting position as `0` and the end position as `1`. Similarly, bases 100 to 115 would have a start of `99` and an end of `115`. This is simply for the sake of consistency with most other bioinformatics packages. py2bit-0.3.0/lib2bit/000077500000000000000000000000001324055321300142505ustar00rootroot00000000000000py2bit-0.3.0/lib2bit/2bit.c000066400000000000000000000541051324055321300152610ustar00rootroot00000000000000#include #include #include #include #include #include #include #include #include #include "2bit.h" uint64_t twobitTell(TwoBit *tb); /* Read nmemb elements, each of size sz from the current file offset into data. Return the number of elements read. On error, the return value is either 0 or less than nmemb */ size_t twobitRead(void *data, size_t sz, size_t nmemb, TwoBit *tb) { if(tb->data) { if(memcpy(data, tb->data + tb->offset, nmemb * sz) == NULL) return 0; tb->offset += nmemb * sz; return nmemb; } else { return fread(data, sz, nmemb, tb->fp); } } /* Seek to a specific position, which is essentially trivial for memmaped stuff Returns: 0 on success, -1 on error */ int twobitSeek(TwoBit *tb, uint64_t offset) { if(offset >= tb->sz) return -1; if(tb->data) { tb->offset = offset; return 0; } else { return fseek(tb->fp, (long) offset, SEEK_SET); } } /* Like ftell, but generalized to handle memmaped files Returns the offset */ uint64_t twobitTell(TwoBit *tb) { if(tb->data) return tb->offset; return (uint64_t) ftell(tb->fp); } /* Given a byte containing 4 bases, return the character representation of the offset'th base */ char byte2base(uint8_t byte, int offset) { int rev = 3 - offset; uint8_t mask = 3 << (2 * rev); int foo = (mask & byte) >> (2 * rev); char bases[4] = "TCAG"; return bases[foo]; } void bytes2bases(char *seq, uint8_t *byte, uint32_t sz, int offset) { uint32_t pos = 0, remainder = 0, i = 0; char bases[4] = "TCAG"; uint8_t foo = byte[0]; // Deal with the first partial byte if(offset != 0) { while(offset < 4) { seq[pos++] = byte2base(foo, offset++); } if(pos >= sz) return; foo = byte[++i]; } // Deal with everything else, with the possible exception of the last fractional byte remainder = (sz - pos) % 4; while(pos < sz - remainder) { foo = byte[i++]; seq[pos + 3] = bases[foo & 3]; foo >>= 2; seq[pos + 2] = bases[foo & 3]; foo >>= 2; seq[pos + 1] = bases[foo & 3]; foo >>= 2; seq[pos] = bases[foo & 3]; foo >>= 2; pos += 4; } // Deal with the last partial byte if(remainder > 0) foo = byte[i]; for(offset=0; offsetidx->nBlockCount[tid]; i++) { blockStart = tb->idx->nBlockStart[tid][i]; blockEnd = blockStart + tb->idx->nBlockSizes[tid][i]; if(blockEnd <= start) continue; if(blockStart >= end) break; if(blockStart < start) { blockEnd = (blockEnd < end) ? blockEnd : end; pos = 0; width = blockEnd - start; } else { blockEnd = (blockEnd < end) ? blockEnd : end; pos = blockStart - start; width = blockEnd - blockStart; } width += pos; for(; pos < width; pos++) seq[pos] = 'N'; } } /* Replace uppercase with lower-case letters, if required */ void softMask(char *seq, TwoBit *tb, uint32_t tid, uint32_t start, uint32_t end) { uint32_t i, width, pos = 0; uint32_t blockStart, blockEnd; if(!tb->idx->maskBlockStart) return; for(i=0; iidx->maskBlockCount[tid]; i++) { blockStart = tb->idx->maskBlockStart[tid][i]; blockEnd = blockStart + tb->idx->maskBlockSizes[tid][i]; if(blockEnd <= start) continue; if(blockStart >= end) break; if(blockStart < start) { blockEnd = (blockEnd < end) ? blockEnd : end; pos = 0; width = blockEnd - start; } else { blockEnd = (blockEnd < end) ? blockEnd : end; pos = blockStart - start; width = blockEnd - blockStart; } width += pos; for(; pos < width; pos++) { if(seq[pos] != 'N') seq[pos] = tolower(seq[pos]); } } } /* This is the worker function for twobitSequence, which mostly does error checking */ char *constructSequence(TwoBit *tb, uint32_t tid, uint32_t start, uint32_t end) { uint32_t sz = end - start + 1; uint32_t blockStart, blockEnd; char *seq = malloc(sz * sizeof(char)); uint8_t *bytes = NULL; int offset; if(!seq) return NULL; //There are 4 bases/byte blockStart = start/4; offset = start % 4; blockEnd = end/4 + ((end % 4) ? 1 : 0); bytes = malloc(blockEnd - blockStart); if(!bytes) goto error; if(twobitSeek(tb, tb->idx->offset[tid] + blockStart) != 0) goto error; if(twobitRead(bytes, blockEnd - blockStart, 1, tb) != 1) goto error; bytes2bases(seq, bytes, sz - 1, offset); free(bytes); //Null terminate the output seq[sz - 1] = '\0'; //N-mask everything NMask(seq, tb, tid, start, end); //Soft-mask if requested softMask(seq, tb, tid, start, end); return seq; error: if(seq) free(seq); if(bytes) free(bytes); return NULL; } /* Given a chromosome, name, and optional range, return the corresponding sequence. The start and end or 0-based half-open, so end-start is the number of bases. If both start and end are 0, then the whole chromosome is used. On error (e.g., a missing chromosome), NULL is returned. */ char *twobitSequence(TwoBit *tb, char *chrom, uint32_t start, uint32_t end) { uint32_t i, tid=0; //Get the chromosome ID for(i=0; ihdr->nChroms; i++) { if(strcmp(tb->cl->chrom[i], chrom) == 0) { tid = i; break; } } if(tid == 0 && strcmp(tb->cl->chrom[i], chrom) != 0) return NULL; //Get the start/end if not specified if(start == end && end == 0) { end = tb->idx->size[tid]; } //Sanity check the bounds if(end > tb->idx->size[tid]) return NULL; if(start >= end) return NULL; return constructSequence(tb, tid, start, end); } /* Given a tid and a position, set the various mask variables to an appropriate block of Ns. * If maskIdx is not -1, these are set to the first overlapping block (or maskIdx is set to the number of N blocks). * If maskIdx is not -1 then it's incremented and maskStart/maskEnd set appropriately. If the returned interval doesn't overlap the start/end range, then both values will be -1. */ void getMask(TwoBit *tb, uint32_t tid, uint32_t start, uint32_t end, uint32_t *maskIdx, uint32_t *maskStart, uint32_t *maskEnd) { if(*maskIdx == (uint32_t) -1) { for((*maskIdx)=0; (*maskIdx)idx->nBlockCount[tid]; (*maskIdx)++) { *maskStart = tb->idx->nBlockStart[tid][*maskIdx]; *maskEnd = (*maskStart) + tb->idx->nBlockSizes[tid][*maskIdx]; if(*maskEnd < start) continue; if(*maskEnd >= start) break; } } else if(*maskIdx >= tb->idx->nBlockCount[tid]) { *maskStart = (uint32_t) -1; *maskEnd = (uint32_t) -1; } else { *maskIdx += 1; if(*maskIdx >= tb->idx->nBlockCount[tid]) { *maskStart = (uint32_t) -1; *maskEnd = (uint32_t) -1; } else { *maskStart = tb->idx->nBlockStart[tid][*maskIdx]; *maskEnd = (*maskStart) + tb->idx->nBlockSizes[tid][*maskIdx]; } } //maskStart = maskEnd = -1 if no overlap if(*maskIdx >= tb->idx->nBlockCount[tid] || *maskStart >= end) { *maskStart = (uint32_t) -1; *maskEnd = (uint32_t) -1; } } uint8_t getByteMaskFromOffset(int offset) { switch(offset) { case 0: return (uint8_t) 15; case 1: return (uint8_t) 7; case 2: return (uint8_t) 3; } return 1; } void *twobitBasesWorker(TwoBit *tb, uint32_t tid, uint32_t start, uint32_t end, int fraction) { void *out; uint32_t tmp[4] = {0, 0, 0, 0}, len = end - start + (start % 4), i = 0, j = 0; uint32_t blockStart, blockEnd, maskIdx = (uint32_t) -1, maskStart, maskEnd, foo; uint8_t *bytes = NULL, mask = 0, offset; if(fraction) { out = malloc(4 * sizeof(double)); } else { out = malloc(4 * sizeof(uint32_t)); } if(!out) return NULL; //There are 4 bases/byte blockStart = start/4; offset = start % 4; blockEnd = end/4 + ((end % 4) ? 1 : 0); bytes = malloc(blockEnd - blockStart); if(!bytes) goto error; //Set the initial mask, reset start/offset so we always deal with full bytes mask = getByteMaskFromOffset(offset); start = 4 * blockStart; offset = 0; if(twobitSeek(tb, tb->idx->offset[tid] + blockStart) != 0) goto error; if(twobitRead(bytes, blockEnd - blockStart, 1, tb) != 1) goto error; //Get the index/start/end of the next N-mask block getMask(tb, tid, start, end, &maskIdx, &maskStart, &maskEnd); while(i < len) { // Check if we need to jump if(maskIdx != -1 && start + i + 4 >= maskStart) { if(start + i >= maskStart || start + i + 4 - offset > maskStart) { //Jump iff the whole byte is inside an N block if(start + i >= maskStart && start + i + 4 - offset < maskEnd) { //iff we're fully in an N block then jump i = maskEnd - start; getMask(tb, tid, i, end, &maskIdx, &maskStart, &maskEnd); offset = (start + i) % 4; j = i / 4; mask = getByteMaskFromOffset(offset); i = 4 * j; //Now that the mask has been set, reset i to byte offsets offset = 0; continue; } //Set the mask, if appropriate foo = 4*j + 4*blockStart; // The smallest position in the byte if(mask & 1 && (foo + 3 >= maskStart && foo + 3 < maskEnd)) mask -= 1; if(mask & 2 && (foo + 2 >= maskStart && foo + 2 < maskEnd)) mask -= 2; if(mask & 4 && (foo + 1 >= maskStart && foo + 1 < maskEnd)) mask -= 4; if(mask & 8 && (foo >= maskStart && foo < maskEnd)) mask -= 8; if(foo + 4 > maskEnd) { getMask(tb, tid, i, end, &maskIdx, &maskStart, &maskEnd); continue; } } } //Ensure that anything after then end is masked if(i+4>=len) { if((mask & 1) && i+3>=len) mask -=1; if((mask & 2) && i+2>=len) mask -=2; if((mask & 4) && i+1>=len) mask -=4; if((mask & 8) && i>=len) mask -=8; } foo = bytes[j++]; //Offset 3 if(mask & 1) { tmp[foo & 3]++; } foo >>= 2; mask >>= 1; //Offset 2 if(mask & 1) { tmp[foo & 3]++; } foo >>= 2; mask >>= 1; //Offset 1 if(mask & 1) { tmp[foo & 3]++; } foo >>= 2; mask >>= 1; //Offset 0 if(mask & 1) { tmp[foo & 3]++; // offset 0 } i += 4; mask = 15; } free(bytes); //out is in TCAG order, since that's how 2bit is stored. //However, for whatever reason I went with ACTG in the first release... if(fraction) { ((double*) out)[0] = ((double) tmp[2])/((double) len); ((double*) out)[1] = ((double) tmp[1])/((double) len); ((double*) out)[2] = ((double) tmp[0])/((double) len); ((double*) out)[3] = ((double) tmp[3])/((double) len); } else { ((uint32_t*) out)[0] = tmp[2]; ((uint32_t*) out)[1] = tmp[1]; ((uint32_t*) out)[2] = tmp[0]; ((uint32_t*) out)[3] = tmp[3]; } return out; error: if(out) free(out); if(bytes) free(bytes); return NULL; } void *twobitBases(TwoBit *tb, char *chrom, uint32_t start, uint32_t end, int fraction) { uint32_t tid = 0, i; //Get the chromosome ID for(i=0; ihdr->nChroms; i++) { if(strcmp(tb->cl->chrom[i], chrom) == 0) { tid = i; break; } } if(tid == 0 && strcmp(tb->cl->chrom[i], chrom) != 0) return NULL; //Get the start/end if not specified if(start == end && end == 0) { end = tb->idx->size[tid]; } //Sanity check the bounds if(end > tb->idx->size[tid]) return NULL; if(start >= end) return NULL; return twobitBasesWorker(tb, tid, start, end, fraction); } /* Given a chromosome, chrom, return it's length. 0 is used if the chromosome isn't present. */ uint32_t twobitChromLen(TwoBit *tb, char *chrom) { uint32_t i; for(i=0; ihdr->nChroms; i++) { if(strcmp(tb->cl->chrom[i], chrom) == 0) return tb->idx->size[i]; } return 0; } /* Fill in tb->idx. Note that the masked stuff will only be stored if storeMasked == 1, since it uses gobs of memory otherwise. On error, tb->idx is left as NULL. */ void twobitIndexRead(TwoBit *tb, int storeMasked) { uint32_t i, data[2]; TwoBitMaskedIdx *idx = calloc(1, sizeof(TwoBitMaskedIdx)); //Allocation and error checking if(!idx) return; idx->size = malloc(tb->hdr->nChroms * sizeof(uint32_t)); idx->nBlockCount = calloc(tb->hdr->nChroms, sizeof(uint32_t)); idx->nBlockStart = calloc(tb->hdr->nChroms, sizeof(uint32_t*)); idx->nBlockSizes = calloc(tb->hdr->nChroms, sizeof(uint32_t*)); if(!idx->size) goto error; if(!idx->nBlockCount) goto error; if(!idx->nBlockStart) goto error; if(!idx->nBlockSizes) goto error; idx->maskBlockCount = calloc(tb->hdr->nChroms, sizeof(uint32_t)); if(!idx->maskBlockCount) goto error; if(storeMasked) { idx->maskBlockStart = calloc(tb->hdr->nChroms, sizeof(uint32_t*)); idx->maskBlockSizes = calloc(tb->hdr->nChroms, sizeof(uint32_t*)); if(!idx->maskBlockStart) goto error; if(!idx->maskBlockSizes) goto error; } idx->offset = malloc(tb->hdr->nChroms * sizeof(uint64_t)); if(!idx->offset) goto error; //Read in each chromosome/contig for(i=0; ihdr->nChroms; i++) { if(twobitSeek(tb, tb->cl->offset[i]) != 0) goto error; if(twobitRead(data, sizeof(uint32_t), 2, tb) != 2) goto error; idx->size[i] = data[0]; idx->nBlockCount[i] = data[1]; //Allocate the nBlock starts/sizes and fill them in idx->nBlockStart[i] = malloc(idx->nBlockCount[i] * sizeof(uint32_t)); idx->nBlockSizes[i] = malloc(idx->nBlockCount[i] * sizeof(uint32_t)); if(!idx->nBlockStart[i]) goto error; if(!idx->nBlockSizes[i]) goto error; if(twobitRead(idx->nBlockStart[i], sizeof(uint32_t), idx->nBlockCount[i], tb) != idx->nBlockCount[i]) goto error; if(twobitRead(idx->nBlockSizes[i], sizeof(uint32_t), idx->nBlockCount[i], tb) != idx->nBlockCount[i]) goto error; //Get the masked block information if(twobitRead(idx->maskBlockCount + i, sizeof(uint32_t), 1, tb) != 1) goto error; //Allocate the maskBlock starts/sizes and fill them in if(storeMasked) { idx->maskBlockStart[i] = malloc(idx->maskBlockCount[i] * sizeof(uint32_t)); idx->maskBlockSizes[i] = malloc(idx->maskBlockCount[i] * sizeof(uint32_t)); if(!idx->maskBlockStart[i]) goto error; if(!idx->maskBlockSizes[i]) goto error; if(twobitRead(idx->maskBlockStart[i], sizeof(uint32_t), idx->maskBlockCount[i], tb) != idx->maskBlockCount[i]) goto error; if(twobitRead(idx->maskBlockSizes[i], sizeof(uint32_t), idx->maskBlockCount[i], tb) != idx->maskBlockCount[i]) goto error; } else { if(twobitSeek(tb, twobitTell(tb) + 8 * idx->maskBlockCount[i]) != 0) goto error; } //Reserved if(twobitRead(data, sizeof(uint32_t), 1, tb) != 1) goto error; idx->offset[i] = twobitTell(tb); } tb->idx = idx; return; error: if(idx) { if(idx->size) free(idx->size); if(idx->nBlockCount) free(idx->nBlockCount); if(idx->nBlockStart) { for(i=0; ihdr->nChroms; i++) { if(idx->nBlockStart[i]) free(idx->nBlockStart[i]); } free(idx->nBlockStart[i]); } if(idx->nBlockSizes) { for(i=0; ihdr->nChroms; i++) { if(idx->nBlockSizes[i]) free(idx->nBlockSizes[i]); } free(idx->nBlockSizes[i]); } if(idx->maskBlockCount) free(idx->maskBlockCount); if(idx->maskBlockStart) { for(i=0; ihdr->nChroms; i++) { if(idx->maskBlockStart[i]) free(idx->maskBlockStart[i]); } free(idx->maskBlockStart[i]); } if(idx->maskBlockSizes) { for(i=0; ihdr->nChroms; i++) { if(idx->maskBlockSizes[i]) free(idx->maskBlockSizes[i]); } free(idx->maskBlockSizes[i]); } if(idx->offset) free(idx->offset); free(idx); } } void twobitIndexDestroy(TwoBit *tb) { uint32_t i; if(tb->idx) { if(tb->idx->size) free(tb->idx->size); if(tb->idx->nBlockCount) free(tb->idx->nBlockCount); if(tb->idx->nBlockStart) { for(i=0; ihdr->nChroms; i++) { if(tb->idx->nBlockStart[i]) free(tb->idx->nBlockStart[i]); } free(tb->idx->nBlockStart); } if(tb->idx->nBlockSizes) { for(i=0; ihdr->nChroms; i++) { if(tb->idx->nBlockSizes[i]) free(tb->idx->nBlockSizes[i]); } free(tb->idx->nBlockSizes); } if(tb->idx->maskBlockCount) free(tb->idx->maskBlockCount); if(tb->idx->maskBlockStart) { for(i=0; ihdr->nChroms; i++) { if(tb->idx->maskBlockStart[i]) free(tb->idx->maskBlockStart[i]); } free(tb->idx->maskBlockStart); } if(tb->idx->maskBlockSizes) { for(i=0; ihdr->nChroms; i++) { if(tb->idx->maskBlockSizes[i]) free(tb->idx->maskBlockSizes[i]); } free(tb->idx->maskBlockSizes); } if(tb->idx->offset) free(tb->idx->offset); free(tb->idx); } } void twobitChromListRead(TwoBit *tb) { uint32_t i; uint8_t byte; char *str = NULL; TwoBitCL *cl = calloc(1, sizeof(TwoBitCL)); //Allocate cl and do error checking if(!cl) goto error; cl->chrom = calloc(tb->hdr->nChroms, sizeof(char*)); cl->offset = malloc(sizeof(uint32_t) * tb->hdr->nChroms); if(!cl->chrom) goto error; if(!cl->offset) goto error; for(i=0; ihdr->nChroms; i++) { //Get the string size (not null terminated!) if(twobitRead(&byte, 1, 1, tb) != 1) goto error; //Read in the string str = calloc(1 + byte, sizeof(char)); if(!str) goto error; if(twobitRead(str, 1, byte, tb) != byte) goto error; cl->chrom[i] = str; str = NULL; //Read in the size if(twobitRead(cl->offset + i, sizeof(uint32_t), 1, tb) != 1) goto error; } tb->cl = cl; return; error: if(str) free(str); if(cl) { if(cl->offset) free(cl->offset); if(cl->chrom) { for(i=0; ihdr->nChroms; i++) { if(cl->chrom[i]) free(cl->chrom[i]); } free(cl->chrom); } free(cl); } } void twobitChromListDestroy(TwoBit *tb) { uint32_t i; if(tb->cl) { if(tb->cl->offset) free(tb->cl->offset); if(tb->cl->chrom) { for(i=0; ihdr->nChroms; i++) { if(tb->cl->chrom[i]) free(tb->cl->chrom[i]); } free(tb->cl->chrom); } free(tb->cl); } } void twobitHdrRead(TwoBit *tb) { //Read the first 16 bytes uint32_t data[4]; TwoBitHeader *hdr = calloc(1, sizeof(TwoBitHeader)); if(!hdr) return; if(twobitRead(data, 4, 4, tb) != 4) goto error; //Magic hdr->magic = data[0]; if(hdr->magic != 0x1A412743) { fprintf(stderr, "[twobitHdrRead] Received an invalid file magic number (0x%"PRIx32")!\n", hdr->magic); goto error; } //Version hdr->version = data[1]; if(hdr->version != 0) { fprintf(stderr, "[twobitHdrRead] The file version is %"PRIu32" while only version 0 is defined!\n", hdr->version); goto error; } //Sequence Count hdr->nChroms = data[2]; if(hdr->nChroms == 0) { fprintf(stderr, "[twobitHdrRead] There are apparently no chromosomes/contigs in this file!\n"); goto error; } tb->hdr = hdr; return; error: if(hdr) free(hdr); } void twobitHdrDestroy(TwoBit *tb) { if(tb->hdr) free(tb->hdr); } void twobitClose(TwoBit *tb) { if(tb) { if(tb->fp) fclose(tb->fp); if(tb->data) munmap(tb->data, tb->sz); twobitChromListDestroy(tb); twobitIndexDestroy(tb); //N.B., this needs to be called last twobitHdrDestroy(tb); free(tb); } } TwoBit* twobitOpen(char *fname, int storeMasked) { int fd; struct stat fs; TwoBit *tb = calloc(1, sizeof(TwoBit)); if(!tb) return NULL; tb->fp = fopen(fname, "rb"); if(!tb->fp) goto error; //Try to memory map the whole thing, since these aren't terribly large //Since we might be multithreading this in python, use shared memory fd = fileno(tb->fp); if(fstat(fd, &fs) == 0) { tb->sz = (uint64_t) fs.st_size; tb->data = mmap(NULL, fs.st_size, PROT_READ, MAP_SHARED, fd, 0); if(tb->data) { if(madvise(tb->data, fs.st_size, MADV_RANDOM) != 0) { munmap(tb->data, fs.st_size); tb->data = NULL; } } } //Attempt to read in the fixed header twobitHdrRead(tb); if(!tb->hdr) goto error; //Read in the chromosome list twobitChromListRead(tb); if(!tb->cl) goto error; //Read in the mask index twobitIndexRead(tb, storeMasked); if(!tb->idx) goto error; return tb; error: twobitClose(tb); return NULL; } py2bit-0.3.0/lib2bit/2bit.h000066400000000000000000000154471324055321300152740ustar00rootroot00000000000000#include #include /*! \mainpage libBigWig * * \section Introduction * * lib2bit is a C-based library for accessing [2bit files](https://genome.ucsc.edu/FAQ/FAQformat.html#format7). At the moment, only reading 2bit files is supported (there are no plans to change this, though if someone wants to submit a pull request...). Though it's unlikely to matter, * * The motivation for this project is due to needing fast access to 2bit files in [deepTools](https://github.com/fidelram/deepTools). Originally, we were using bx-python for this, which had the benefit of being easy to install and pretty quick. However, that wasn't compatible with python3, so we switched to [twobitreader](https://github.com/benjschiller/twobitreader). While doing everything we needed and working under both python2 and python3, it turns out that it has terrible performance (up to 1000x slow down in `computeGCBias`). Since we'd like to have our cake and eat it too, I began wrote a C library for convenient 2bit access and then [a python wrapper](https://github.com/dpryan79/py2bit) around it to work in python2 and 3. * * \section Installation * * 2bit files are very simple and there are no dependencies. Simply typing `make` should suffice for compilation. To install into a specific path (the default is `/usr/local`): * * make install prefix=/some/where/else * * `lib2bit.so` and `lib2bit.a` will then be in `/some/where/else/lib` and `2bit.h` in `/some/where/else/include`. * * \section Example * * See the `test/` directory for an example of using the library. */ /*! \file 2bit.h * * These are all functions and structures exported in lib2bit. There are a few things that could be more efficiently implemented, but at the moment theverything is "fast enough". */ #ifdef __cplusplus extern "C" { #endif /*! * @brief This structure holds the fixed-sized file header (16 bytes, of which 4 are blank). The version should always be 0. In theory, the endianness of the magic number can change (indicating that everything in the file should be swapped). As I've never actually seen this occur in the wild I've not bothered implementing it, though it'd be simple enough to do so. */ typedef struct { uint32_t magic; /** #include #include "py2bit.h" static PyObject *py2bitOpen(PyObject *self, PyObject *args, PyObject *kwds) { char *fname = NULL; PyObject *storeMaskedO = Py_False; pyTwoBit_t *pytb; int storeMasked = 0; TwoBit *tb = NULL; static char *kwd_list[] = {"fname", "storeMasked", NULL}; if(!PyArg_ParseTupleAndKeywords(args, kwds, "s|O", kwd_list, &fname, &storeMaskedO)) goto error; if(storeMaskedO == Py_True) storeMasked = 1; //Open the file tb = twobitOpen(fname, storeMasked); if(!tb) goto error; pytb = PyObject_New(pyTwoBit_t, &pyTwoBit); if(!pytb) goto error; pytb->storeMasked = storeMasked; pytb->tb = tb; return (PyObject*) pytb; error: if(tb) twobitClose(tb); PyErr_SetString(PyExc_RuntimeError, "Received an error during file opening!"); return NULL; } PyObject *py2bitEnter(pyTwoBit_t *self, PyObject *args) { pyTwoBit_t *pytb = self->tb; if(!pytb) { PyErr_SetString(PyExc_RuntimeError, "The 2bit file handle is not open!"); return NULL; } Py_INCREF(self); return (PyObject*) self; } static void py2bitDealloc(pyTwoBit_t *self) { if(self->tb) twobitClose(self->tb); PyObject_DEL(self); } static PyObject *py2bitClose(pyTwoBit_t *self, PyObject *args) { if(self->tb) twobitClose(self->tb); self->tb = NULL; Py_INCREF(Py_None); return Py_None; } //Returns the file size, number of chromosomes/contigs, total sequence length and total masked length static PyObject *py2bitInfo(pyTwoBit_t *self, PyObject *args) { TwoBit *tb = self->tb; PyObject *ret = NULL, *val = NULL; uint32_t i, j, foo; if(!tb) { PyErr_SetString(PyExc_RuntimeError, "The 2bit file handle is not open!"); return NULL; } ret = PyDict_New(); //file size val = PyLong_FromUnsignedLongLong(tb->sz); if(!val) goto error; if(PyDict_SetItemString(ret, "file size", val) == -1) goto error; Py_DECREF(val); //nContigs val = PyLong_FromUnsignedLong(tb->hdr->nChroms); if(!val) goto error; if(PyDict_SetItemString(ret, "nChroms", val) == -1) goto error; Py_DECREF(val); //sequence length foo = 0; for(i=0; ihdr->nChroms; i++) foo += tb->idx->size[i]; val = PyLong_FromUnsignedLong(foo); if(!val) goto error; if(PyDict_SetItemString(ret, "sequence length", val) == -1) goto error; Py_DECREF(val); //hard-masked length foo = 0; for(i=0; ihdr->nChroms; i++) { for(j=0; jidx->nBlockCount[i]; j++) { foo += tb->idx->nBlockSizes[i][j]; } } val = PyLong_FromUnsignedLong(foo); if(!val) goto error; if(PyDict_SetItemString(ret, "hard-masked length", val) == -1) goto error; Py_DECREF(val); //soft-masked length if(tb->idx->maskBlockStart) { foo = 0; for(i=0; ihdr->nChroms; i++) { for(j=0; jidx->maskBlockCount[i]; j++) { foo += tb->idx->maskBlockSizes[i][j]; } } val = PyLong_FromUnsignedLong(foo); if(!val) goto error; if(PyDict_SetItemString(ret, "soft-masked length", val) == -1) goto error; Py_DECREF(val); } return ret; error: Py_XDECREF(val); Py_XDECREF(ret); PyErr_SetString(PyExc_RuntimeError, "Received an error while gathering information on the 2bit file!"); return NULL; } static PyObject *py2bitChroms(pyTwoBit_t *self, PyObject *args) { PyObject *ret = NULL, *val = NULL; TwoBit *tb = self->tb; char *chrom = NULL; uint32_t i; if(!tb) { PyErr_SetString(PyExc_RuntimeError, "The 2bit file handle is not open!"); return NULL; } if(!(PyArg_ParseTuple(args, "|s", &chrom)) || !chrom) { ret = PyDict_New(); if(!ret) goto error; for(i=0; ihdr->nChroms; i++) { val = PyLong_FromUnsignedLong(tb->idx->size[i]); if(!val) goto error; if(PyDict_SetItemString(ret, tb->cl->chrom[i], val) == -1) goto error; Py_DECREF(val); } } else { for(i=0; ihdr->nChroms; i++) { if(strcmp(tb->cl->chrom[i], chrom) == 0) { ret = PyLong_FromUnsignedLong(tb->idx->size[i]); if(!ret) goto error; break; } } } if(!ret) { Py_INCREF(Py_None); ret = Py_None; } return ret; error : Py_XDECREF(val); Py_XDECREF(ret); PyErr_SetString(PyExc_RuntimeError, "Received an error while adding an item to the output dictionary!"); return NULL; } #if PY_MAJOR_VERSION >= 3 PyObject *PyString_FromString(char *seq) { return PyUnicode_FromString(seq); } #endif static PyObject *py2bitSequence(pyTwoBit_t *self, PyObject *args, PyObject *kwds) { PyObject *ret = NULL; TwoBit *tb = self->tb; char *seq, *chrom; unsigned long startl = 0, endl = 0; uint32_t start, end, len; static char *kwd_list[] = {"chrom", "start", "end", NULL}; if(!tb) { PyErr_SetString(PyExc_RuntimeError, "The 2bit file handle is not open!"); return NULL; } if(!PyArg_ParseTupleAndKeywords(args, kwds, "s|kk", kwd_list, &chrom, &startl, &endl)) { PyErr_SetString(PyExc_RuntimeError, "You must supply at least a chromosome!"); return NULL; } len = twobitChromLen(tb, chrom); if(len == 0) { PyErr_SetString(PyExc_RuntimeError, "The specified chromosome doesn't exist in the 2bit file!"); return NULL; } if(endl > len) endl = len; end = (uint32_t) endl; if(startl >= endl && startl > 0) { PyErr_SetString(PyExc_RuntimeError, "The start value must be less then the end value (and the end of the chromosome"); return NULL; } start = (uint32_t) startl; seq = twobitSequence(tb, chrom, start, end); if(!seq) { PyErr_SetString(PyExc_RuntimeError, "There was an error while fetching the sequence!"); return NULL; } ret = PyString_FromString(seq); free(seq); if(!ret) { PyErr_SetString(PyExc_RuntimeError, "Received an error while converting the C-level char array to a python string!"); return NULL; } return ret; } static PyObject *py2bitBases(pyTwoBit_t *self, PyObject *args, PyObject *kwds) { PyObject *ret = NULL, *val = NULL; PyObject *fractionO = Py_True; TwoBit *tb = self->tb; char *chrom; void *o = NULL; unsigned long startl = 0, endl = 0; uint32_t start, end, len; static char *kwd_list[] = {"chrom", "start", "end", "fraction", NULL}; int fraction = 1; if(!tb) { PyErr_SetString(PyExc_RuntimeError, "The 2bit file handle is not open!"); return NULL; } if(!PyArg_ParseTupleAndKeywords(args, kwds, "s|kkO", kwd_list, &chrom, &startl, &endl, &fractionO)) { PyErr_SetString(PyExc_RuntimeError, "You must supply at least a chromosome!"); return NULL; } len = twobitChromLen(tb, chrom); if(len == 0) { PyErr_SetString(PyExc_RuntimeError, "The specified chromosome doesn't exist in the 2bit file!"); return NULL; } if(endl > len) endl = len; end = (uint32_t) endl; if(startl >= endl && startl > 0) { PyErr_SetString(PyExc_RuntimeError, "The start value must be less then the end value (and the end of the chromosome"); return NULL; } start = (uint32_t) startl; if(fractionO == Py_False) fraction = 0; o = twobitBases(tb, chrom, start, end, fraction); if(!o) { PyErr_SetString(PyExc_RuntimeError, "Received an error while determining the per-base metrics."); return NULL; } ret = PyDict_New(); if(!ret) goto error; //A if(fraction) val = PyFloat_FromDouble(((double*)o)[0]); else val = PyLong_FromUnsignedLong(((uint32_t*)o)[0]); if(!val) goto error; if(PyDict_SetItemString(ret, "A", val) == -1) goto error; Py_DECREF(val); //C if(fraction) val = PyFloat_FromDouble(((double*)o)[1]); else val = PyLong_FromUnsignedLong(((uint32_t*)o)[1]); if(!val) goto error; if(PyDict_SetItemString(ret, "C", val) == -1) goto error; Py_DECREF(val); //T if(fraction) val = PyFloat_FromDouble(((double*)o)[2]); else val = PyLong_FromUnsignedLong(((uint32_t*)o)[2]); if(!val) goto error; if(PyDict_SetItemString(ret, "T", val) == -1) goto error; Py_DECREF(val); //G if(fraction) val = PyFloat_FromDouble(((double*)o)[3]); else val = PyLong_FromUnsignedLong(((uint32_t*)o)[3]); if(!val) goto error; if(PyDict_SetItemString(ret, "G", val) == -1) goto error; Py_DECREF(val); free(o); return ret; error: if(o) free(o); Py_XDECREF(ret); Py_XDECREF(val); PyErr_SetString(PyExc_RuntimeError, "Received an error while constructing the output dictionary!"); return NULL; } static PyObject *py2bitHardMaskedBlocks(pyTwoBit_t *self, PyObject *args, PyObject *kwds) { PyObject *ret = NULL, *tup = NULL; TwoBit *tb = self->tb; char *chrom; unsigned long startl = 0, endl = 0, totalBlocks = 0, tid; uint32_t start, end, len, blockStart, blockEnd, i, j; static char *kwd_list[] = {"chrom", "start", "end", NULL}; if(!tb) { PyErr_SetString(PyExc_RuntimeError, "The 2bit file handle is not open!"); return NULL; } if(!PyArg_ParseTupleAndKeywords(args, kwds, "s|kk", kwd_list, &chrom, &startl, &endl)) { PyErr_SetString(PyExc_RuntimeError, "You must supply at least a chromosome!"); return NULL; } //Get the chromosome ID for(i=0; ihdr->nChroms; i++) { if(strcmp(tb->cl->chrom[i], chrom) == 0) { tid = i; break; } } len = twobitChromLen(tb, chrom); if(len == 0) { PyErr_SetString(PyExc_RuntimeError, "The specified chromosome doesn't exist in the 2bit file!"); return NULL; } if(endl == 0) endl = len; if(endl > len) endl = len; end = (uint32_t) endl; if(startl >= endl && startl > 0) { PyErr_SetString(PyExc_RuntimeError, "The start value must be less then the end value (and the end of the chromosome"); return NULL; } start = (uint32_t) startl; // Count the total number of overlapping N-masked blocks for(i=0; iidx->nBlockCount[tid]; i++) { blockStart = tb->idx->nBlockStart[tid][i]; blockEnd = blockStart + tb->idx->nBlockSizes[tid][i]; if(blockStart < end && blockEnd > start) totalBlocks++; } // Form the output ret = PyList_New(totalBlocks); if(!ret) goto error; if(totalBlocks == 0) return ret; for(i=0, j=0; iidx->nBlockCount[tid]; i++) { blockStart = tb->idx->nBlockStart[tid][i]; blockEnd = blockStart + tb->idx->nBlockSizes[tid][i]; if(blockStart < end && blockEnd > start) { tup = Py_BuildValue("(kk)", (unsigned long) blockStart, (unsigned long) blockEnd); if(!tup) goto error; if(PyList_SetItem(ret, j++, tup)) goto error; } } return ret; error: if(ret) Py_XDECREF(ret); if(tup) Py_XDECREF(tup); PyErr_SetString(PyExc_RuntimeError, "Received an error while constructing the output list and tuples!"); return NULL; } static PyObject *py2bitSoftMaskedBlocks(pyTwoBit_t *self, PyObject *args, PyObject *kwds) { PyObject *ret = NULL, *tup = NULL; TwoBit *tb = self->tb; char *chrom; unsigned long startl = 0, endl = 0, totalBlocks = 0, tid; uint32_t start, end, len, blockStart, blockEnd, i, j; static char *kwd_list[] = {"chrom", "start", "end", NULL}; if(!tb) { PyErr_SetString(PyExc_RuntimeError, "The 2bit file handle is not open!"); return NULL; } if(!PyArg_ParseTupleAndKeywords(args, kwds, "s|kk", kwd_list, &chrom, &startl, &endl)) { PyErr_SetString(PyExc_RuntimeError, "You must supply at least a chromosome!"); return NULL; } //Get the chromosome ID for(i=0; ihdr->nChroms; i++) { if(strcmp(tb->cl->chrom[i], chrom) == 0) { tid = i; break; } } len = twobitChromLen(tb, chrom); if(len == 0) { PyErr_SetString(PyExc_RuntimeError, "The specified chromosome doesn't exist in the 2bit file!"); return NULL; } if(endl == 0) endl = len; if(endl > len) endl = len; end = (uint32_t) endl; if(startl >= endl && startl > 0) { PyErr_SetString(PyExc_RuntimeError, "The start value must be less then the end value (and the end of the chromosome"); return NULL; } start = (uint32_t) startl; if(!tb->idx->maskBlockStart) { PyErr_SetString(PyExc_RuntimeError, "The file was not opened with storeMasked=True! Consequently, there are no stored soft-masked regions."); return NULL; } // Count the total number of overlapping soft-masked blocks for(i=0; iidx->maskBlockCount[tid]; i++) { blockStart = tb->idx->maskBlockStart[tid][i]; blockEnd = blockStart + tb->idx->maskBlockSizes[tid][i]; if(blockStart < end && blockEnd > start) totalBlocks++; } // Form the output ret = PyList_New(totalBlocks); if(!ret) goto error; if(totalBlocks == 0) return ret; for(i=0, j=0; iidx->maskBlockCount[tid]; i++) { blockStart = tb->idx->maskBlockStart[tid][i]; blockEnd = blockStart + tb->idx->maskBlockSizes[tid][i]; if(blockStart < end && blockEnd > start) { tup = Py_BuildValue("(kk)", (unsigned long) blockStart, (unsigned long) blockEnd); if(!tup) goto error; if(PyList_SetItem(ret, j++, tup)) goto error; } } return ret; error: if(ret) Py_XDECREF(ret); if(tup) Py_XDECREF(tup); PyErr_SetString(PyExc_RuntimeError, "Received an error while constructing the output list and tuples!"); return NULL; } #if PY_MAJOR_VERSION >= 3 PyMODINIT_FUNC PyInit_py2bit(void) { PyObject *res; if(PyType_Ready(&pyTwoBit) < 0) return NULL; res = PyModule_Create(&py2bitmodule); if(!res) return NULL; Py_INCREF(&pyTwoBit); PyModule_AddObject(res, "py2bit", (PyObject *) &pyTwoBit); PyModule_AddStringConstant(res, "__version__", pyTwoBitVersion); return res; } #else //Python2 initialization PyMODINIT_FUNC initpy2bit(void) { PyObject *res; if(PyType_Ready(&pyTwoBit) < 0) return; res = Py_InitModule3("py2bit", tbMethods, "A module for handling 2bit files"); Py_INCREF(&pyTwoBit); PyModule_AddObject(res, "py2bit", (PyObject *) &pyTwoBit); PyModule_AddStringConstant(res, "__version__", pyTwoBitVersion); } #endif py2bit-0.3.0/py2bit.h000066400000000000000000000223171324055321300143100ustar00rootroot00000000000000#include #include "2bit.h" #define pyTwoBitVersion "0.3.0" typedef struct { PyObject_HEAD TwoBit *tb; int storeMasked; //Whether storeMasked was set. 0 = False, 1 = True } pyTwoBit_t; static PyObject* py2bitOpen(PyObject *self, PyObject *args, PyObject *kwds); static PyObject *py2bitEnter(pyTwoBit_t *pybw, PyObject *args); static PyObject *py2bitInfo(pyTwoBit_t *pybw, PyObject *args); static PyObject* py2bitClose(pyTwoBit_t *pybw, PyObject *args); static PyObject* py2bitChroms(pyTwoBit_t *pybw, PyObject *args); static PyObject *py2bitSequence(pyTwoBit_t *pybw, PyObject *args, PyObject *kwds); static PyObject *py2bitBases(pyTwoBit_t *pybw, PyObject *args, PyObject *kwds); static PyObject *py2bitHardMaskedBlocks(pyTwoBit_t *pybw, PyObject *args, PyObject *kwds); static PyObject *py2bitSoftMaskedBlocks(pyTwoBit_t *pybw, PyObject *args, PyObject *kwds); static void py2bitDealloc(pyTwoBit_t *pybw); static PyMethodDef tbMethods[] = { {"open", (PyCFunction)py2bitOpen, METH_VARARGS|METH_KEYWORDS, "Open a 2bit file.\n\ \n\ Returns:\n\ A TwoBit object on success, otherwise None.\n\ \n\ Arguments:\n\ file: The name of a 2bit file.\n\ \n\ Optional arguments:\n\ storeMasked: Whether to store information about soft-masking (default False).\n\ \n\ Note that storing soft-masking information can be memory intensive and doing so\n\ will result in soft-masked bases being lower case if the sequence is fetched\n\ (see the sequence() function)\n\ \n\ >>> import py2bit\n\ >>> tb = py2bit.open(\"some_file.2bit\")\n\ \n\ To store soft-masking information:\n\ >>> tb = py2bit.open(\"some_file.2bit\", True)"}, {NULL, NULL, 0, NULL} }; static PyMethodDef tbObjMethods[] = { {"info", (PyCFunction)py2bitInfo, METH_VARARGS, "Returns a dictionary containing the following key:value pairs: \n\ \n\ * The file size, in bytes ('file size').\n\ * The number of chromosomes/contigs ('nChroms').\n\ * The total sequence length ('sequence length').\n\ * The total hard-masked length ('hard-masked length').\n\ * The total soft-masked length, if available ('soft-masked length').\n\ \n\ A base is hard-masked if it is an N and soft-masked if it's lower case. Note that soft-masking is ignored by default (you must specify 'storeMasked=True' when you open the file.\n\ \n\ >>> import py2bit\n\ >>> tb = py2bit.open(\"some_file.2bit\")\n\ >>> tb.info()\n\ {'file size': 160L, 'nChroms': 2L, 'sequence length': 250L, 'hard-masked length': 150L, 'soft-masked length': 8L}\n\ >>> tb.close()\n"}, {"close", (PyCFunction)py2bitClose, METH_VARARGS, "Close a 2bit file.\n\ \n\ >>> import py2bit\n\ >>> tb = py2bit.open(\"some_file.2bit\")\n\ >>> tb.close()\n"}, {"chroms", (PyCFunction)py2bitChroms, METH_VARARGS, "Return a chromosome: length dictionary. The order is typically not\n\ alphabetical and the lengths are long (thus the 'L' suffix).\n\ \n\ Optional arguments:\n\ chrom: An optional chromosome name\n\ \n\ Returns:\n\ A list of chromosome lengths or a dictionary of them.\n\ \n\ >>> import py2bit\n\ >>> tb = py2bit.open(\"test/test.2bit\")\n\ >>> tb.chroms()\n\ {'chr1': 150L, 'chr2': 100L}\n\ \n\ Note that you may optionally supply a specific chromosome:\n\ \n\ >>> tb.chroms(\"chr1\")\n\ 150L\n\ \n\ If you specify a non-existant chromosome then no output is produced:\n\ \n\ >>> tb.chroms(\"foo\")\n\ >>>\n"}, {"sequence", (PyCFunction)py2bitSequence, METH_VARARGS|METH_KEYWORDS, "Retrieve the sequence of a chromosome, or subset of it. On error, a runtime\n\ exception is thrown.\n\ \n\ Positional arguments:\n\ chr: Chromosome name\n\ \n\ Keyword arguments:\n\ start: Starting position (0-based)\n\ end: Ending position (1-based)\n\ \n\ Returns:\n\ A string containing the sequence.\n\ \n\ If start and end aren't specified, the entire chromosome is returned. If the\n\ end value is beyond the end of the chromosome then it is adjusted accordingly.\n\ \n\ >>> import py2bit\n\ >>> tb = py2bit.open(\"test/test.2bit\")\n\ >>> tb.sequence(\"chr1\")\n\ NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACGTACGTACGTagctagctGATCGATCGTAGCTAGCTAGCTAGCTGATCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN\n\ >>> tb.sequence(\"chr1\", 24, 74)\n\ NNNNNNNNNNNNNNNNNNNNNNNNNNACGTACGTACGTagctagctGATC\n\ >>> tb.close()"}, {"bases", (PyCFunction)py2bitBases, METH_VARARGS|METH_KEYWORDS, "Retrieve the percentage or number of A, C, T, and Gs in a chromosome or subset\n\ thereof. On error, a runtime exception is thrown.\n\ \n\ Positional arguments:\n\ chr: Chromosome name\n\ \n\ Optional keyword arguments:\n\ start: Starting position (0-based)\n\ end: Ending position (1-based)\n\ fraction: Whether to return fractional or integer values (default 'True',\n\ so fractional values are returned)\n\ \n\ Returns:\n\ A dictionary with nucleotide as the key and fraction (or count) as the\n\ value.\n\ \n\ If start and end aren't specified, the entire chromosome is returned. If the\n\ end value is beyond the end of the chromosome then it is adjusted accordingly.\n\ \n\ Note that the fractions will sum to much less than 1 if there are hard-masked\n\ bases. Counts may sum to less than the length of the region for the same reason.\n\ \n\ >>> import py2bit\n\ >>> tb = py2bit.open(\"test/test.2bit\")\n\ >>> tb.bases(tb, \"chr1\")\n\ {'A': 0.08, 'C': 0.08, 'T': 0.08666666666666667, 'G': 0.08666666666666667}\n\ >>> tb.bases(tb, \"chr1\", 24, 74)\n\ {'A': 0.12, 'C': 0.12, 'T': 0.12, 'G': 0.12}\n\ >>> tb.bases(tb, \"chr1\", 24, 74, True)\n\ {'A': 6, 'C': 6, 'T': 6, 'G': 6}\n\ >>> tb.close()"}, {"hardMaskedBlocks", (PyCFunction)py2bitHardMaskedBlocks, METH_VARARGS|METH_KEYWORDS, "Retrieve a list of hard-masked blocks on a single-chromosome (or range on it).\n\ \n\ Positional arguments:\n\ chr: Chromosome name\n\ \n\ Optional keyword arguments:\n\ start: Starting position (0-based)\n\ end: Ending position (1-based)\n\ \n\ Returns:\n\ A list of tuples, with items start and end.\n\ \n\ >>> import py2bit\n\ >>> tb = py2bit.open(\"test/test.2bit\")\n\ >>> print(tb.hardMaskedBlocks(\"chr1\")\n\ [(0, 50), (100, 150)]\n\ >>> print(tb.hardMaskedBlocks(\"chr1\", 75, 100)\n\ []\n\ >>> print(tb.hardMaskedBlocks(\"chr1\", 75, 101)\n\ [(100, 150)]\n\ >>> tb.close()"}, {"softMaskedBlocks", (PyCFunction)py2bitSoftMaskedBlocks, METH_VARARGS|METH_KEYWORDS, "Retrieve a list of soft-masked blocks on a single-chromosome (or range on it).\n\ \n\ Positional arguments:\n\ chr: Chromosome name\n\ \n\ Optional keyword arguments:\n\ start: Starting position (0-based)\n\ end: Ending position (1-based)\n\ \n\ Returns:\n\ A list of tuples, with items start and end.\n\ \n\ >>> import py2bit\n\ >>> tb = py2bit.open(\"test/test.2bit\", storeMasked=True)\n\ >>> print(tb.softMaskedBlocks(\"chr1\")\n\ [(62, 70)]\n\ >>> print(tb.softMaskedBlocks(\"chr1\", 0, 50)\n\ []\n\ >>> tb.close()"}, {"__enter__", (PyCFunction) py2bitEnter, METH_NOARGS, NULL}, {"__exit__", (PyCFunction) py2bitClose, METH_VARARGS, NULL}, {NULL, NULL, 0, NULL} }; #if PY_MAJOR_VERSION >= 3 struct py2bitmodule_state { PyObject *error; }; #define GETSTATE(m) ((struct py2bitmodule_state*)PyModule_GetState(m)) static PyModuleDef py2bitmodule = { PyModuleDef_HEAD_INIT, "py2bit", "A python module for accessing 2bit files", -1, tbMethods, NULL, NULL, NULL, NULL }; #endif static PyTypeObject pyTwoBit = { #if PY_MAJOR_VERSION >= 3 PyVarObject_HEAD_INIT(NULL, 0) #else PyObject_HEAD_INIT(NULL) 0, /*ob_size*/ #endif "py2bit.pyTwoBit", /*tp_name*/ sizeof(pyTwoBit), /*tp_basicsize*/ 0, /*tp_itemsize*/ (destructor)py2bitDealloc, /*tp_dealloc*/ 0, /*tp_print*/ 0, /*tp_getattr*/ 0, /*tp_setattr*/ 0, /*tp_compare*/ 0, /*tp_repr*/ 0, /*tp_as_number*/ 0, /*tp_as_sequence*/ 0, /*tp_as_mapping*/ 0, /*tp_hash*/ 0, /*tp_call*/ 0, /*tp_str*/ PyObject_GenericGetAttr, /*tp_getattro*/ PyObject_GenericSetAttr, /*tp_setattro*/ 0, /*tp_as_buffer*/ #if PY_MAJOR_VERSION >= 3 Py_TPFLAGS_DEFAULT, /*tp_flags*/ #else Py_TPFLAGS_HAVE_CLASS, /*tp_flags*/ #endif "bigWig File", /*tp_doc*/ 0, /*tp_traverse*/ 0, /*tp_clear*/ 0, /*tp_richcompare*/ 0, /*tp_weaklistoffset*/ 0, /*tp_iter*/ 0, /*tp_iternext*/ tbObjMethods, /*tp_methods*/ 0, /*tp_members*/ 0, /*tp_getset*/ 0, /*tp_base*/ 0, /*tp_dict*/ 0, /*tp_descr_get*/ 0, /*tp_descr_set*/ 0, /*tp_dictoffset*/ 0, /*tp_init*/ 0, /*tp_alloc*/ 0, /*tp_new*/ 0,0,0,0,0,0 }; py2bit-0.3.0/py2bitTest/000077500000000000000000000000001324055321300147725ustar00rootroot00000000000000py2bit-0.3.0/py2bitTest/__init__.py000066400000000000000000000000001324055321300170710ustar00rootroot00000000000000py2bit-0.3.0/py2bitTest/foo.2bit000066400000000000000000000002411324055321300163340ustar00rootroot00000000000000C'Achr1"chr2p–d22> ÉÉËKN´´´´ád22œœœ´´ááËKKKNpy2bit-0.3.0/py2bitTest/test.py000066400000000000000000000052051324055321300163250ustar00rootroot00000000000000import os import py2bit class Test(): fname = os.path.dirname(py2bit.__file__) + "/py2bitTest/foo.2bit" def testOpenClose(self): tb = py2bit.open(self.fname, True) assert(tb is not None) tb.close() def testChroms(self): tb = py2bit.open(self.fname, True) chroms = tb.chroms() correct = {'chr1': 150, 'chr2': 100} for k, v in chroms.items(): assert(correct[k] == v) assert(tb.chroms("chr1") == 150) assert(tb.chroms("c") is None) tb.close() def testInfo(self): tb = py2bit.open(self.fname, True) correct = {'file size': 161, 'nChroms': 2, 'sequence length': 250, 'hard-masked length': 150, 'soft-masked length': 8} check = tb.info() assert(len(correct) == len(check)) for k, v in check.items(): assert(correct[k] == v) tb.close() def testSequence(self): tb = py2bit.open(self.fname, True) assert(tb.sequence("chr1") == "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACGTACGTACGTagctagctGATCGATCGTAGCTAGCTAGCTAGCTGATCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN") assert(tb.sequence("chr1", 0, 1000) == "NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACGTACGTACGTagctagctGATCGATCGTAGCTAGCTAGCTAGCTGATCNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN") assert(tb.sequence("chr1", 24, 74) == "NNNNNNNNNNNNNNNNNNNNNNNNNNACGTACGTACGTagctagctGATC") tb.close() def testBases(self): tb = py2bit.open(self.fname, True) assert(tb.bases("chr1") == {'A': 0.08, 'C': 0.08, 'T': 0.08666666666666667, 'G': 0.08666666666666667}) assert(tb.bases("chr1", 24, 74) == {'A': 0.12, 'C': 0.12, 'T': 0.12, 'G': 0.12}) assert(tb.bases("chr1", 24, 74, False) == {'A': 6, 'C': 6, 'T': 6, 'G': 6}) tb.close() def testSequence(self): tb = py2bit.open(self.fname, True) assert(tb.sequence("chr1", 1, 3) == "NN") assert(tb.sequence("chr1", 1, 2) == "N") tb.close() def testHardMaskedBlocks(self): tb = py2bit.open(self.fname, True) assert(tb.hardMaskedBlocks("chr1") == [(0, 50), (100, 150)]) assert(tb.hardMaskedBlocks("chr1", 25, 75) == [(0, 50)]) assert(tb.hardMaskedBlocks("chr1", 75, 100) == []) assert(tb.hardMaskedBlocks("chr1", 75, 101) == [(100, 150)]) assert(tb.hardMaskedBlocks("chr2") == [(50, 100)]) tb.close() def testSoftMaskedBlocks(self): tb = py2bit.open(self.fname, storeMasked=True) assert(tb.softMaskedBlocks("chr1") == [(62, 70)]) assert(tb.softMaskedBlocks("chr1", 0, 50) == []) tb.close() py2bit-0.3.0/setup.cfg000066400000000000000000000000501324055321300145350ustar00rootroot00000000000000[metadata] description-file = README.md py2bit-0.3.0/setup.py000077500000000000000000000037551324055321300144500ustar00rootroot00000000000000#!/usr/bin/env python from setuptools import setup, Extension, find_packages from distutils import sysconfig import subprocess import glob import sys srcs = [x for x in glob.glob("lib2bit/*.c")] srcs.append("py2bit.c") additional_libs = [sysconfig.get_config_var("LIBDIR"), sysconfig.get_config_var("LIBPL")] module1 = Extension('py2bit', sources = srcs, library_dirs = additional_libs, include_dirs = ['lib2bit', sysconfig.get_config_var("INCLUDEPY")]) setup(name = 'py2bit', version = '0.3.0', description = 'A package for accessing 2bit files using lib2bit', author = "Devon P. Ryan", author_email = "ryan@ie-freiburg.mpg.de", url = "https://github.com/deeptools/py2bit", license = "MIT", download_url = "https://github.com/deeptools/py2bit/tarball/0.3.0", keywords = ["bioinformatics", "2bit"], classifier = ["Development Status :: 5 - Production/Stable", "Intended Audience :: Developers", "License :: OSI Approved", "Programming Language :: C", "Programming Language :: Python", "Programming Language :: Python :: 2", "Programming Language :: Python :: 2.6", "Programming Language :: Python :: 2.7", "Programming Language :: Python :: 3", "Programming Language :: Python :: 3.3", "Programming Language :: Python :: 3.4", "Programming Language :: Python :: 3.5", "Programming Language :: Python :: 3.6", "Programming Language :: Python :: Implementation :: CPython", "Operating System :: POSIX", "Operating System :: Unix", "Operating System :: MacOS"], packages = find_packages(), include_package_data=True, ext_modules = [module1])