mirtop-0.4.24/ 0000755 0001750 0001750 00000000000 14143247637 012454 5 ustar nilesh nilesh mirtop-0.4.24/README.md 0000644 0001750 0001750 00000006650 14143247637 013742 0 ustar nilesh nilesh # mirtop
[](https://travis-ci.org/miRTop/mirtop#)
[](http://www.repostatus.org/#active)
[](https://www.biorxiv.org/content/10.1101/505222v1)
Command line tool to annotate with a standard naming miRNAs e isomiRs.
This tool adapt the miRNA GFF3 format agreed on here: https://github.com/miRTop/mirGFF3
Chat
----
[Ask question, ideas](https://gitter.im/mirtop/Lobby#)
[Contributors to code](https://gitter.im/mirtop/devel)
Cite
----
http://mirtop.github.io
Contributing
------------
Everybody is welcome to contribute, fork the `devel` branch and start working!
If you are interesting in miRNA or small RNA analysis, you can jump into the incubator issue pages to propose/ask or say hi:
https://github.com/miRTop/incubator/issues
About
-----
Join the team: https://orgmanager.miguelpiedrafita.com/join/15463928
Read more: http://mirtop.github.io
Installation
------------
### Bioconda
`conda install mirtop -c bioconda`
### PIP
`pip install mirtop`
### develop version
Thes best solution is to install conda to get an independent enviroment.
```
wget http://repo.continuum.io/miniconda/Miniconda-latest-Linux-x86_64.sh
bash Miniconda-latest-Linux-x86_64.sh -b -p ~/mirtop_env
export PATH=$PATH:~/mirtop_env
conda install -c bioconda pysam pybedtools pandas biopython samtools
git clone http://github.com/miRTop/mirtop
cd mirtop
python setup.py develop
```
Quick start
-----------
Read complete commands at: https://mirtop.readthedocs.org
```
git clone mirtop
cd mirtop/data
mirtop gff --sps hsa --hairpin examples/annotate/hairpin.fa --gtf examples/annotate/hsa.gff3 -o test_out sim_isomir.bam
```
Output
------
The `mirtop gff` generates the GFF3 adapted format to capture miRNA variations. The output is explained [here](https://github.com/miRTop/mirGFF3).
Contributors
------------
* [Lorena Pantano](https://github.com/lpantano) (Bioinformatic Core, Harvard Chan School, Boston, USA)
* [Shruthi Bhat Bandyadka](https://github.com/sbb25) (Partners Personalized Medicine, Cambridge MA, USA)
* [Iñaki Martínez de Ilarduya](http://www.germanstrias.org/technology-services/high-performance-computing/contact/)(HPC core, IGTP, Badalona, Spain)
* Rafael Alis
* [Victor Barrera](https://github.com/vbarrera) (Bioinformatic Core, Harvard Chan School, Boston, USA)
* [Steffen Möller](https://github.com/smoe) (University of Rostock)
* [Kieran O'Neill](https://github.com/oneillkza)
* [Roderic Espin](Universitat Oberta de Barcelona)
Citizens
--------
Here we cite any person who has contribute somehow to the project different than through code development and/or
bioinformatic concepts.
Gianvito Urgese,
Jan Oppelt(CEITEC Masaryk University, Brno, Czech Republic),
Thomas Desvignes,
Bastian,
Kieran O'Neill (BC Cancer),
Charles Reid (University of California Davis),
Radhika Khetani (Harvard Chan School of Public Health),
Shannan Ho Sui (Harvard Chan School of Public Health),
Simonas Juzenas(CAU),
Rafael Alis (Catholic University of Valencia),
Aida Arcas (Instituto de Neurociencias (CSIC-UMH)),
Yufei Lin (Harvard University),
Victor Barrera(Harvard Chan School of Public Health),
Marc Halushka (Johns Hopkins University)
mirtop-0.4.24/requirements.txt 0000644 0001750 0001750 00000000070 14143247637 015735 0 ustar nilesh nilesh pysam
pybedtools
pandas
biopython
pyyaml
pybedtools
six
mirtop-0.4.24/scripts/ 0000755 0001750 0001750 00000000000 14143247637 014143 5 ustar nilesh nilesh mirtop-0.4.24/scripts/make_spikeins.py 0000644 0001750 0001750 00000007356 14143247637 017352 0 ustar nilesh nilesh from __future__ import print_function
import argparse
import os
import random
from collections import defaultdict
import pysam
import mirtop.libs.logger as mylog
import mirtop.libs.do as runner
def _read_fasta(fa, size):
source = dict()
with open(fa) as inh:
for line in inh:
if line.startswith(">"):
name = line.strip().split()[0].replace(">", "")
else:
if len(line.strip()) >= size:
source.update({name: line.strip()[0:size]})
return source
def _update_ends(source):
nts = ["A", "T", "C", "G"]
start_idx = 0
end_idx = 0
for name in source:
source[name] = nts[start_idx] + source[name] + nts[end_idx]
if end_idx == 3 and start_idx == 3:
end_idx = -1
start_idx = 0
if end_idx == 3:
start_idx += 1
end_idx = 0
end_idx += 1
return source
def _write_fasta(sequences, filename):
with open(filename, 'w') as outh:
for name in sequences:
if sequences[name]:
print(">%s\n%s" % (name, sequences[name]), file=outh)
return filename
def _parse_hits(sam, source):
uniques = defaultdict(list)
# read sequences and score hits (ignore same sequence)
handle = pysam.Samfile(sam, "rb")
for line in handle:
reference = handle.getrname(line.reference_id)
name = line.query_name
# sequence = line.query_sequence if not line.is_reverse else reverse_complement(line.query_sequence)
if reference == name:
continue
# print([reference, name, line.get_tag("NM")])
distance = line.get_tag("NM")
uniques[name].append(distance)
uniques[reference].append(distance)
# read parsed data and keep the ones with score > 10 edit distance
for name in uniques:
if min(uniques[name]) < 4:
if name in source:
source[name] = None
return source
parser = argparse.ArgumentParser()
parser.add_argument("--fa",
help="File with mature sequences.", required=True)
parser.add_argument("-s", "--size", default=22,
help="Size of spike-ins to generate.")
parser.add_argument("-n", "--number", default=16,
help="Number of spike-ins to generate.")
parser.add_argument("-o", "--out", default="spikeins.fa",
help="Name used for output files.")
parser.add_argument("--seed", help="set up seed for reproducibility.",
default=42)
parser.add_argument("--universe", help="Set up universe sequences to avoid duplication.",
default=None)
args = parser.parse_args()
random.seed(args.seed)
mylog.initialize_logger(os.path.dirname(os.path.abspath(args.out)))
logger = mylog.getLogger(__name__)
# Read file to get all sequences longer than size - 2
size = args.size - 2
source = _read_fasta(args.fa, size)
logger.info("%s was read: %s sequences were loaded" % (args.fa, len(source)))
source = _update_ends(source)
logger.info("source updated with extended nts: %s" % source)
# Map all vs all with razers3
modified = _write_fasta(source, os.path.join(os.path.dirname(args.out), "modified.fa"))
sam = os.path.join(os.path.dirname(args.out), "modified.bam")
runner.run(("razers3 -i 75 -rr 80 -f -so 1 -o {output} {target} {query}").format(output=sam, target=modified, query=modified))
uniques = _parse_hits(sam, source)
print(uniques)
if args.universe:
sam = os.path.join(os.path.dirname(args.out), "modified_vs_universe.sam")
runner.run(("razers3 -i 75 -rr 80 -f -o {output} {target} {query}").format(output=sam, target=args.universe, query=modified))
uniques = _parse_hits(sam, uniques)
print(uniques)
# Write uniques to fasta
_write_fasta(uniques, args.out)
mirtop-0.4.24/scripts/pyPlotM.py 0000644 0001750 0001750 00000004501 14143247637 016121 0 ustar nilesh nilesh import matplotlib.pyplot as plt
def makePlots(tsvFileN, pdfFileN, show):
#Reading file
with open(tsvFileN, "r") as ins:
lines = []
for line in ins:
lines.append(line.split('\t'))
#Calculating maximum number of plots
maxPlots = 0
for x in range(1, len(lines)):
idc = int(lines[x][0]) + 1
if maxPlots < idc:
maxPlots = idc
# Set up the matplotlib figure
cols = 3
rows = maxPlots/3
plt.subplots(3, rows, figsize=(8, 6), sharex=True)
#Creating array to store the values
array = []
for i in range(0, maxPlots):
arr2 = []
for j in range(0, 5):
arr2.append(0)
array.append(arr2)
#Filling the array with the values
for x in range(1, len(lines)):
idc = int(lines[x][0])
cnt = int(lines[x][3])
typ = lines[x][4].strip()
nam = lines[x][1].strip()
pos = 3;
if typ == 'synthetic':
pos = 2
if typ == 'bcbio':
pos = 0
if typ == 'mirge':
pos = 1
array[idc][pos] = cnt
array[idc][4] = nam
#Plotting the graphs
plt.figure(1)
#plt.xlabel('tool')
#plt.ylabel('Counts')
p = []
p.append(array[0][0])
p.append(array[0][1])
p.append(array[0][2])
n = array[0][4]
for i in range(0, maxPlots):
del(p[2])
del(p[1])
del(p[0])
p.append(array[i][0])
p.append(array[i][1])
p.append(array[i][2])
n = array[i][4]
pcd = rows * 100 + cols * 10 + 1 + i
plt.subplot(pcd)
ax = plt.gca()
ax.set_facecolor('lightgray')
plt.xticks([1,2,3], ('bcbio', 'mirge', 'synthetic'))
plt.yticks([0,10,20,30,40,50])
plt.tick_params(axis='both', which='major', labelsize=8)
plt.tick_params(axis='both', which='minor', labelsize=8)
plt.bar([1,2,3], p, color='gray')
plt.title(n)
for i, v in enumerate(p):
plt.text(i+0.9, 0, str(v), color='black', fontsize='8', fontweight='bold')
plt.subplots_adjust(top=0.92, bottom=0.10, left=0.10, right=0.95, hspace=0.50, wspace=0.35)
plt.savefig(pdfFileN, format="pdf")
if show == 1:
plt.show()
makePlots("../data/examples/plot/example_count.tsv", "kk.pdf", 1)
mirtop-0.4.24/scripts/import_gff3.py 0000644 0001750 0001750 00000021365 14143247637 016743 0 ustar nilesh nilesh # -*- coding: utf-8 -*-
"""
Function that loads a gff file into a pandas dataframe
"""
import pandas as pd
import numpy as np
def loadfile(filename,verbose=True):
try:
if verbose==True:
print 'Loading', filename
# obtaning sample names and number from 3rd line in header
num_header_lines=0
with open(filename) as f:
rowfile=f.readline()
num_header_lines+=1
while True:
if rowfile.startswith('## COLDATA'):
sample_names=rowfile.split()[2].split(',')
break
else:
rowfile=f.readline()
num_header_lines+=1
sample_number = len(sample_names)
if verbose==True:
print '--------------------------------------'
print sample_number,' samples in the file'
print '--------------------------------------'
for elem in sample_names:
print elem
print '--------------------------------------'
# number of columns in gff file
gff_cols = pd.read_table(filename, sep='\t', skiprows=num_header_lines, header=None).columns
# Adquiring non-attributes data
body_data=pd.read_table(filename, sep='\t', skiprows=num_header_lines, header=None, usecols=gff_cols[0:-1])
body_data.columns = ['SeqID', 'source', 'type', 'start', 'end', 'score', 'strand', 'phase']
# Adquiring attributes data
atr_data = pd.read_table(filename, sep='\t', skiprows=num_header_lines, header=None, usecols=gff_cols[[-1]])
#print 'hasta aqui todo bien'
# Splitting the attributes column
list_atr = []
# cheking attributes present in first row
attr_names=[attr.split()[0] for attr in atr_data.values[0,0].split(';')]
#attributes in the column
#print attr_names
num_attr = len(attr_names) #number of attributes
#expression_colindex=attr_names.index ('Expression') #position of the expression column in the attr column
if verbose==True:
print num_attr,' attributes in the file '
print '--------------------------------------'
for attr in attr_names:
print attr
print '--------------------------------------'
# joining rows of attributes without the descriptor
for row in range(atr_data.shape[0]):
list_atr.append([attr.split()[1] for attr in atr_data.values[row,0].split(';')])
#list_atr.append(atr_data.values[row, 0].split()[1::2])
# appending observations
atr_data = pd.DataFrame(list_atr, columns=attr_names)
# desglosing the expression column in a column for each sample
list_expression=[]
for row in range(atr_data.shape[0]):
list_expression.append(atr_data.loc[row,'Expression'].split(','))
sample_names=['Expression_' +x for x in sample_names]
expression_data = pd.DataFrame(list_expression, columns=sample_names)
atr_data=atr_data.drop('Expression',axis=1) #Remove the expression column
atr_data=atr_data.join(expression_data)
# Joining the body and attributes dataframes
data = body_data.join(atr_data)
#Unlisting the variant column
tempframe=data[data.type=='isomiR']
list_variants_present=[]
for row in tempframe.itertuples():
actual_variant=tempframe.loc[row.Index,'Variant'].split(',')
for i in range(len(actual_variant)):
if actual_variant[i] not in list_variants_present:
list_variants_present.append(actual_variant[i])
for var in list_variants_present:
for row in data.itertuples():
try:
index=data.loc[row.Index,'Variant'].split(',').index(var)
except:
index=-1
if index>=0:
#print var, data.loc[row.Index,'Variant']
data.at[row.Index,var]=1
else:
data.at[row.Index,var]=np.nan
return data
except:
print 'Error loading the file'
"""
Function that check the header then load a gff file and check the content
Returns the dataframe if the format is ok, false if not
"""
def load_check_gff3(filename):
try:
Error = False
coldata_found=False
# Checking the format file
# Header and 1st data row
with open(filename) as file:
rowfile=file.readline()
while True:
if rowfile.startswith('##'):
if rowfile.startswith("## COLDATA"):
coldata_found=True
rowfile=file.readline()
else:
data_1=rowfile.split('\t')
break
if coldata_found==False:
print 'No COLDATA, bad header'
return False
#Number of columns without breaking down attributes column
if len(data_1) > 9:
Error=True
print(len(data_1))
print('Too much columns')
# Cheking the attributes column
attr_names=data_1[-1].split(';')
list_attr=[]
for atr in range(len(attr_names)-1):
list_attr.append(attr_names[atr].split()[0])
possible_attr = ['UID', 'Read', 'Name', 'Parent', 'Variant', 'Cigar', 'Hits', 'Alias', 'Genomic', 'Expression',
'Filter', 'Seed_fam']
for attr in list_attr:
if attr not in possible_attr:
Error=True
print attr,'is not a possible attribute'
break
if Error:
print 'File format error'
return False
# If not format error, loading content
try:
dataframe=loadfile(filename,True)
except:
print 'Error loading file'
return False
print 'Checking content'
for i in range(dataframe.shape[0]):
# Labels in type column
if dataframe.loc[i, 'type'] not in ['ref_miRNA', 'isomiR']:
Error = True
print'line', i, 'pip install Markdownbad type error'
# start= dataframe.loc[i, 'end']:
Error = True
print 'line', i, 'start >=end error'
# Strand + or -
if dataframe.loc[i, 'strand'] not in ['+', '-']:
Error = True
print 'line', i, 'bad strand error'
# Variant checking
possible_variant=['iso_5p','iso_3p','iso_add','iso_snp_seed','iso_snp_central_offset','iso_snp_central',
'iso_central_supp','iso_snp_central_supp','iso_snp']
variant_i=dataframe.loc[i,'Variant'].split(',')
if len(variant_i)==1 and variant_i[0]!='NA':
if variant_i[0].split(':')[0] not in possible_variant:
Error = True
print 'Variant error', variant_i[0].split(':')[0], 'line', i
elif variant_i[0]!='NA':
for var in range(len(variant_i)):
if variant_i[var].split(':')[0] not in possible_variant:
Error = True
print 'Variant error', variant_i[0].split(':')[0], 'line', i
#Checking expression data
expression_cols=[col for col in dataframe.columns if 'Expression_' in col]
for col in expression_cols:
for i in range(dataframe.shape[0]):
if not dataframe.loc[i,col].isdigit():
print dataframe.loc[i,col].isdigit()
print 'Expression count error line',i
Error= True
dataframe[col]=dataframe[col].astype(int) #setting the datatype of counts
dataframe[col]=dataframe[col].replace(0,np.nan) #Setting 0 reads to NaN
if 'Filter' in dataframe.columns:
for i in range(dataframe.shape[0]):
if dataframe.loc[i, 'Filter']!='Pass':
print 'Warning non-pass filter in line',i
if Error:
print 'File format error'
return False
print '--------------------------------------'
print dataframe.dtypes
print '--------------------------------------'
print 'Format ok'
return dataframe
except:
print 'Error checking the file'
return False
mirtop-0.4.24/scripts/prepare.py 0000644 0001750 0001750 00000005656 14143247637 016167 0 ustar nilesh nilesh import os
import sys
from collections import defaultdict
from argparse import ArgumentParser
def _read_pri(fn):
pri = dict()
with open(fn) as inh:
for line in inh:
if line.startswith(">") & line.strip().endswith("pri"):
name = line.strip()[1:-4]
else:
pri[name] = line.strip()
return pri
def _read_bed(fn):
bed = defaultdict(dict)
with open(fn) as inh:
for line in inh:
cols = line.strip().split("\t")
if cols[3].find("pri") > 0:
continue
if cols[3].find("loop") > 0:
continue
if cols[3].find("seed") > 0:
continue
if cols[3].find("motif") > 0:
continue
if cols[3].find("co") > 0:
continue
bed[cols[3].split("_")[0]].update({cols[3]: [int(cols[1]), int(cols[2]), cols[5]]})
return bed
def _download(url, outfn):
if os.path.isfile(outfn):
return outfn
os.system('wget -O %s %s' % (outfn, url))
return outfn
if __name__ == "__main__":
parser = ArgumentParser(description="Prepare files from mirGeneDB to be used with seqbuster")
parser.add_argument("--bed", help="bed file with position of all sequence", required=1)
parser.add_argument("--precursor30", help="file or url with fasta of precursor + 30 nt", required=1)
args = parser.parse_args()
sps = os.path.basename(args.precursor30).split("-")[0]
if os.path.isfile(args.bed):
fnbed = args.bed
else:
fnbed = _download(args.bed, "%s.bed" % sps)
if os.path.isfile(args.precursor30):
fnfa = args.precursor30
else:
fnfa = _download(args.precursor30, "%s.fa" % sps)
fa = _read_pri(fnfa)
bed = _read_bed(fnbed)
OUT = open("%s.miRNA.str" % sps, 'w')
OUTP = open("%s.hairpin.fa" % sps, 'w')
for mir in fa:
if mir in bed:
precursor = bed[mir][mir + "_pre"]
print precursor
mir5p = ""
mir3p = ""
for mature in bed[mir]:
info = bed[mir][mature]
# print info
if mature.endswith("pre"):
continue
if precursor[2] == "-":
start = int(precursor[1]) - int(info[1]) + 31
end = int(precursor[1]) - int(info[0]) + 30
else:
start = int(info[0]) - int(precursor[0]) + 31
end = int(info[1]) - int(precursor[0]) + 30
# print [mature, start, end, fa[mir][start:end]]
if mature.find("5p") > 0:
mir5p = "[%s:%s-%s]" % (mature, start, end)
if mature.find("3p") > 0:
mir3p = "[%s:%s-%s]" % (mature, start, end)
print >>OUT, ">%s (X) %s %s" % (mir, mir5p, mir3p)
print >>OUTP, ">%s\n%s" % (mir, fa[mir])
OUT.close()
OUTP.close()
mirtop-0.4.24/scripts/make_unique.py 0000644 0001750 0001750 00000005567 14143247637 017035 0 ustar nilesh nilesh from __future__ import print_function
import argparse
import os
import random
from collections import defaultdict
import pysam
import mirtop.libs.logger as mylog
import mirtop.libs.do as runner
parser = argparse.ArgumentParser()
parser.add_argument("--fa",
help="File with mature sequences.", required=True)
parser.add_argument("-o", "--out", default="spikeins.fa",
help="Name used for output files.")
parser.add_argument("--seed", help="set up seed for reproducibility.",
default=42)
parser.add_argument("--max_size", help="maximum size allowed in the final output.",
default=25)
args = parser.parse_args()
random.seed(args.seed)
def _sam_to_bam(bam_fn):
bam_out = "%s.bam" % os.path.splitext(bam_fn)[0]
cmd = "samtools view -Sbh {bam_fn} -o {bam_out}"
runner.run(cmd.format(**locals()))
return bam_fn
def _bam_sort(bam_fn):
bam_sort_by_n = os.path.splitext(bam_fn)[0] + "_sort.bam"
runner.run(("samtools sort -n -o {bam_sort_by_n} {bam_fn}").format(
**locals()))
return bam_sort_by_n
def _read_fasta(fa):
source = dict()
with open(fa) as inh:
for line in inh:
if line.startswith(">"):
name = line.strip().split()[0].replace(">", "")
else:
source.update({name: line.strip()})
return source
def _write_fasta(sequences, filename, max=25):
with open(filename, 'w') as outh:
for name in sequences:
if sequences[name]:
if len(sequences[name]) < max:
print(">%s\n%s" % (name, sequences[name]), file=outh)
return filename
def _parse_hits(sam, source):
uniques = defaultdict(list)
# bam_fn = _sam_to_bam(sam)
# bam_fn = _bam_sort(bam_fn)
# read sequences and score hits (ignore same sequence)
handle = pysam.Samfile(sam, "rb")
for line in handle:
reference = handle.getrname(line.reference_id)
name = line.query_name
# sequence = line.query_sequence if not line.is_reverse else reverse_complement(line.query_sequence)
if reference == name:
continue
# print([reference, name, line.get_tag("NM")])
distance = line.get_tag("NM")
uniques[name].append(distance)
uniques[reference].append(distance)
# read parsed data and keep the ones with score > 10 edit distance
for name in uniques:
if min(uniques[name]) < 5:
if name in source:
source[name] = None
return source
# Map all vs all with razers3
source = _read_fasta(args.fa)
sam = os.path.join(os.path.dirname(args.out), "modified.bam")
runner.run(("razers3 -dr 5 -i 75 -rr 80 -f -so 1 -o {output} {target} {query}").format(output=sam, target=args.fa, query=args.fa))
uniques = _parse_hits(sam, source)
# Write uniques to fasta
_write_fasta(uniques, args.out, args.max_size)
mirtop-0.4.24/scripts/create_mirgenedb.sh 0000644 0001750 0001750 00000000512 14143247637 017754 0 ustar nilesh nilesh
SCRIPT="$1"
function run {
B="http://mirgenedb.org:81/static/data/$1-all.bed"
F="http://mirgenedb.org:81/static/data/$1-$2-pri-30-30.fas"
python $SCRIPT/prepare.py --bed ${B} --precursor30 ${F}
}
run hsa hg38
run mmu mm10
run rno rn6
run cpo cavPor3
run ocu oryCun2
run dno dasNov3
run gga galGal4
run dre danRer10
mirtop-0.4.24/scripts/miRNA.simulator.py 0000644 0001750 0001750 00000012720 14143247637 017503 0 ustar nilesh nilesh from __future__ import print_function
from optparse import OptionParser
import os
import random
import numpy
from mirtop.mirna import fasta
from mirtop.mirna import mapper
from mirtop.mirna import realign
from mirtop.gff import body, header
import mirtop.libs.logger as mylog
logger = mylog.getLogger(__name__)
def write_collapse_fastq(reads, out_fn):
idx = 0
with open(out_fn, 'a') as outh:
for r in reads:
idx += 1
print(">name%s_x%s" % (idx, r[1]), file=outh)
print(r[0], file=outh)
def write_fastq(reads, out_fn):
idx = 0
with open(out_fn, 'a') as outh:
for r in reads:
idx += 1
print("@name_read:%s" % idx, file=outh)
print(r, file=outh)
print("+", file=outh)
print("I" * len(r), file=outh)
def create_read(read, count, adapter="TGGAATTCTCGGGTGCCAAGGAACTC", size=36):
reads = list()
for i in range(0, count):
rest = size - len(read)
part = adapter[:rest]
reads.append(read + part)
return reads
def variation(info, seq):
randS = random.randint(info[0] - 2, info[0] + 2) + 1
randE = random.randint(info[1] - 1, info[1] + 2) + 1
if randS < 1:
randS = 1
if randE > len(seq):
randE = info[1] - 1
randSeq = seq[randS:randE]
t5Lab = ""
t5Lab = seq[randS:info[0]] if randS < info[0] else t5Lab
t5Lab = seq[info[0]:randS].lower() if randS > info[0] else t5Lab
t3Lab = ""
t3Lab = seq[randE:info[1] + 1].lower() if randE < info[1] + 1 else t3Lab
t3Lab = seq[info[1] + 1:randE] if randE > info[1] + 1 else t3Lab
# mutation
isMut = random.randint(0, 10)
mutLab = []
if isMut == 3:
ntMut = random.randint(0, 3)
posMut = random.randint(0, len(randSeq) - 1)
if not randSeq[posMut] == nt[ntMut]:
temp = list(randSeq)
mutLab = [[posMut, nt[ntMut], randSeq[posMut]]]
temp[posMut] = nt[ntMut]
randSeq = "".join(temp)
# addition
isAdd = random.randint(0, 3)
addTag = ""
if isAdd == 2:
posAdd = random.randint(1, 3)
for numadd in range(posAdd):
ntAdd = random.randint(0, 1)
print([randSeq, seq[randS + len(randSeq)]])
if nt[ntAdd] == seq[randS + len(randSeq)]:
ntAdd = 1 if ntAdd == 0 else 0
randSeq += nt[ntAdd]
addTag += nt[ntAdd]
print([randSeq, randE, info[1]])
return [randSeq, randS, t5Lab, t3Lab, mutLab, addTag]
def create_iso(name, mir, seq, numsim, exp):
reads = dict()
full_read = list()
clean_read = list()
seen = set()
for mirna in mir[name]:
info = mir[name][mirna]
for rand in range(int(numsim)):
e = 1
if exp:
trial = random.randint(1, 100)
p = random.randint(1, 50) / 50.0
e = numpy.random.negative_binomial(trial, p, 1)[0]
iso = realign.isomir()
randSeq, iso.start, iso.t5, iso.t3, iso.subs, iso.add = variation(info, seq)
if randSeq in seen:
continue
seen.add(randSeq)
iso.end = iso.start + len(randSeq)
aln = realign.align(randSeq, seq[iso.start:iso.end])
iso.cigar = realign.make_cigar(aln[0], aln[1])
iso.mirna = mirna
query_name = "%s.%s.%s" % (mirna, iso.format_id("."), randSeq)
reads[query_name] = realign.hits()
reads[query_name].set_sequence(randSeq)
reads[query_name].counts = e
reads[query_name].set_precursor(name, iso)
full_read.extend(create_read(randSeq, e))
clean_read.append([randSeq, e])
# print([randSeq, mutLab, addTag, t5Lab, t3Lab, mirSeq])
# data[randSeq] = [exp, iso] # create real object used in code to generate GFF
write_fastq(full_read, full_fq)
write_collapse_fastq(clean_read, clean_fq)
gff = body.create(reads, "miRBase21", "sim1")
return gff
def _write(lines, header, fn):
out_handle = open(fn, 'w')
print(header, file=out_handle)
for m in lines:
for s in sorted(lines[m].keys()):
for hit in lines[m][s]:
print(hit[4], file=out_handle)
out_handle.close()
usagetxt = "usage: %prog --fa precurso.fa --gtf miRNA.gtf -n 10"
parser = OptionParser(usage=usagetxt, version="%prog 1.0")
parser.add_option("--fa",
help="", metavar="FILE")
parser.add_option("--gtf",
help="", metavar="FILE")
parser.add_option("-n", "--num", dest="numsim",
help="")
parser.add_option("-e", "--exp", dest="exp", action="store_true",
help="give expression", default=False)
parser.add_option("-p", "--prefix", help="output name")
parser.add_option("--seed", help="set up seed for reproducibility.", default=None)
(options, args) = parser.parse_args()
if options.seed:
random.seed(options.seed)
full_fq = "%s_full.fq" % options.prefix
clean_fq = "%s_clean.fq" % options.prefix
out_gff = "%s.gff" % options.prefix
if os.path.exists(full_fq):
os.remove(full_fq)
if os.path.exists(clean_fq):
os.remove(clean_fq)
pre = fasta.read_precursor(options.fa, "")
mir = mapper.read_gtf_to_precursor(options.gtf)
nt = ['A', 'T', 'G', 'C']
gffs = dict()
h = header.create(["sampleX"], "miRBase1", "")
for precursor in pre:
seq = pre[precursor]
gffs.update(create_iso(precursor, mir, seq, options.numsim, options.exp))
_write(gffs, h, out_gff)
mirtop-0.4.24/HISTORY.md 0000644 0001750 0001750 00000007654 14143247637 014153 0 ustar nilesh nilesh 0.4.24
* [fix bad](https://github.com/miRTop/mirtop/issues/64) annotation when 5 or more T/A at the end of the sequence by @DrHogart
* Add SQL database creation
0.4.23
* fix empty stats file [#61](https://github.com/miRTop/mirtop/issues/61) by @leontienvdbent
0.4.22
* fix when reads map halfway on to the edge
* fix edge case where limit==variant_size
0.4.21
* Missing trimming events since 0.4.19
0.4.20
* Support export isomiR rawData output
* Support genomic coordinates as output in the gff
* Make TOOLS mandatory in header
* Implement method to create gff line
* Improve docs
0.4.19
* Add --version option
* Fix bug that ignore sequences starting at 0 in bam files
0.4.18
* Cast map object to list to avoid errors in py3.
* Support Manatee output.
* Support chunk reading for genomic BAM files.
* Support chunk reading for seqbuster files.
* Support chunk reading for BAM files.
* Normalize functions to support different databases.
* Support miRgeneDB.
* Export to VCF. Thanks to Roderic Espin.
* Support isomiRs that go beyond 5p end
* Support genomic coordinates.
* Fix missing reads when using --keep-read in the final mirtop.gff file.
* Allow longer truncation and addition events.
* Accept seqbuster input without frequency column.
* Allow keep name of the sequence.
* Accept indels in snv category.
* Additions are only last nucleotides that are mismatches.
* Adapt mintplate license.
* Revert sign in iso_5p, replace snp by snv.
* Skip lines that contain malformed UID.
* Add FASTA as an exporter from GFF.
* Fix BAM parsing to new GFF rules.
* Add the possibility to work with spikeins to detect random variability.
* Fixing UID attribute for tools that don't use our cypher system
* Add class to parse GFF line as a first move toward isolation
* Add JSON log for stats command.
0.3.17
* Normalize the read of the tool outputs.
* Add docs with autodoc plugin.
* Validator by @Vbarrera.
* Improve examples commands and test coverage.
* Only counts sequences with Filter == Pass during stats.
* Counts cmd add nucleotide information when --add-extra option is on.
* Fix error in stats that open the file in addition mode.
* Importer for sRNAbench just convert lines from input to GFF format.
* Skip lines with non-valid UID or miRNAs not in reference at counts cmd.
* Fix separators in counts cmd.
* Make --sps optional.
* Add synthetic data with known isomiRs to data set.
* Allow extra columns when converting to counts TSV file.
* Allow extra attributes for isomir-sea as well.
* Allow extra attributes to show the nts
that change in each isomiR type.
* Fix Expression attrb when join gff files.Thanks @AlisR.
* Print help when no files are giving to any subcommand.
* Fix bug for duplicated isomiRs tags. Thanks @AlisR.
* Fix bug in order of merged gff file. Thanks @AlisR.
* Add module to read GFF/GTF line in body.py
* Add version line to stats output
* Improve PROST! importer
* Fix output for isomiRs package
0.2.*
* Make GTF default output
* Add function to get SNPs from Variant attribute
* Improve PROST with last version output
* Add isomiR-SEA compatibility
* Fix sRNAbench exact match to NA in GFF
* Change stats to use only 1 level isomiR classification
* Add GFF to count matrix
* Add read_attributes function
* Improve isomiR reading from srnabench tool
* Add PROST to supported tools
0.1.7
* Remove deletion from addition isomiRs
* Support for srnabench output
* Fix bug mixing up source column
* Support Seqbuster output
* Functin to guess database used from GTF file through --mirna parameter
* Adapt output format to https://github.com/miRTop/incubator/blob/master/format/definition.md
0.1.5
* add function to check correct annotation
* add test data for SAM parsing
* add script to simulate isomiRs
* parse indels from bam file
0.1.4
* fix index BAM file command line
* add function to accept indels and test unit
* change header from subs -> mism to be compatible with isomiRs
mirtop-0.4.24/setup.py 0000644 0001750 0001750 00000002474 14143247637 014175 0 ustar nilesh nilesh """small RNA-seq annotation"""
import os
from setuptools import setup, find_packages
version = '0.4.24'
url = 'http://github.com/mirtop/mirtop'
def readme():
with open('README.md') as f:
return f.read()
def write_version_py():
version_py = os.path.join(os.path.dirname(__file__), 'mirtop',
'version.py')
with open(version_py, "w") as out_handle:
out_handle.write("\n".join(['__version__ = "%s"' % version,
'__url__ = "%s"' % url]))
write_version_py()
setup(name='mirtop',
version=version,
description='Small RNA-seq annotation',
long_description=readme(),
long_description_content_type="text/markdown",
classifiers=[
'License :: OSI Approved :: MIT License',
'Programming Language :: Python :: 2.7',
"Programming Language :: Python :: 3",
'Topic :: Scientific/Engineering :: Bio-Informatics'
],
keywords='RNA-seq miRNA isomiRs annotation',
url=url,
author='Lorena Pantano',
author_email='lorena.pantano@gmail.com',
license='MIT',
packages=find_packages(),
test_suite='nose',
entry_points={
'console_scripts': ['mirtop=mirtop.command_line:main'],
},
include_package_data=True,
zip_safe=False)
mirtop-0.4.24/artwork/ 0000755 0001750 0001750 00000000000 14143247637 014145 5 ustar nilesh nilesh mirtop-0.4.24/artwork/logo.png 0000644 0001750 0001750 00000040077 14143247637 015623 0 ustar nilesh nilesh PNG
IHDR p
sRGB cHRM z&