estscan-3.0.3/0000755000551200011300000000000011216117215012327 5ustar chrisludwigestscan-3.0.3/COPYRIGHT0000644000551200011300000001017610160301455013625 0ustar chrisludwig ESTScan license --------------- Copyright (c) Swiss Institute of Bioinformatics, Ludwig Institute for Cancer Research (LICR), and Swiss Institute for Experimental Cancer Research (ISREC), 1999, 2004. For the purposes of this copyright, the Swiss Institute of Bioinformatics acts on behalf of its partners, LICR and ISREC. The ESTScan software is the exclusive property of the copyright owners, at UNIL - BEP, CH-1015 LAUSANNE, Switzerland. The Swiss Institute of Bioinformatics provides the ESTScan program WITHOUT ANY WARRANTY OF MERCHANTABILITY OR FITNESS FOR ANY PARTICULAR PURPOSE, OR ANY OTHER WARRANTY, EXPRESS OR IMPLIED. License Terms: Use, Modification and Redistribution (including distribution of any modified or derived work) in source and binary forms is permitted only if each of the following conditions is met: 1. Redistributions qualify as "freeware" or "Open Source Software" under one of the following terms: (a) Redistributions are made at no charge beyond the reasonable cost of materials and delivery. (b) Redistributions are accompanied by a copy of the Source Code or by an irrevocable offer to provide a copy of the Source Code for up to three years at the cost of materials and delivery. Such redistributions must allow further use, modification, and redistribution of the Source Code under substantially the same terms as this license. For the purposes of redistribution "Source Code" means the complete source code of ESTScan including all modifications. Other forms of redistribution are allowed only under a separate royalty- free agreement permitting such redistribution subject to standard commercial terms and conditions. A copy of such agreement may be obtained from the Swiss Institute of Bioinformatics at the above address. 2. Redistributions of source code must retain the copyright notices as they appear in each source code file, these license terms, and the disclaimer/limitation of liability set forth in the introductory paragraph. 3. Redistributions in binary form must reproduce the Copyright Notice, these license terms, and the disclaimer/limitation of liability set forth as above, in the documentation and/or other materials provided with the distribution. For the purposes of binary distribution the "Copyright Notice" refers to the following language: "Copyright (c) 1999, 2004 Swiss Institute of Bioinformatics. All rights reserved." 4. Neither the name of the Swiss Institute of Bioinformatics nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. 5. All redistributions must comply with the conditions imposed by the Swiss Institute of Bioinformatics on certain embedded code, whose copyright notice and conditions for redistribution are as follows: (a) Copyright (c) 1999, 2004 Swiss Institute of Bioinformatics. All rights reserved. (b) Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: (i) Redistributions of source code must retain the above copyright notice, this list of conditions and the above disclaimer. (ii) Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. (iii) All advertising materials mentioning features or use of this software must display the following acknowledgement: "This product includes software developed by the Swiss Institute of Bioinformatics and its contributors." (iv) Neither the name of the Institute nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. ($Id: COPYRIGHT,v 1.1.1.1 2004/12/16 12:44:29 c4chris Exp $ Version 1.1, last updated 9 December 2004) estscan-3.0.3/estscan.spec0000644000551200011300000000772311216117140014651 0ustar chrisludwig# $Id: estscan.spec,v 1.10 2009/06/17 07:42:24 c4chris Exp $ Name: estscan Version: 3.0.3 Release: 0 Summary: Detect coding regions in EST sequences Group: Applications/Engineering License: ESTScan URL: http://estscan.sourceforge.net Source0: http://dl.sf.net/estscan/%{name}-%{version}.tar.gz BuildRoot: %{_tmppath}/%{name}-%{version}-%{release}-root-%(%{__id_u} -n) %description ESTScan is a program that can detect coding regions in DNA sequences, even if they are of low quality. ESTScan will also detect and correct sequencing errors that lead to frameshifts. ESTScan is not a gene prediction program , nor is it an open reading frame detector. In fact, its strength lies in the fact that it does not require an open reading frame to detect a coding region. As a result, the program may miss a few translated amino acids at either the N or the C terminus, but will detect coding regions with high selectivity and sensitivity. %package devel Summary: Development tools to create matrices for estscan Group: Applications/Engineering Requires: %{name} = %{version}-%{release} Provides: perl(build_model_utils.pl) %description devel The estscan-devel package contains various tools to develop and evaluate your own score matrices for use with estscan. %prep %setup -q sed -i 's+/usr/molbio/share/ESTScan+%{_sysconfdir}/%{name}+' estscan.c # Help RPM depsolver find the requirements sed -i 's+/usr/bin/env perl+%{_bindir}/perl+' build_model build_model_utils.pl evaluate_model extract_EST extract_mRNA extract_UG_EST prepare_data %build make CFLAGS="-std=gnu99 $RPM_OPT_FLAGS" %{?_smp_mflags} estscan maskred makesmat %install rm -rf $RPM_BUILD_ROOT mkdir -p ${RPM_BUILD_ROOT}%{_bindir} install -m755 estscan ${RPM_BUILD_ROOT}%{_bindir} install -m755 maskred ${RPM_BUILD_ROOT}%{_bindir} install -m755 makesmat ${RPM_BUILD_ROOT}%{_bindir} install -m755 build_model ${RPM_BUILD_ROOT}%{_bindir} install -m755 evaluate_model ${RPM_BUILD_ROOT}%{_bindir} install -m755 extract_EST ${RPM_BUILD_ROOT}%{_bindir} install -m755 extract_mRNA ${RPM_BUILD_ROOT}%{_bindir} install -m755 extract_UG_EST ${RPM_BUILD_ROOT}%{_bindir} install -m755 prepare_data ${RPM_BUILD_ROOT}%{_bindir} mkdir -p ${RPM_BUILD_ROOT}%{perl_vendorarch} install -m644 build_model_utils.pl ${RPM_BUILD_ROOT}%{perl_vendorarch} mkdir -p ${RPM_BUILD_ROOT}%{_sysconfdir}/%{name} %check %clean rm -rf $RPM_BUILD_ROOT %files %defattr(-,root,root,-) %doc COPYRIGHT %dir %{_sysconfdir}/%{name}/ %{_bindir}/estscan %files devel %defattr(-,root,root,-) %{_bindir}/maskred %{_bindir}/makesmat %{_bindir}/build_model %{_bindir}/evaluate_model %{_bindir}/extract_EST %{_bindir}/extract_mRNA %{_bindir}/extract_UG_EST %{_bindir}/prepare_data %{perl_vendorarch}/build_model_utils.pl %changelog * Wed Jun 17 2009 Christian Iseli - 3.0.3-0 - version 3.0.3 - 2009-06-17 09:40 c4chris * estscan.c, estscan.spec: Bump to version 3.0.3. - 2009-02-18 18:09 c4chris * makesmat.c: Add some sanity checks on FASTA header line format. - 2007-03-27 16:47 c4chris * estscan.spec: Update changelog. * Tue Mar 27 2007 Christian Iseli - 3.0.2-0 - version 3.0.2 - 2007-03-27 16:45 c4chris * estscan.c, estscan.spec: Bump to version 3.0.2. - 2007-03-26 19:38 c4chris * prepare_data: Show a bit less digits in the masked percent msg. - 2007-03-08 13:33 c4chris * prepare_data: Fix masked nucleotides report message. - 2007-02-01 16:18 c4chris * estscan.spec: Update changelog. * Thu Feb 1 2007 Christian Iseli - 3.0.1-0 - version 3.0.1 - 2007-02-01 16:15 c4chris * estscan.c, estscan.spec: Bump to version 3.0.1. - 2007-01-25 15:25 c4chris * extract_mRNA: Make use of new BTLib version 0.16 (can now parse general GenBank format hopefully). - 2007-01-25 14:39 c4chris * prepare_data: Properly count nt (skip newlines). * Tue Dec 19 2006 Christian Iseli - 3.0-0 - created estscan-3.0.3/extract_UG_EST0000755000551200011300000004026410541760167015055 0ustar chrisludwig#!/usr/bin/env perl # $Id: extract_UG_EST,v 1.2 2006/12/19 12:52:07 c4chris Exp $ ################################################################################ # # extract_UG_EST # -------------- # # Claudio Lottaz, SIB-ISREC, Claudio.Lottaz@isb-sib.ch # Christian Iseli, LICR ITO, Christian.Iseli@licr.org # # Copyright (c) 2006 Swiss Institute of Bioinformatics. All rights reserved. # ################################################################################ use strict; use FASTAFile; use Symbol; # global variables my $verbose = 1; my $norna = 0; my $datadir = '.'; my $filestem = ''; require "build_model_utils.pl"; ################################################################################ # # Check command-line for switches # my $usage = "Usage: extract_UG_EST [options] \n" . " where options are:\n" . " -q don't log on terminal\n" . "More information is obtained using 'perldoc extract_UG_EST'\n"; while ($ARGV[0] =~ m/^-/) { if ($ARGV[0] eq '-q') { shift; $verbose = 0; next; } die "Unrecognized switch: $ARGV[0]\n$usage"; } if ($#ARGV < 0) { die "No configuration file specified\n$usage"; } ################################################################################ # # Main-loop through all specified config-files # my $parFile; while($parFile = shift) { my($organism, $hightaxo, $dbfiles, $ugdata, $estdata, $datadir2, $filestem2, $rnafile, $estfile, $estcdsfile, $estutrfile, $trainingfile, $testfile, $utrfile, $cdsfile, $tuplesize, $minmask, $pseudocounts, $minscore, $startlength, $startpreroll, $stoplength, $stoppreroll, $smatfile, $estscanparams, $nb_isochores, $isochore_borders) = readConfig($parFile, undef, undef, undef, undef, undef, undef, undef, undef, $verbose); log_open("readconfig.log"); $datadir = $datadir2; $filestem = $filestem2; showConfig($parFile, $organism, $hightaxo, $dbfiles, $ugdata, $estdata, $datadir, $rnafile, $estfile, $estcdsfile, $estutrfile, $trainingfile, $testfile, $utrfile, $cdsfile, $tuplesize, $minmask, $pseudocounts, $minscore, $startlength, $startpreroll, $stoplength, $stoppreroll, $smatfile, $estscanparams, $nb_isochores, $isochore_borders); log_close(); # evaluate on mRNAs log_open("extract_UG_EST.log"); # extract ESTs from UG clusters die "No UG clusters data file provided" unless defined $ugdata; log_print("\nGenerating evaluation EST data...."); analyzeClusters($testfile, $ugdata); collect_testsets($estcdsfile, $estutrfile); log_close(); print "$parFile done.\n"; } exit 0; ################################################################################ # # Estimate false positive/negative rate on EST-data extracted through UniGene # sub analyzeClusters { # For each UniGene cluster with reference to refseq mRNA or mRNA # entry in embl containing complete CDS annotated, find all ESTs # and use megablast to align the ESTs with the mRNA. Store the # match positions. my($mrna, $refseqmrna, @emblmrnas, @ests, $eststring); my $fh = gensym; my $ugdata = "$datadir/Evaluate/ug.data"; my($testfile, $ugcf) = @_; if (-s $ugdata) { log_print(" $ugdata already exists, skip."); } else { my %ID; my $ua = BTFile->new("$ugcf.ac.idx" . "[18,10,10]", "", $ugcf); log_print(" find UniGene clusters for test mRNAs..."); my $fh = gensym; open $fh, $testfile; while ( <$fh> ) { if (m/^>tem.(\S+)/) { my $ac = $1; my $e = $ua->fetch($ac); unless (defined $e) { print STDERR "Couldn't find AC $ac in $ugcf\n"; next; } ($ac) = $e =~ /^ID\s+(\S+)/s; $ID{$ac} = 1; } } close $fh; undef $ua; open $fh, ">$ugdata"; $ua = BTFile->new("$ugcf.idx" . "[18,10,10]", "", $ugcf); foreach my $ac (sort(keys %ID)) { my $e = $ua->fetch($ac); print $fh $e; } close $fh; } log_print(" reading UniGene clusters..."); my @clusters; my $clusterfile = "$datadir/Evaluate/clusters.lst"; if (-s $clusterfile) { log_print(" $clusterfile already exists, read existing"); open $fh, $clusterfile; @clusters = <$fh>; close $fh; } else { if (system("which fetch") != 0) { log_print(" fatal: fetch not found"); die; } $mrna = undef; $refseqmrna = undef; undef @ests; undef @emblmrnas; open $fh, ">$clusterfile"; my $ugfh = gensym; open $ugfh, $ugdata; while ( <$ugfh> ) { if (m/^SEQUENCE ACC=([NX]M_[^.;]+)/) { $refseqmrna = "rs:$1"; } elsif (m/PID=/ && m/^SEQUENCE ACC=([^.;]+)/) { push @emblmrnas, "embl:$1"; } if (m/LID=/ && m/^SEQUENCE ACC=([^.;]+)/) { push @ests, "embl:$1"; } if (m+^//+) { # found the end of the current entry if ($#ests >= 0) { if (defined $refseqmrna) { $mrna = "$refseqmrna"; } else { foreach my $id (@emblmrnas) { my $e = `fetch $id`; if (($e =~ m/^ID\s+\S+; SV \d+; linear; mRNA;/s) && ($e =~ m/^FT CDS\s+\d+\.\.\d+$/m)) { $mrna = "$id"; last; } } } } if (defined($mrna)) { my $line = "$mrna : " . join(' ', @ests) . "\n"; print $fh $line; push @clusters, $line; } $mrna = undef; $refseqmrna = undef; undef @ests; undef @emblmrnas; } } close $ugfh; close $fh; } my $nclusters = $#clusters + 1; log_print(" found $nclusters clusters"); log_print(" - matching ESTs against mRNAs..."); my $matchfile = "$datadir/Evaluate/matches.lst"; if (-s $matchfile) { log_print(" $matchfile already exists, skipped"); return; } if (system("which formatdb") != 0) { log_print(" fatal: formatdb not found"); die; } if (system("which megablast") != 0) { log_print(" fatal: megablast not found"); die; } my $i = 0; my $tmprnafile = "$datadir/Evaluate/tmpmrna.seq"; my $tmpestdb = "$datadir/Evaluate/tmpestdb"; my $matchfh = gensym; system("touch $matchfile"); foreach (@clusters) { my $cdsBegin; my $cdsEnd; ($mrna, $eststring) = split / : /, $_, 2; chop $eststring; @ests = split / /, $eststring; $i++; my $nests = $#ests + 1; log_print(" analyzing cluster $i of $nclusters:" . " $mrna (has $nests ESTs)..."); # find full RNA and the position of its coding sequence my $e = `fetch $mrna`; if ($e =~ m/CDS\s+(\d+)\.\.(\d+)/) { $cdsBegin = $1; $cdsEnd = $2; } else { next; } system "fetch -f $mrna > $tmprnafile"; # collect ESTs into BLAST database open $matchfh, "|xargs fetch -f >$tmpestdb.seq"; print $matchfh join("\n", @ests), "\n"; close $matchfh; system "formatdb -p F -n $tmpestdb -i $tmpestdb.seq"; # match ESTs using megablast my $minLen = 300; # minimal length an EST matches the RNA nicely my $maxMissed = 0.05; # maximal number of unmatched nucleotides open $matchfh, ">>$matchfile"; open $fh, "megablast -D 0 -e 1e-20 -d $tmpestdb -i $tmprnafile |"; $_ = <$fh>; my($estid, $ori, $eStart, $rStart, $eStop, $rStop, $mismatch) = m/^\'emb\|([^\|]+)[^\']+\'==\'([+-])\S+ \((\d+) (\d+) (\d+) (\d+)\) (\d+)/; my $oldestid = $estid; my $oldori = $ori; my $offset = $eStart; my $matchBegin = $rStart; my $matchEnd = $rStop; my $matchLen = $eStop; my $matchMissed = $offset + $mismatch; while ( <$fh> ) { ($estid, $ori, $eStart, $rStart, $eStop, $rStop, $mismatch) = m/^\'emb\|([^\|]+)[^\']+\'==\'([+-])\S+ \((\d+) (\d+) (\d+) (\d+)\) (\d+)/; next if $ori eq '-'; if ($oldestid eq $estid) { my $tmpa = abs($eStart - $matchLen); my $tmpb = abs($rStart - $matchEnd); $matchMissed += (($tmpa < $tmpb) ? $tmpb : $tmpa) + $mismatch; } else { my $tmp = $matchEnd - $matchBegin + $offset; if ($matchLen < $tmp) { $matchLen = $tmp; } if (($oldori eq '+') && ($matchLen > $minLen) && ($matchMissed / $matchLen < $maxMissed)) { print $matchfh "$mrna $cdsBegin $cdsEnd embl:$oldestid ", "$matchBegin $matchEnd $offset $matchMissed\n"; } $oldestid = $estid; $oldori = $ori; $offset = $eStart; $matchBegin = $rStart; $matchMissed = (($eStart < $rStart) ? $eStart : $rStart) + $mismatch; } $matchEnd = $rStop; $matchLen = $eStop; } my $tmp = $matchEnd - $matchBegin + $offset; if ($matchLen < $tmp) { $matchLen = $tmp; } if (($oldori eq '+') && ($matchLen > $minLen) && ($matchMissed / $matchLen < $maxMissed)) { print $matchfh "$mrna $cdsBegin $cdsEnd embl:$oldestid ", "$matchBegin $matchEnd $offset $matchMissed\n"; } close $matchfh; close $fh; unlink <$datadir/Evaluate/tmp*>; } } sub collect_testsets { # reads the list generated by analyzeClusters and for each # full-length mRNA determines at most one piece of EST entirely # contained in coding sequence and one contained in untranslated # region. my $fh = gensym; my($estcdsfile, $estutrfile) = @_; log_print(" - collecting ESTs from coding sequences and" . " untranslated regions..."); if ((-s $estutrfile) && (-s $estcdsfile)) { log_print(" $estcdsfile already exists, skipped"); log_print(" $estutrfile already exists, skipped"); return; } my $i = 0; my($oldmrna, $utrdone, $cdsdone); my $minUTRLen = 100; # minimum length of a valid UTR match my $matchfile = "$datadir/Evaluate/matches.lst"; my $nbMatches = `wc -l $matchfile`; $nbMatches =~ m/^\s*(\S+)/; $nbMatches = $1; my $matchfh = gensym; open $matchfh, $matchfile; while ( <$matchfh> ) { my ($mrna, $cdsBegin, $cdsEnd, $est, $matchBegin, $matchEnd, $offset, $mismatches) = split; if ($mrna ne $oldmrna) { $utrdone = undef; $cdsdone = undef; $oldmrna = $mrna; } my $matchLen = $matchEnd - $matchBegin + $offset; next if $matchLen < $minUTRLen; if ($matchEnd < $cdsBegin) { # match is entirely in 5'UTR unless (defined $utrdone) { addESTBegin($est, $estutrfile, $matchLen); $utrdone = 1; } } else { if ($matchEnd < $cdsEnd) { # match end is in CDS if ($matchBegin < $cdsBegin) { # match overlaps CDS begin if (!defined($utrdone) && (($cdsBegin-$matchBegin) > $minUTRLen)) { addESTBegin($est, $estutrfile, $cdsBegin - $matchBegin + $offset - 1); $utrdone = 1; } unless (defined $cdsdone) { addCdsEST($est, $estcdsfile, $matchLen, $cdsBegin - $matchBegin + $offset, $matchLen); $cdsdone = 1; } } else { # match is included in CDS unless (defined $cdsdone) { addCdsEST($est, $estcdsfile, $matchLen, 1, $matchLen); $cdsdone = 1; } } } else { # match end is in 3'UTR if ($matchBegin < $cdsBegin) { # match includes CDS if (!defined($utrdone) && (($cdsBegin-$matchBegin) > $minUTRLen)) { addESTBegin($est, $estutrfile, $cdsBegin - $matchBegin + $offset - 1); $utrdone = 1; } if (!defined($utrdone) && (($matchEnd - $cdsEnd) > $minUTRLen)) { addESTEnd($est, $estutrfile, $matchLen, $matchEnd - $cdsEnd); $utrdone = 1; } unless (defined $cdsdone) { addCdsEST($est, $estcdsfile, $matchLen, $cdsBegin - $matchBegin + $offset, $cdsEnd - $matchBegin + $offset); $cdsdone = 1; } } else { if ($matchBegin < $cdsEnd) { # match overlaps CDS end if (!defined($utrdone) && (($matchEnd-$cdsEnd) > $minUTRLen)) { addESTEnd($est, $estutrfile, $matchLen, $matchEnd - $cdsEnd); $utrdone = 1; } unless (defined $cdsdone) { addCdsEST($est, $estcdsfile, $matchLen, 1, $cdsEnd - $matchBegin + $offset); $cdsdone = 1; } } else { # match is entirely in 3'UTR if (!defined($utrdone) && ($matchEnd - $matchBegin > $minUTRLen)) { addESTBegin($est, $estutrfile, $matchLen); $utrdone = 1; } } } } } $i += 1; if (($i % 100) == 0) { log_print(" $i of $nbMatches matches evaluated"); } } close $matchfh; } sub addCdsEST { my($est, $estfile, $len, $cdsFrom, $cdsTo) = @_; my $src = FASTAFile->new("fetch -f $est |"); $src->openStream; my $e = $src->getNext; close $src->{_BTFfile}; $e->{_seq} = substr $e->{_seq}, 0, $len; $e->{_seqHead} =~ s/(>\S+)/$1 CDS: $cdsFrom $cdsTo /; $e->{_seqHead} .= " (first $len nucleotides)"; my $fh = gensym; open $fh, ">> $estfile"; $e->printFASTA($fh); close $fh; } sub addESTBegin { my($est, $estfile, $len) = @_; my $src = FASTAFile->new("fetch -f $est |"); $src->openStream; my $e = $src->getNext; close $src->{_BTFfile}; $e->{_seq} = substr $e->{_seq}, 0, $len; $e->{_seqHead} .= " (first $len nucleotides)"; my $fh = gensym; open $fh, ">> $estfile"; $e->printFASTA($fh); close $fh; } sub addESTEnd { my($est, $estfile, $matchLen, $len) = @_; my $src = FASTAFile->new("fetch -f $est |"); $src->openStream; my $e = $src->getNext; close $src->{_BTFfile}; $e->{_seq} = substr $e->{_seq}, 0, $matchLen; $e->{_seq} = substr $e->{_seq}, -$len; $e->{_seqHead} .= " (last $len nucleotides)"; my $fh = gensym; open $fh, ">> $estfile"; $e->printFASTA($fh); close $fh; } ################################################################################ # # Documentation # =head1 NAME extract_UG_EST - extract ESTs from UG clusters as test sets for ESTScan =head1 SYNOPSIS extract_UG_EST [options] =head1 DESCRIPTION The data needed for evaluating ESTScan using ESTs is extracted from UniGene clusters. UniGene clusters are used to determine ESTs from untranslated regions and coding sequence respectively. This is done by matching the ESTs of a given cluster against its full-length mRNA with megablast and then determining where the match occurs relative to the annotated coding sequence. For each category, coding and non-coding, a single EST is chosen per cluster, in order to avoid redundancy. The matching location also allows to determine where coding sequences start and end in partially coding ESTs. The same annotation in FASTA headers is used as for mRNAs. The sets of coding and non-coding ESTs can later be used to perform the same computational experiments as those done with mRNA data. Files which already exist are reused. If an existing file is to be recomputed, it must deleted before running the script again. For instance, if a particular collection of EST sequences should be used instead of data extracted by extract_UG_EST, providing these in FASTA format under the name of the EST file (where extract_UG_EST would store the extracted data) is enough. The same procedure can be applied to provide hand picked test data. However, in mRNA and EST data used for test and training, evaluate_model expects annotations of coding sequence start and stop in the header as two integer values following the tag 'CDS:'. The first integer points to the first nucleotide of the CDS, the second to the last. Thus the length of the CDS is - + 1. The position counting starts with 1. =head1 DIRECTORY STRUCTURE extract_UG_EST uses the same directory structure as build_model, the root of which is given in the configuration file. From this root it adds the subdirectory 'Evaluate', which contains all result files. EST data as well as test and training data files are deposited in the data-root directory if not otherwise specified in the configuration file. =head1 OPTIONS AND CONFIGURATION FILE -q Be quiet. Additional parameters defined in the configuration files for extract_UG_EST are listed here: * $ugdata Name of the file(s) containing data about unigene clusters. If this is not defined, no evaluation is currently implemented. $filestem is used to generate many filenames. It is generated automatically according to the tuplesize, the minmask and the pseudocounts applied to generate them. =head1 REQUIREMENTS During analysis of UniGene clusters and evaluation of the generated tables some external packages are used to collect and compare sequences. extract_UG_EST relies on 'megablast' to determine where ESTs match on full-length mRNA sequences. The 'fetch' utility is used to find the EST and mRNA entries. This tool needs a properly indexed version of EMBL and RefSeq flatfiles. Use 'indexer' for this. Both tools are part of the BTLib toolset. =head1 AUTHOR Claudio Lottaz, SIB-ISREC, Claudio.Lottaz@isb-sib.ch =cut # # End of file # ################################################################################ estscan-3.0.3/maskred.c0000644000551200011300000002044010214061324014115 0ustar chrisludwig/* * $Id: maskred.c,v 1.2 2005/03/10 15:08:04 c4chris Exp $ * * maskred.c * * Reads from stdin in FASTA format expecting nucleotide data, masks * reoccuring tuples with 'N' characters if they overlap by a * specified number of nucleotides and writes in FASTA format on * stdout. * * Usage: maskred [-s ] [-o ] * [-m } < infile > outfile * * '-s' specifies the tuples used to determine reoccurence. Tuples * observed in different frames are not considered reoccuring. Tuples * observed in UTRs are also considered from those in CDS and vice * versa. '-o' specifies how many nucleotides of subsequnt * reoccuring tuples must overlap in order to consider them as one * region to be masked. Only regions longer than the value specified * with '-m' are actually masked. * * maskred expects CDS annotation in the FASTA-header of the * inputs. Immediately after the tag 'CDS: ' the next two'integers * separated by a are considered first and last position of * the CDS. CDS is not recognized correctly if its specification does * not entirely occur in the first 1023 bites of the header. * * written by Claudio Lottaz (SIB-ISREC) in September/October 2001 */ #include #include #include #include #include #include #ifndef __GNUC__ #include #endif #include #include /******************************************************************************* * * Command line arguments */ static void usage(char *arg0) { fprintf(stderr, "Usage: %s [options] < infile > outfile\n" " where options are:\n" " -m minimum length of region to be filtered\n" " -o overlap required between following recurring tuples\n" " -s size of redundancy filter\n" " -d debug\n", arg0); exit(1); } typedef struct _options_t { int tupsize; int overlap; int minmask; int debug; } options_t; static options_t options; static void getOptions(int argc, char *argv[]) { /* default values */ options.tupsize = 12; options.overlap = 0; options.minmask = 30; options.debug = 0; /* read command line */ while (1) { int c = getopt(argc, argv, "dhm:o:s:"); if (c == -1) break; switch (c) { case 'd': options.debug = 1; break; case 'm': options.minmask = atoi(optarg); break; case 'o': options.overlap = atoi(optarg); break; case 's': options.tupsize = atoi(optarg); break; case 'h': usage(argv[0]); default: printf ("Option -%c is unknown\n", c); usage(argv[0]); } } /* check switches */ if (options.tupsize > 16) { fprintf(stderr, "maskred: redundancy filter more than 16 nucleotides wide (%d)\n", options.tupsize); exit(1); } if (options.overlap == 0) options.overlap = options.tupsize - 1; if (options.overlap >= options.tupsize) { fprintf(stderr, "maskred: too much overlap required (%d >= %d)\n", options.overlap, options.tupsize); exit(1); } if (optind != argc) usage(argv[0]); } /******************************************************************************* * * various data types and utilities */ /* set and test single bits an a large array if bits **************************/ /* Store which tuples have been encountered. One array per frame and one for UTRs */ static unsigned char *seenTuples[4]; static unsigned long tupleIndexMask; static unsigned char bitValue[] = {1, 2, 4, 8, 16, 32, 64, 128}; inline void setBit(unsigned long index, unsigned char *bitField) { bitField[index >> 3] |= bitValue[index & 7]; } inline int testBit(unsigned long index, unsigned char *bitField) { return(bitField[index >> 3] & bitValue[index & 7]); } /* Interpret nucleotides ******************************************************/ static char decode[] = {'A', 'C', 'G', 'T', 'N'}; const unsigned int N = 4; static unsigned long getCode(int c) { unsigned long i = c & 0x1f; /* Get lower bits, get rid of upper/lower info. */ switch (i) { case 1: return 0; /* This is A. */ case 3: return 1; /* This is C. */ case 7: return 2; /* This is G. */ case 20: return 3; /* This is T. */ case 2 : /* This is B. */ case 4: /* This is D. */ case 8: /* This is H. */ case 11: /* This is K. */ case 13: /* This is M. */ case 14: /* This is N. */ case 18: /* This is R. */ case 19: /* This is S. */ case 22: /* This is V. */ case 23: /* This is W. */ case 25: return N; /* This is Y. */ default: /* Everything else. */ fprintf(stderr, "Bad character in getCode: %c(%d)\n", (char)c, c); exit(1); } } /******************************************************************************** * * Data analysis */ static void printTuple(unsigned long index, int len) { int i; unsigned char *t = (unsigned char *)alloca(sizeof(unsigned char) * len); for (i = len-1; i >= 0; i--) { t[i] = index & 3; index >>= 2; } for (i = 0; i < len; i++) putchar(decode[t[i]]); } inline char get_nucleotide() { char c = getchar(); while ((c == 12) || (c == 10)) c = getchar(); /* skip and */ return c; } int maskrun(char *string, int *begin, int *end) /* in 'string' fills substring from 'begin' to 'end' with Ns. * perform this action only of the run is long enough. */ { int i, len = 0; if ((*end - *begin) > options.minmask) { for(i = *begin; i <= *end; i++) string[i] = 'N'; len = *end - *begin + 1; } *begin = *end = -1; return len; } void mask_file() { int i, pos, size, skip, masked = 0; int tupleIndex, maskStart, maskEnd; int cdsStart, cdsEnd; char c, buf[1024], *longbuf, *s; size = 1024; longbuf = malloc(sizeof(char)*size); c = getchar(); while(c == '>') { /* read and print the header, find CDS */ fgets(buf, 1024, stdin); fprintf(stdout, ">%s", buf); s = strstr(buf, "CDS: ") + 5; cdsStart = atoi(s); s = strchr(s, ' ') + 1; cdsEnd = atoi(s); while(buf[strlen(buf)-1] != '\n') { fgets(buf, 1024, stdin); fprintf(stdout, "%s", buf); } /* read sequence into the buffer */ pos = 0; c = get_nucleotide(); while((c != '>') && (c != EOF)) { if (pos == size) { size += 1024; longbuf = realloc(longbuf, sizeof(char) * size); } longbuf[pos] = c; pos++; c = get_nucleotide(); } /* mask redundant tuples */ tupleIndex = 0; maskStart = maskEnd = -1; /* -1 means current region unmasked */ skip = options.tupsize; /* tupleIndex is invalid for the next skip positions */ for (i = 0; i < pos; i++) { unsigned long code = getCode(longbuf[i]); if (code < N) tupleIndex = ((tupleIndex << 2) | code) & tupleIndexMask; else { skip = options.tupsize; maskrun(longbuf, &maskStart, &maskEnd); } if (skip) skip--; else { int f = ((cdsStart <= i) && (i <= cdsEnd)) ? (i - cdsStart) % 3 : 3; if (testBit(tupleIndex, seenTuples[f])) { if (maskStart == -1) maskStart = i - options.tupsize + 1; maskEnd = i; } else { if ((maskStart != -1) && (i - maskEnd) > (options.tupsize - options.overlap)) masked += maskrun(longbuf, &maskStart, &maskEnd); setBit(tupleIndex, seenTuples[f]); } } } masked += maskrun(longbuf, &maskStart, &maskEnd); /* write masked buffer to output */ for (i = 0; i < pos; i++) { putchar(longbuf[i]); if (((i+1) % 80) == 0) putchar('\n'); } putchar('\n'); } fprintf(stdout, ">masked nucleotides: %d\n", masked); free(longbuf); } /******************************************************************************** * * Main */ int main(int argc, char *argv[]) { int i; unsigned long storeSize; getOptions(argc, argv); /* reserve one bit for each nucleotide tuple of the size of the filter */ storeSize = 1<<(2*options.tupsize - 3); for (i = 0; i < 4; i++) { seenTuples[i] = (unsigned char *)malloc(sizeof(unsigned char) * storeSize); memset(seenTuples[i], 0, storeSize); } tupleIndexMask = ((storeSize - 1) << 3) | 7 ; mask_file(); /* clean up */ for (i = 0; i < 4; i++) { free(seenTuples[i]); } return 0; } /* * End of File * *******************************************************************************/ estscan-3.0.3/build_model_utils.pl0000644000551200011300000002101210542006365016363 0ustar chrisludwig# $Id: build_model_utils.pl,v 1.10 2006/12/19 16:01:57 c4chris Exp $ ################################################################################ # # build_model_utils # ----------------- # # Claudio Lottaz, SIB-ISREC, Claudio.Lottaz@isb-sib.ch # Christian Iseli, LICR ITO, Christian.Iseli@licr.org # # Copyright (c) 1999-2002, 2006 Swiss Institute of Bioinformatics. # All rights reserved. # ################################################################################ use strict; use Symbol; use POSIX (); my $reportfh = gensym; my $local_datadir = ''; my $local_filestem = ''; my $local_verbose = 0; return 1; ################################################################################ # # Define and show paramters # sub readConfig { # read the configuration specified in the given file my($parFileName, $forcedtuplesize, $forcedminmask, $forcedpseudocounts, $forcedminscore, $forcedstartlength, $forcedstartpreroll, $forcedstoplength, $forcedstoppreroll, $verbose) = @_; my($organism, $hightaxo, $dbfiles, $ugdata, $estdata, $datadir, $filestem, $rnafile, $estfile, $estcdsfile, $estutrfile, $trainingfile, $testfile, $utrfile, $cdsfile, $tuplesize, $minmask, $pseudocounts, $minscore, $startlength, $startpreroll, $stoplength, $stoppreroll, $smatfile, $estscanparams, $nb_isochores); # default values $tuplesize = 6; $minmask = 30; $pseudocounts = 1; $minscore = -100; $startpreroll = 2; $stoppreroll = 2; $nb_isochores = 0; $estscanparams = "-m -100 -d -50 -i -50 -N 0"; my(@isochore_borders) = (0.0, 43.0, 47.0, 51.0, 100.0); # read parameter file unless (-s $parFileName) { warn "The parameter-file $parFileName does not exist, skipped"; next; } if (!eval `cat $parFileName`) { die "Error in '$parFileName': $@"; } if (!defined($organism)) { die '$organism not specified in '.$parFileName; } if (!defined($datadir)) { die '$datadir not specified in '.$parFileName; } # set up directories if (!(-e $datadir)) { mkdir($datadir, 0775); } if (!(-e "$datadir/Report")) { mkdir("$datadir/Report", 0775); } if (!(-e "$datadir/Matrices")) { mkdir("$datadir/Matrices", 0775); } if (!(-e "$datadir/Isochores")) { mkdir("$datadir/Isochores", 0775); } if (!(-e "$datadir/Shuffled")) { mkdir("$datadir/Shuffled", 0775); } if (!(-e "$datadir/Evaluate")) { mkdir("$datadir/Evaluate", 0755); } # compute further default values and forced values if (defined($forcedtuplesize)) { $tuplesize = $forcedtuplesize; } if (defined($forcedminmask)) { $minmask = $forcedminmask; } if (defined($forcedpseudocounts)) { $pseudocounts = $forcedpseudocounts; } if (defined($forcedminscore)) { $minscore = $forcedminscore; } if (defined($forcedstartlength)) { $startlength = $forcedstartlength; } if (defined($forcedstartpreroll)) { $startpreroll = $forcedstartpreroll; } if (defined($forcedstoplength)) { $stoplength = $forcedstoplength; } if (defined($forcedstoppreroll)) { $stoppreroll = $forcedstoppreroll; } if (!defined($startlength)) { $startlength = $startpreroll + POSIX::ceil($tuplesize / 3); } if (!defined($stoplength)) { $stoplength = $stoppreroll + POSIX::ceil($tuplesize / 3); } $filestem =sprintf("%01d_%05d\_%07d\_%1d%1d%1d%1d", $tuplesize, $minmask, $pseudocounts, $startlength, $startpreroll, $stoplength, $stoppreroll); $local_datadir = $datadir; $local_filestem = $filestem; $local_verbose = $verbose; if (!defined($rnafile)) { $rnafile = "$datadir/mrna.seq"; } if (!defined($estfile)) { $estfile = "$datadir/ests.seq"; } if (!defined($estcdsfile)) { $estcdsfile = "$datadir/Evaluate/estcds.seq"; } if (!defined($estutrfile)) { $estutrfile = "$datadir/Evaluate/estutr.seq"; } if (!defined($trainingfile)) { $trainingfile = "$datadir/training.seq"; } if (!defined($testfile)) { $testfile = "$datadir/test.seq"; } if (!defined($utrfile)) { $utrfile = "$datadir/Evaluate/rnautr.seq"; } if (!defined($cdsfile)) { $cdsfile = "$datadir/Evaluate/rnacds.seq"; } if (!defined($smatfile)) { $smatfile = "$datadir/Matrices/$filestem.smat"; } # check if parameters are plausible if ($startpreroll >= $startlength) { die("build_model: preroll in start profile too large " . "($startpreroll >= $startlength)"); } if ($stoppreroll > $stoplength) { die("build_model: preroll in stop profile too large " . "($stoppreroll >= $stoplength)"); } if ($tuplesize > 11) { die "build_model: analysed tuples too big ($tuplesize)"; } return($organism, $hightaxo, $dbfiles, $ugdata, $estdata, $datadir, $filestem, $rnafile, $estfile, $estcdsfile, $estutrfile, $trainingfile, $testfile, $utrfile, $cdsfile, $tuplesize, $minmask, $pseudocounts, $minscore, $startlength, $startpreroll, $stoplength, $stoppreroll, $smatfile, $estscanparams, $nb_isochores, \@isochore_borders); } sub showConfig { # Show chosen paramters my($parFileName, $organism, $hightaxo, $dbfiles, $ugdata, $estdata, $datadir, $rnafile, $estfile, $estcdsfile, $estutrfile, $trainingfile, $testfile, $utrfile, $cdsfile, $tuplesize, $minmask, $pseudocounts, $minscore, $startlength, $startpreroll, $stoplength, $stoppreroll, $smatfile, $estscanparams, $nb_isochores, $isochore_borders_ref) = @_; my @isochore_borders = @{$isochore_borders_ref}; my($i); log_print("Build ESTScan Tables for $parFileName"); $parFileName =~ s/\S/\-/g; log_print("-------------------------" . $parFileName); log_print("\nCurrent parameters:"); if ($hightaxo eq "") { log_print(" - organism: $organism"); } else { log_print(" - organism: $organism"); log_print(" - taxonomic level: $hightaxo"); } log_print(" - database files are: $dbfiles"); log_print(" - UniGene data is in: $ugdata"); log_print(" - ESTs for testing: $estdata\n"); log_print(" - data directory: $datadir"); log_print(" - mRNA file is: $rnafile"); log_print(" - EST file is: $estfile"); log_print(" - ESTs with coding: $estcdsfile"); log_print(" - ESTs without coding: $estutrfile"); log_print(" - training file is: $trainingfile"); log_print(" - test file is: $testfile"); log_print(" - clean UTR file is: $utrfile"); log_print(" - clean CDS file is: $cdsfile"); log_print(" - HMM paramters file: $smatfile\n"); log_print(" - tuple size: $tuplesize"); log_print(" - min redundancy mask: $minmask"); log_print(" - added pseudocounts: $pseudocounts"); log_print(" - minimum score: $minscore"); log_print(" - start profile length/preroll: $startlength/$startpreroll"); log_print(" - stop profile length/preroll: $stoplength/$stoppreroll"); if ($nb_isochores>0) { log_print(" - nb of isochores: $nb_isochores"); } else { my @isos; for ($i = 0; $i < $#isochore_borders; $i++) { push @isos, $isochore_borders[$i] . "-" . $isochore_borders[$i+1];; } wrapped_log_print(" - Isochores: ", 80, @isos); } log_print(" - options passed scan program: $estscanparams"); } sub wrapped_log_print { # prints the array given comma separated after its name, wrapping # it nicely my ($name, $width, @array) = @_; my $filler = $name; $filler =~ s/./ /g; my $buffer = $name; my $linelen = length($name); foreach (@array) { $_ .= ', '; if (($linelen + length($_)) < $width) { $buffer .= $_; $linelen += length($_); } else { $buffer .= "\n$filler$_"; $linelen = length("$filler$_"); } } chop($buffer);chop($buffer); log_print($buffer); } ################################################################################ # # Odds and Ends # sub log_open { my($fname) = "$local_datadir/Report/$local_filestem\_$_[0]"; open($reportfh, ">$fname"); } sub log_print { my(@stuff) = @_; if ($local_verbose) { print @stuff, "\n"; } print $reportfh @stuff, "\n"; } sub log_close { close($reportfh); } # # End of file # ################################################################################ estscan-3.0.3/makesmat.c0000644000551200011300000005225011147040301014273 0ustar chrisludwig/* * $Id: makesmat.c,v 1.5 2009/02/18 17:09:21 c4chris Exp $ * * makesmat.c * * Reads from stdin in FASTA format expecting full-length messenger * RNA data, counts tuples which do not contain ambiguous codes and * deduces log-odd emission probabilities for untranslated region and * coding sequence as well as positionspecific scoring matrices for * start and stop sites. Output is generated in GENSCAN format. * * Usage: maskred [-t ] [-p pseudocounts] * [-o ] < infile > outfile * * '-t' specifies the tuples used to determine counts. Tuples observed * in different frames are counted apart. Tuples observed in UTRs are * also considered separate from those in CDS. '-p' specifies the * pseudocounts added in the end. Pseudocounts are added proportional * to multiplied single nucleotide occurence and sum up to the number * specified using '-p'. The option '-o' allows to select whether * counts (c), probabilities (p) or log-odd scores (s) are computed. * * makersmat expects CDS annotation in the FASTA-header of the * inputs. Immediately after the tag 'CDS: ' the next two'integers * separated by a are interpreted as the first and last * position of the CDS. CDS is not recognized correctly if its * specification does not entirely occur in the first 1023 bites of * the header. * * written by Claudio Lottaz (SIB-ISREC) in October 2001 */ #include #include #include #include #include #include #ifndef __GNUC__ #include #endif #include #include typedef struct _options_t { double scoreFactor; int tupsize; int pseudocounts; int startFrames; int startOffset; int stopFrames; int stopOffset; char output_type; /* 's':scores (default), 'p':probabilities, 'c':counts */ int minscore; int debug; } options_t; static options_t options; /******************************************************************************* * * Command line arguments */ static void usage(char *arg0) { fprintf(stderr, "Usage: %s [options] < infile > outfile\n" " where options are:\n" " -t tuple size [%d]\n" " -p pseudocounts to be added [%d]\n" " -f number of frames in start profiles [%d]\n" " -o site offset within start profiles [%d]\n" " -F number of frames in stop profiles [%d]\n" " -O site offset within stop profiles [%d]\n" " -T output type: scores, probabilities or counts [%c]\n" " -m minimum score [%d]\n" " -s score multiplication factor [%.1f]\n" " -h display usage info\n" " -d debug\n", arg0, options.tupsize, options.pseudocounts, options.startFrames, options.startOffset, options.stopFrames, options.stopOffset, options.output_type, options.minscore, options.scoreFactor); exit(1); } static void getOptions(int argc, char *argv[]) { /* default values */ options.scoreFactor = 5.0; options.tupsize = 6; options.pseudocounts = 1; options.output_type = 's'; options.startFrames = 18; options.startOffset = 7; options.stopFrames = 18; options.stopOffset = 6; options.minscore = -100; options.debug = 0; /* read command line */ while (1) { int c = getopt(argc, argv, "t:p:f:F:o:O:T:m:s:dh"); if (c == -1) break; switch (c) { case 't': options.tupsize = atoi(optarg); break; case 'p': options.pseudocounts = atoi(optarg); break; case 'f': options.startFrames = atoi(optarg); break; case 'o': options.startOffset = atoi(optarg); break; case 'F': options.stopFrames = atoi(optarg); break; case 'O': options.stopOffset = atoi(optarg); break; case 'T': options.output_type = optarg[0]; break; case 'm': options.minscore = atoi(optarg); break; case 's': options.scoreFactor = atof(optarg); break; case 'd': options.debug = 1; break; case 'h': usage(argv[0]); default: printf ("Option -%c is unknown\n", c); usage(argv[0]); } } /* check switches */ if (options.tupsize > 16) { fprintf(stderr, "makesmat: tuplesize too large (%d > 16)\n", options.tupsize); exit(1); } if (options.tupsize < 2) { fprintf(stderr, "makesmat: tuplesize too small (%d < 2)\n", options.tupsize); exit(1); } if (options.pseudocounts < 0) { fprintf(stderr, "makesmat: negative pseudocounts (%d)\n", options.pseudocounts); exit(1); } if ((options.startFrames - options.startOffset + 1)< options.tupsize) { fprintf(stderr, "makesmat: start profile, tuple size to large (%d-%d>=%d)\n", options.startFrames, options.startOffset, options.tupsize); exit(1); } if ((options.stopFrames - options.stopOffset) < options.tupsize) { fprintf(stderr, "makesmat: stop profile, tuple size to large (%d-%d>%d)\n", options.stopFrames, options.stopOffset, options.tupsize); exit(1); } if ((options.output_type != 's') && (options.output_type != 'p') && (options.output_type != 'c')) { fprintf(stderr, "makesmat: unrecognized output-type (%c)\n", options.output_type); exit(1); } if (optind != argc) usage(argv[0]); } /****************************************************************************** * * various data types and utilities */ /* Interpret nucleotides */ static char decode[] = {'A', 'C', 'G', 'T', 'N'}; const int N = 4; static char getCode(int c) { unsigned long i = c & 0x1f; /* Get lower bits, get rid of upper/lower info. */ switch (i) { case 1: return 0; /* This is A. */ case 3: return 1; /* This is C. */ case 7: return 2; /* This is G. */ case 20: return 3; /* This is T. */ case 2 : /* This is B. */ case 4: /* This is D. */ case 8: /* This is H. */ case 11: /* This is K. */ case 13: /* This is M. */ case 14: /* This is N. */ case 18: /* This is R. */ case 19: /* This is S. */ case 22: /* This is V. */ case 23: /* This is W. */ case 25: return N; /* This is Y. */ default: /* Everything else. */ fprintf(stderr, "Bad character in getCode: %c(%d)\n", (char)c, c); exit(1); } } static inline char get_nucleotide(void) { char c = getchar(); while ((c == 12) || (c == 10)) c = getchar(); /* skip and */ return c; } static void print_tuple(unsigned long index, int len) { int i; unsigned char *t = (unsigned char *)alloca(sizeof(unsigned char) * len); for (i = len-1; i >= 0; i--) { t[i] = index & 3; index >>= 2; } for (i = 0; i < len; i++) putchar(decode[t[i]]); } /* tuple and nucleotide counters */ static unsigned long singleTotal; /* number of nucleotides */ static unsigned long singleCtr[4]; /* occurance of single nucleotides */ static unsigned long *startctr[4]; /* counters for start PSSM */ static unsigned long *stopctr[4]; /* counters for stop PSSM */ typedef struct _counters_t { unsigned long tupsize1Total; /* number of (tupsize-1)-tuples */ unsigned long tupsizeTotal; /* number of (tupsize)-tuples */ unsigned long *tupsize1Ctr; /* occurence of (tupsize-1)-tuples */ unsigned long *tupsizeCtr; /* occurence of tupsize-tuples */ } counters_t, *counters_p_t; static counters_t ctr[4]; /* counters for frames 0, 1 and 2 as well as UTR */ static void initCounters(void) { int i; int nbTuples1 = (1 << (2*(options.tupsize - 1))); int nbTuples = (1 << (2*options.tupsize)); for (i = 0; i < 4; i++) { ctr[i].tupsize1Total = 0; ctr[i].tupsizeTotal = 0; ctr[i].tupsize1Ctr = (unsigned long *)malloc(sizeof(unsigned long) * nbTuples1); ctr[i].tupsizeCtr = (unsigned long *)malloc(sizeof(unsigned long) * nbTuples); memset(ctr[i].tupsize1Ctr, 0, sizeof(unsigned long) * nbTuples1); memset(ctr[i].tupsizeCtr, 0, sizeof(unsigned long) * nbTuples); startctr[i] = (unsigned long *)malloc(sizeof(unsigned long) * options.startFrames); memset(startctr[i], 0, sizeof(unsigned long) * options.startFrames); stopctr[i] = (unsigned long *)malloc(sizeof(unsigned long) * options.stopFrames); memset(stopctr[i], 0, sizeof(unsigned long) * options.stopFrames); } singleTotal = 0; memset(singleCtr, 0, sizeof(unsigned long) * 4); } static void update_counters(unsigned long index, int frame, int skip) /* skip indicates how many high order nts in index are not valid */ { if (skip <= 1) { ctr[frame].tupsize1Ctr[index >> 2]++; ctr[frame].tupsize1Total++; } if (skip == 0) { ctr[frame].tupsizeCtr[index]++; ctr[frame].tupsizeTotal++; } } static void free_counters(void) { int i; for (i = 0; i <= 3; i++) { free(ctr[i].tupsize1Ctr); free(ctr[i].tupsizeCtr); } } /****************************************************************************** * * Data analysis */ void count_tuples(void) { int i, j, pos, size, skip; int tupleIndex, tupleIndexMask; int cdsStart, cdsEnd; char c, buf[1024], *s; unsigned char *longbuf; size = 1024; longbuf = malloc(sizeof(unsigned char)*size); tupleIndexMask = (1<<(2*options.tupsize)) - 1; c = getchar(); while (c == '>') { /* read and print the header, find CDS */ fgets(buf, 1024, stdin); s = strstr(buf, "CDS: "); if (s == NULL) { /* skip to next entry */ c = getchar(); while((c != EOF) && (c != '>')) c = getchar(); continue; } s += 5; cdsStart = atoi(s) - 1; /* C index start at 0! */ s = strchr(s, ' '); if (s == NULL) { fprintf(stderr, "Bad FASTA header line:\n%s\n", buf); fprintf(stderr, "Expected CDS: \n"); exit(1); } s += 1; cdsEnd = atoi(s) - 1; /* C index start at 0! */ while (buf[strlen(buf)-1] != '\n') { fgets(buf, 1024, stdin); } /* read sequence into the buffer */ pos = 0; c = get_nucleotide(); while ((c != '>') && (c != EOF)) { if (pos == size) { size += 1024; longbuf = realloc(longbuf, sizeof(unsigned char) * size); } longbuf[pos] = getCode(c); pos++; c = get_nucleotide(); } /* count single nucleotides */ for (i = 0; i < pos; i++) { if (longbuf[i] < N) { singleCtr[longbuf[i]]++; singleTotal++; } } /* sanity check */ if (cdsStart >= pos || cdsEnd >= pos) { fprintf(stderr, "Bad FASTA header line:\n%s\n", buf); fprintf(stderr, "Expected CDS: \n"); fprintf(stderr, "CDS start (%d) or CDS end (%d) is out of range (%d)\n", cdsStart, cdsEnd, pos); exit(1); } /* count 5'UTR */ tupleIndex = 0; /* tupleIndex is invalid for the next skip positions */ skip = options.tupsize; for (i = 0; i <= cdsStart - options.startOffset; i++) { if (longbuf[i] < N) tupleIndex = ((tupleIndex << 2) | longbuf[i]) & tupleIndexMask; else skip = options.tupsize; if (skip) skip--; update_counters(tupleIndex, 3, skip); /* update UTR counters */ } /* count start profile */ j = cdsStart - options.startOffset + 1; i = (j < 0) ? 0 : j; while (i < j + options.startFrames) { if (longbuf[i] < N) startctr[longbuf[i]][i - j]++; i++; } /* count CDS core */ tupleIndex = 0; /* tupleIndex is invalid for the next skip positions */ skip = options.tupsize; for (i = cdsStart + options.startFrames - options.startOffset + 1; i <= cdsEnd - options.stopOffset; i++) { if (longbuf[i] < N) tupleIndex = ((tupleIndex<<2) | longbuf[i]) & tupleIndexMask; else skip = options.tupsize; if (skip) skip--; /* update UTR counters */ update_counters(tupleIndex, (i - cdsStart)%3, skip); } /* count stop profile */ for (i = cdsEnd - options.stopOffset + 1; (i <= cdsEnd + options.stopFrames - options.stopOffset) && (i < pos); i++) { if (longbuf[i] < N) stopctr[longbuf[i]][i - cdsEnd + options.stopOffset - 1]++; } /* count 3'UTR */ tupleIndex = 0; /* tupleIndex is invalid for the next skip positions */ skip = options.tupsize; for (i = cdsEnd + options.stopFrames - options.stopOffset + 1; i < pos; i++) { if (longbuf[i] < N) tupleIndex = ((tupleIndex<<2) | longbuf[i]) & tupleIndexMask; else skip = options.tupsize; if (skip) skip--; update_counters(tupleIndex, 3, skip); /* update UTR counters */ } } free(longbuf); } /**************************************************************************** * * Print tables */ static double pseudo(int tuple, int tupsize) { int i; double p = 1.0; for (i = 0; i < tupsize; i++) { p *= ((double) singleCtr[tuple & 3]) / ((double) singleTotal); tuple >>= 2; } return p * options.pseudocounts; } static void print_cdstable(double *singleProb) { int f, i, *scores[3]; double *tuple1Prob[3], *tupleProb[3]; int nbTuples1 = (1 << 2*(options.tupsize-1)); int nbTuples = (1 << 2*options.tupsize); /* compute probabilities */ for (f = 0; f < 3; f++) { if (options.debug) printf("Probabilities for frame %d %d\n", f, options.tupsize - 1); tuple1Prob[f] = (double *) malloc(sizeof(double) * nbTuples1); for (i = 0; i < nbTuples1; i++) { tuple1Prob[f][i] = ((double) ctr[f].tupsize1Ctr[i] + pseudo(i, options.tupsize - 1)) / (double) (ctr[f].tupsize1Total + options.pseudocounts); if (options.debug) { print_tuple(i, options.tupsize-1); printf(":%8g=(%8ld + %8g)/(%8ld + %8d)\n", tuple1Prob[f][i], ctr[f].tupsize1Ctr[i], pseudo(i, options.tupsize-1), ctr[f].tupsize1Total, options.pseudocounts); } } if (options.debug) printf("Probabilities for frame %d %d\n", f, options.tupsize); tupleProb[f] = (double *)malloc(sizeof(double) * nbTuples); for (i = 0; i < nbTuples; i++) { tupleProb[f][i] = ((double) ctr[f].tupsizeCtr[i] + pseudo(i, options.tupsize)) / (double) (ctr[f].tupsizeTotal + options.pseudocounts); if (options.debug) { print_tuple(i, options.tupsize); printf(":%8g=(%8ld + %8g) / (%8ld + %8d)\n", tupleProb[f][i], ctr[f].tupsizeCtr[i], pseudo(i, options.tupsize), ctr[f].tupsizeTotal, options.pseudocounts); } } } /* compute log-odds and scores */ for (f = 0; f < 3; f++) { if (options.debug) printf("Log-odds for frame %d\n", f); scores[f] = (int *)malloc(sizeof(int) * nbTuples); for (i = 0; i < nbTuples; i++) { if ((tupleProb[f][i] == 0.0) || (tuple1Prob[f][i >> 2] == 0.0)) scores[f][i] = options.minscore; else { double score = log(tupleProb[f][i] / singleProb[i & 3] / tuple1Prob[f][i >> 2]) / M_LN2 * options.scoreFactor; scores[f][i] = (score < options.minscore) ? options.minscore : round(score); } if (options.debug) { print_tuple(i, options.tupsize); printf(": %8g = %8g / %8g, %4d = 10*log(%8g / %8g)\n", tupleProb[f][i] / tuple1Prob[f][i >> 2], tupleProb[f][i], tuple1Prob[f][i >> 2], scores[f][i], tupleProb[f][i] / tuple1Prob[f][i >> 2], singleProb[i & 3]); } } } /* print table */ if (options.debug) { if (options.output_type == 'p') printf("single: %.6f %.6f %.6f %.6f\n", singleProb[0], singleProb[1], singleProb[2], singleProb[3]); if (options.output_type == 'c') printf("single (total %ld): %-6ld %-6ld %-6ld %-6ld\n", singleTotal, singleCtr[0], singleCtr[1], singleCtr[2], singleCtr[3]); } for (f = 0; f < 3; f++) { for (i = 0; i < 1 << (2*options.tupsize); i += 4) { int rowIndex = i >> 2; if (options.debug) { print_tuple(rowIndex, options.tupsize-1); if (options.output_type == 'p') printf(" (total %.5f)", tuple1Prob[f][i >> 2]); if (options.output_type == 'c') printf(" (total %ld)",ctr[f].tupsize1Ctr[i>>2]); printf(": "); } switch (options.output_type) { case 's': printf("%-6d %-6d %-6d %-6d\n", scores[f][i], scores[f][i + 1], scores[f][i + 2], scores[f][i + 3]); break; case 'p': printf("%.6f %.6f %.6f %.6f\n", tupleProb[f][i], tupleProb[f][i + 1], tupleProb[f][i + 2], tupleProb[f][i + 3]); break; case 'c': printf("%-6ld %-6ld %-6ld %-6ld\n", ctr[f].tupsizeCtr[i], ctr[f].tupsizeCtr[i + 1], ctr[f].tupsizeCtr[i + 2], ctr[f].tupsizeCtr[i + 3]); } } } /* clean up */ for (f = 0; f < 3; f++) { free(tuple1Prob[f]); free(tupleProb[f]); free(scores[f]); } } static void print_utrtable(double *singleProb) { int i, *scores; double *tuple1Prob, *tupleProb, currTotal; int nbTuples1 = (1 << 2*(options.tupsize-1)); int nbTuples = (1 << 2*options.tupsize); /* compute probabilities */ currTotal = ctr[3].tupsize1Total; tuple1Prob = (double *) malloc(sizeof(double) * nbTuples1); for (i = 0; i < nbTuples1; i++) tuple1Prob[i] = ((double) ctr[3].tupsize1Ctr[i] + pseudo(i, options.tupsize-1)) / (currTotal + options.pseudocounts); currTotal = ctr[3].tupsizeTotal; tupleProb = (double *) malloc(sizeof(double) * nbTuples); for (i = 0; i < nbTuples; i++) tupleProb[i] = ((double) ctr[3].tupsizeCtr[i] + pseudo(i, options.tupsize)) / (currTotal + options.pseudocounts); /* compute log-odds and scores */ scores = (int *) malloc(sizeof(int) * nbTuples); for (i = 0; i < nbTuples; i++) { if ((tupleProb[i] == 0.0) || (tuple1Prob[i>>2] == 0.0)) scores[i] = options.minscore; else { double score = log(tupleProb[i] / singleProb[i&3] / tuple1Prob[i>>2]) / M_LN2 * options.scoreFactor; scores[i] = (score < options.minscore) ? options.minscore : round(score); } } /* print table */ if (options.debug) { if (options.output_type == 'p') printf("single: %.6f %.6f %.6f %.6f\n", singleProb[0], singleProb[1], singleProb[2], singleProb[3]); if (options.output_type == 'c') printf("single (total %ld): %-6ld %-6ld %-6ld %-6ld\n", singleTotal, singleCtr[0], singleCtr[1], singleCtr[2], singleCtr[3]); } for (i = 0; i < 1 << (2*options.tupsize); i += 4) { int rowIndex = i >> 2; if (options.debug) { print_tuple(rowIndex, options.tupsize-1); if (options.output_type == 'p') printf(" (total %.5f)", tuple1Prob[i >> 2]); if (options.output_type == 'c') printf(" (total %ld)", ctr[3].tupsize1Ctr[i>>2]); printf(": "); } switch (options.output_type) { case 's': printf("%-6d %-6d %-6d %-6d\n", scores[i], scores[i + 1], scores[i + 2], scores[i + 3]); break; case 'p': printf("%.6f %.6f %.6f %.6f\n", tupleProb[i], tupleProb[i + 1], tupleProb[i + 2], tupleProb[i + 3]); break; case 'c': printf("%-6ld %-6ld %-6ld %-6ld\n", ctr[3].tupsizeCtr[i], ctr[3].tupsizeCtr[i + 1], ctr[3].tupsizeCtr[i + 2], ctr[3].tupsizeCtr[i + 3]); } } /* clean up */ free(tuple1Prob); free(tupleProb); free(scores); } static void print_pssm(double *singleProb, unsigned long **ctr, int frames) { int i, j; for (i = 0; i < frames; i++) { int currTotal = 0; for (j = 0; j < 4; j++) { currTotal += ctr[j][i]; } for (j = 0; j < 4; j++) { double pseudo = options.pseudocounts * (double) singleCtr[j] / (double) singleTotal; double p = ((double) ctr[j][i] + pseudo) / ((double) currTotal + options.pseudocounts); switch(options.output_type) { case 'c': printf("%-6ld ", ctr[j][i]); break; case 'p': printf("%.6f ", p); break; case 's': if ((p == 0.0) || (singleProb[j] == 0.0)) printf("%-6d ", options.minscore); else { double score = log(p / singleProb[j]) / M_LN2 * 10.0; printf("%-6d ", (score < options.minscore) ? options.minscore : (int) round(score)); } } } printf("\n"); } } /***************************************************************************** * * Main */ int main(int argc, char *argv[]) { int i; double currTotal, singleProb[4]; /* initialize and count tuples */ getOptions(argc, argv); initCounters(); count_tuples(); /* compute probabilities of single nucleotides */ currTotal = singleTotal; for (i = 0; i < 4; i++) singleProb[i] = (double) singleCtr[i] / currTotal; /* output tables */ printf("FORMAT: CODING REGION %d 3 1 %c C+G: \n", options.tupsize, options.output_type); print_cdstable(singleProb); printf("FORMAT: UNTRANSLATED REGION %d 1 1 %c C+G: \n", options.tupsize, options.output_type); print_utrtable(singleProb); printf("FORMAT: START PROFILE 1 %d %d %c C+G: \n", options.startFrames, options.startOffset, options.output_type); print_pssm(singleProb, startctr, options.startFrames); printf("FORMAT: STOP PROFILE 1 %d %d %c C+G: \n", options.stopFrames, options.stopOffset, options.output_type); print_pssm(singleProb, stopctr, options.stopFrames); /* clean up */ free_counters(); return 0; } /* * End of File * ****************************************************************************/ estscan-3.0.3/Makefile0000644000551200011300000000124510425351562013777 0ustar chrisludwig# $Id: Makefile,v 1.1 2006/05/01 09:22:58 c4chris Exp $ # Set the appropriate compilers and options for your system: # Any system with GNU compilers: CC = gcc CFLAGS = -O2 F77 = g77 FFLAGS = -O2 LDFLAGS = -lm # Linux with Intel compilers: # CC = icc # CFLAGS = -O3 -ipo -axP # F77 = ifort # FFLAGS = -O3 -ipo -axP PROGS=maskred makesmat estscan winsegshuffle all: $(PROGS) clean: \rm -f *~ $(PROGS) *.o maskred: maskred.o $(CC) $(LDFLAGS) -o $@ $< makesmat: makesmat.o $(CC) $(LDFLAGS) -o $@ $< estscan: estscan.o $(CC) $(LDFLAGS) -o $@ $< winsegshuffle: winsegshuffle.o $(F77) $(LDFLAGS) -o $@ $< .c.o: $(CC) $(CFLAGS) -c $< .f.o: $(F77) $(FFLAGS) -c $< estscan-3.0.3/estscan.c0000644000551200011300000012511711216116730014143 0ustar chrisludwig/* $Id: estscan.c,v 1.7 2009/06/17 07:40:08 c4chris Exp $ * * Christian Iseli, LICR ITO, Christian.Iseli@licr.org * * Copyright (c) 2004 Swiss Institute of Bioinformatics. All rights reserved. * * Compile with -std=gnu99 */ #include #include #include #include #include #include #include #include #include #include #include #include #ifdef DEBUG #include #endif #if !defined(__GNUC__) && defined(sun) #define inline #endif #define BUF_SIZE 4096 #define MT_UNKNOWN -1 #define MT_CODING 0 #define MT_UNTRANSLATED 1 #define MT_START 2 #define MT_STOP 3 #define MT_COUNT 4 #define min(x, y) ((x > y) ? (y) : (x)) #define max(x, y) ((x < y) ? (y) : (x)) typedef struct _matrix { signed char **m; char *name; char *kind; double CGmin; double CGmax; int matType; unsigned int order; unsigned int frames; int offset; } matrix_t, *matrix_p_t; typedef struct _read_buf_t { char *line; unsigned int lmax; unsigned int lc; unsigned int ic; char in[BUF_SIZE]; } read_buf_t, *read_buf_p_t; typedef struct _seq_t { const char *fName; char *header; unsigned char *seq; double GC_pct; read_buf_t rb; int fd; unsigned int len; unsigned int maxHead; unsigned int max; } seq_t, *seq_p_t; typedef struct _result_t { unsigned char *s; int score; unsigned int start; unsigned int stop; int reverse; } result_t, *result_p_t; typedef union _col_elt_t { void **elt; matrix_p_t *m; result_p_t *r; } col_elt_t; typedef struct _col_t { col_elt_t e; unsigned int size; unsigned int nb; } col_t, *col_p_t; typedef struct _options_t { FILE *out; FILE *transl; char *matrix; double percent; double both; int min; int dPen; int iPen; int ts5uPen; int tscPen; int ts3uPen; int t5ucPen; int t5uePen; int tc3uPen; int tcePen; int t3uePen; int Nvalue; unsigned int sWidth; int all; int maxOnly; int skipLen; int minLen; int no_del; int single; } options_t; static const char Version[] = "This is ESTScan version 3.0.3.\n" "Copyright (c) 1999-2009 by the Swiss Institute of Bioinformatics.\n" "All rights reserved. See the file COPYRIGHT for details.\n"; static const char Usage[] = "%s [options] [ ...]\n\n" #ifdef DEBUG "Debug version\n\n" #endif "Available options (default value in braces[]):\n" " -a All in one sequence output\n" " -b only results are shown, which have scores higher than this \n" " fraction of the best score [%f].\n" " -d deletion penalty [%d]\n" " -h print this usage information\n" " -i insertion penalty [%d]\n" " -l only results longer than this length are shown [%d]\n" " -M score matrices file ($ESTSCANDIR/Hs.smat)\n" " [%s]\n" " -m min value in matrix [%d]\n" " -N how to compute the score of N [%d]\n" " -n remove deleted nucleotides from the output\n" " -O report header information for best match only\n" " -o send output to file. - means stdout. If both -t and -o specify\n" " stdout, only proteins will be written.\n" " -p GC select correction for score matrices [%f]\n" " -S only analyze positive strand\n" " -s Skip sequences shorter than length [%d]\n" " -T 8 integers used as log-probabilities for transitions,\n" " start->5'UTR, start->CDS, start->3'UTR, 5'UTR->CDS,\n" " 5'UTR->end, CDS->3'UTR, CDS->end, 3'UTR->end\n" " [%d, %d, %d, %d, %d, %d, %d, %d]\n" " -t Translate to protein. - means stdout.\n" " will go to the file and the nucleotides will still go to stdout.\n" " -v version information\n" " -w width of the FASTA sequence output [%d]\n"; static options_t options; static char *argv0; /* declaration of indexes also used in getFrame */ static int iBegin, i5utr, iStart, iCds, iStop, i3utr; /* next tsize states implement insertion/deletion after nucleotide in frame index */ static int iInsAfter[3], iDelAfter[3]; /* last tsize states implemented insertion/deletion before nucleotide in frame index */ static int iInsNext[3], iDelNext[3]; static unsigned int maxSize = 0; static int *V = NULL; static int *tr = NULL; static const unsigned char dna_complement[256] = " " " TVGH CD M KN YSA BWXR tvgh cd m kn ysa bwxr " " " " "; /* ................................................................ */ /* @ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~. */ /* ................................................................ */ /* ................................................................ */ #ifdef __GNUC__ static void fatal(const char *fmt, ...) __attribute__ ((format (printf, 1, 2) , __noreturn__)); #endif static void fatal(const char *fmt, ...) { va_list ap; va_start(ap, fmt); fflush(stdout); if (argv0) { char *p = strrchr(argv0, '/'); fprintf(stderr, "%s: ", p ? p+1 : argv0); } vfprintf(stderr, fmt, ap); va_end(ap); #ifdef DEBUG abort(); #else exit(1); #endif } static int intCompare(const void *a, const void *b) { int ia = * (int *) a; int ib = * (int *) b; return ia < ib ? -1 : (ia > ib ? 1 : 0); } static void * xmalloc(size_t size) { void *res = malloc(size); if (res == NULL) fatal("malloc of %zd failed: %s (%d)\n", size, strerror(errno), errno); return res; } #if 0 static void * xcalloc(size_t nmemb, size_t size) { void *res = calloc(nmemb, size); if (res == NULL) fatal("calloc of %zd, %zd failed: %s (%d)\n", nmemb, size, strerror(errno), errno); return res; } #endif static void * xrealloc(void *ptr, size_t size) { void *res = realloc(ptr, size); if (res == NULL) fatal("realloc of %p to %zd failed: %s (%d)\n", ptr, size, strerror(errno), errno); return res; } static void grow_read_buf(read_buf_p_t b) { b->lmax += BUF_SIZE; b->line = xrealloc(b->line, b->lmax * sizeof(char)); } static char * shuffle_line(read_buf_p_t b, size_t *cur) { if (b->ic == 0 || *cur >= b->ic) return NULL; /* Make sure we have enough room in line. */ if (b->lmax <= b->lc + (b->ic - *cur)) grow_read_buf(b); while (*cur < b->ic && b->in[*cur] != '\n') b->line[b->lc++] = b->in[(*cur)++]; if (*cur < b->ic) { /* Ok, we have our string. */ /* Copy the newline. */ b->line[b->lc++] = b->in[(*cur)++]; /* We should be fine, since we read BUF_SIZE -1 at most... */ b->line[b->lc] = 0; /* Adjust the input buffer. */ if (*cur < b->ic) { memmove(b->in, b->in + *cur, (b->ic - *cur) * sizeof(char)); b->ic -= *cur; } else b->ic = 0; *cur = 0; return b->line; } /* Go read some more. */ b->ic = 0, *cur = 0; return NULL; } static char * read_line_buf(read_buf_p_t b, int fd) { char *s = NULL; ssize_t rc; size_t cur = 0; b->lc = 0; if ((s = shuffle_line(b, &cur)) != NULL) return s; do { if ((rc = read(fd, b->in + b->ic, BUF_SIZE - b->ic - 1)) == -1) { if (errno != EINTR) fatal("Could not read from %d: %s(%d)\n", fd, strerror(errno), errno); } else b->ic += rc; s = shuffle_line(b, &cur); if (s == NULL && rc == 0) { /* Got to the EOF... */ b->line[b->lc] = 0; s = b->line; } } while (s == NULL); return s; } static void init_buf(read_buf_p_t b) { b->line = xmalloc(BUF_SIZE * sizeof(char)); b->lmax = BUF_SIZE; b->lc = 0; b->ic = 0; } static void free_buf(read_buf_p_t b) { free(b->line); } static void init_seq(const char *fName, seq_p_t sp) { sp->fName = fName; sp->header = NULL; sp->seq = NULL; init_buf(&sp->rb); if (fName != NULL) { sp->fd = open(fName, O_RDONLY); if (sp->fd == -1) fatal("Could not open file %s: %s(%d)\n", fName, strerror(errno), errno); } else sp->fd = 0; sp->len = 0; sp->maxHead = 0; sp->max = 0; read_line_buf(&sp->rb, sp->fd); } static int get_next_seq(seq_p_t sp) { const int lenStr = 24; unsigned int headerLen; char *buf = sp->rb.line; int res; unsigned int ctr[256], gc, atgc; ctr['A'] = ctr['C'] = ctr['G'] = ctr['T'] = 0; while (sp->rb.lc > 0 && buf[0] != '>') buf = read_line_buf(&sp->rb, sp->fd); if (sp->rb.lc == 0) return -1; /* We have the FASTA header. */ if (sp->rb.lc + lenStr + 1 > sp->maxHead) { sp->maxHead = sp->rb.lc + lenStr + 1; sp->header = (char *) xrealloc(sp->header, sp->maxHead * sizeof(char)); } headerLen = sp->rb.lc; memcpy(sp->header, buf, (sp->rb.lc + 1) * sizeof(char)); sp->len = 0; buf = read_line_buf(&sp->rb, sp->fd); while (sp->rb.lc > 0 && buf[0] != '>') { unsigned char c; /* Make sure we have enough room for this additional line. */ if (sp->len + sp->rb.lc + 1 > sp->max) { sp->max = max(sp->len + sp->rb.lc + 1, sp->max + 0x40000); sp->seq = (unsigned char *) xrealloc(sp->seq, sp->max * sizeof(unsigned char)); } while ((c = *buf++) != 0) { if (isupper(c)) { ctr[c] += 1; sp->seq[sp->len++] = c; } else if (islower(c)) { c = toupper(c); ctr[c] += 1; sp->seq[sp->len++] = c; } } buf = read_line_buf(&sp->rb, sp->fd); } sp->seq[sp->len] = 0; buf = strstr(sp->header, " LEN="); if (buf) { char *s; if (*(buf - 1) == ';') { buf -= 1; s = buf + 6; headerLen -= 6; } else { s = buf + 5; headerLen -= 5; } while (isdigit(*s)) { s += 1; headerLen -= 1; } while (*s) *buf++ = *s++; } buf = sp->header + headerLen - 1; while (iscntrl(*buf) || isspace(*buf)) buf -= 1; res = snprintf(buf + 1, lenStr, "; LEN=%u\n", sp->len); if (res < 0 || res >= lenStr) fatal("Sequence too long: %u\n", sp->len); gc = ctr['G'] + ctr['C']; atgc = gc + ctr['A'] + ctr['T']; sp->GC_pct = (atgc == 0) ? 0.0 : 100.0 * (double) gc / (double) atgc; return 0; } static void free_seq(seq_p_t sp) { free(sp->seq); free(sp->header); free_buf(&sp->rb); if (sp->fName != NULL) close(sp->fd); } static void seq_revcomp_inplace(seq_p_t seq) { unsigned char *s = seq->seq; unsigned char *t = seq->seq + seq->len; unsigned char c; while (s < t) { c = dna_complement[*--t]; *t = dna_complement[*s]; *s++ = c; } } static void init_col(col_p_t c, unsigned int size) { c->size = size; c->nb = 0; if (size > 0) c->e.elt = (void **) xmalloc(size * sizeof(void *)); else c->e.elt = NULL; } static void add_col_elt(col_p_t c, void *elt, unsigned int grow) { if (c->size <= c->nb) { c->size += grow; c->e.elt = (void **) xrealloc(c->e.elt, c->size * sizeof(void *)); } c->e.elt[c->nb++] = elt; } #if 0 /* CI */ static void add_unique_col_elt(col_p_t c, void *elt, unsigned int grow) { unsigned int i; for (i = 0; i < c->nb; i++) if (c->e.elt[i] == elt) return; add_col_elt(c, elt, grow); } static void merge_col(col_p_t c1, col_p_t c2) { unsigned int i; for (i = 0; i < c2->nb; i++) add_col_elt(c1, c2->e.elt[i], COL_G_GROW); } #endif /* 0 CI */ static void free_col(col_p_t c) { #ifndef NDEBUG memset(c->e.elt, 0, c->size * sizeof(void *)); #endif free(c->e.elt); #ifndef NDEBUG memset(c, 0, sizeof(col_t)); #endif } static void CreateMatrix(matrix_p_t m, signed char *data, unsigned int nElt) { int i; unsigned int frame; unsigned int sSize = 1; unsigned int *step = (unsigned int *) xmalloc(sizeof(unsigned int) * m->order); unsigned int *sStep = (unsigned int *) xmalloc(sizeof(unsigned int) * m->order); if (m->order < 1) fatal("CreateMatrix: order should be >=1 (%d)", m->order); m->m = (signed char **) xmalloc(sizeof(signed char *) * m->frames); /* Compute some stepping info. */ step[m->order - 1] = 4; sStep[m->order - 1] = 5; for (i = m->order - 2; i >= 0; i--) { step[i] = step[i + 1] * 4; sStep[i] = sStep[i + 1] * 5; } /* Check the size of the array. */ if (step[0] * m->frames != nElt) fatal("CreateMatrix: bad array size (%d, should be %d)", nElt, step[0] * m->frames); /* Compute the score table size. */ for (frame = 0; frame < m->order; frame++) sSize *= 5; for (frame = 0; frame < m->frames; frame++) { signed char *ptr; /* Get space for the score tables. */ m->m[frame] = (signed char *) xmalloc(sizeof(signed char) * sSize); ptr = m->m[frame]; /* Process the array. */ for (i = 0; i < (int) step[0]; i++) { int val = *data++; int j; /* Do not go below min. */ val = (val < options.min) ? options.min : val; *ptr++ = val; for (j = m->order - 1; j >= 0; j--) { if ((i + 1) % step[j] == 0) { /* We have to fill in the next sStep[j]/5 score slots. */ int k; int stepping = sStep[j] / 5; if (options.Nvalue == 0) { /* Plain average thing. */ for (k = 0; k < stepping - 1; k++) { int avg = *(ptr - stepping) + *(ptr - stepping * 2) + *(ptr - stepping * 3) + *(ptr - stepping * 4); avg /= 4; *ptr++ = avg; } *ptr++ = 0; /* Null expectation to accept an N. */ } else { /* Something a bit more funky... */ for (k = 0; k < stepping - 1; k++) { int avg; int sorted[4]; sorted[0] = *(ptr - stepping); sorted[1] = *(ptr - stepping * 2); sorted[2] = *(ptr - stepping * 3); sorted[3] = *(ptr - stepping * 4); /* Need to sort the darn thing... */ qsort(sorted, 4, sizeof(int), intCompare); switch(options.Nvalue) { case 1: avg = sorted[3]; break; case 2: avg = (sorted[3] + sorted[2]) / 2; break; case 3: avg = (sorted[3] + sorted[2] + sorted[1]) / 3; break; case -1: avg = sorted[0]; break; case -2: avg = (sorted[0] + sorted[1]) / 2; break; case -3: avg = (sorted[0] + sorted[1] + sorted[2]) / 3; break; default: fatal("Bad method (%d) to compute N score value.", options.Nvalue); } *ptr++ = avg; } *ptr++ = 0; /* Null expectation to accept an N. */ } } } } } free(step); free(sStep); } static inline unsigned int GetCode(unsigned char c) { switch (c) { case 'A': return 0; /* This is A. */ case 'C': return 1; /* This is C. */ case 'G': return 2; /* This is G. */ case 'T': return 3; /* This is T. */ } return 4; /* Everything else. */ } static inline void findMax(int prev, int *prevV, int transit, int *bPrev, int *bScore) { int score = prevV[prev] + transit; if (score > *bScore) { *bScore = score; *bPrev = prev; } } static inline void findMax0(int prev, int *prevV, int *bPrev, int *bScore) { int score = prevV[prev]; if (score > *bScore) { *bScore = score; *bPrev = prev; } } static void initIndices(int tsize, int startlen, int stoplen) { unsigned int f; iBegin = -1; i5utr = 0; iStart = 1; iCds = iStart + startlen; iStop = iCds + 3; i3utr = iStop + stoplen; for (f = 0; f < 3; f++) { iInsAfter[f] = i3utr + f * tsize + 1; iDelAfter[f] = i3utr + (f + 3) * tsize - f + 1; } for (f = 0; f < 3; f++) { unsigned int f1; for (f1 = 0; f1 < 3; f1++) { if ((f1 + tsize) % 3 == f) iInsNext[f] = iInsAfter[f1] + tsize - 1; if ((f1 + tsize + 1) % 3 == f) iDelNext[f] = iDelAfter[f1] + tsize - 2; } } } #ifdef DEBUG static void printIndex(unsigned int index, unsigned int len) { char *s = (char *) xmalloc(sizeof(char) * (len + 1)); int i; s[len] = 0; for (i = len - 1; i >= 0; i--) { int c = index % 5; index /= 5; switch (c) { case 0: s[i] = 'A'; break; case 1: s[i] = 'C'; break; case 2: s[i] = 'G'; break; case 3: s[i] = 'T'; break; default: s[i] = 'N'; } } fputs(s, stderr); free(s); } static void printInitStatus(unsigned int states, unsigned int seqLen, unsigned int tsize, unsigned int tableSize, unsigned int tindex, unsigned int *insTindex, unsigned int *delTindex) { unsigned int f, i; fprintf(stderr, "Begin: %d\n", iBegin); fprintf(stderr, "5'UTR: %d\n", i5utr); fprintf(stderr, "Start: %d\n", iStart); fprintf(stderr, "Stop: %d\n", iStop); fprintf(stderr, "3'UTR: %d\n", i3utr); for (f = 0; f < 3; f++) fprintf(stderr, "Frame %u: CDS %d, insert after/next %d/%d, delete after/next %d/%d\n", f, iCds + f, iInsAfter[f], iInsNext[f], iDelAfter[f], iDelNext[f]); fprintf(stderr, "states %u, seq length %u, tsize %u, tableSize %u, tindex ", states, seqLen, tsize, tableSize); printIndex(tindex, tsize); fprintf(stderr, "\ninsertion indices: "); for (i = 0; i < tsize; i++) { printIndex(insTindex[i], tsize); fprintf(stderr, " "); } fprintf(stderr, "\ndeletion indices: "); for (i = 0; i < tsize - 1; i++) { printIndex(delTindex[i], tsize); fprintf(stderr, " "); } fprintf(stderr, "\n"); } static void printCurrentStatus(unsigned int p, unsigned char c, unsigned int code, unsigned int tindex, unsigned int tsize, unsigned int *insTindex, unsigned int *delTindex, unsigned int states, int *currV, int *currTr) { unsigned int i; fprintf(stderr, "%u:%c-%u: ", p, c, code); printIndex(tindex, tsize); fprintf (stderr, " /"); for (i = 0; i < tsize; i++) { fprintf(stderr, " "); printIndex(insTindex[i], tsize); } fprintf(stderr, " /"); for (i = 0; i < tsize - 1; i++) { fprintf(stderr, " "); printIndex(delTindex[i], tsize); } fprintf(stderr, "\n"); for (i = 0; i < states; i++) { if (currV[i] < INT_MIN / 3) fprintf(stderr, " -inf/%2d", currTr[i]); else fprintf(stderr, "%5d/%2d", currV[i], currTr[i]); if ((i % 10) == 9) fprintf(stderr, "\n"); } fprintf(stderr, "\n"); } #endif /* returns frame if is coding, relies on startoffset, startlength, stopoffset and stoplength to be multiples of 3, returns -1 if not coding */ static inline int getFrame(int state, int tsize, int startoffset, int stopoffset) { int f; int d = state - iStart - startoffset + 1; if (d < 0) return -1; if (state < (iStop + stopoffset)) return(d % 3); for (f = 1; f < tsize; f++) { if ((iInsAfter[(12 - f) % 3] + f) == state) return 0; if ((iInsAfter[(13 - f) % 3] + f) == state) return 1; if ((iInsAfter[(14 - f) % 3] + f) == state) return 2; if ((iDelAfter[(14 - f) % 3] + f - 1) == state) return 0; if ((iDelAfter[(15 - f) % 3] + f - 1) == state) return 1; if ((iDelAfter[(16 - f) % 3] + f - 1) == state) return 2; } return -1; } static int Compute(seq_p_t seq, col_p_t mc, col_p_t rc, int reverse, int maxScore) { matrix_p_t M[MT_COUNT]; unsigned int i, code, f, s; unsigned char *p; int iCurr, bPrev, bScore, tmpPrev, tmpScore; unsigned int tableSize = 1; unsigned int tindex; unsigned int *insTindex, *delTindex; unsigned int states, mSize; int *currV, *prevV, *currTr; /* Find the right matrices. */ memset(M, 0, sizeof(M)); for (i = 0; i < mc->nb; i ++) { if (seq->GC_pct >= mc->e.m[i]->CGmin && seq->GC_pct <= mc->e.m[i]->CGmax && M[mc->e.m[i]->matType] == NULL) M[mc->e.m[i]->matType] = mc->e.m[i]; } for (i = 0; i < MT_COUNT; i++) if (M[i] == NULL) fatal("We have no %d matrix for %.2f GC in:\n %s", i, seq->GC_pct, seq->header); /* initialize some more parameters */ insTindex = (unsigned int *) xmalloc(sizeof(unsigned int) * M[MT_CODING]->order); delTindex = (unsigned int *) xmalloc(sizeof(unsigned int) * (M[MT_CODING]->order - 1)); /* allocate tables and compute the state indices */ states = 2 + M[MT_START]->frames + M[MT_STOP]->frames + 6 * M[MT_CODING]->order; mSize = sizeof(int) * seq->len * states; if (maxSize < mSize) { maxSize = mSize; V = (int *) xrealloc(V, maxSize); tr = (int *) xrealloc(tr, maxSize); } currV = V; currTr = tr; /* size of score tables per frame */ for (i = 0; i < M[MT_CODING]->order; i++) tableSize *= 5; /* tindex will point to the position representing the last * M[MT_CODING]->order chars on seq */ /* tindex now represents all N's */ tindex = tableSize - 1; for (i = 0; i < M[MT_CODING]->order - 1; i++) insTindex[i] = delTindex[i] = tindex; insTindex[i] = tindex; initIndices(M[MT_CODING]->order, M[MT_START]->frames, M[MT_STOP]->frames); #ifdef DEBUG printInitStatus(states, seq->len, M[MT_CODING]->order, tableSize, tindex, insTindex, delTindex); #endif /* fill in the Viterbi and traceback tables, initialize for first char on seq */ code = GetCode(seq->seq[0]); tindex = (5 * tindex + code) % tableSize; currV[i5utr] = options.ts5uPen + M[MT_UNTRANSLATED]->m[0][tindex]; for (f = 0; f < M[MT_START]->frames; f++) currV[iStart+f] = options.min + M[MT_START]->m[f][code]; for (f = 0; f < 3; f++) currV[iCds + f] = options.tscPen + M[MT_CODING]->m[f][tindex]; for (f = 0; f < M[MT_STOP]->frames; f++) currV[iStop + f] = options.min + M[MT_STOP]->m[f][code]; currV[i3utr] = options.ts3uPen + M[MT_UNTRANSLATED]->m[0][tindex]; for (s = i3utr + 1; s < states; s++) currV[s] = INT_MIN / 2; for (s = 0; s < states; s++) currTr[s] = iBegin; #ifdef DEBUG printCurrentStatus(0, seq->seq[0], code, tindex, M[MT_CODING]->order, insTindex, delTindex, states, V, tr); #endif /* fill in the Viterbi and traceback tables, main part */ for (p = seq->seq + 1; *p; p++) { prevV = currV; currV += states; currTr += states; /* update index variables */ code = GetCode(*p); for (i = M[MT_CODING]->order - 1; i > 0; i--) insTindex[i] = (5 * insTindex[i - 1] + code) % tableSize; if (M[MT_CODING]->order > 2) for (i = M[MT_CODING]->order - 2; i > 0; i--) delTindex[i] = (5 * delTindex[i - 1] + code) % tableSize; insTindex[0] = tindex; delTindex[0] = (25 * tindex + 20 + code) % tableSize; tindex = (5 * tindex + code) % tableSize; /* consider current nucleotide in 5'UTR */ /* transitions UTR->UTR and CDS->CDS are presumed zero */ currV[i5utr] = prevV[i5utr] + M[MT_UNTRANSLATED]->m[0][tindex]; currTr[i5utr] = i5utr; /* consider current nucleotide in start profile */ currV[iStart] = prevV[i5utr] + options.t5ucPen + M[MT_START]->m[0][code]; currTr[iStart] = i5utr; iCurr = iStart; bPrev = iStart - 1; for (f = 1; f < M[MT_START]->frames; f++) { iCurr += 1; bPrev += 1; currV[iCurr] = prevV[bPrev] + M[MT_START]->m[f][code]; currTr[iCurr] = bPrev; } /* consider current nucleotide in CDS */ iCurr = iCds; bPrev = iCds - 1; bScore = prevV[bPrev]; findMax(i5utr, prevV, options.min, &bPrev, &bScore); findMax0(iCurr + 2, prevV, &bPrev, &bScore); findMax0(iInsNext[0], prevV, &bPrev, &bScore); findMax0(iDelNext[0], prevV, &bPrev, &bScore); currV[iCurr] = bScore + M[MT_CODING]->m[0][tindex]; currTr[iCurr] = bPrev; for (f = 1; f < 3; f++) { iCurr += 1; bPrev = iCurr - 1; bScore = prevV[bPrev]; findMax0(iInsNext[f], prevV, &bPrev, &bScore); findMax0(iDelNext[f], prevV, &bPrev, &bScore); currV[iCurr] = bScore + M[MT_CODING]->m[f][tindex]; currTr[iCurr] = bPrev; } /* consider current nucleotide in stop profile */ bPrev = INT_MIN; bScore = INT_MIN; for (f = M[MT_START]->offset + 2; f < M[MT_START]->frames; f += 3) findMax(iStart + f, prevV, options.min, &bPrev, &bScore); for (f = 0; f < M[MT_CODING]->order; f++) findMax(iInsAfter[(14 - f) % 3] + f, prevV, options.min, &bPrev, &bScore); for (f = 0; f < M[MT_CODING]->order - 1; f++) findMax(iDelAfter[(15 - f) % 3] + f, prevV, options.min, &bPrev, &bScore); tmpPrev = bPrev; tmpScore = bScore; findMax(iCds + 2, prevV, options.tc3uPen, &bPrev, &bScore); currV[iStop] = bScore + M[MT_STOP]->m[0][code]; currTr[iStop] = bPrev; iCurr = iStop; bPrev = iStop - 1; for (f = 1; f < M[MT_STOP]->frames; f++) { iCurr += 1; bPrev += 1; currV[iCurr] = prevV[bPrev] + M[MT_STOP]->m[f][code]; currTr[iCurr] = bPrev; } /* consider current nucleotide in 3' UTR */ bPrev = INT_MIN; bScore = INT_MIN; findMax0(i3utr - 1, prevV, &bPrev, &bScore); findMax0(i3utr, prevV, &bPrev, &bScore); findMax(iCds+2, prevV, options.min, &bPrev, &bScore); currV[i3utr] = bScore + M[MT_UNTRANSLATED]->m[0][tindex]; currTr[i3utr] = bPrev; /* consider current nucleotide in CDS after insertion */ for (f = 0; f < 3; f++) { iCurr = iInsAfter[f]; bPrev = iCds + f; currV[iCurr] = prevV[bPrev] + options.iPen; currTr[iCurr] = bPrev; iCurr = iInsAfter[f]; bPrev = iCurr - 1; for (i = 1; i < M[MT_CODING]->order; i++) { iCurr += 1; bPrev += 1; currV[iCurr] = prevV[bPrev] + M[MT_CODING]->m[(i + f) % 3][insTindex[i]]; currTr[iCurr] = bPrev; } } /* consider current nucleotide in CDS after deletion */ for (f = 0; f < 3; f++) { iCurr = iDelAfter[f]; bPrev = iCds + f; currV[iCurr] = prevV[bPrev] + options.dPen + M[MT_CODING]->m[(f + 2) % 3][delTindex[0]]; currTr[iCurr] = bPrev; iCurr = iDelAfter[f]; bPrev = iCurr - 1; for (i = 1; i < M[MT_CODING]->order - 1; i++) { iCurr += 1; bPrev += 1; currV[iCurr] = prevV[bPrev] + M[MT_CODING]->m[(i + f + 2) % 3][delTindex[i]]; currTr[iCurr] = bPrev; } } #ifdef DEBUG printCurrentStatus(p - seq->seq, *p, code, tindex, M[MT_CODING]->order, insTindex, delTindex, states, currV, currTr); #endif } /* fill in the Viterbi and traceback tables, terminate and find best */ bPrev = i5utr; bScore = currV[bPrev] + options.t5uePen; for (f = 0; f < M[MT_START]->frames; f++) findMax(iStart+f, currV, options.min, &bPrev, &bScore); for (f = 0; f < 3; f++) { findMax(iCds+f, currV, options.tcePen, &bPrev, &bScore); for (i = 0; i < M[MT_CODING]->order; i++) findMax(iInsAfter[f] + i, currV, options.tcePen, &bPrev, &bScore); for (i = 0; i < M[MT_CODING]->order - 1; i++) findMax(iDelAfter[f] + i, currV, options.tcePen, &bPrev, &bScore); } for (f = 0; f < M[MT_STOP]->frames; f++) findMax(iStop+f, currV, options.min, &bPrev, &bScore); findMax(i3utr, currV, options.t3uePen, &bPrev, &bScore); #ifdef DEBUG fprintf(stderr, "finished to fill Viterbi matrix, best score %d in state %d\n", bScore, bPrev); #endif /* traceback and generate coding sequences starting from bPrev (confidence bScore) */ iCurr = bPrev; p -= 1; while(iCurr != iBegin) { int iOld = -1, rStart, rStop; unsigned char *r, *q; /* skip non coding */ while((iBegin < iCurr && iCurr < iStart + M[MT_START]->offset - 1) || (iStop + M[MT_STOP]->offset - 1 < iCurr && iCurr < iInsAfter[0])) { #ifdef DEBUG fprintf(stderr, "trace back non-coding: state %d position %4d(%c)\n", iCurr, p - seq->seq, *p); #endif iCurr = currTr[iCurr]; currTr -= states; p -= 1; } /* handle coding */ if (iCurr != iBegin) { unsigned char *res = (unsigned char *) xmalloc(sizeof(unsigned char) * 2 * seq->len); int rScore = V[(p - seq->seq) * states + iCurr]; r = res; rStop = (p - seq->seq); if (getFrame(iCurr, M[MT_CODING]->order, M[MT_START]->offset, M[MT_STOP]->offset) == 0) { *r++ = 'X'; *r++ = 'X'; } if (getFrame(iCurr, M[MT_CODING]->order, M[MT_START]->offset, M[MT_STOP]->offset) == 1) *r++ = 'X'; while((iStart + M[MT_START]->offset - 1 <= iCurr && iCurr <= iStop + M[MT_STOP]->offset - 1) || iInsAfter[0] <= iCurr) { int done = 0; for (f = 0; f < 3; f++) { if (iCurr == iInsAfter[f]) { *r++ = tolower(*p); done = 1; } if (iCurr == iDelAfter[f]) { *r++ = toupper(*p); *r++ = 'X'; done = 1; } } if (done == 0) *r++ = toupper(*p); /* remove stop-profile penalty from coding score */ if (iCurr == iCds + 2 && iOld == iStop) rScore -= options.tc3uPen; #ifdef DEBUG fprintf(stderr, "trace back coding: state %2d(%2d) position %4d(%c)\n", iCurr, getFrame(iCurr, M[MT_CODING]->order, M[MT_START]->offset, M[MT_STOP]->offset), p-seq->seq, *(r-1)); #endif iOld = iCurr; iCurr = currTr[iCurr]; currTr -= states; p -= 1; } rStart = p - seq->seq + 1; if (p >= seq->seq) rScore -= V[(p - seq->seq) * states + iCurr]; if (rScore > maxScore) maxScore = rScore; if (getFrame(iOld, M[MT_CODING]->order, M[MT_START]->offset, M[MT_STOP]->offset) == 1) *r++ = 'X'; if (getFrame(iOld, M[MT_CODING]->order, M[MT_START]->offset, M[MT_STOP]->offset) == 2) { *r++='X'; *r++='X'; } *r-- = 0; /* reverse the array and add to the result-array */ q = res; while (q < r) { unsigned char c = *r; *r-- = *q; *q++ = c; } #ifdef DEBUG fprintf(stderr, "found coding %s, add to results, state %d \n", res, iCurr); #endif if (rStop - rStart >= options.minLen) { result_p_t r = (result_p_t) xmalloc(sizeof(result_t)); r->score = rScore; r->start = rStart; r->stop = rStop; r->reverse = reverse; r->s = res; add_col_elt(rc, r, 8); } else free(res); } } /* clean up */ free(insTindex); free(delTindex); return maxScore; } static void LoadMatrix(const char *fName, col_p_t mc) { read_buf_t rb; char *buf; int fd = open(fName, O_RDONLY); if (fd == -1) fatal("Could not open file %s: %s(%d)\n", fName, strerror(errno), errno); init_col(mc, 16); init_buf(&rb); buf = read_line_buf(&rb, fd); while (rb.lc > 0) { if (strncmp(buf, "FORMAT: ", 8) == 0) { matrix_p_t m = (matrix_p_t) xmalloc(sizeof(matrix_t)); char name[256], fType[256], mType[256]; int res; unsigned int size = 256; signed char *data = (signed char *) xmalloc(sizeof(signed char) * size); unsigned int nElt = 0; res = sscanf(buf, "FORMAT: %255s %255s %255s %u %u %d s C+G: %lf %lf", name, fType, mType, &m->order, &m->frames, &m->offset, &m->CGmin, &m->CGmax); if (m->CGmin < 0.0) m->CGmin = 0.0; if (m->CGmin > 0.0) m->CGmin += options.percent; m->CGmax += options.percent; if (m->CGmax > 100.0) m->CGmax = 100.0; if (res != 8 || (buf = read_line_buf(&rb, fd)) == NULL || rb.lc == 0) fatal("Bad data header format in file %s, near %s (%d)\n", fName, name, res); m->name = strdup(name); m->kind = strdup(mType); m->matType = MT_UNKNOWN; if (strncmp(fType, "CODING", 6) == 0) m->matType = MT_CODING; if (strncmp(fType, "UNTRANSLATED", 12) == 0) m->matType = MT_UNTRANSLATED; if (strncmp(fType, "START", 5) == 0) m->matType = MT_START; if (strncmp(fType, "STOP", 4) == 0) m->matType = MT_STOP; while (buf[0] == '-' || isdigit(buf[0])) { int a, c, g, t; res = sscanf(buf, "%d %d %d %d", &a, &c, &g, &t); if (res != 4 || (buf = read_line_buf(&rb, fd)) == NULL) fatal("Bad data format in file %s, near %s (%d)\n", fName, name, res); if (nElt + 4 > size) { size += 256; data = (signed char *) xrealloc(data, sizeof(signed char) * size); } data[nElt++] = a; data[nElt++] = c; data[nElt++] = g; data[nElt++] = t; } CreateMatrix(m, data, nElt); add_col_elt(mc, m, 16); free(data); } else if ((buf = read_line_buf(&rb, fd)) == NULL) fatal("Probable bug in read_line_buf while reading %s\n", fName); } close(fd); free_buf(&rb); } static void remove_lc(unsigned char *s) { unsigned char *t = s; while (*s) { if (isupper(*s)) *t++ = *s; s += 1; } *t = 0; } static char * na2aa(unsigned char *s) { static char *CABC = "KNKNTTTTRSRSIIMI" /* AAA AAC ... ATT */ "QHQHPPPPRRRRLLLL" /* CAA CAC ... CTT */ "EDEDAAAAGGGGVVVV" /* GAA GAC ... GTT */ "OYOYSSSSOCWCLFLF"; /* TAA TAC ... TTT */ static char *CNBC = "XTXXXPRLXAGVXSXX"; /* AAN ACN ... TTN */ char *res = xmalloc(sizeof(char) * (strlen((char *) s) / 3 + 2)); char *cur = res; while (*s) { int idx = 0; /* Check first nt. */ switch (*s) { case 'A': break; case 'C': idx = 1; break; case 'G': idx = 2; break; case 'T': idx = 3; break; default: idx = -1; } s += 1; if (*s == 0) { *cur++ = 'X'; break; } if (idx == -1) { *cur++ = 'X'; s += 1; if (*s == 0) break; s += 1; continue; } idx <<= 2; /* Check second nt. */ switch (*s) { case 'A': break; case 'C': idx += 1; break; case 'G': idx += 2; break; case 'T': idx += 3; break; default: idx = -1; } s += 1; if (*s == 0) { if (idx == -1) *cur++ = 'X'; else *cur++ = CNBC[idx]; break; } if (idx == -1) { *cur++ = 'X'; s += 1; continue; } idx <<= 2; /* Check third nt. */ switch (*s) { case 'A': break; case 'C': idx += 1; break; case 'G': idx += 2; break; case 'T': idx += 3; break; default: *cur++ = CNBC[idx >> 2]; idx = -1; } if (idx != -1) *cur++ = CABC[idx]; s += 1; } *cur = 0; return res; } static void showResults(col_p_t rc, seq_p_t seq, unsigned char *rSeq, int maxScore) { unsigned int i; unsigned int cnt = 0; if (options.maxOnly != 0) { result_p_t r = NULL; if (options.out == NULL) return; for (i = 0; i < rc->nb; i++) if (rc->e.r[i]->score == maxScore) { r = rc->e.r[i]; break; } i = 0; while (!isspace(seq->header[i])) i += 1; if (r != NULL) fprintf(options.out, "%.*s %d %u %u %u %c\n", i, seq->header, r->score, r->start + 1, r->stop + 1, seq->len, r->reverse ? '-' : '+'); else fprintf(options.out, "%.*s %d\n", i, seq->header, maxScore); return; } if (options.all != 0) { fprintf(stderr, "Sorry, options.all is unimplemented yet...\n"); /* my $scores = ""; my $outSeq = ""; my $lastPos = 0; foreach my $r (@$rres) { my $curScore = $$r[0]; $scores .= $curScore . " "; if ($theRealMax * $main::both > $curScore) { next; } if ($$r[1] > $lastPos) { my $s = substr($seq->{_seq}, $lastPos, $$r[1] - $lastPos); $s =~ tr/A-Z/a-z/; $outSeq .= $s; } $outSeq .= $$r[3]; $lastPos = $$r[2] + 1; } if ($lastPos < $seq->seqLength) { my $s = substr($seq->{_seq}, $lastPos); $s =~ tr/A-Z/a-z/; $outSeq .= $s; } my $head = $seq->seqHead; $head =~ s/^(\S+)/$1 $scores/; print $main::out "$head\n"; $outSeq =~ s/(.{$main::sWidth})/$1\n/g; $outSeq =~ s/\s+$//; # remove a trailing newline, since we add one below. print $main::out "$outSeq\n"; */ return; } for (i = 0; i < rc->nb; i++) { result_p_t r = rc->e.r[i]; char *h = seq->header; size_t len = strlen(h); char *buf, *ptr; if ((double) maxScore * options.both > (double) r->score) continue; buf = (char *) xmalloc((len + 256) * sizeof(char)); ptr = buf; while (*h && *h != '|' && !isspace(*h)) *ptr++ = *h++; if (*h == '|') { while (*h && *h != '|' && !isspace(*h)) *ptr++ = *h++; } if (cnt > 0) *ptr++ = 'a' + cnt - 1; cnt += 1; while (*h && !isspace(*h)) *ptr++ = *h++; sprintf(ptr, " %d %d %d %s", r->score, r->start + 1, r->stop + 1, h); if (r->reverse != 0) { char *t = strstr(buf, "; minus strand"); if (t != NULL) { while (*(t + 14) != 0) { *t = *(t + 14); t += 1; } *t = 0; } else { len = strlen(buf); while (isspace(buf[len - 1])) len -= 1; strcpy(buf + len, "; minus strand\n"); } } if (options.transl != NULL) { char *ps; remove_lc(r->s); ps = na2aa(r->s); len = strlen(buf); while (isspace(buf[len - 1])) len -= 1; fprintf(options.transl, "%.*s; translated\n", (unsigned int) len, buf); len = strlen(ps); ptr = ps; /* remove trailing stop codon(s). */ while (len > 0 && ps[len - 1] == 'O') { len -= 1; ps[len] = 0; } /* translate intermediate stop codons by X. */ while (*ptr) { if (*ptr == 'O') *ptr = 'X'; ptr += 1; } ptr = ps; while (len > options.sWidth) { fprintf(options.transl, "%.*s\n", options.sWidth, ptr); len -= options.sWidth; ptr += options.sWidth; } fprintf(options.transl, "%s\n", ptr); free(ps); } if (options.out != NULL) { fputs(buf, options.out); if (options.no_del != 0) remove_lc(r->s); len = strlen((char *) r->s); ptr = (char *) r->s; while (len > options.sWidth) { fprintf(options.out, "%.*s\n", options.sWidth, ptr); len -= options.sWidth; ptr += options.sWidth; } fprintf(options.out, "%s\n", ptr); } free(buf); } } static void process_file(const char *fName, col_p_t mc) { seq_t seq; col_t rc; init_col(&rc, 8); init_seq(fName, &seq); while (get_next_seq(&seq) == 0) { unsigned int i; int maxScore = INT_MIN; unsigned char *rSeq = NULL; if (seq.len == 0) continue; maxScore = Compute(&seq, mc, &rc, 0, maxScore); if (options.single == 0) { if (options.all) rSeq = (unsigned char *) strdup((char *) seq.seq); seq_revcomp_inplace(&seq); maxScore = Compute(&seq, mc, &rc, 1, maxScore); if (options.all) { unsigned char *tem = rSeq; rSeq = seq.seq; seq.seq = tem; } } showResults(&rc, &seq, rSeq, maxScore); for (i = 0; i < rc.nb; i++) { result_p_t r = rc.e.r[i]; free(r->s); free(r); } rc.nb = 0; free(rSeq); } free_seq(&seq); free_col(&rc); /* my $bigMax = ESTScan::Compute($seq->{_seq}, $main::iPen, $main::dPen, $main::min, $main::maxOnly == 0 ? \@res : undef, $matIndex, $utrMatIndex, $startMatIndex, $stopMatIndex, $main::minLen); my $result_nr = -1; if ($main::single != 0) { showResults($seq, \@res, $bigMax, \$result_nr, $bigMax); next; } my @resRev; my $seqRev = $seq->revComp; my $bigMaxRev = ESTScan::Compute($seqRev->{_seq},$main::iPen,$main::dPen,$main::min, $main::maxOnly == 0 ? \@resRev : undef, $matIndex, $utrMatIndex, $startMatIndex, $stopMatIndex, $main::minLen); my $theRealMax = $bigMax >= $bigMaxRev ? $bigMax : $bigMaxRev; showResults($seq, \@res, $bigMax, \$result_nr, $theRealMax); showResults($seqRev, \@resRev, $bigMaxRev, \$result_nr, $theRealMax); } */ } int main(int argc, char *argv[]) { const char *ESTScanDir; int getHelp = 0; col_t mc; #ifdef DEBUG unsigned int i; mcheck(NULL); mtrace(); #endif argv0 = argv[0]; if (setlocale(LC_ALL, "POSIX") == NULL) fprintf(stderr, "%s: Warning: could not set locale to POSIX\n", argv[0]); /* Default options. */ ESTScanDir = getenv("ESTSCANDIR"); if (ESTScanDir == NULL) ESTScanDir = "/usr/molbio/share/ESTScan"; options.matrix = xmalloc((strlen(ESTScanDir) + 9) * sizeof(char)); strcat(strcpy(options.matrix, ESTScanDir), "/Hs.smat"); options.min = -100; options.dPen = -50; options.iPen = -50; options.ts5uPen = -10; options.tscPen = -10; options.ts3uPen = -5; options.t5ucPen = -80; options.t5uePen = -40; options.tc3uPen = -80; options.tcePen = -40; options.t3uePen = -20; options.percent = 4.0; options.Nvalue = 0; options.sWidth = 60; options.all = 0; options.maxOnly = 0; options.skipLen = 1; options.minLen = 50; options.transl = NULL; options.out = stdout; options.both = 1.0; options.no_del = 0; options.single = 0; while (1) { int c = getopt(argc, argv, "ab:d:hi:l:M:m:N:nOo:p:Ss:T:t:vw:"); if (c == -1) break; switch (c) { case 'a': options.all = 1; break; case 'b': options.both = atof(optarg); break; case 'd': options.dPen = atoi(optarg); break; case 'h': getHelp = 1; break; case 'i': options.iPen = atoi(optarg); break; case 'l': options.minLen = atoi(optarg); break; case 'M': options.matrix = optarg; break; case 'm': options.min = atoi(optarg); break; case 'N': options.Nvalue = atoi(optarg); break; case 'n': options.no_del = 1; break; case 'O': options.maxOnly = 1; break; case 'o': if (strcmp(optarg, "-") == 0) { options.out = stdout; } else { options.out = fopen(optarg, "w"); if (options.out == NULL) fatal("Couldn't create file %s: %s (%d)\n", optarg, strerror(errno), errno); } break; case 'p': options.percent = atof(optarg); break; case 'S': options.single = 1; break; case 's': options.skipLen = atoi(optarg); break; case 'T': /* if (defined $opt{'T'}) { my $nbProbs = ($opt{'T'} =~ s/,/,/g) + 1; if (($nbProbs) != 8) { usage("Wrong number of transition probabilities(" . $nbProbs . ")"); } ($main::ts5uPen,$main::tscPen,$main::ts3uPen,$main::t5ucPen, $main::t5uePen,$main::tc3uPen,$main::tcePen,$main::t3uePen) = split(/,/,$opt{'T'}); } */ fputs("Option T is not yet implemented...\n", stderr); return 1; case 't': if (strcmp(optarg, "-") == 0) { options.transl = stdout; if (options.out == options.transl) options.out = NULL; } else { options.transl = fopen(optarg, "w"); if (options.transl == NULL) fatal("Couldn't create file %s: %s (%d)\n", optarg, strerror(errno), errno); } break; case 'v': fputs(Version, stderr); return 0; case 'w': options.sWidth = atoi(optarg); break; case '?': break; default: fprintf(stderr, "?? getopt returned character code 0%o ??\n", c); } } if (getHelp) { fprintf(stderr, Usage, argv[0], options.both, options.dPen, options.iPen, options.minLen, options.matrix, options.min, options.Nvalue, options.percent, options.skipLen, options.ts5uPen, options.tscPen, options.ts3uPen, options.t5ucPen, options.t5uePen, options.tc3uPen, options.tcePen, options.t3uePen, options.sWidth); return 1; } LoadMatrix(options.matrix, &mc); #ifdef DEBUG fprintf(stderr, "We have loaded %u matrices:\n", mc.nb); for (i = 0; i < mc.nb; i++) { matrix_p_t m = mc.e.m[i]; fprintf(stderr, "%u: %s %s %d %.2f %.2f %u %u %d\n", i, m->name, m->kind, m->matType, m->CGmin, m->CGmax, m->order, m->frames, m->offset); } #endif if (optind >= argc) process_file(NULL, &mc); else while (optind < argc) process_file(argv[optind++], &mc); #ifdef DEBUG for (i = 0; i < mc.nb; i++) { matrix_p_t m = mc.e.m[i]; unsigned int j; for (j = 0; j < m->frames; j++) free(m->m[j]); free(m->m); free(m->name); free(m->kind); free(m); } free_col(&mc); free(V); free(tr); free(options.matrix); #endif return 0; } estscan-3.0.3/prepare_data0000755000551200011300000004340610602002376014712 0ustar chrisludwig#!/usr/bin/env perl # $Id: prepare_data,v 1.7 2007/03/26 17:38:06 c4chris Exp $ ################################################################################ # # prepare_data # ------------ # # Claudio Lottaz, SIB-ISREC, Claudio.Lottaz@isb-sib.ch # Christian Iseli, LICR ITO, Christian.Iseli@licr.org # # Copyright (c) 2006 Swiss Institute of Bioinformatics. All rights reserved. # ################################################################################ use strict; use FASTAFile; use Symbol; # global variables my $verbose = 1; my $dosplit = 0; my $forcedminmask = undef; my $datadir = '.'; my $filestem = ''; require "build_model_utils.pl"; ################################################################################ # # Check command-line for switches # my $usage = "Usage: prepare_data [options] \n" . " where options are:\n" . " -q don't log on terminal\n" . " -e split extracted data into training and test sets\n" . " -m force minimal mask, overwrites entry in config-files\n" . "More information is obtained using 'perldoc prepare_data'\n"; while ($ARGV[0] =~ m/^-/) { if ($ARGV[0] eq '-q') { shift; $verbose = 0; next; } if ($ARGV[0] eq '-e') { shift; $dosplit = 1; next; } if ($ARGV[0] eq '-m') { shift; $forcedminmask = shift; next; } die "Unrecognized switch: $ARGV[0]\n$usage"; } if ($#ARGV < 0) { die "No configuration file specified\n$usage"; } ################################################################################ # # Main-loop through all specified config-files # my $parFile; while($parFile = shift) { my($organism, $hightaxo, $dbfiles, $ugdata, $estdata, $datadir2, $filestem2, $rnafile, $estfile, $estcdsfile, $estutrfile, $trainingfile, $testfile, $utrfile, $cdsfile, $tuplesize, $minmask, $pseudocounts, $minscore, $startlength, $startpreroll, $stoplength, $stoppreroll, $smatfile, $estscanparams, $nb_isochores, $isochore_borders) = readConfig($parFile, undef, $forcedminmask, undef, undef, undef, undef, undef, undef, $verbose); log_open("readconfig.log"); $datadir = $datadir2; $filestem = $filestem2; showConfig($parFile, $organism, $hightaxo, $dbfiles, $ugdata, $estdata, $datadir, $rnafile, $estfile, $estcdsfile, $estutrfile, $trainingfile, $testfile, $utrfile, $cdsfile, $tuplesize, $minmask, $pseudocounts, $minscore, $startlength, $startpreroll, $stoplength, $stoppreroll, $smatfile, $estscanparams, $nb_isochores, $isochore_borders); log_close(); # prepare data log_open("prepare_data.log"); my $gc_histogram = analyzeGC($rnafile); my $isochores = computeIsochores($nb_isochores, $isochore_borders, $gc_histogram); if ($dosplit) { splitTraining($rnafile, $trainingfile, $testfile, $isochores); } else { symlink $rnafile, $trainingfile; symlink $rnafile, $testfile; splitIsochores($rnafile, $isochores); } maskRedundancy($minmask, $isochores); log_print("\nGenerating evaluation mRNA data...."); split_mRNAs($testfile, $utrfile, $cdsfile); log_close(); print "$parFile done.\n"; } exit(0); ################################################################################ # # Analyze GC-content distribution # sub analyzeGC { # crawl through the RNAs collected mRNA and establish a GC-contents histogram my($rnafile) = @_; my $datfh = gensym; log_print("\nAnalyzing GC contents..."); my $nb_seqs = 0; my @gc_histogram; if (-s "$datadir/Report/gc.dat") { log_print(" - loading existing GC-content histogram..."); open($datfh, "$datadir/Report/gc.dat"); while(<$datfh>) { my($index, $count) = split; $gc_histogram[$index] = $count; $nb_seqs += $count; } close($datfh); log_print(" read $nb_seqs sequences"); } else { log_print(" - generating GC-content histogram..."); my $e; my $src = FASTAFile->new("$rnafile"); $src->openStream; while(defined ($e = $src->getNext)) { my $gc = int(gc_content($e->{_seq})); $gc_histogram[$gc] += 1; $nb_seqs += 1; } close($src->{_BTFfile}); log_print(" read $nb_seqs sequences"); # write datafile open($datfh, ">$datadir/Report/gc.dat"); for (my $i = 0; $i < 100; $i++) { if (!defined $gc_histogram[$i]) { print $datfh $i, " 0\n"; } else { print $datfh $i, " ", $gc_histogram[$i], "\n"; } } close($datfh); } # write the gnu-script for the gc-histogram my $scriptfh = gensym; open($scriptfh, ">$datadir/Report/gc.gplot"); print $scriptfh < $currentTop) { push(@isochore_borders, $i); $currentTop += $isoIncrement; } } push(@isochore_borders, 100); } my($buffer) = " isochores used: "; for ($i = 0; $i < $#isochore_borders; $i++) { $isochores[$i] = $isochore_borders[$i] . "-" . $isochore_borders[$i+1]; $buffer .= $isochores[$i] . ", "; } chop($buffer);chop($buffer); log_print("$buffer"); return \@isochores; } sub splitTraining { # Splits the mRNA data into training set and test set. Moreover, # generates the isochore partitionning for the training set. my($rnafile, $trainingfile, $testfile, $isochores_ref) = @_; my @isochores = @{$isochores_ref}; my $isodir = "$datadir/Isochores"; log_print("\nSplit mRNA data into isochores, test and training data..."); log_print(" - writing isochores..."); my($isochore, @isochorefhs, $isochorefh); # check whether isochores and testfile have already been computed my($isochoresReady) = 1; foreach $isochore (@isochores) { if (!(-s "$isodir/mrna$isochore.seq")) { $isochoresReady=0; next; } log_print(" $isodir/mrna$isochore.seq already exists"); } if (-s "$trainingfile") { log_print(" $trainingfile already exists"); } else { $isochoresReady = 0; } if (-s "$testfile") { log_print(" $testfile already exists"); } else { $isochoresReady = 0; } # write trainingfile, testfile and isochores (CDS) if ($isochoresReady == 1) { log_print(" all files exist, skipped"); } else{ # open all isochore files for writing foreach $isochore (@isochores) { $isochorefh = gensym; open($isochorefh, ">$isodir/mrna$isochore.seq"); push(@isochorefhs, $isochorefh); } my $trainingfh = gensym; open($trainingfh, ">$trainingfile"); my $testfh = gensym; open($testfh, ">$testfile"); # read mRNAs and write the isochores my $e; my $testSeqs = 0; my %isoSeqs; my $src = FASTAFile->new("$rnafile"); $src->openStream; while(defined ($e = $src->getNext)) { my $gc = gc_content($e->{_seq}); for (my $i = 0; $i <= $#isochores; $i++) { my($low, $high) = ($isochores[$i] =~ m/^([^\-]+)\-(.*)$/); if ($gc < $high) { my($id, $x, $begin, $end, $desc) = split(/ /, $e->{_seqHead}, 5); $isoSeqs{$isochores[$i]}++; if ($isoSeqs{$isochores[$i]}%2) { $e->printFASTA($testfh); $testSeqs++; } else { $e->printFASTA($trainingfh); my $fh = $isochorefhs[$i]; print $fh &genRNAEntry($e->ac, $desc, $begin, $end, $e->{_seq}); } last; } } } close($src->{_BTFfile}); # close the isochore files foreach $isochorefh (@isochorefhs) { close($isochorefh); } close($trainingfh); close($testfh); foreach (sort(keys(%isoSeqs))) { log_print(" ", $isoSeqs{$_}, " sequences found in isochore $_"); } log_print(" $testSeqs of these written into $testfile"); } } sub splitIsochores { # Split the mRNAs into isochores (CDS). my($rnafile, $isochores_ref) = @_; my @isochores = @{$isochores_ref}; my($isochore, @isochorefhs, $isochorefh); my $isodir = "$datadir/Isochores"; log_print("\nSplit mRNA data into isochores..."); # check whether isochores have already been computed my($isochoresReady) = 1; foreach $isochore (@isochores) { if (!(-s "$isodir/mrna$isochore.seq")) { $isochoresReady=0; next; } log_print(" isochore $isodir/mrna$isochore.seq already exists"); } # write isochores (CDS) if ($isochoresReady == 1) { log_print(" all files exist, skipped"); } else{ # open all isochore files for writing foreach $isochore (@isochores) { $isochorefh = gensym; open($isochorefh, ">$isodir/mrna$isochore.seq"); push(@isochorefhs, $isochorefh); } # read mRNA and write the isochores my $e; my %isoSeqs; my $src = FASTAFile->new("$rnafile"); $src->openStream; while(defined ($e = $src->getNext)) { my $gc = gc_content($e->{_seq}); for (my $i = 0; $i <= $#isochores; $i++) { my($low, $high) = ($isochores[$i] =~ m/^([^\-]+)\-(.*)$/); if ($gc < $high) { $isoSeqs{$isochores[$i]}++; my($id, $x, $begin, $end, $desc) = split(/ /, $e->{_seqHead}, 5); my $fh = $isochorefhs[$i]; print $fh &genRNAEntry($e->ac, $desc, $begin, $end, $e->{_seq}); last; } } } close($src->{_BTFfile}); # close the isochore files foreach $isochorefh (@isochorefhs) { close($isochorefh); } foreach (sort(keys(%isoSeqs))) { log_print(" ", $isoSeqs{$_}, " sequences found in isochore $_"); } } } sub genRNAEntry { my($id, $desc, $cdsBegin, $cdsEnd, $seq) = @_; $seq =~ s/(.{80})/$1\n/g; return ">tem|$id CDS: $cdsBegin $cdsEnd $desc\n$seq\n"; } sub maskRedundancy { # replace redundant regions by 'N' my($minmask, $isochores_ref) = @_; my @isochores = @{$isochores_ref}; log_print("\nMasking redundancy from isochores..."); my $isochore; my $isodir = "$datadir/Isochores"; foreach $isochore (@isochores) { my $maskedFile = "$isodir/mrna$isochore\_mr$minmask.seq"; if (-s $maskedFile) { log_print(" - $maskedFile already exists, skipped"); } else { my $infile = "$isodir/mrna$isochore.seq"; log_print(" - masking redundancy in isochore $isochore"); system("maskred -m $minmask < $infile > $maskedFile"); } my $masked = `tail -1 $maskedFile`; $masked =~ m/^>masked nucleotides: (\d+)/; my $m = $1; my $nts = `grep -v '^>' $maskedFile | wc -cl`; $nts =~ s/^\s*(\d+)\s+(\d+)\s*$/$2-$1/e; my $pct = int(10000*$m/$nts + 0.5) / 100; log_print(" masked $m of $nts nucleotides ($pct%)"); } } sub split_mRNAs { # reads testfile and splits the entries into untranslated and # coding sequences according to the annotation expected in the # FASTA headers. my($testfile, $utrfile, $cdsfile) = @_; log_print(" - splitting mRNAs into UTRs and CDSs..."); if ((-s $utrfile) && (-s $cdsfile)) { log_print(" UTR and CDS files already exist, skipped"); return; } my $e; my $utrfh = gensym; open($utrfh, ">$utrfile"); my $cdsfh = gensym; open($cdsfh, ">$cdsfile"); my $src = FASTAFile->new($testfile); $src->openStream; while(defined($e = $src->getNext)) { if ($e->{_seqHead} =~ m/CDS: (\S+) (\S+)/) { my $cdsStart = $1; my $cdsEnd = $2; my $cdsLen = $cdsEnd - $cdsStart + 1; my $head = $e->{_seqHead}; my $seq = $e->{_seq}; $e->{_seqHead} =~ s/CDS: (\S+) (\S+)/CDS: 1 $cdsLen/; $e->{_seq} = substr($e->seq, $cdsStart - 1, $cdsLen); $e->printFASTA($cdsfh); $e->{_seqHead} = "$head, 5'UTR"; $e->{_seq} = substr($seq, 0, $cdsStart - 1); $e->printFASTA($utrfh); $e->{_seqHead} = "$head, 3'UTR"; $e->{_seq} = substr($seq, $cdsEnd); $e->printFASTA($utrfh); } } close($src->{_BTFfile}); close($cdsfh); close($utrfh); my $nbUTRs = `grep -c '^>' $utrfile`; chop($nbUTRs); my $nbCDSs = `grep -c '^>' $cdsfile`; chop($nbCDSs); log_print(" found $nbUTRs untranslated regions and $nbCDSs coding sequences"); } ################################################################################ # # Documentation # =head1 NAME prepare_data - prepare training and test data for ESTScan =head1 SYNOPSIS prepare_data [options] =head1 DESCRIPTION prepare_data prepares training and test data to generate codon usage tables for ESTScan. The script reads configuration files on the comannd line and performs the following steps for each of them: - Split the data into isochores and testsets - Mask redundant pieces of sequences - Extract untranslated regions from test mRNA Files which already exist are reused. If an existing file is to be recomputed, it must be deleted before the script is run again. If a particular collection of mRNA is to be used instead of data extracted by extract_mRNA, providing these in FASTA format under the name of the mRNA file (where extract_mRNA would store the extracted data) is enough. If the '-e' switch is provided, prepare_data splits the data extracted from the given databases or provided by the user into training and test data. =head1 DIRECTORY STRUCTURE extract_data uses the following directory structure, the root of which is given in the configuration file. From this root it contains the following subdirectories: - Isochores: data split in isochores, some with redundancy masked - Report: contains all log files mRNA data, test and training data files, is deposited in the data-root directory if not otherwise specified in the configuration file. =head1 OPTIONS -q quiet Do not log on terminal. -e split data into test and training If this switch is provided extracted mRNA data is split into training and test set. As sequences are extracted they are alternately deposited in the training and the test set.. -m Minimum consecutive runs of nucleotides masked from the original sequences in order to limit data redundancy. Pieces of sequence are only masked, if all of their nucleotides are part of reoccuring 12-tuples which overlap by at least 4 nucleotides. This switch overwrites the variable $minmask from paramter files. =head1 CONFIGURATION FILE The parameters defined in the configuration file have the following meaning: * $organism (mandatory) The desired organism as it is given in EMBL "OS" or RefSeq "ORGANISM" lines. * $dbfiles Files from where full-length mRNA sequences are to be extracted, tries to guess whether the files come from EMBL or RefSeq. If this is not specified, expects a collection of mRNA in $rnafile. * $datadir (mandatory) Base directory where all of the above files are located and the temporary result files are stored * $rnafile (default is "$datadir/mrna.seq") Name of the file with the extracted mRNA entries. * $smatfile (default is "$datadir/Matrices/$filestem.smat") Name of the file where the HMM-model is to be written * $nb_isochores (default is 0) Number of isochores, when isochores are to be determined automatically from the GC-content distributaion as equal-sized groups. 0 means,no automatic detection. * @isochore_borders (default is (0, 43, 47, 51, 100) Array of GC percentages where isochores are split, first entry is usually 0 and last 100. This is overwritten when $nb_isochores is not zero. * $tuplesize (default is 6) Size of tuples counted for codon statistics. This is overwritten by the -t switch. * $minmask (default is 30) Minimal run of consecutive nucleotides masked as redundant. This is overwritten by the -m switch. * $pseudocounts (default is 1) pseudocount to be added when generating the codon usage tables, overwritten by -m * $minscore (default is -100) minimum score attributed to log-odds and log-probabilities, overwritten by -s. * $startlength, $startpreroll (default 2+ceil(tuplesize/3) and 2) number of nucleotide triplets contained in the start profile and how many of these are contained in the 5' untranslated region, overwritten by -l (length) and -r (preroll) * $stoplength, $stoppreroll (default 2+ceil(tuplesize/3) and 2) number of nucleotide triplets contained in the stop profile and how many of these are contained in the coding sequence, overwritten by -L (length) and -R (preroll) * $estscanparams (default is "-m -50 -d -50 -i -50 -N 0") parameters passed to ESTScan during evaluation $filestem is used to generate many filenames. It is generated automatically according to the tuplesize, the minmask and the pseudocounts applied to generate them. =head1 AUTHOR Claudio Lottaz, SIB-ISREC, Claudio.Lottaz@isb-sib.ch =cut # # End of file # ################################################################################ estscan-3.0.3/winsegshuffle.f0000644000551200011300000001040110214061324015337 0ustar chrisludwig* Program winsegshuffle * $Id: winsegshuffle.f,v 1.2 2005/03/10 15:08:04 c4chris Exp $ *----------------------------------------------------------------------* * Function: generates a window-segment shuffled database * from a source sequence database * Author: Philipp Bucher *----------------------------------------------------------------------* Parameter (NERR= 0) Character*01 CSEQ(11000000) Character*01 CSEG(100) Integer IWIN(100000) Character*132 RCIN Character*64 CHHA * Read command line NTOT=10000000 NLEN=200 NSEG=10 NWIN=20 LINL=70 IDUM=-100 Call Repar * (NTOT,NLEN,NSEG,NWIN,IDUM,IRC) If(IRC.NE.0) then Write(NERR,'( * ''Usage: winsegshuffle '', * ''[db_size [seq_length [seg_length '', * ''[window [iran-seed]]]]]'' * )') Stop End if C Print *,NTOT C Print *,NLEN C Print *,NSEG C Print *,NWIN C Print *,IDUM * Read input sequences K1=0 1 Read(5,'(A)',End=40) RCIN If(RCIN(1:1).EQ.'>') then If(K1.GE.NTOT) Go to 50 Else L=Lblnk(RCIN) Read(RCIN,'(132A)')(CSEQ(ii1),ii1=K1+1,K1+L) K1=K1+L End if Go to 1 40 NTOT = K1 - L * Do segment window shuffling 50 Continue Do I1=1,NTOT,NSEG*NWIN Call Permut(IWIN,NWIN,IDUM) * Write(6,'(20I3)')(IWIN(ii1),ii1=1,20) K3=0 Do I2=1,NWIN N3=I1+(IWIN(I2)-1)*NSEG Do I3=1,NSEG K3=K3+1 N3=N3+1 CSEG(K3)=CSEQ(N3) End do End do K2=0 Do I2=I1+1,I1+NSEG*NWIN K2=K2+1 CSEQ(I2)=CSEG(K2) End do End do * Print sequence: K1=0 Do I1=1,NTOT,NLEN K1=K1+1 Write(CHHA,'(''>'',2I5)') NLEN,K1 Do I2=2,11 If(CHHA(I2:I2).EQ.' ') CHHA(I2:I2)='0' End do CHHA=CHHA(1:11) // ' ..' Write(6,'(64A)')(CHHA(ii1:ii1),ii1=1,Lblnk(CHHA)) Write(6,'((70A))')(CSEQ(ii1),ii1=I1+1,I1+NLEN) End do 100 Stop 900 Go to 100 End *----------------------------------------------------------------------* FUNCTION RAN2(IDUM) PARAMETER (M=714025,IA=1366,IC=150889,RM=1.4005112E-6) DIMENSION IR(97) DATA IFF /0/ IF(IDUM.LT.0.OR.IFF.EQ.0)THEN IFF=1 IDUM=MOD(IC-IDUM,M) DO 11 J=1,97 IDUM=MOD(IA*IDUM+IC,M) IR(J)=IDUM 11 CONTINUE IDUM=MOD(IA*IDUM+IC,M) IY=IDUM ENDIF J=1+(97*IY)/M IF(J.GT.97.OR.J.LT.1) PAUSE IY=IR(J) RAN2=IY*RM IDUM=MOD(IA*IDUM+IC,M) IR(J)=IDUM RETURN END *----------------------------------------------------------------------* Subroutine Permut(IR,NR,IDUM) Integer IR(*) Do 10 I1=1,NR IR(I1)=I1 10 Continue Do 20 I1=NR,2,-1 J1=IR(I1) RS=RAN2(IDUM) K1=INT(RS*I1)+1 IR(I1)=IR(K1) IR(K1)=J1 20 Continue Return End *----------------------------------------------------------------------* Function Lblnk(string) Character*(*) string L=Len(string) Do 9 I1=L,1,-1 If(STRING(I1:I1).NE.' ') go to 10 9 Continue 10 Lblnk=I1 Return End *----------------------------------------------------------------------* Subroutine Repar * (NTOT,NLEN,NSEG,NWIN,IDUM,IRC) Character*64 CARG IRC=0 N1=Iargc() Do I1=1,N1 Call GetArg(I1,CARG) If(CARG(1:1).NE.'-') then If(I1.EQ.1) Read(CARG,*,Err=900) NTOT If(I1.EQ.2) Read(CARG,*,Err=900) NLEN If(I1.EQ.3) Read(CARG,*,Err=900) NSEG If(I1.EQ.4) Read(CARG,*,Err=900) NWIN If(I1.EQ.5) Read(CARG,*,Err=900) IDUM End if End do If(IDUM.GT.0) IDUM=-1-IDUM 100 Return 900 IRC=1 Go to 100 End *----------------------------------------------------------------------* estscan-3.0.3/build_model0000755000551200011300000002755610541762732014565 0ustar chrisludwig#!/usr/bin/env perl # $Id: build_model,v 1.7 2006/12/19 13:15:06 c4chris Exp $ ################################################################################ # # build_model # ----------- # # Claudio Lottaz, SIB-ISREC, Claudio.Lottaz@isb-sib.ch # Christian Iseli, LICR ITO, Christian.Iseli@licr.org # # Copyright (c) 1999, 2006 Swiss Institute of Bioinformatics. # All rights reserved. # ################################################################################ use strict; use Symbol; # global variables my $verbose = 1; my $forcedtuplesize = undef; my $forcedminmask = undef; my $forcedpseudocounts = undef; my $forcedminscore = undef; my $forcedstartlength = undef; my $forcedstartpreroll = undef; my $forcedstoplength = undef; my $forcedstoppreroll = undef; my $datadir = '.'; my $filestem = ''; require "build_model_utils.pl"; ################################################################################ # # Check command-line for switches # my $usage = "Usage: build_model [options] \n" . " where options are:\n" . " -q don't log on terminal\n" . " -t force tuple size, overwrites entry in config-files\n" . " -m force minimal mask, overwrites entry in config-files\n" . " -p force pseudocounts, overwrites entry in config-files\n" . " -s force minimal score, overwrites entry in config-files\n" . " -l force length of start profile (in codons/triplets)\n" . " -r force start profile's preroll in 5'UTR (in codons/triplets)\n" . " -L force length of stop profile (in codons/triplets)\n" . " -R force stop profile's preroll in 5'UTR (in codons/triplets)\n" . "More information is obtained using 'perldoc build_model'\n"; while ($ARGV[0] =~ m/^-/) { if ($ARGV[0] eq '-q') { shift; $verbose = 0; next; } if ($ARGV[0] eq '-t') { shift; $forcedtuplesize = shift; next; } if ($ARGV[0] eq '-m') { shift; $forcedminmask = shift; next; } if ($ARGV[0] eq '-p') { shift; $forcedpseudocounts = shift; next; } if ($ARGV[0] eq '-s') { shift; $forcedminscore = shift; next; } if ($ARGV[0] eq '-l') { shift; $forcedstartlength = shift; next; } if ($ARGV[0] eq '-r') { shift; $forcedstartpreroll = shift; next; } if ($ARGV[0] eq '-L') { shift; $forcedstoplength = shift; next; } if ($ARGV[0] eq '-R') { shift; $forcedstoppreroll = shift; next; } die "Unrecognized switch: $ARGV[0]\n$usage"; } if ($#ARGV < 0) { die "No configuration file specified\n$usage"; } ################################################################################ # # Main-loop through all specified config-files # my $parFile; while($parFile = shift) { my($organism, $hightaxo, $dbfiles, $ugdata, $estdata, $datadir2, $filestem2, $rnafile, $estfile, $estcdsfile, $estutrfile, $trainingfile, $testfile, $utrfile, $cdsfile, $tuplesize, $minmask, $pseudocounts, $minscore, $startlength, $startpreroll, $stoplength, $stoppreroll, $smatfile, $estscanparams, $nb_isochores, $isochore_borders) = readConfig($parFile, $forcedtuplesize, $forcedminmask, $forcedpseudocounts, $forcedminscore, $forcedstartlength, $forcedstartpreroll, $forcedstoplength, $forcedstoppreroll, $verbose); log_open("readconfig.log"); $datadir = $datadir2; $filestem = $filestem2; showConfig($parFile, $organism, $hightaxo, $dbfiles, $ugdata, $estdata, $datadir, $rnafile, $estfile, $estcdsfile, $estutrfile, $trainingfile, $testfile, $utrfile, $cdsfile, $tuplesize, $minmask, $pseudocounts, $minscore, $startlength, $startpreroll, $stoplength, $stoppreroll, $smatfile, $estscanparams, $nb_isochores, $isochore_borders); log_close(); my $isochores; if ($nb_isochores == 0) { my $i; my @IC; for ($i = 0; $i < $#$isochore_borders; $i++) { $IC[$i] = $$isochore_borders[$i] . "-" . $$isochore_borders[$i + 1]; } $isochores = \@IC; } else { local *FH; open FH, "$datadir/Report/${filestem}_prepare_data.log" or die "prepare_data log file is missing : $!"; while ( ) { next unless s/^\s+isochores used:\s+//; s/\s//g; my @F = split /,/; $isochores = \@F; } close FH; } if ($#$isochores < 0) { die "There must be at least one isochore..."; } # generate codon usage and transition probability tables log_open("generate_tables.log"); if (-s "$smatfile") { log_print(" - $smatfile already exists, skipped"); } else { generateEmissionTables($tuplesize, $pseudocounts, $minscore, $startlength, $startpreroll, $stoplength, $stoppreroll, $isochores, $smatfile, $minmask, $parFile); } log_close(); print "$parFile done.\n"; } exit(0); ################################################################################ # # Generate codon usage tables for HMM-model # sub generateEmissionTables { # for earch isochore launches maketable, modifying its output in # order to mark the isochores and appends the result to the # smatfile. my($tuplesize, $pseudocounts, $minscore, $startlength, $startpreroll, $stoplength, $stoppreroll, $isochores, $smatfile, $minmask, $parFile) = @_; log_print("\nWriting codon usage tables..."); # makesmat counts in frames, not in codons/triplets, therefore: my $startframes = 3 * $startlength; my $startoffset = 3 * $startpreroll + 1; my $stopframes = 3 * $stoplength; my $stopoffset = 3 * $stoppreroll; my $isodir = "$datadir/Isochores"; my $smatfh = gensym; open($smatfh, ">>$smatfile"); foreach my $iso (sort(@$isochores)) { log_print(" - computing for isochore $iso..."); my($low, $high) = $iso =~ m/^([^\-]+)\-(.*)$/; $iso .= "_mr$minmask.seq"; my $cmd = "makesmat -t $tuplesize -p $pseudocounts -m $minscore " . "-f $startframes -o $startoffset -F $stopframes -O $stopoffset" . " < $isodir/mrna$iso"; my $out = `$cmd`; $out =~ s//$parFile/g; $out =~ s//$low $high/g; print($smatfh $out); } close($smatfh); } ################################################################################ # # Documentation # =head1 NAME build_model - create a model for ESTScan =head1 SYNOPSIS build_model [options] =head1 DESCRIPTION build_model generates codon usage tables for ESTScan. Codon usage is analyzed in mRNAs containing whole coding sequences. The script reads configuration files on the comannd line and computes codon usage tables for each of them. Files which already exist are reused. If an existing file is to be recomputed, it must be deleted before the script is run again. The mRNA files can be prepared with the extract_mRNA and prepare_data scripts, or simply provided in FASTA format in the Isochores subdirectory. In mRNA data, build_model expects annotations of coding sequence start and stop in the header as two integer values following the tag 'CDS:'. The first integer points to the first nucleotide of the CDS, the second to the last. Thus the length of the CDS is - + 1. The first nucleotide in the sequence has index 1. =head1 DIRECTORY STRUCTURE build_model uses the following directory structure, the root of which is given in the configuration file. From this root it contains the following subdirectories: - Isochores: data split in isochores, some with redundancy masked - Matrices: contains the generated tables - Report: contains all log files mRNA data, test and training data files, is deposited in the data-root directory if not otherwise specified in the configuration file. =head1 OPTIONS -q quiet Do not log on terminal. -t Size of tuples to be considered to generate the codon usage tables. - 1 is the order of the corresponding Markov model used by ESTScan. This switch overwrites the variable $tuplesize from paramter files. -m Minimum consecutive runs of nucleotides masked from the original sequences in order to limit data redundancy. Pieces of sequence are only masked, if all of their nucleotides are part of reoccuring 12-tuples which overlap by at least 4 nucleotides. This switch overwrites the variable $minmask from parameter files. -p pseudocount Pseudocounts to be added when generating codon usage tables. This switch overwrites the variable $pseudocounts from paramter files. -s Minimal score in tables and transitions, lower scores are overwritten with this value. This switch overwrites the variable $minscore from paramter files. -l The number of nucleotide triplets taken into account for the start-profile. This switch overwrites the variable $startlength from paramter files. -r The number of nucleotide triplets of the 5' untranslated region contained in the start profile. This switch overwrites the variable $startpreroll from paramter files. -L The number of nucleotide triplets taken into account for the stop-profile. This switch overwrites the variable $stoplength from paramter files. -R The number of nucleotide triplets of the coding sequence contained in the stop profile. This switch overwrites the variable $stoppreroll from paramter files. =head1 CONFIGURATION FILE The parameters defined in the configuration file have the following meaning: * $organism (mandatory) The desired organism as it is given in EMBL "OS" or RefSeq "ORGANISM" lines. * $dbfiles Files from where full-length mRNA sequences are to be extracted, tries to guess whether the files come from EMBL or RefSeq. If this is not specified, expects a collection of mRNA in $rnafile. * $datadir (mandatory) Base directory where all of the above files are located and the temporary result files are stored * $rnafile (default is "$datadir/mrna.seq") Name of the file with the extracted mRNA entries. * $smatfile (default is "$datadir/Matrices/$filestem.smat") Name of the file where the HMM-model is to be written * $nb_isochores (default is 0) Number of isochores, when isochores are to be determined automatically from the GC-content distributaion as equal-sized groups. 0 means no automatic detection. * @isochore_borders (default is (0, 43, 47, 51, 100) Array of GC percentages where isochores are split, first entry is usually 0 and last 100. This is overwritten when $nb_isochores is not zero. * $tuplesize (default is 6) Size of tuples counted for codon statistics. This is overwritten by the -t switch. * $minmask (default is 30) Minimal run of consecutive nucleotides masked as redundant. This is overwritten by the -m switch. * $pseudocounts (default is 1) pseudocount to be added when generating the codon usage tables, overwritten by -m * $minscore (default is -100) minimum score attributed to log-odds and log-probabilities, overwritten by -s. * $startlength, $startpreroll (default 2+ceil(tuplesize/3) and 2) number of nucleotide triplets contained in the start profile and how many of these are contained in the 5' untranslated region, overwritten by -l (length) and -r (preroll) * $stoplength, $stoppreroll (default 2+ceil(tuplesize/3) and 2) number of nucleotide triplets contained in the stop profile and how many of these are contained in the coding sequence, overwritten by -L (length) and -R (preroll) * $estscanparams (default is "-m -50 -d -50 -i -50 -N 0") parameters passed to ESTScan during evaluation $filestem is used to generate many filenames. It is generated automatically according to the tuplesize, the minmask and the pseudocounts applied to generate them. =head1 AUTHOR Claudio Lottaz, SIB-ISREC, Claudio.Lottaz@isb-sib.ch =cut # # End of file # ################################################################################ estscan-3.0.3/evaluate_model0000755000551200011300000006150710541762732015266 0ustar chrisludwig#!/usr/bin/env perl # $Id: evaluate_model,v 1.15 2006/12/19 13:15:06 c4chris Exp $ ################################################################################ # # evaluate_model # -------------- # # Claudio Lottaz, SIB-ISREC, Claudio.Lottaz@isb-sib.ch # Christian Iseli, LICR ITO, Christian.Iseli@licr.org # # Copyright (c) 1999, 2006 Swiss Institute of Bioinformatics. # All rights reserved. # ################################################################################ use strict; use FASTAFile; use Symbol; # global configuration my $winsegshuffle_dbsize = 10000000; # the size for shuffled databases # global variables my $verbose = 1; my $norna = 0; my $optionsfile = ''; my $forcedtuplesize = undef; my $forcedminmask = undef; my $forcedpseudocounts = undef; my $forcedminscore = undef; my $forcedstartlength = undef; my $forcedstartpreroll = undef; my $forcedstoplength = undef; my $forcedstoppreroll = undef; my $forcedsmatfile = undef; my $forceestscanparams = undef; my $datadir = '.'; my $filestem = ''; my $program = "estscan"; require "build_model_utils.pl"; ################################################################################ # # Check command-line for switches # my $usage = "Usage: evaluate_model [options] \n" . " where options are:\n" . " -q don't log on terminal\n" . " -O options passed to program\n" . " -o options file\n" . " -n skip evaluation on mRNAs\n" . " -t force tuple size, overwrites entry in config-files\n" . " -M force score matrices file\n" . " -m force minimal mask, overwrites entry in config-files\n" . " -P force program name (or file)\n" . " -p force pseudocounts, overwrites entry in config-files\n" . " -s force minimal score, overwrites entry in config-files\n" . " -l force length of start profile (in codons/triplets)\n" . " -r force start profile's preroll in 5'UTR (in codons/triplets)\n" . " -L force length of stop profile (in codons/triplets)\n" . " -R force sop profile's preroll in 5'UTR (in codons/triplets)\n" . "More information is obtained using 'perldoc evaluate_model'\n"; while ($ARGV[0] =~ m/^-/) { if ($ARGV[0] eq '-q') { shift; $verbose = 0; next; } if ($ARGV[0] eq '-n') { shift; $norna = 1; next; } if ($ARGV[0] eq '-O') { shift; $forceestscanparams = shift; next; } if ($ARGV[0] eq '-o') { shift; $optionsfile = shift; next; } if ($ARGV[0] eq '-t') { shift; $forcedtuplesize = shift; next; } if ($ARGV[0] eq '-M') { shift; $forcedsmatfile = shift; next; } if ($ARGV[0] eq '-m') { shift; $forcedminmask = shift; next; } if ($ARGV[0] eq '-P') { shift; $program = shift; next; } if ($ARGV[0] eq '-p') { shift; $forcedpseudocounts = shift; next; } if ($ARGV[0] eq '-s') { shift; $forcedminscore = shift; next; } if ($ARGV[0] eq '-l') { shift; $forcedstartlength = shift; next; } if ($ARGV[0] eq '-r') { shift; $forcedstartpreroll = shift; next; } if ($ARGV[0] eq '-L') { shift; $forcedstoplength = shift; next; } if ($ARGV[0] eq '-R') { shift; $forcedstoppreroll = shift; next; } die "Unrecognized switch: $ARGV[0]\n$usage"; } if ($#ARGV < 0) { die "No configuration file specified\n$usage"; } ################################################################################ # # Main-loop through all specified config-files # my $parFile; while($parFile = shift) { my($organism, $hightaxo, $dbfiles, $ugdata, $estdata, $datadir2, $filestem2, $rnafile, $estfile, $estcdsfile, $estutrfile, $trainingfile, $testfile, $utrfile, $cdsfile, $tuplesize, $minmask, $pseudocounts, $minscore, $startlength, $startpreroll, $stoplength, $stoppreroll, $smatfile, $estscanparams, $nb_isochores, $isochore_borders) = readConfig($parFile, $forcedtuplesize, $forcedminmask, $forcedpseudocounts, $forcedminscore, $forcedstartlength, $forcedstartpreroll, $forcedstoplength, $forcedstoppreroll, $verbose); if (defined $forcedsmatfile) { $smatfile = $forcedsmatfile; } if (defined $forceestscanparams) { $estscanparams = $forceestscanparams; } log_open("readconfig.log"); $datadir = $datadir2; $filestem = $filestem2; showConfig($parFile, $organism, $hightaxo, $dbfiles, $ugdata, $estdata, $datadir, $rnafile, $estfile, $estcdsfile, $estutrfile, $trainingfile, $testfile, $utrfile, $cdsfile, $tuplesize, $minmask, $pseudocounts, $minscore, $startlength, $startpreroll, $stoplength, $stoppreroll, $smatfile, $estscanparams, $nb_isochores, $isochore_borders); if (system("which $program") != 0) { log_print(" fatal: $program not found"); die; } log_close(); # read options-file, each line contains a set of options my $o; my @options; if ($optionsfile eq "") { push @options, $estscanparams; } else { my $fh = gensym; open $fh, $optionsfile; @options = <$fh>; close $fh; } # evaluate log_open("evaluate_model.log"); log_print("\nUsing $program to scan EST/mRNA\n"); my %uf = ( 'rna' => $utrfile, 'est' => $estutrfile ); my %cf = ( 'rna' => $cdsfile, 'est' => $estcdsfile ); my %tf = ( 'rna' => $testfile, 'est' => $estcdsfile ); my %N = ( 'rna' => "mRNA", 'est' => "EST" ); foreach my $k ("rna", "est") { next if ($k eq "rna") && ($norna == 1); next unless -s $cf{$k}; foreach $o (@options) { $o =~ s/\n$//; log_print("\nEvaluating new model on " . $N{$k} . " data using $o...."); estimateFalse($k, 'utr', $uf{$k}, $smatfile, $o); estimateFalse($k, 'cds', $cf{$k}, $smatfile, $o); evaluateStartStop($smatfile, $tf{$k}, $k, $o); log_print("\nEvaluating new model on " . $N{$k} . " data using $o -S...."); estimateFalse($k, 'utr', $uf{$k}, $smatfile, "$o -S "); estimateFalse($k, 'cds', $cf{$k}, $smatfile, "$o -S "); evaluateStartStop($smatfile, $tf{$k}, $k, "$o -S "); if ($#options > 0) { unlink <$datadir/Evaluate/${k}prc*>; } } } log_close(); print "$parFile done.\n"; } exit 0; sub estimateFalse { # Runs ESTScan on ESTs from UTRs and CDSs. According to $cdsutr # determines the false positive or false negative rate. my ($prefix, $cdsutr, $estfile, $smatfile, $estscanparams) = @_; my $x = `grep -i '^[a-z]' $estfile | wc`; $x =~ s/^\s+//; my($l, $w, $c) = split(/\s+/, $x); my $nbTot = $c - $l; # compute predictions for estfile my $EP = $estscanparams; $EP =~ s/[- ]+//g; my $prcfile = "$datadir/Evaluate/$prefix" . "prc$filestem$EP" . "_$cdsutr.seq"; if (-s $prcfile) { log_print(" - $prcfile already exists, skipped"); } else { log_print(" - predicting CDS for $estfile..."); system "$program $estscanparams -M $smatfile $estfile > $prcfile"; } $x = `grep -i '^[a-z]' $prcfile | wc`; $x =~ s/^\s+//; ($l, $w, $c) = split /\s+/, $x; my $nbPrc = $c - $l; my $r = $nbPrc/$nbTot; my $nbSeq = `grep -c '^>' $prcfile`; chop $nbSeq; my $nbTotSeq = `grep -c '^>' $estfile`; chop $nbTotSeq; log_print(" found $nbPrc coding of $nbTot nucleotides " . "in $nbSeq of $nbTotSeq sequences"); if ($cdsutr eq 'utr') { my $pct = sprintf "%.2f", 100 * $r; my $pctseq = sprintf "%.2f", 100 * ($nbSeq / $nbTotSeq); log_print(" estimated false positive rate: $pct% (nt) $pctseq% (seq)"); } else { my $pct = sprintf "%.2f", 100 * (1.0 - $r); my $pctseq = sprintf "%.2f", 100 * (1.0 - ($nbSeq / $nbTotSeq)); log_print(" estimated false negative rate: " . "$pct% (nt) $pctseq% (seq)"); } # read scores from prcfile my @scores; my $fh = gensym; open $fh, $prcfile; while ( <$fh> ) { if (m/^>\S+ (\S+) (\S+) (\S+)/) { push @scores, $1; } # if (m/^>\S+ (\S+) (\S+) (\S+)/) { # push @scores, $1/($3 - $2 + 1); # } # length norm } close $fh; # write cumulative data my $score; my $cumulator = ($cdsutr eq 'utr') ? $#scores + 1 : 0; my $histofile = "$datadir/Evaluate/$prefix" . "prc$filestem"; if ($cdsutr eq 'utr') { open $fh, ">$histofile.dat"; } else { open $fh, ">>$histofile.dat"; print $fh "\n"; } foreach $score (sort {$a <=> $b} @scores) { my $y = 100 * $cumulator / ($#scores + 1); if (($y < 90) && ($y > 1)) { print $fh "$score $y\n"; } if ($cdsutr eq 'utr') { $cumulator--; } else { $cumulator++; } } close $fh; # write gnuplot script my($order, $minmask, $pseudo) = split /\_/, $filestem; my $title = "Selectivity on UTRs / Sensitivity on CDSs " . "(order $order, minmask $minmask, pseudocounts $pseudo)"; open $fh, ">$histofile.gplot"; print $fh < $resultfile"; } # count in testfile my(%seqLengths, %seqCDSlen, $currSeq, $e); my $origCtr = 0; my $src = FASTAFile->new($testfile); $src->openStream; while (defined( $e = $src->getNext )) { $seqLengths{$e->ac} = length $e->{_seq}; $e->{_seqHead} =~ m/CDS: (\S+) (\S+)/; $seqCDSlen{$e->ac} = $2 - $1 + 1; $origCtr++; } close $src->{_BTFfile}; if ($origCtr == 0) { log_print(" no sequences found in $testfile, skipped"); return; } # initialize variables my(%startHistogram, %stopHistogram); my $delta = 99; # distances larger than this are considered weak predictions my $averageStart = 0; my $averageStop = 0; my $weakStartCtr = 0; my $wrongFrameCtr = 0; my $weakStopCtr = 0; my $worstSeq = ""; my $ctr = 0; my $cdsNts = 0; my $utrNts = 0; my $fpNts = 0; my $fnNts = 0; # count utr for false positive/negative rate if ($prefix eq 'est') { my $utrfile = "$datadir/Evaluate/estutr.seq"; my $x = `grep -i '^[a-z]' $utrfile | wc`; $x =~ s/^\s+//; my($l, $w, $c) = split /\s+/, $x; $utrNts = $c - $l; $utrfile = "$datadir/Evaluate/$prefix" . "prc$filestem$EP" . "_utr.seq"; $x = `grep -i '^[a-z]' $utrfile | wc`; $x =~ s/^\s+//; ($l, $w, $c) = split /\s+/, $x; $fpNts = $c - $l; } # read and evaluate the result file log_print(" - computing histograms from $resultfile..."); $src = FASTAFile->new($resultfile); $src->openStream; while (defined( $e = $src->getNext)) { # read entry #if ($e->{_FASTAde} =~ m/minus strand/) { next } my ($predStart, $predStop, $annotatedStart, $annotatedStop) = $e->{_seqHead} =~ m/(\S+) (\S+) ?CDS: (\S+) (\S+)/; my $theId = $e->ac; if ($e->{_FASTAde} =~ m/minus strand/ && $predStart ne "") { my $tmp = $seqLengths{$e->ac} - $predStart + 1; $predStart = $seqLengths{$e->ac} - $predStop + 1; $predStop = $tmp; } # update histogram my $diffStart = $predStart - $annotatedStart; my $diffStop = $predStop - $annotatedStop; $averageStart += abs($diffStart); $averageStop += abs($diffStop); if (abs($diffStart) > $delta) { $weakStartCtr++; } if (abs($diffStop) > $delta) { $weakStopCtr++; } if ((($diffStart % 3) != 0) && ($annotatedStart != 1)) { $wrongFrameCtr++; } $startHistogram{sprintf("%05d", $diffStart)} += 1; $stopHistogram{sprintf("%05d", $diffStop)} += 1; $ctr++; # update false positive/negative data my $newCds = $annotatedStop - $annotatedStart + 1; $cdsNts += $newCds; $utrNts += $seqLengths{$e->ac} - $newCds; if ($predStart ne "" && (($annotatedStart > $predStop) || ($annotatedStop < $predStart))) { $fpNts += $predStop - $predStart + 1; $fnNts += $annotatedStop - $annotatedStart + 1; } else { if ($diffStart < 0) { $fpNts -= $diffStart; } else { $fnNts += $diffStart; } if ($diffStop < 0) { $fnNts -= $diffStop; } else { $fpNts += $diffStop; } } delete $seqCDSlen{$e->ac}; } close $src->{_BTFfile}; foreach (values(%seqCDSlen)) { $fnNts += $_; } log_print(" predicted $ctr coding regions"); my $pct = sprintf "%.2f", 100 * ($fpNts / $utrNts); log_print(" estimated false positive rate $pct% (nt)"); $pct = sprintf "%.2f", 100 * ($fnNts / $cdsNts); log_print(" estimated false negative rate $pct% (nt)"); $averageStart /= $ctr; $averageStop /= $ctr; log_print(" - writing data-files and gnuplot scripts..."); # write data for start histogram my $fh = gensym; my $histofile = "$datadir/Evaluate/$prefix" . "prc$filestem" . "_starthisto"; open $fh, ">$histofile.dat"; foreach (sort (keys %startHistogram)) { print $fh "$_ $startHistogram{$_}\n"; } close $fh; my $exactStart = $startHistogram{'00000'}; # write gnuplot script for start histogram $averageStart = sprintf("%.f", $averageStart); open $fh, ">$histofile.gplot"; print $fh <$histofile.dat"; foreach (sort (keys %startHistogram)) { $x += $startHistogram{$_}; $cumulus -= $startHistogram{$_}; print $fh "$_ ", $cumulus/$origCtr*100, "\n"; if (($_ >= $y) && ($#ticks > -1)) { $distBuffer .= sprintf "\t<=%4d: %8d\n", $y, $x; $pieLabels .= "'<=$y:" . sprintf("%4.1f\%", $x/$origCtr*100) . "', "; $pieValues .= "$x, "; $x = 0; $y = shift @ticks; } } close $fh; $pieLabels .= "'>$y: " . sprintf("%4.1f\%", $x/$origCtr*100) . "', 'missed: " . sprintf("%4.1f4\%", 100*$cumulus/$origCtr) . "')"; $pieLabels =~ s/<=0:/exact:/; $pieValues .= "$x, $cumulus)"; $distBuffer .= sprintf "\t >%4d: %8d\n\tmissed: %8d", $y, $x, $cumulus; log_print(" start predicted at distance\n$distBuffer"); # write R script for pie chart my $piescript = "$datadir/Evaluate/$filestem\_piecharts.R"; if ($prefix eq 'rna') { unlink $piescript; } open $fh, ">>$piescript"; print $fh <$histofile.gplot"; print $fh <$histofile.dat"; foreach (sort (keys %stopHistogram)) { print $fh "$_ $stopHistogram{$_}\n"; } close $fh; $averageStop = sprintf "%.1f", $averageStop; my $exactStop = $stopHistogram{'00000'}; # write gnuplot script for stop histogram open $fh, ">$histofile.gplot"; print $fh <$histofile.dat"; foreach (sort (keys %stopHistogram)) { $x += $stopHistogram{$_}; $cumulus -= $stopHistogram{$_}; print $fh "$_ ", $cumulus/$origCtr, "\n"; if (($_ >= $y) && ($#ticks > -1)) { $distBuffer .= sprintf "\t<=%4d: %8d\n", $y, $x; $pieLabels .= "'<=$y:" . sprintf("%4.1f\%", 100*$x/$origCtr) . "', "; $pieValues .= "$x, "; $x = 0; $y = shift @ticks; } } close $fh; $pieLabels .= "'>$y: " . sprintf("%4.1f\%", $x/$origCtr*100) . "', 'missed: " . sprintf("%4.1f4\%", 100*$cumulus/$origCtr) . "')"; $pieLabels =~ s/<=0:/exact:/; $pieValues .= "$x, $cumulus)"; $distBuffer .= sprintf "\t >%4d: %8d\n\tmissed: %8d", $y, $x, $cumulus; log_print(" stop predicted at distance\n$distBuffer"); # write R script for pie chart open $fh, ">>$piescript"; print $fh <$histofile.gplot"; print $fh < =head1 DESCRIPTION evaluate_model evaluates the performance of an ESTscan model. It uses the same directory structure as build_model, knows most of build_model's command line switches and read the same configuration files. ESTScan's performance is evluated in terms of false positive and negative rates on the nucleotide level, as well as prediction accuracy for start and stop sites. The script reads configuration files on the comannd line and performs the following steps for each of them: - Extract untranslated regions from test mRNA - Evaluate false positive rate. evaluate false negative - Evaluate false negative rate as well as start and stop prediction accuracy on test mRNAs - Find entirely untranslated and partially coding ESTs using UniGene - Evaluate false positive rate on untranslated ESTs - Evaluate false negative rate as well as start and stop prediction accuracy on partially coding ESTs The evaluation is carried out on two kinds of data: mRNA data and EST data. The first three steps work on mRNA the latter on ESTs. When using RNA, two computational experiments are made. On one hand all untranslate regions of the test set of mRNAs are extracted and then analyzed in order to find the false positive rate on nucleotide as well as on sequence level. On the other hand, the complete test mRNAs are analyzed to estimate false positive and false negative rates on a nucleotide level. Moreover, distance distributions between prediction and annotation of start and stop sites are computed and histograms generated as gnuplot scripts and data files. The data needed for evluation using ESTs is extracted using UniGene. UniGene clusters are used to determine ESTs from untranslated regions and coding sequence respectively. This is done by matching the ESTs of a given cluster against its full-length mRNA with megablast and then determining where the match occurs relative to the annotated coding sequence. For each category, coding and non-coding, a single EST is chosen per cluster, in order to avoid redundancy. The matching location also allows to determine where coding sequences start and end in partially coding ESTs. The same annotation in FASTA headers is used as for mRNAs. The sets of coding and non-coding ESTs are used to perform the same computational experiments as those done with mRNA data. Files which already exist are reused. If an existing file is to be recomputed, it must deleted before running the script again. For instance, if a particular collection of mRNA or EST sequences should be used instead of data extracted by evaluate_model, providing these in FASTA format under the name of the mRNA file (where evaluate_model would store the extracted data) is enough. The same procedure can be applied to provide hand picked test data. However, in mRNA and EST data used for test and training, evaluate_model expects annotations of coding sequence start and stop in the header as two integer values following the tag 'CDS:'.The first integer points to the first nucleotide of the CDS, the second to the last. Thus the length of the CDS is - + 1. The position counting starts with 1. =head1 DIRECTORY STRUCTURE evaluate_model uses the same directory structure as build_model, the root of which is given in the configuration file. From this root it adds the subdirectory 'Evaluate', which contains all result files. mRNA and EST data as well as test and training data files are dposited in the data-root directory if not otherwise specified in the configuration file. =head1 OPTIONS AND CONFIGURATION FILE The same command options except '-e' and variables in configuration files as for build_model can be used. They are essentially used to find the proper model and to create different files for different models in a systematic manner. However, there is one additional option: -o An options file contains a set of ESTScan commandline switches per line. For each line of the file the evaluation is performed once. The script and data files generated at each run are removed in this mode, the results can only be collected in the Report file. If '-o' is not specified, evaluation is performed using ESTScan's default parameters (except for the matric to be evaluated, of course). Additional parameters defined in the configuration files for evaluate_model are listed here: * $ugdata Name of the file(s) containing data about unigene clusters. If this is not defined, no evaluation is currently implemented. $filestem is used to generate many filenames. It is generated automatically according to the tuplesize, the minmask and the pseudocounts applied to generate them. =head1 REQUIREMENTS During analysis of UniGene clusters and evaluation of the generated tables some external packages are used to collect and compare sequences. evaluate_model relies on 'megablast' to determine where ESTs match on full-length mRNA sequences. The 'fetch' utility is used to find the EST and mRNA entries. This tool needs a properly indexed version of EMBL and RefSeq flatfiles. Use 'indexer' for this. Both tools are part of the BTLib toolset. =head1 AUTHOR Claudio Lottaz, SIB-ISREC, Claudio.Lottaz@isb-sib.ch =cut # # End of file # ################################################################################ estscan-3.0.3/extract_EST0000755000551200011300000004462510541762732014467 0ustar chrisludwig#!/usr/bin/env perl # $Id: extract_EST,v 1.2 2006/12/19 13:15:06 c4chris Exp $ ################################################################################ # # evaluate_model # -------------- # # Claudio Lottaz, SIB-ISREC, Claudio.Lottaz@isb-sib.ch # Christian Iseli, LICR ITO, Christian.Iseli@licr.org # # Copyright (c) 1999 Swiss Institute of Bioinformatics. All rights reserved. # ################################################################################ use strict; use FASTAFile; use Symbol; # global configuration my $winsegshuffle_dbsize = 10000000; # the size for shuffled databases # global variables my $verbose = 1; my $norna = 0; my $optionsfile = ''; my $forcedtuplesize = undef; my $forcedminmask = undef; my $forcedpseudocounts = undef; my $forcedminscore = undef; my $forcedstartlength = undef; my $forcedstartpreroll = undef; my $forcedstoplength = undef; my $forcedstoppreroll = undef; my $forcedsmatfile = undef; my $forceestscanparams = undef; my $datadir = '.'; my $filestem = ''; my $program = "estscan"; require "build_model_utils.pl"; ################################################################################ # # Check command-line for switches # my $usage = "Usage: evaluate_model [options] \n" . " where options are:\n" . " -q don't log on terminal\n" . " -O options passed to program\n" . " -o options file\n" . " -n skip evaluation on mRNAs\n" . " -t force tuple size, overwrites entry in config-files\n" . " -M force score matrices file\n" . " -m force minimal mask, overwrites entry in config-files\n" . " -P force program name (or file)\n" . " -p force pseudocounts, overwrites entry in config-files\n" . " -s force minimal score, overwrites entry in config-files\n" . " -l force length of start profile (in codons/triplets)\n" . " -r force start profile's preroll in 5'UTR (in codons/triplets)\n" . " -L force length of stop profile (in codons/triplets)\n" . " -R force sop profile's preroll in 5'UTR (in codons/triplets)\n" . "More information is obtained using 'perldoc evaluate_model'\n"; while ($ARGV[0] =~ m/^-/) { if ($ARGV[0] eq '-q') { shift; $verbose = 0; next; } if ($ARGV[0] eq '-n') { shift; $norna = 1; next; } if ($ARGV[0] eq '-O') { shift; $forceestscanparams = shift; next; } if ($ARGV[0] eq '-o') { shift; $optionsfile = shift; next; } if ($ARGV[0] eq '-t') { shift; $forcedtuplesize = shift; next; } if ($ARGV[0] eq '-M') { shift; $forcedsmatfile = shift; next; } if ($ARGV[0] eq '-m') { shift; $forcedminmask = shift; next; } if ($ARGV[0] eq '-P') { shift; $program = shift; next; } if ($ARGV[0] eq '-p') { shift; $forcedpseudocounts = shift; next; } if ($ARGV[0] eq '-s') { shift; $forcedminscore = shift; next; } if ($ARGV[0] eq '-l') { shift; $forcedstartlength = shift; next; } if ($ARGV[0] eq '-r') { shift; $forcedstartpreroll = shift; next; } if ($ARGV[0] eq '-L') { shift; $forcedstoplength = shift; next; } if ($ARGV[0] eq '-R') { shift; $forcedstoppreroll = shift; next; } die "Unrecognized switch: $ARGV[0]\n$usage"; } if ($#ARGV < 0) { die "No configuration file specified\n$usage"; } ################################################################################ # # Main-loop through all specified config-files # my $parFile; while($parFile = shift) { my($organism, $dbfiles, $ugdata, $estdata, $datadir2, $filestem2, $rnafile, $estfile, $estcdsfile, $estutrfile, $trainingfile, $testfile, $utrfile, $cdsfile, $tuplesize, $minmask, $pseudocounts, $minscore, $startlength, $startpreroll, $stoplength, $stoppreroll, $smatfile, $estscanparams, $nb_isochores, $isochore_borders) = readConfig($parFile, $forcedtuplesize, $forcedminmask, $forcedpseudocounts, $forcedminscore, $forcedstartlength, $forcedstartpreroll, $forcedstoplength, $forcedstoppreroll, $verbose); if (defined $forcedsmatfile) { $smatfile = $forcedsmatfile; } if (defined $forceestscanparams) { $estscanparams = $forceestscanparams; } log_open("readconfig.log"); $datadir = $datadir2; $filestem = $filestem2; showConfig($parFile, $organism, $dbfiles, $ugdata, $estdata, $datadir, $rnafile, $estfile, $estcdsfile, $estutrfile, $trainingfile, $testfile, $utrfile, $cdsfile, $tuplesize, $minmask, $pseudocounts, $minscore, $startlength, $startpreroll, $stoplength, $stoppreroll, $smatfile, $estscanparams, $nb_isochores, $isochore_borders); log_close(); die "Unsupported... Please fix me"; # read options-file, each line contains a set of options my $o; my @options; if ($optionsfile eq "") { push(@options, $estscanparams); } else { my $fh = gensym; open($fh, $optionsfile); @options = <$fh>; close($fh); } log_open("extract_EST.log"); log_print("\nUsing $program to scan EST/mRNA\n"); extract_ESTs($estdata, $organism, $estfile); estimateFalseNegative($smatfile, $estfile, $cdsfile, $estscanparams); estimateFalsePositive($organism, $smatfile, $estdata, $estscanparams); log_close(); print "$parFile done.\n"; } exit 0; ################################################################################ # # Estimate false negative rate compared to megablast # sub extract_ESTs { # From a large collection of ESTs in FASTA format ($estdata) # filters the entries for the given species ($organism) and # selects randomly entries dispersed over the whole collection # until the wanted size of the generated file is reached # ($dbsize). my($estdata, $organism, $estfile) = @_; # check if done log_print(" - Extracting EST entries..."); if (-s $estfile) { log_print(" - $estfile already exists"); return; } my($file, $fileSeqs); my $totSeqs = 0; my $totLen = 0; my(@infiles) = glob($estdata); my $estdbsize = 0; foreach $file (@infiles) { my @s = stat($file); $estdbsize += $s[7]; } my $acceptrate = $winsegshuffle_dbsize / $estdbsize; if ($acceptrate < 0.01) { $acceptrate = 0.01; } log_print(" accept rate is $acceptrate"); my $outfh = gensym; open($outfh, ">$estfile"); while($totLen < $winsegshuffle_dbsize) { foreach $file (@infiles) { my $src = FASTAFile->new("$file"); $src->openStream; log_print(" - reading in $file..."); my $e; while(defined ($e = $src->getNext)) { if (($e->{_seqHead} =~ m/$organism/) && (rand() < $acceptrate)) { $fileSeqs++; $totLen += length($e->{_seq}); $e->printFASTA($outfh); } if ($totLen > $winsegshuffle_dbsize) { last; } } close($src->{_BTFfile}); log_print(" selected $fileSeqs sequences so far"); if ($totLen > $winsegshuffle_dbsize) { last; } } } # close files close($outfh); } sub estimateFalseNegative { # Determine matches of coding sequences from test files # in ESTs using megablast. Find how many are not detected # by ESTScan as an estimate of false negative rate. my ($smatfile, $estfile, $cdsfile, $estscanparams) = @_; log_print("\nEstimate false negative rate..."); # generate blast library containing the est data if (-s "$datadir/Evaluate/estdb.nsq") { log_print(" - formatted blast database already exists, skipped"); } else { log_print(" - formatting blast database from $estfile..."); symlink($estfile, "$datadir/Evaluate/estdb.seq"); system("cd $datadir/Evaluate; rm formatdb.log; " . "formatdb -i estdb.seq -p F -n estdb"); } # match est-data against cds data using megablast my $cdsvsestfile = "$datadir/Evaluate/cdsvsest.mbl"; if (-s $cdsvsestfile) { log_print(" - $cdsvsestfile already exists, skipped"); } else { log_print(" - matching coding sequences ($cdsfile) against ESTs..."); system("cd $datadir/Evaluate; megablast -d estdb -i $cdsfile > $cdsvsestfile"); } my $e; my $nbMatched = 0; log_print(" found $nbMatched matches, now writing..."); my $matchedFile = "$datadir/Evaluate/mblmatched.seq"; if (-s $matchedFile) { log_print(" $matchedFile already exists, skipped."); $nbMatched = `grep -c '>' $matchedFile`; chop($nbMatched); } else { # read matches from megablast output log_print(" - collecting metched ESTs..."); my %storedEsts; my $mblfh = gensym; open($mblfh, "$cdsvsestfile"); while(<$mblfh>) { my($est,$direction,$cds,$estStart,$cdsStart,$estEnd,$cdsEnd,$mismatch) = m/^\'([^\']*)\'==\'([+|-])([^\']*)\' \((\S+) (\S+) (\S+) (\S+)\) (\S+)/; my $matchLength = ($cdsEnd - $cdsStart); ($est) = $est =~ m/\|([^\|]*)\|/; if ($matchLength < 100) { next; } if (($mismatch / $matchLength) > 0.05) { next; } if (exists($storedEsts{$est})) { next; } $storedEsts{$est} = 1; $nbMatched++; } close($mblfh); # generate FASTA-file with matched ESTs my $matchedfh = gensym; open($matchedfh, ">$matchedFile"); my $src = FASTAFile->new("$estfile"); $src->openStream; while(defined($e = $src->getNext)) { if (exists($storedEsts{$e->ac})) { $e->printFASTA($matchedfh); } delete($storedEsts{$e->ac}); } close($src->{_BTFfile}); } if ($nbMatched == 0) { log_print(" no matches found in EST data, skipped"); return; } # compute predictions for matched ESTs my $prcfile = "$datadir/Evaluate/prc$filestem.seq"; if (-s $prcfile) { log_print(" - $prcfile already exists, skipped"); } else { log_print(" - predicting CDS for matched ESTs..."); system("$program $estscanparams -M $smatfile $matchedFile > $prcfile"); } my $nbPrc = `grep -c '>' $prcfile`; chop($nbPrc); log_print(" found $nbPrc predicted coding sequences"); log_print(" estimated false negative rate: " . (1.0 - $nbPrc/$nbMatched)); } ################################################################################ # # Estimate false positive rate on window segment shuffled EST-data # sub estimateFalsePositive { # Estimates false positive rate on window segment shuffled ESTs. my ($organism, $smatfile, $lengths_ref, $estdata, $estscanparams) = @_; my @lengths = @{$lengths_ref}; log_print("\nEstimate false positive rate on shuffled est-data..."); # collect another set of ESTs my $estfile = "$datadir/Evaluate/ests.seq"; if (-s $estfile) { log_print(" - $estfile already exists, skipped"); } else { my $dbsize = 0; my $estfh = gensym; open($estfh, ">$estfile"); my($file, $fileSeqs); my $totSeqs = 0; my(@infiles) = glob($estdata); my $estdbsize = 0; foreach $file (@infiles) { my @s = stat($file); $estdbsize += $s[7]; } my $acceptrate = $winsegshuffle_dbsize / $estdbsize; log_print(" accept rate is $acceptrate"); while($dbsize < $winsegshuffle_dbsize) { foreach $file (@infiles) { $fileSeqs = 0; my $src = FASTAFile->new("$file"); $src->openStream; log_print(" - reading $file..."); my $e; while(defined ($e = $src->getNext)) { if (($e->{_seqHead} =~ m/$organism/) && (rand() < $acceptrate)) { $fileSeqs++; $dbsize += length($e->{_seq}); $e->printFASTA($estfh); } if ($dbsize > $winsegshuffle_dbsize) { last; } } close($src->{_BTFfile}); $totSeqs += $fileSeqs; log_print(" selected $fileSeqs sequences so far ($dbsize nucleotides)"); if ($dbsize > $winsegshuffle_dbsize) { last; } } } close($estfh); log_print(" - written $totSeqs in $estfile"); } my $length; my $totalPredicted = 0; my $totalPredictedStart = 0; my $totalShuffled = 0; foreach $length (@lengths) { log_print(" - computing for length $length..."); # shuffle data my $shuffledFile = "$datadir/Shuffled/testset$length.seq"; if (-s $shuffledFile) { log_print(" shuffled data already exists, skipped"); } else { log_print(" shuffling EST-data..."); system("winsegshuffle 500000 $length < $estfile > $shuffledFile"); } my $nbShuffled = `grep -c '>' $shuffledFile`; chop($nbShuffled); $totalShuffled += $nbShuffled; if ($nbShuffled == 0) { log_print(" no shuffled data written for $length from $estfile, skipped"); next; } # predict my $prcfile = "$datadir/Evaluate/prc$filestem\_shuffled$length.seq"; if (-s $prcfile) { log_print(" $prcfile already exists, skipped"); } else { log_print(" predicting coding on shuffled files..."); system("$program $estscanparams -M $smatfile $shuffledFile > $prcfile"); } my $nbPrc = `grep -c '>' $prcfile`; chop($nbPrc); log_print(" estimated false positive rate: " . sprintf("%6.4f", $nbPrc/$nbShuffled) . " ($nbPrc predicted in $nbShuffled shuffled seqs)"); $totalPredicted += $nbPrc; } if ($totalShuffled == 0) { log_print(" - no shuffled data written, skipped"); } else { log_print(" - average estimated false positive rate: " . sprintf("%6.4f", $totalPredicted/$totalShuffled) . " ($totalPredicted predicted in $totalShuffled shuffled seqs)"); } } ################################################################################ # # Documentation # =head1 NAME evaluate_model - evaluate an ESTScan model generated by build_model =head1 SYNOPSIS evaluate_model [options] =head1 DESCRIPTION evaluate_model evaluates the performance of an ESTscan model. It uses the same directory structure as build_model, knows most of build_model's command line switches and read the same configuration files. ESTScan's performance is evluated in terms of false positive and negative rates on the nucleotide level, as well as prediction accuracy for start and stop sites. The script reads configuration files on the comannd line and performs the following steps for each of them: - Extract untranslated regions from test mRNA - Evaluate false positive rate. evaluate false negative - Evaluate false negative rate as well as start and stop prediction accuracy on test mRNAs - Find entirely untranslated and partially coding ESTs using UniGene - Evaluate false positive rate on untranslated ESTs - Evaluate false negative rate as well as start and stop prediction accuracy on partially coding ESTs The evaluation is carried out on two kinds of data: mRNA data and EST data. The first three steps work on mRNA the latter on ESTs. When using RNA, two computational experiments are made. On one hand all untranslate regions of the test set of mRNAs are extracted and then analyzed in order to find the false positive rate on nucleotide as well as on sequence level. On the other hand, the complete test mRNAs are analyzed to estimate false positive and false negative rates on a nucleotide level. Moreover, distance distributions between prediction and annotation of start and stop sites are computed and histograms generated as gnuplot scripts and data files. The data needed for evluation using ESTs is extracted using UniGene. UniGene clusters are used to determine ESTs from untranslated regions and coding sequence respectively. This is done by matching the ESTs of a given cluster against its full-length mRNA with megablast and then determining where the match occurs relative to the annotated coding sequence. For each category, coding and non-coding, a single EST is chosen per cluster, in order to avoid redundancy. The matching location also allows to determine where coding sequences start and end in partially coding ESTs. The same annotation in FASTA headers is used as for mRNAs. The sets of coding and non-coding ESTs are used to perform the same computational experiments as those done with mRNA data. Files which already exist are reused. If an existing file is to be recomputed, it must deleted before running the script again. For instance, if a particular collection of mRNA or EST sequences should be used instead of data extracted by evaluate_model, providing these in FASTA format under the name of the mRNA file (where evaluate_model would store the extracted data) is enough. The same procedure can be applied to provide hand picked test data. However, in mRNA and EST data used for test and training, evaluate_model expects annotations of coding sequence start and stop in the header as two integer values following the tag 'CDS:'.The first integer points to the first nucleotide of the CDS, the second to the last. Thus the length of the CDS is - + 1. The position counting starts with 1. =head1 DIRECTORY STRUCTURE evaluate_model uses the same directory structure as build_model, the root of which is given in the configuration file. From this root it adds the subdirectory 'Evaluate', which contains all result files. mRNA and EST data as well as test and training data files are dposited in the data-root directory if not otherwise specified in the configuration file. =head1 OPTIONS AND CONFIGURATION FILE The same command options except '-e' and variables in configuration files as for build_model can be used. They are essentially used to find the proper model and to create different files for different models in a systematic manner. However, there is one additional option: -o An options file contains a set of ESTScan commandline switches per line. For each line of the file the evaluation is performed once. The script and data files generated at each run are removed in this mode, the results can only be collected in the Report file. If '-o' is not specified, evaluation is performed using ESTScan's default parameters (except for the matric to be evaluated, of course). Additional parameters defined in the configuration files for evaluate_model are listed here: * $ugdata Name of the file(s) containing data about unigene clusters. If this is not defined, no evaluation is currently implemented. $filestem is used to generate many filenames. It is generated automatically according to the tuplesize, the minmask and the pseudocounts applied to generate them. =head1 REQUIREMENTS During analysis of UniGene clusters and evaluation of the generated tables some external packages are used to collect and compare sequences. evaluate_model relies on 'megablast' to determine where ESTs match on full-length mRNA sequences. The 'fetch' utility is used to find the EST and mRNA entries. This tool needs a properly indexed version of EMBL and RefSeq flatfiles. Use 'indexer' for this. Both tools are part of the BTLib toolset. =head1 AUTHOR Claudio Lottaz, SIB-ISREC, Claudio.Lottaz@isb-sib.ch =cut # # End of file # ################################################################################ estscan-3.0.3/extract_mRNA0000755000551200011300000002132510556136711014617 0ustar chrisludwig#!/usr/bin/env perl # $Id: extract_mRNA,v 1.4 2007/01/25 14:25:13 c4chris Exp $ ################################################################################ # # extract_mRNA # ------------ # # Claudio Lottaz, SIB-ISREC, Claudio.Lottaz@isb-sib.ch # Christian Iseli, LICR ITO, Christian.Iseli@licr.org # # Copyright (c) 2006 Swiss Institute of Bioinformatics. All rights reserved. # ################################################################################ use strict; use EMBLFile; use GBFile; use Symbol; # global variables my $verbose = 1; my $datadir = '.'; my $filestem = ''; require "build_model_utils.pl"; ################################################################################ # # Check command-line for switches # my $usage = "Usage: extract_mRNA [options] \n" . " where options are:\n" . " -q don't log on terminal\n" . "More information is obtained using 'perldoc extract_mRNA'\n"; while ($ARGV[0] =~ m/^-/) { if ($ARGV[0] eq '-q') { shift; $verbose = 0; next; } die "Unrecognized switch: $ARGV[0]\n$usage"; } if ($#ARGV < 0) { die "No configuration file specified\n$usage"; } ################################################################################ # # Main-loop through all specified config-files # my $parFile; while($parFile = shift) { my($organism, $hightaxo, $dbfiles, $ugdata, $estdata, $datadir2, $filestem2, $rnafile, $estfile, $estcdsfile, $estutrfile, $trainingfile, $testfile, $utrfile, $cdsfile, $tuplesize, $minmask, $pseudocounts, $minscore, $startlength, $startpreroll, $stoplength, $stoppreroll, $smatfile, $estscanparams, $nb_isochores, $isochore_borders) = readConfig($parFile, undef, undef, undef, undef, undef, undef, undef, undef, $verbose); log_open("readconfig.log"); $datadir = $datadir2; $filestem = $filestem2; showConfig($parFile, $organism, $hightaxo, $dbfiles, $ugdata, $estdata, $datadir, $rnafile, $estfile, $estcdsfile, $estutrfile, $trainingfile, $testfile, $utrfile, $cdsfile, $tuplesize, $minmask, $pseudocounts, $minscore, $startlength, $startpreroll, $stoplength, $stoppreroll, $smatfile, $estscanparams, $nb_isochores, $isochore_borders); log_close(); # extract data log_open("extract_data.log"); extract_mRNAs($dbfiles, $organism, $hightaxo, $rnafile); log_close(); print "$parFile done.\n"; } exit(0); ################################################################################ # # Extract mRNA entries # sub extract_mRNAs { # Find all full-length messenger RNA entries of the given species # ($organism) in the given databse ($dbfiles) and writes them in # FASTA format with CDS annotation in the header ($rnafile). Tries # to guess whether RefSeq or EMBL data is provided. my($dbfiles, $organism, $hightaxo, $rnafile) = @_; log_print("\nExtracting mRNA entries..."); if (-s $rnafile) { log_print(" - $rnafile already exists, skipped"); return; } my($file, $fileSeqs, $fileLen); my $totSeqs = 0; my $totLen = 0; my $outfh = gensym; open($outfh, ">$rnafile"); my(@infiles) = glob($dbfiles); foreach $file (@infiles) { $fileSeqs = 0; $fileLen = 0; $_ = `head -1 $file`; if (m/^LOCUS/) { # we read a file in genbank format my $src = GBFile->new("$file"); $src->openStream; log_print(" - processing RefSeq-file $file..."); my $e; while(defined ($e = $src->getNext)) { my $ac = $ {$e->{_GBac}}[0]; next if $e->{_GBtype} ne "mRNA"; next unless $e->{_GBcdsst} =~ /^\d+$/; if ($hightaxo eq "") { if (($e->{_GBscn} eq $organism)) { $fileSeqs++; $fileLen += $e->{_GBcdsen} - $e->{_GBcdsst} + 1; print $outfh &genRNAEntry($ac, $e->{_GBdesc}, $e->{_GBcdsst}, $e->{_GBcdsen}, $e->{_seq}); } } else { my $o = index($e->{_GBtaxo}, $hightaxo); if ($o >= 0) { $fileSeqs++; $fileLen += $e->{_GBcdsen} - $e->{_GBcdsst} + 1; print $outfh &genRNAEntry($ac, $e->{_GBdesc}, $e->{_GBcdsst}, $e->{_GBcdsen}, $e->{_seq}); } } } close($src->{_BTFfile}); } else { # we read a file in EMBL format my $src = EMBLFile->new("$file"); $src->openStream; log_print(" - processing EMBL-file $file..."); my $e; while(defined ($e = $src->getNext)) { if ($hightaxo ne "") { my $o = index($e->{_EMBLtaxo}, $hightaxo); if (($o < 0) || ($e->{_EMBLtype} ne "mRNA")) { next; # no mRNA entry } } elsif (($e->{_EMBLscn}[0] ne $organism) || ($e->{_EMBLtype} ne "mRNA")) { next; # no mRNA entry } my($begin, $end); foreach(@{$e->{_EMBLft}}) { if (($begin, $end) = $_ =~ /^CDS\s+(?(\d+)$/) { last; } } if (!defined($begin)) { next; # no CDS found } if ((m/>/) && (m/{_EMBLac}}[0], $e->{_EMBLdesc}, $begin, $end, $e->{_seq}); } close($src->{_BTFfile}); } $totSeqs += $fileSeqs; $totLen += $fileLen; log_print(" found $fileSeqs sequences, $fileLen coding nucleotides"); } close($outfh); log_print(" - overall found $totSeqs sequences, " . "$totLen coding nucleotides"); } sub genRNAEntry { my($id, $desc, $cdsBegin, $cdsEnd, $seq) = @_; $seq =~ s/(.{80})/$1\n/g; return ">tem|$id CDS: $cdsBegin $cdsEnd $desc\n$seq\n"; } ################################################################################ # # Documentation # =head1 NAME extract_mRNA - extract mRNA data to build models for ESTScan =head1 SYNOPSIS extract_mRNA [options] =head1 DESCRIPTION mRNA data is extracted from files in EMBL or RefSeq format. The script reads configuration files on the command line and performs the extraction. =head1 DIRECTORY STRUCTURE build_model uses the directory structure which is given in the configuration file. mRNA data, test and training data files, is deposited in the data-root directory if not otherwise specified in the configuration file. =head1 OPTIONS -q quiet Do not log on terminal. =head1 CONFIGURATION FILE The parameters defined in the configuration file have the following meaning: * $organism (mandatory) The desired organism as it is given in EMBL "OS" or RefSeq "ORGANISM" lines. * $dbfiles Files from where full-length mRNA sequences are to be extracted, tries to guess whether the files come from EMBL or RefSeq. If this is not specified, expects a collection of mRNA in $rnafile. * $datadir (mandatory) Base directory where all of the above files are located and the temporary result files are stored * $rnafile (default is "$datadir/mrna.seq") Name of the file with the extracted mRNA entries. * $smatfile (default is "$datadir/Matrices/$filestem.smat") Name of the file where the HMM-model is to be written * $nb_isochores (default is 0) Number of isochores, when isochores are to be determined automatically from the GC-content distributaion as equal-sized groups. 0 means no automatic detection. * @isochore_borders (default is (0, 43, 47, 51, 100) Array of GC percentages where isochores are split, first entry is usually 0 and last 100. This is overwritten when $nb_isochores is not zero. * $tuplesize (default is 6) Size of tuples counted for codon statistics. This is overwritten by the -t switch. * $minmask (default is 30) Minimal run of consecutive nucleotides masked as redundant. This is overwritten by the -m switch. * $pseudocounts (default is 1) pseudocount to be added when generating the codon usage tables, overwritten by -m * $minscore (default is -100) minimum score attributed to log-odds and log-probabilities, overwritten by -s. * $startlength, $startpreroll (default 2+ceil(tuplesize/3) and 2) number of nucleotide triplets contained in the start profile and how many of these are contained in the 5' untranslated region, overwritten by -l (length) and -r (preroll) * $stoplength, $stoppreroll (default 2+ceil(tuplesize/3) and 2) number of nucleotide triplets contained in the stop profile and how many of these are contained in the coding sequence, overwritten by -L (length) and -R (preroll) * $estscanparams (default is "-m -50 -d -50 -i -50 -N 0") parameters passed to ESTScan during evaluation $filestem is used to generate many filenames. It is generated automatically according to the tuplesize, the minmask and the pseudocounts applied to generate them. =head1 AUTHOR Claudio Lottaz, SIB-ISREC, Claudio.Lottaz@isb-sib.ch =cut # # End of file # ################################################################################