debian/0000755000000000000000000000000012064117035007165 5ustar debian/control0000644000000000000000000000162712064076401010577 0ustar Source: cdbfasta Section: science Priority: optional Maintainer: Debian Med Packaging Team DM-Upload-Allowed: yes Uploaders: Steffen Moeller , Andreas Tille , Tim Booth Build-Depends: debhelper (>= 9), zlib1g-dev, help2man Standards-Version: 3.9.4 Homepage: http://cdbfasta.sourceforge.net/ Vcs-Browser: http://svn.debian.org/wsvn/debian-med/trunk/packages/cdbfasta/trunk/ Vcs-Svn: svn://svn.debian.org/debian-med/trunk/packages/cdbfasta/trunk/ Package: cdbfasta Architecture: any Depends: ${shlibs:Depends}, ${misc:Depends} Description: Constant DataBase indexing and retrieval tools for multi-FASTA files CDB (Constant DataBase) can be used for creating indices for quick retrieval of any particular sequences from large multi-FASTA files. It has the option to compress data records in order to save space. debian/install0000644000000000000000000000004211551654120010553 0ustar cdbfasta usr/bin cdbyank usr/bin debian/rules0000755000000000000000000000151512064111245010244 0ustar #!/usr/bin/make -f # -*- makefile -*- # debian/rules for cdbfasta # Author: Tim Booth # Uncomment this to turn on verbose mode. #export DH_VERBOSE=1 %: dh $@ override_dh_auto_build: dh_auto_build -- LINKER="g++ $(LDFLAGS)" override_dh_installman: help2man --no-info --no-discard-stderr --version-option=-v ./cdbfasta |\ sed 's/^Invalid argument:.*//' |\ sed 's/manual page for cdbfasta version .*/Creates an index file for records from a multi-fasta file./' \ > $(CURDIR)/debian/`dh_listpackages`/usr/share/man/man1/cdbfasta.1 help2man --no-info --no-discard-stderr --version-option=-v ./cdbyank |\ sed 's/^Invalid argument:.*//' |\ sed 's/manual page for cdbyank version .*/Query an index file created with cdbfasta./' \ > $(CURDIR)/debian/`dh_listpackages`/usr/share/man/man1/cdbyank.1 debian/watch0000644000000000000000000000016111624177613010225 0ustar version=3 #Watch file is broken due to no proper versioning on upstream http://sf.net/cdbfasta/cdbfasta\.tar\.gz debian/cdbfasta_usage.html0000644000000000000000000002505612064076203013017 0ustar cdb tools for fasta files

CDB (Constant DataBase) indexing and retrieval tools for multi-FASTA files

This is a brief introduction to a couple of platform independent file-based hashing tools (cdbfasta and cdbyank) that can be used for creating indices for quick retrieval of any particular sequences from large multi-FASTA files. The last version has the option to compress data records in order to save space. The index files are now architecture independent, the same index file can be created and used on many different Unix platform (be it 32bit/64bit, big-endian or little-endian architectures) and even Windows.

Content:

   1.Typical usage
   2.Retrieving sequence ranges or deflines
   3.Data compression option
   4.Development notes
 

1. Typical usage

Use cdbfasta to create the index file for a multi-FASTA file and cdbyank to pull records based on that index file. An usage message is displayed if the commands cdbyank or cdbyank are run without any parameters (or with -h). In order to create an index file, the name of the fasta file to be indexed must be provided:
cdbfasta <fasta_file>
The fasta file can be specified with the whole path (if it's not in the current directory), e.g.
cdbfasta /usr/local/db/GUDB.human
By default cdbfasta creates an index file with the same path and name as the database file but with the  .cidx suffix added to the original name. So in the example above, a file GUDB.human.cidx will be created in /usr/local/db/. The default usage considers the key for a FASTA record to be the first space-delimited token following the ">" starting character from the definition line. For example, if a FASTA record had a defline like this:

>AA141526

...then we can use the string 'AA141526' with cdbyank to retrieve the full FASTA record associated to that sequence name:

cdbyank -a 'AA141526' /usr/local/db/GUDB.human.cidx
Sometimes all the space delimited tokens in the defline need to be declared as keys in the index file, pointing to the same fasta record. This can be accomplished by cdbfasta by using the "-m" switch.

For long and complex fastA file accessions (for example : EGAD|61|GP|186739|gb|AAA63210.1||M60828) there is a possibility to create the index file in such a way that there is no need to provide the full string to cdbyank in order to retrieve such a sequence, but only the first "<db>|<accession>" pair (i.e. a substring ending at the second '|' character) should be enough. (EGAD|61 in the example above). In order to enable this feature, there are two alternative options for cdbfasta:

  • -c : the index file is built only by storing the "shortcut key" (the first "db|accession" pair found in the defline of each fasta record). In this case, cdbyank will only be able to accept these "shortcut" accessions for record retrieval.
  • -C : the index file is built by storing both the "shortcut key" and the full keys (which are considered to end at the first space character in the defline). In this case, two strings are stored as keys for each fastA record so any of them can be used as an accession for retrieval of the same record with cdbyank.
In order to retrieve records from the database file, cdbyank should be provided with the name of the index file created previously with cdbfasta, e.g.:
cdbyank -a 'human|Z98492' /usr/local/db/GUDB.human.cidx
A list of accessions is expected at stdin if -a option is not provided, e.g.:
cat seq_list | cdbyank /usr/local/db/GUDB.human.cidx
This way the output will be a series a fasta records at stdout. By redirecting this output to a file a multifasta file is obtained. cdbyank locates the database file by stripping the '.cidx' suffix off the index filename. But this is not enforced, because by using the -d option, cdbyank can make use of a user-provided database to be used by the given index file. In the example above, if the index file "GUDB.human.cidx" is moved into another directory, a cdbyank command (in that other directory) can be issued like that:
cdbyank -a 'human|Z98492' -d /usr/local/db/GUDB.human GUDB.human.cidx
The position of the index file in the list of arguments of cdbyank is not enforced. For the -a usage, the error status returned by cdbyank to the shell will be 1 if the given key was not found and 0 for success.

The total number of fasta records indexed and the list of the keys stored in a specific cdb index file can be retrieved with cdbyank's -n and -l switches, respectively. This information is obtained from the index file directly (the database file is not needed for that). There is also a -s option that displays a summary of the indexing information stored in the index at index time. These are the initial name of the fastA file, its size, how the index was created (e.g. was -m (multiple keys) option given ? was -c or -C (shortcut keys) option given?), the number of keys stored in the file as well as the number of fasta records indexed - the latter being the same with what -n option returns.

As an extra feature, cdbfasta and cdbyank can also be used for some special cases where databases may have different records but with the same key (non-unique keys). Although the performance will degrade a little, cdbfasta is able to index this kind of files, but by default cdbyank only outputs the first record found. If you want all the possible records sharing the same key (accession) to be retrieved and displayed, the -x option should be given to cdbyank.
 

2. Retrieving sequence ranges or only the defline


There are two cdbyank options added for convenience: -F option returns the definition line of each requested FASTA record (the first line for each record).  The -R option of cdbyank is intended for FASTA files containing actual genetic sequences (nucleotide or protein) and expects each of the retrieval commands to have the following format (space delimited):

<key>  <right_coordinate>  <left_coordinate>

For example if we only want to retrieve the sequence range 24...178 (letter numbering starts at 1) from sequence with the name 'human|Z98492', then the cdbyank command would look like this:

cdbyank -a 'human|Z98492 24 178' -R GUDB.human.cidx
Multiple sequence ranges can be extracted this way by providing a file having each line following the format above (key followed by the two coordinates). Then, as before, such file can be piped into cdbyank with -R option to pull specific sequence ranges for each of the sequences specified in the input file.
 
cat seqlistranges | cdbyank -R GUDB.human.cidx
Note that this range option works by actually parsing and looping through the retrieved record characters internally - so the performance is poor when some terminal range is pulled from a very large record.

3. Data compression option

The indexing program cdbfasta has the  -z <compressed_db> option which creates a compressed file <compress_db> from the data in the given input file and at the same time creates an index file for this new compressed database, named <compressed_db>.cidxz.The original input file can then be discarded -- as it can be recovered at any point later from the <compressed_db> file by using the -z option of cdbyank.
Because each record is compressed separately, compression is poor if the records are small. Compression is only advised when:
  • data records are large enough for the compression algorithm to become efficient (at least 1KB per record, the more the better)
  • only random access is needed to the data records (so the original file can be discarded)
The compression can be quite slow for large files and there is also some performance penalty for cdbyank as it has to decompress the retrieved records on the fly. The input data for cdbfasta compression can be collected from stdin if '-' is used instead of a file name:
cat my_data_files* | cdbfasta - -z mydata.cdbz
This option is useful especially when the total size of input data files is extremely large (over the file-system limits or over the 4GB internal limit of cdbfasta) while the compressed output can be small enough to fall under such limits.
With compressed databases cdbyank can be used normally without extra options as it will auto-detect the compression (from the index file info) and activate on-the-fly decompression of the retrieved records. Only -F and -R options are not yet supported for compressed records.

4. Development notes

These tools were developed in C++, based on the publicly available cdb ("constant database") code written by D.J. Bernstein (http://cr.yp.to/djb.html). "Constant databases" are those that we don't need to add to or remove records from. The original C source was (rather crudely) wrapped into C++ classes and adjusted to automatically index fasta records and to create an external index instead of compacting the original data file like the original cdb library code does.  Also the "endianness" is now checked at runtime and the bytes are swapped accordingly such that the file offsets and record sizes are always read/written in the same way in the index file.
The compression option uses zlib's "deflate" method. The program uses deflate() with Z_FULL_FLUSH after each record, such that random record decompression is possible after the first [dummy] record is decompressed internally.
The index file contains an info chunk (actually stored at the end of the file) which maintains a summary data and flags about the indexing process (the -s option of cdbyank shows this info). Since the compression option was added, cdbyank is always trying to read this information first (before opening the data file) in order to determine if the data records are compressed or not.

Please let me know if you notice any problems with these tools.

--
Geo Pertea
geo.pertea@gmail.com
06/09/2003
 

debian/copyright0000644000000000000000000000264512064076104011130 0ustar Format: http://www.debian.org/doc/packaging-manuals/copyright-format/1.0/ Upstream-Name: cdbfasta Upstream-Contact: Geo Pertea , Source: http://sourceforge.net/projects/cdbfasta/files/ Files: * Copyright: © 2002-2010 Geo Pertea , Valentin Antonescu The Institute for Genomic Research License: PD Date: Mon, 17 Dec 2012 11:26:54 -0500 From: Geo To: andreas@an3as.eu CC: valentin antonescu Subject: Re: Fwd: Please forward to Geo Pertea: What exat license for cdbfasta? . Hi Andreas, I don't think about the licensing issue much and this may be related to the fact that the cdbfasta code if heavily based on Daniel J. Bernstein 's cdb source code which has been notably license-free . However, since DJB released his individual source code files to public domain we also consider the whole cdbfasta code to be public domain, free and open source software. . Thanks, --geo Files: debian/* Copyright: © 2010 Steffen Moeller © 2012 Andreas Tille License: GPL-3 On Debian systems, the full text of the GNU General Public License version 3 can be found in the file `/usr/share/common-licenses/GPL-3'. debian/source/0000755000000000000000000000000011551625337010476 5ustar debian/source/format0000644000000000000000000000001411500166521011671 0ustar 3.0 (quilt) debian/get-orig-source0000755000000000000000000000067011624177613012142 0ustar #!/bin/bash set -u set -e # Get original source and set version based on date # Arbitrarily chose Heanet SF mirror mkdir -vp ../tarballs wget -a /dev/stdout -P ../tarballs -N -S http://heanet.dl.sourceforge.net/project/cdbfasta/cdbfasta.tar.gz | tee ../tarballs/wget.log FILEDATE=`grep '^ Last-Modified: ' ../tarballs/wget.log | cut -b 18- | xargs -0 date +'%Y%m%d' -d` ln -s tarballs/cdbfasta.tar.gz ../cdbfasta_$FILEDATE.orig.tar.gz debian/changelog0000644000000000000000000000176112064117035011044 0ustar cdbfasta (0.99-20100722-1) unstable; urgency=low * Initial upload to Debian (Closes: #696233) * debian/cdbfasta_usage.html: Add separate upstream documentation * debhelper 9 (control+compat) + some manual changes for hardening -- Andreas Tille Tue, 18 Dec 2012 16:41:28 +0100 cdbfasta (0.981-20100722-1ubuntu3) lucid; urgency=low * Moved manpage generation to debian/rules as per advice from Charles Plessy * Quick and dirty workaround for Lintian false positive (error message was being seen as evidence of statically linked ZLib). -- Tim Booth Mon, 11 Apr 2011 14:43:39 +0100 cdbfasta (0.981-20100722-1ubuntu2) lucid; urgency=low * Continued work on package - added basic manapages * Changed description in control file -- Tim Booth Fri, 08 Apr 2011 18:05:51 +0100 cdbfasta (0.981-20100722-1) UNRELEASED; urgency=low * First preparation -- Steffen Moeller Thu, 02 Dec 2010 19:02:35 +0100 debian/patches/0000755000000000000000000000000012064107307010615 5ustar debian/patches/workaround-lintian-false-positive0000644000000000000000000000124511624177613017331 0ustar # Lintian uses some string detection heuristics to spot statically linked Zlib # and this string triggers the warning. This change keeps Lintian happy and makes # the error message gramatically correct. --- a/gcdbz.cpp +++ b/gcdbz.cpp @@ -65,7 +65,7 @@ err = deflate(&zstream, Z_FULL_FLUSH); zsize=zstream.total_out-t_out; if ((err !=Z_OK && err!=Z_STREAM_END) || zsize<=0) - GError("GCdbz error: deflate 1st record failed! (err=%d)\n", err); + GError("GCdbz error: deflating 1st record failed! (err=%d)\n", err); //now write the header and the dummy record //in case this was not done before: gcvt_uint=(endian_test())? &uint32_sun : &uint32_x86; debian/patches/hardening.patch0000644000000000000000000000115112064107307013573 0ustar Description: Regard hardening CFLAGS setting Unfortunately I did not succeeded in also injecting LDFLAGS in the same manner. It was simply ignored. So I decided to override the LINKER variable in debian/rules. Author: Andreas Tille Date: Tue, 18 Dec 2012 16:41:28 +0100 --- a/Makefile +++ b/Makefile @@ -33,9 +33,9 @@ else endif ifeq ($(findstring nommap,$(MAKECMDGOALS)),) - CFLAGS = $(DBGFLAGS) $(BASEFLAGS) + CFLAGS := $(DBGFLAGS) $(BASEFLAGS) $(CFLAGS) else - CFLAGS = $(DBGFLAGS) $(BASEFLAGS) -DNO_MMAP + CFLAGS := $(DBGFLAGS) $(BASEFLAGS) -DNO_MMAP $(CFLAGS) endif %.o : %.c debian/patches/series0000644000000000000000000000006212064077006012032 0ustar workaround-lintian-false-positive hardening.patch debian/doc-base0000644000000000000000000000107012064075623010571 0ustar Document: cdbfasta Title: CDB (Constant DataBase) indexing and retrieval tools for multi-FASTA files Author: Geo Pertea Abstract: Constant DataBase indexing and retrieval tools for multi-FASTA files CDB (Constant DataBase) can be used for creating indices for quick retrieval of any particular sequences from large multi-FASTA files. It has the option to compress data records in order to save space. Section: Science/Biology Format: html Files: /usr/share/doc/cdbfasta/cdbfasta_usage.html Index: /usr/share/doc/cdbfasta/cdbfasta_usage.html debian/compat0000644000000000000000000000000212064076371010372 0ustar 9 debian/dirs0000644000000000000000000000002311551654050010047 0ustar usr/share/man/man1 debian/docs0000644000000000000000000000004212064075352010041 0ustar README debian/cdbfasta_usage.html