pax_global_header00006660000000000000000000000064123751214400014511gustar00rootroot0000000000000052 comment=4a8f86f84bef779c8d0373917d392cfbd58beef1 FALCON-0.1.3/000077500000000000000000000000001237512144000124545ustar00rootroot00000000000000FALCON-0.1.3/MANIFEST.in000066400000000000000000000000201237512144000142020ustar00rootroot00000000000000include src/c/* FALCON-0.1.3/README.md000066400000000000000000000171451237512144000137430ustar00rootroot00000000000000Falcon =========== Falcon: a set of tools for fast aligning long reads for consensus and assembly The Falcon tool kit is a set of simple code collection which I use for studying efficient assembly algorithm for haploid and diploid genomes. It has some back-end code implemented in C for speed and some simple front-end written in Python for convenience. Please take a look at the `readme.md` file inside the `examples` directory. It shows how to do assembly using `HBAR-DTK` + `Falcon` on Amazon EC2 with a `StarCluster` setup . If any one knows anything comparable to `StarCluster` for Google Compute Engine, please let me know. I can build a VM there too. FILES ----- Here is a brief description of the files in the package Several C files for implementing sequence matching, alignment and consensus: kmer_lookup.c # kmer match code for quickly identify potential hits DW_banded.c # function for detailed sequence alignment # It is based on Eugene Myers' Paper # "AnO(ND) difference algorithm and its variations", 1986, # http://dx.doi.org/10.1007/BF01840446 falcon.c # functions for generating consensus sequences for a set of multiple sequence alginment common.h # header file for common declaration A python wrapper library using Python's ctypes to call the C functions: falcon_kit.py Some python scripts for (1) overlapping reads (2) generation consensus and (3) generate assembly contigs: falcon_overlap.py # an overlapper falcon_wrap.py # generate consensus from a group of reads get_rdata.py # a utility for preparing data for falcon_wrap.py falcon_asm.py # take the overlapping information and the sequence to generate assembled contig falcon_fixasm.py # a script analyzing the assembly graph and break contigs on potential mis-assembly points remove_dup_ctg.py # a utility code to remove duplication contigs in the assembly results INSTALLATION ------------ You need to install `pbcore` and `networkx` first. You might want to install the `HBAR-DTK` if you want to assemble genomes from raw PacBio data. On a Linux box, you should be able to use the standard `python setup.py install` to compile the C code and install python package. There is no standard way to install the shared objects from the C code inside a python package, so I did some hack to make it work. It might have some unexpected behavior. You can simply install the `.so` files in a path where the operation system can find (e.g. setting the environment variable `LD_LIBRARY_PATH`), and remove all prefix in Python `ctypes` `CDDL` function calls. EXAMPLES -------- Example for generating pre-assembled reads: python get_rdata.py queries.fofn targets.fofn m4.fofn 72 0 16 8 64 50 50 | falcon_wrap.py > p-reads-0.fa bestn : 72 group_id : 0 num_chunk : 16 min_cov : 8 max_cov : 64 trim_align : 50 trim_plr : 50 It is designed to use with the m4 alignment information generated by blasr + HBAR_WF2.py (https://github.com/PacificBiosciences/HBAR-DTK) Example for generating overlap data: falcon_overlap.py --min_len 4000 --n_core 24 --d_core 3 preads.fa > preads.ovlp Example for generating assembly falcon_asm.py preads.ovlp preads.fa The following files will be generated by `falcon_asm.py` in the same directory: full_string_graph.adj # the adjecent nodes of the edges in the full string graph string_graph.gexf # the gexf file of the string graph for graph visulization string_graph.adj # the adjecent nodes of the edges in the string graph after transitive reduction edges_list # full edge list paths # path for the unitigs unit_edges.dat # path and sequence of the untigs uni_graph.gexf # unitig graph in gexf format unitgs.fa # fasta files of the unitigs all_tigs_paths # paths for all final contigs (= primary contigs + associated contigs) all_tigs.fa # fasta file for all contigs primary_tigs_paths # paths for all primary contigs primary_tigs.fa # fasta file fot the primary contigs asm_graph.gexf # the assembly graph where the edges are the contigs Although I have tested this tool kit to genome up to 150Mb and get reasonable good assembly results, this tool kit is still highly experimental and is not meant to be used by novice people. If you like to try it out, you will very likely to know more detail about it and be able to tweak the code to adapt it to your computation cluster. I will hope that I can provide more details and clean the code up a little in the future so it can be useful for more people. The principle of the layout algorithm is also available at https://speakerdeck.com/jchin/string-graph-assembly-for-diploid-genomes-with-long-reads ABOUT THE LICENSE ------------------ Major part of the coding work is done with my own time and on my own MacBook(R) Air. However, as a PacBio(R) employee, most of the testing are done with the data generated by PacBio and PacBio's computational resources, so it is fair the code is released with PacBio's version of open source licence. If you are from a competitor and try to take advantage of any open source code from PacBio, the only thing you can really justify such practice is to release your real data in public and your code as open source too. Also, releasing this code to public is fully my own discretion. If my employer has any concern about this, I might have to pull it off. Standard PacBio Open Source License that is associated with this package: #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ --Jason Chin, Dec 16, 2013 FALCON-0.1.3/doc/000077500000000000000000000000001237512144000132215ustar00rootroot00000000000000FALCON-0.1.3/doc/file_format_note.md000066400000000000000000000145231237512144000170640ustar00rootroot00000000000000Quick Note on FALCON Assembly Output Format ============================================ After running `falcon_asm.py`, the following files will be generated - `edges_list`: the list of edges in the assembled string graph - `unit_edge_paths`: the path of each unitig - `unit_edges.dat`: the path and the sequence of each unitig - `unitgs.fa`: fasta file of all unitigs - `all_tigs_paths`: the path of all contigs - `all_tigs.fa`: the sequences of all contigs - `primary_tigs_paths`: the path of the primary contigs - `primary_tigs.fa`: the sequences of the initial primary contigs - `bundle_edges`: the edges and paths of each "string bundles" After running `falcon_fixasm.py`, it generates the following files - `primary_tigs_c.fa`: the final primary contigs - `primary_tigs_paths_c`: the path of the final primary contigs - `all_tiling_path_c`: the "tiling" path of all contigs - `primary_tigs_node_pos_c`: the positions of the nodes in each of the primary contigs The format of each node is the identifier of the DNA fragment followed by `:B` or `:E` indicating the end of the read that is corresponding to the node. The `egdes_list` file has a simple 4 column format: `in_node out_node edge_label overlap_length`. Here is an example how edges are represented in the `egdes_list` file: 00099576_1:B 00101043_0:B 00101043_0:1991-0 14333 00215514_0:E 00025025_0:B 00025025_0:99-0 14948 00223367_0:E 00146924_0:B 00146924_0:1188-0 8452 00205542_0:E 00076625_0:B 00076625_0:396-0 11067 The `edge_label`, e.g. `00101043_0:1991-0`, encodes the correspondent sequence of the edge from the DNA fragment. The edge `00099576_1:B -> 00101043_0:B` has a sequence from read `00101043_0` base 1991 to 0. The `unit_edge_paths` file contains the information of the path of each unitig. Each line represents an unitig. For example, the unitig `00001c` is represented as: >00001c-00169881_0:B-00121915_0:E-133 00169881_0:B 00201238_0:E 00137179_0:E 00142410_0:B 00223493_0:B 00208425_0:B 00102538_0:E 00160115_0:E ... 00122905_0:E 00121915_0:E The full unitig id `00001c-00169881_0:B-00121915_0:E-133` includes the unique serial number `00001c`, the begin node `00169881_0:B` and the end node `00121915_0:E` followed by the number of nodes 133 in the path. The rest of the fields list the full path node by node. The `primary_tigs_paths` and `all_tigs_paths` have the same format as the `unit_edge_paths` except the edges in the path are the unitig edges rather than the edges in the original string graph. The `unit_edges.dat` contains not only the begin nodes, the end nodes and the paths of the unitigs but also the full sequences of the unitigs. It has simple 4 column format `begin node`, `end node`, `path`, `sequence`. The different nodes in the path are delimited by `-`. The sequence identifiers in `all_tigs.fa` also encode the relationship between different contigs. For example: $ grep ">" all_tigs.fa | head -15 >0000-0000 2e8a7078_130260_0:B-02eca7b8_135520_0:E >0000-0001 6edbcd5c_128868_0:E-3353572d_72448_963:E >0000-0002 2f1c350c_15083_0:E-8c92434f_60400_0:E >0000-0003 02eca7b8_135520_0:B-02030999_5577_0:B >0000-0004-u 53756d78_87035_13099:B-d850f3f2_135807_0:E >0000-0005-u 80ae02b0_43730_1168:B-4901e842_5163_2833:B >0000-0006-u e1709413_155764_0:E-e55b636f_50757_0:E >0000-0007-u e56a70f0_80897_1520:E-06734432_150537_0:E >0000-0008-u 1ab64aad_59082_807:E-6f9ad27e_23458_5638:E >0000-0009-u 1a88ddf4_21715_0:B-9eb4f7d7_79023_11041:E >0000-0010-u ada57c82_24446_0:E-4ce44ebc_41426_0:E >0000-0011-u 49704ee2_54679_0:B-a9ced3cc_90191_1410:E >0000-0012-u b3728b6f_59022_233:E-bd1579e4_160424_0:B All these sequences have the same first field `0000`. It means all these contigs are initialized from the same "string bundles". If the second field is `0000`, it means that sequence is the primary contig of this bundle. The rest are the "associated contigs". The second field in the identifier simply indicates the begin and the end node of the contigs. After running `falcon_fixasm.py`, some of the primary contigs could be broken apart into smaller pieces. For example: $ grep ">" primary_tigs_c.fa | head -15 >0000_00 >0001_00 >0001_01 >0001_02 >0002_00 >0002_01 In this case, the initial primary contig `0000` (`0000-0000` in the `all_tigs.fa` file) is intact. However, the `0001-0000` has been broken into 3 primary contigs `0001_00`, `0001_01`, and `0001_02`. Some of the associated contigs might be caused by sequencing / consensus errors or missing overlapping information. Running `falcon_dedup.py` compares the associated contigs to the corresponding sequences in the primary contigs. If the identity is high, namely not large scale variants found, they will be removed. Mummer3 (Nucmer) package is used and is necessary for this step. `falcon_dedup.py` generates a file called `a_nodup.fa` which contains the non-redundant associate contigs. Input File Format For FalconSense --------------------------------- The `falcon_sense.py` generates consensus from a set of raw sequences. The input is a stream of sequences. Each row has two columns. Different set of reads are delimited by `+ +` and the file should be ended by `+ +`. Here is an example seed_id1 ACTACATACATACTTA... read_id2 TCTGGCAACACTACTTA... ... - - seed_id2 ACTACATACATACTTA... read_id3 TCTGGCAACACTACTTA... ... - - + + In this case, if there are enough coverage to correct `seed_id1` and `seed_id2`, the `falcon_sense.py` will generate two consensus sequences (labeled with `seed_id1` and `seed_id2`) in fasta format to `stdout`. Final Note ---------- 1. Typically, the size of `unitgs.fa` will be roughly twice of the genome size, since the file contains both dual edges from each overlap. In the process of the assembly, only one of the dual edges will be used in the final contigs. 2. The relation between the associate contigs and the primary contigs can be simply identified by the begin and the end nods of the associted contigs. One can easily constructed the corresponding sequences in the primary contigs for identify the variants between them. 3. One can construct a unitig graph from the `unit_edge_paths` files, the graph is typically much smaller than the initial string graph which is more convenient for visualization for understanding the assembly/genome structure. 4. The `-` and `:` characters are used as delimiter for parsing, so the initial reads identifier should not have these two characters. FALCON-0.1.3/examples/000077500000000000000000000000001237512144000142725ustar00rootroot00000000000000FALCON-0.1.3/examples/Dmel_asm.md000066400000000000000000000206441237512144000163430ustar00rootroot00000000000000Dmel Assembly with FALCON on Amazon EC2 ========================================= Preparation for Running StarCluster ----------------------------------- I use a development version of StarCluster since the stable version does not support the kind of instance that we need to use in AWS EC2. You can install the developement version by directly cloning the StarCluster's GitHub repository. The following is a simple example for installing the development version. You might have to install other python packages that StarCluster is dependent on. ``` git clone https://github.com/jtriley/StarCluster.git cd StarCluster # you can check out the exact revision that I am using for this document git checkout 4149bbed292b0298478756d778d8fbf1dd210daf python setup.py install ``` For using StarCluster to create a SGE in AWS EC2, I assume you already know how to create an AWS EC2 account, and go through the tutorial for running VMs on EC2. I have built a public EC2 EBS snapshot. You should create a new EBS volume using the `PacBio_Dmel_Asm / snap-19e7a0df` snapshop. It already contains the raw sequence fasta files and an assembly as example already. Here is an example of the configuration for StarCluster:: ``` [aws info] aws_access_key_id = your_access_key aws_secret_access_key = your_secret_access_key aws_user_id = your_user_id [volume DMEL] volume_id=your_dmel_data_EBS_id #e.g volume_id=vol-c9df3b85 mount_path=/mnt/dmel_asm [cluster falcon-pre-asm] keyname = starcluster cluster_size = 1 cluster_user = sgeadmin cluster_shell = bash master_image_id = ami-ef3c0e86 master_instance_type = c3.8xlarge node_image_id = ami-ef3c0e86 node_instance_type = c3.8xlarge availability_zone = us-east-1a volumes = DMEL [cluster falcon-bigmem] keyname = starcluster cluster_size = 1 cluster_user = sgeadmin cluster_shell = bash master_image_id = ami-73d2d21a master_instance_type = cr1.8xlarge node_image_id = ami-73d2d21a node_instance_type = cr1.8xlarge availability_zone = us-east-1a volumes = DMEL [global] default_template = falcon-bigmem ENABLE_EXPERIMENTAL=True ``` I set up two cluster configurations for different part of the assembly process. If you want to run end-to-end in one kind of instance, you can just use the `falcon-bigmem` for assembly. It costs a little bit more. The AMI images (ami-ef3c0e86 and ami-73d2d21a) are pre-built with most package necessary for the assembly work. If you will like to build your own, you can check with this script: ``` https://raw.github.com/PacificBiosciences/FALCON/v0.1.1/examples/install_note.sh ``` Get preassembled reads ------------------------ "Pre-assembly" is the process to error correct PacBio reads to generate "preassembled reads" (p-reads) which have good accuracy to be assembled by traditional Overlap-Layout-Consensus assembly algorithms directly. In this instruction, we use an experimental code `falcon_qrm.py` to match the reads for error correction. It is much faster than using `blasr` for the same purpose but it may not as robust as `blasr` to generate high quality results yet as many statistical properties of the algorithm is not fully studied. First, let start an EC2 cluster of one node to set up a few things by running following `starcluster` command: ``` starcluster start -c falcon-pre-asm falcon ``` Once the cluster is built, one can login the master node by: ``` starcluster sshmaster falcon ``` We will need the following steps to setup the running environment:: 1. update SGE environment ``` cd /mnt/dmel_asm/sge_setup bash sge_setup.sh ``` 2. setup HBAD-DTK environment ``` . /home/HBAR_ENV/bin/activate ``` 3. update HBAR-DTK and falcon_asm ``` cd /mnt/dmel_asm/packages/pbtools.hbar-dtk-0.1.5 python setup.py install cd /mnt/dmel_asm/packages/falcon_kit-0.1.1 #edit falcon_asm.py to set identity threshold for overlapping at 98%, it is done in the EBS snapshot python setup.py install ``` If you want to do an assembly in `/mnt/dmel_asm/new_asm/`, just clone the configuration in `/mnt/dmel_asm/asm_template/` to `/mnt/dmel_asm/new_asm/`: ``` cd /mnt/dmel_asm/ cp -a asm_template/ new_asm/ cd new_asm/ ``` An example of the assembly result can be found in `/mnt/dmel_asm/asm_example`. You can start the pre-assembly stage by running the `HBAR_WF3.py` script as following: ``` python HBAR_WF3.py HBAR_step1.cfg ``` It will take while to preparing the fasta files for pre-assembly. Once that is one, SGE jobs for matching reads will be submitted. Once the SGE jobs are submmited, you can use add more node to run the jobs concurrently to speed up the process by issuing this command on your local host to add the nodes: starcluster addnode -n 15 falcon # add 15 nodes When all nodes are up. You can try to run the load balancer so once the jobs are done, the node can be terminated automatically to save some money. starcluster loadbalance -k 9 -K -m 16 -n 1 falcon I found I have to comment out one line of code in `starcluster/plugins/sge.py` to make it work properly to remove unused nodes: class SGEPlugin(clustersetup.DefaultClusterSetup): def _remove_from_sge(self, node): #comment out the following line in the code #master.ssh.execute('qconf -de %s' % node.alias) If you use 16 nodes, it will takes about 4 hours to finish all jobs. If all pre-assembly jobs finish the cluster will be terminated automatically, but the results will be kept in the EBS volume. The generated p-reads will be in `/mnt/dmel_asm/new_asm/2-preads-falcon/pread_*.fa`. Assembling the p-reads ------------------------ We use a different instance type which has bigger memory to assemble the genome. We only needs one node for the assembly part. We still use SGE as the code was written to run end-to-end assembly in a general SGE cluster. First, start single node cluster by running the commands in the local host: ``` starcluster start -c falcon-bigmem falcon ``` Repeat the setup process: ``` cd /mnt/dmel_asm/sge_setup bash sge_setup.sh . /home/HBAR_ENV/bin/activate cd /mnt/dmel_asm/packages/pbtools.hbar-dtk-0.1.5 python setup.py install cd /mnt/dmel_asm/packages/falcon_kit-0.1.1 #edit falcon_asm.py to set identity threshold for overlapping at 98%, it is done in the EBS snapshot python setup.py install ``` You can start the assembly stage by running the `HBAR_WF3.py` script as following: ``` cd /mnt/dmel_asm/new_asm/ python HBAR_WF3.py HBAR_step2.cfg ``` It takes about two hours for the assembly process to finish. The results will be in `/mnt/dmel_asm/new_asm/3-asm-falcon`. Here is a list of the output files: ``` full_string_graph.adj # the adjacent nodes of the edges in the full string graph string_graph.gexf # the gexf file of the string graph for graph visualization string_graph.adj # the adjecent nodes of the edges in the string graph after transitive reduction edges_list # full edge list paths # path for the unitigs unit_edges.dat # path and sequence of the untigs uni_graph.gexf # unitig graph in gexf format unitgs.fa # fasta files of the unitigs all_tigs_paths # paths for all final contigs (= primary contigs + associated contigs) all_tigs.fa # fasta file for all contigs primary_tigs_paths # paths for all primary contigs primary_tigs.fa # fasta file fot the primary contigs primary_tigs_paths_c # paths for all primary contigs, detectable mis-assemblies are broken primary_tigs_c.fa # fasta file fot the primary contigs, detectable mis-assemblies are broken asm_graph.gexf # the assembly graph where the edges are the contigs ``` There might be redundant contig. The following script can be used to remove redundant contigs: ``` export PATH=$PATH:/home/HBAR_ENV/MUMmer3.23 nucmer -mum all_tigs.fa all_tigs.fa -p all_tigs_self >& /dev/null show-coords -o -H -T all_tigs_self.delta | grep CONTAINS | awk '$7>96' | awk '{print $9}' | sort -u > all_tigs_duplicated_ids remove_dup_ctg.py cat p-tigs_nodup.fa a-tigs_nodup.fa > pa-tigs_nodup.fa ``` The non-reduant set of contigs in `pa-tigs_nodup.fa` will be suitable for further correction by the Quvier algorithm. - Jason Chin, March 9, 2014 FALCON-0.1.3/examples/HBAR.cfg000077500000000000000000000042431237512144000154750ustar00rootroot00000000000000[General] # list of files of the initial bas.h5 files input_fofn = input.fofn # The length cutoff used for seed reads used for initial mapping length_cutoff = 10000 # The length cutoff used for seed reads usef for pre-assembly length_cutoff_pr = 10000 # The read quality cutoff used for seed reads RQ_threshold = 0.75 # SGE job option for distributed mapping sge_option_dm = -pe smp 32 -q all.q # SGE job option for m4 filtering sge_option_qf = -pe smp 4 -q all.q # SGE job option for pre-assembly sge_option_pa = -pe smp 32 -q all.q # SGE job option for CA sge_option_ca = -pe smp 8 -q all.q # SGE job option for Quiver sge_option_qv = -pe smp 32 -q all.q # blasr for initial read-read mapping for each chunck (do not specific the "-out" option). # One might need to tune the bestn parameter to match the number of distributed chunks to get more optimized results blasr_opt = -nCandidates 64 -minMatch 12 -maxLCPLength 15 -bestn 48 -minPctIdentity 75.0 -maxScore -1000 -nproc 32 #This is used for running quiver SEYMOUR_HOME = /mnt/secondary/Smrtpipe/builds/Assembly_Mainline_Nightly_Archive/build470-116466/ #The number of best alignment hits used for pre-assembly #It should be about the same as the final PLR coverage, slight higher might be OK. bestn = 64 # target choices are "pre_assembly", "draft_assembly", "all" # "mapping": initial mapping # "pre_assembly" : generate pre_assembly for any long read assembler to use # "draft_assembly": automatic submit CA assembly job when pre-assembly is done # "all" : submit job for using Quiver to do final polish, not working yet target = pre_assembly # number of chunks for pre-assembly. preassembly_num_chunk = 1 q_chunk_size = 1 t_chunk_size = 3 # "tmpdir" is for preassembly. A lot of small files are created and deleted during this process. # It would be great to use ramdisk for this. Set tmpdir to a NFS mount will probably have very bad performance. tmpdir = /tmp # "big_tmpdir" is for quiver, better in a big disk big_tmpdir = /tmp # various trimming parameters min_cov = 8 max_cov = 64 trim_align = 50 trim_plr = 50 # number of processes used by by blasr during the preassembly process q_nproc = 16 concurrent_jobs = 1 FALCON-0.1.3/examples/StarCluster.cfg000066400000000000000000000010751237512144000172310ustar00rootroot00000000000000[aws info] aws_access_key_id = your_key aws_secret_access_key = your_access_key aws_user_id = your_user_id [key starcluster] key_location = ~/.ec2/starcluster.rsa [cluster falcon] #The AMI image is based on ami-765b3e1f us-east-1 starcluster-base-ubuntu-12.04-x86_64 keyname = starcluster cluster_size = 1 cluster_user = sgeadmin cluster_shell = bash master_image_id = ami-ef3c0e86 master_instance_type = c3.8xlarge node_image_id = ami-ef3c0e86 node_instance_type = c3.8xlarge availability_zone = us-east-1c [global] default_template = falcon ENABLE_EXPERIMENTAL=True FALCON-0.1.3/examples/install_note.sh000066400000000000000000000053471237512144000173320ustar00rootroot00000000000000# This is the script that will build everything needed to generate an assembly # on top of the StarCluster Ubuntu AMI HBAR_ROOT=/home mkdir -p $HBAR_ROOT/HBAR_ENV export HBAR_HOME=$HBAR_ROOT/HBAR_ENV/ sudo apt-get install python-virtualenv virtualenv -p /usr/bin/python2.7 $HBAR_HOME cd $HBAR_HOME . bin/activate pip install numpy==1.6.2 sudo apt-get install python-dev pip install numpy==1.6.2 wget http://www.hdfgroup.org/ftp/HDF5/prev-releases/hdf5-1.8.9/src/hdf5-1.8.9.tar.gz tar zxvf hdf5-1.8.9.tar.gz cd hdf5-1.8.9 ./configure --prefix=$HBAR_HOME --enable-cxx make install cd .. wget http://h5py.googlecode.com/files/h5py-2.0.1.tar.gz tar zxvf h5py-2.0.1.tar.gz cd h5py-2.0.1 python setup.py build --hdf5=$HBAR_HOME python setup.py install cd .. pip install git+https://github.com/PacificBiosciences/pbcore.git#pbcore sudo apt-get install git pip install git+https://github.com/PacificBiosciences/pbcore.git#pbcore pip install git+https://github.com/PacificBiosciences/pbdagcon.git#pbdagcon pip install git+https://github.com/PacificBiosciences/pbh5tools.git#pbh5tools pip install git+https://github.com/cschin/pypeFLOW.git#pypeflow pip install rdflib==3.4.0 pip install git+https://github.com/PacificBiosciences/HBAR-DTK.git#hbar-dtk pip install git+https://github.com/PacificBiosciences/FALCON.git#falcon git clone https://github.com/PacificBiosciences/blasr.git cd blasr export HDF5INCLUDEDIR=/home/HBAR_ENV/include/ export HDF5LIBDIR=/home/HBAR_ENV/lib/ make cp alignment/bin/blasr ../bin/ cp alignment/bin/sawriter ../bin/ cp pbihdfutils/bin/samFilter ../bin cp pbihdfutils/bin/samtoh5 ../bin cd .. wget http://downloads.sourceforge.net/project/boost/boost/1.47.0/boost_1_47_0.tar.gz tar zxvf boost_1_47_0.tar.gz cd boost_1_47_0/ bash bootstrap.sh ./b2 install -j 24 --prefix=$HBAR_ROOT/HBAR_ENV/boost cd .. sudo apt-get install libpcre3 libpcre3-dev wget http://downloads.sourceforge.net/project/swig/swig/swig-2.0.11/swig-2.0.11.tar.gz tar zxvf swig-2.0.11.tar.gz cd swig-2.0.11 ./configure --prefix=$HBAR_ROOT/HBAR_ENV make make install cd .. git clone https://github.com/PacificBiosciences/ConsensusCore.git cd ConsensusCore/ python setup.py install --swig=$HBAR_ROOT/HBAR_ENV/bin/swig --boost=$HBAR_ROOT/HBAR_ENV/boost/include/ cd .. pip install git+https://github.com/PacificBiosciences/GenomicConsensus.git#GenomicConsensus pip install git+https://github.com/PacificBiosciences/pbalign#pbaligno wget http://downloads.sourceforge.net/project/mummer/mummer/3.23/MUMmer3.23.tar.gz tar zxvf MUMmer3.23.tar.gz cd MUMmer3.23/ make install cd .. export PATH=$PATH:/home/HBAR_ENV/MUMmer3.23 wget http://downloads.sourceforge.net/project/samtools/samtools/0.1.19/samtools-0.1.19.tar.bz2 tar jxvf samtools-0.1.19.tar.bz2 cd samtools-0.1.19 make cp samtools ../bin cd .. FALCON-0.1.3/examples/readme.md000066400000000000000000000062361237512144000160600ustar00rootroot00000000000000Running an Amazon EC2 instance that has HBAR-DTK + Falcon pre-installed ======================================================================= 1. Install the latest verison of StarCluster ``` git clone https://github.com/jtriley/StarCluster.git cd StarCluster python setup.py install #better in virtualenv ``` The stable version of StarCluster does not support the `c3` instance. For assembly, using one node of `c3.8xlarge` instance is more convenient. In my test, I can finish single E. coli genome within almost one hour. Namely, one can assembly a bacteria genome in less then 5 bucks. 2. Use the `StarCluster.cfg` as the configuration file for `StarCluster` to setup a `falcon` cluster 3. Start the cluster ``` starcluster start falcon ``` 4. login to the cluster ``` starcluster sshmaster falcon ``` 5. set up the SGE ``` cd /home/sge_setup bash sge_setup.sh ``` 6. There is alreay an existing assembly results in `/home/Ecoli_ASM/`. Here I show how to reproduce it. First, create a new assembly working directory in `/mnt`, set it up and run HBAR_WF3.py to get preassembled reads ``` cd /mnt mkdir test_asm cd test_asm cp /home/Ecoli_ASM/HBAR.cfg . cp /home/Ecoli_ASM/input.fofn . source /home/HBAR_ENV/bin/activate HBAR_WF3.py HBAR.cfg ``` 7. The next part of the assembly does not start automatically yet. The detail steps are in the `run_asm.sh` script and one can use to get contigs and consensus. ``` cp /home/Ecoli_ASM/run_asm.sh . bash run_asm.sh ``` The consensus result is in `/mnt/consensus.fasta`. Since we did not do any consensus after the unitig step. One more run of quiver consensus may further improve the final assembly accuracy. 8. A yeast (S. cerevisiae W303) data set is also included in the AMI. One can try to assemble it with a larger cluster setting. 9. Here is the result of a timing test: ``` (HBAR_ENV)root@master:/mnt/test_asm# time HBAR_WF3.py HBAR.cfg Your job 1 ("mapping_task_q00002_t000011416727c") has been submitted Your job 2 ("qf_task_q00002a3e75f4c") has been submitted Your job 3 ("mapping_task_q00003_t00001b667b504") has been submitted Your job 4 ("qf_task_q000036974ef22") has been submitted Your job 5 ("mapping_task_q00001_t000017bf52d9c") has been submitted Your job 6 ("qf_task_q000010b31d960") has been submitted Your job 7 ("pa_task_000001ee38aee") has been submitted real 26m51.030s user 1m10.152s sys 0m11.993s (HBAR_ENV)root@master:/mnt/test_asm# time bash run_asm.sh [WARNING] This .cmp.h5 file lacks some of the QV data tracks that are required for optimal performance of the Quiver algorithm. For optimal results use the ResequencingQVs workflow in SMRTPortal with bas.h5 files from an instrument using software version 1.3.1 or later. real 13m2.945s user 244m44.322s sys 2m7.032s ``` For better results, one might run `quiver` twice. It is possible to get the whole assembly within one hour (~ 26 + 13 * 2 = 52 minutes). With the overhead on setting up, file transfer, etc., one can assembly a bacteria genome in EC2 less than 5 bucks in principle. -- Jason Chin, 01/18/2014 FALCON-0.1.3/examples/run_asm.sh000066400000000000000000000021551237512144000162750ustar00rootroot00000000000000# This script does the assembly and generate the quiver consensus after one gets preassembled reads # Modification will be needed for larger genome and different computational cluster setup # It should be run within the assembly working directory mkdir 3-asm-falcon/ cd 3-asm-falcon/ cat ../2-preads-falcon/pread_*.fa > preads.fa falcon_overlap.py --min_len 8000 --n_core 24 --d_core 1 preads.fa > preads.ovlp falcon_asm.py preads.ovlp preads.fa falcon_fixasm.py export PATH=$PATH:/home/HBAR_ENV/MUMmer3.23 nucmer -maxmatch all_tigs.fa all_tigs.fa -p all_tigs_self >& /dev/null show-coords -o -H -T all_tigs_self.delta | grep CONTAINS | awk '$7>96' | awk '{print $9}' | sort -u > all_tigs_duplicated_ids remove_dup_ctg.py cat p-tigs_nodup.fa a-tigs_nodup.fa > pa-tigs_nodup.fa cat p-tigs_nodup.fa a-tigs_nodup.fa > /mnt/pa-tigs_nodup.fa find /home/data/Ecoli/ -name "*.bax.h5" > /mnt/h5_input.fofn cd /mnt pbalign.py --forQuiver --nproc 32 --tmpDir /mnt --maxHits 1 h5_input.fofn pa-tigs_nodup.fa output.cmp.h5 samtools faidx pa-tigs_nodup.fa quiver -j 24 output.cmp.h5 -r pa-tigs_nodup.fa -o variants.gff -o consensus.fasta FALCON-0.1.3/setup.py000077500000000000000000000030261237512144000141720ustar00rootroot00000000000000#!/usr/bin/env python from setuptools import setup from distutils.core import Extension setup(name='falcon_kit', version='0.1.3', description='a small toolkit for DNA seqeucne alignment, overlapping, and assembly', author='Jason Chin', author_email='jchin@pacificbiosciences.com', packages=['falcon_kit'], package_dir={'falcon_kit':'src/py/'}, ext_modules=[Extension('falcon_kit.DW_align', ['src/c/DW_banded.c'], extra_link_args=["-fPIC", "-O3"]), Extension('falcon_kit.kmer_lookup', ['src/c/kmer_lookup.c'], extra_link_args=["-fPIC", "-O3"]), Extension('falcon_kit.falcon', ['src/c/DW_banded.c', 'src/c/kmer_lookup.c', 'src/c/falcon.c'], extra_link_args=["-fPIC", "-O3"]), ], scripts = ["src/py_scripts/falcon_asm.py", "src/py_scripts/falcon_asm_dev.py", "src/py_scripts/falcon_overlap.py", "src/py_scripts/falcon_overlap2.py", "src/py_scripts/falcon_qrm.py", "src/py_scripts/falcon_fixasm.py", "src/py_scripts/falcon_dedup.py", "src/py_scripts/falcon_ucns_data.py", "src/py_scripts/falcon_utgcns.py", "src/py_scripts/falcon_sense.py", "src/py_scripts/get_rdata.py", "src/py_scripts/remove_dup_ctg.py"], zip_safe = False, install_requires=[ "pbcore >= 0.6.3", "networkx >= 1.7" ] ) FALCON-0.1.3/src/000077500000000000000000000000001237512144000132435ustar00rootroot00000000000000FALCON-0.1.3/src/c/000077500000000000000000000000001237512144000134655ustar00rootroot00000000000000FALCON-0.1.3/src/c/DW_banded.c000077500000000000000000000251201237512144000154430ustar00rootroot00000000000000 /* * ===================================================================================== * * Filename: DW_banded.c * * Description: A banded version for the O(ND) greedy sequence alignment algorithm * * Version: 0.1 * Created: 07/20/2013 17:00:00 * Revision: none * Compiler: gcc * * Author: Jason Chin, * Company: * * ===================================================================================== #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ */ #include #include #include #include #include "common.h" int compare_d_path(const void * a, const void * b) { const d_path_data2 * arg1 = a; const d_path_data2 * arg2 = b; if (arg1->d - arg2->d == 0) { return arg1->k - arg2->k; } else { return arg1->d - arg2->d; } } void d_path_sort( d_path_data2 * base, unsigned long max_idx) { qsort(base, max_idx, sizeof(d_path_data2), compare_d_path); } d_path_data2 * get_dpath_idx( seq_coor_t d, seq_coor_t k, unsigned long max_idx, d_path_data2 * base) { d_path_data2 d_tmp; d_path_data2 *rtn; d_tmp.d = d; d_tmp.k = k; rtn = (d_path_data2 *) bsearch( &d_tmp, base, max_idx, sizeof(d_path_data2), compare_d_path); //printf("dp %ld %ld %ld %ld %ld %ld %ld\n", (rtn)->d, (rtn)->k, (rtn)->x1, (rtn)->y1, (rtn)->x2, (rtn)->y2, (rtn)->pre_k); return rtn; } void print_d_path( d_path_data2 * base, unsigned long max_idx) { unsigned long idx; for (idx = 0; idx < max_idx; idx++){ printf("dp %ld %ld %ld %ld %ld %ld %ld %ld\n",idx, (base+idx)->d, (base+idx)->k, (base+idx)->x1, (base+idx)->y1, (base+idx)->x2, (base+idx)->y2, (base+idx)->pre_k); } } alignment * align(char * query_seq, seq_coor_t q_len, char * target_seq, seq_coor_t t_len, seq_coor_t band_tolerance, int get_aln_str) { seq_coor_t * V; seq_coor_t * U; // array of matched bases for each "k" seq_coor_t k_offset; seq_coor_t d; seq_coor_t k, k2; seq_coor_t best_m; // the best "matches" for each d seq_coor_t min_k, new_min_k; seq_coor_t max_k, new_max_k; seq_coor_t pre_k; seq_coor_t x, y; seq_coor_t cd; seq_coor_t ck; seq_coor_t cx, cy, nx, ny; seq_coor_t max_d; seq_coor_t band_size; unsigned long d_path_idx = 0; unsigned long max_idx = 0; d_path_data2 * d_path; d_path_data2 * d_path_aux; path_point * aln_path; seq_coor_t aln_path_idx; alignment * align_rtn; seq_coor_t aln_pos; seq_coor_t i; bool aligned = false; //printf("debug: %ld %ld\n", q_len, t_len); //printf("%s\n", query_seq); max_d = (int) (0.3*(q_len + t_len)); band_size = band_tolerance * 2; V = calloc( max_d * 2 + 1, sizeof(seq_coor_t) ); U = calloc( max_d * 2 + 1, sizeof(seq_coor_t) ); k_offset = max_d; // We should probably use hashmap to store the backtracing information to save memory allocation time // This O(MN) block allocation scheme is convient for now but it is slower for very long sequences d_path = calloc( max_d * (band_size + 1 ) * 2 + 1, sizeof(d_path_data2) ); aln_path = calloc( q_len + t_len + 1, sizeof(path_point) ); align_rtn = calloc( 1, sizeof(alignment)); align_rtn->t_aln_str = calloc( q_len + t_len + 1, sizeof(char)); align_rtn->q_aln_str = calloc( q_len + t_len + 1, sizeof(char)); align_rtn->aln_str_size = 0; align_rtn->aln_q_s = 0; align_rtn->aln_q_e = 0; align_rtn->aln_t_s = 0; align_rtn->aln_t_e = 0; //printf("max_d: %lu, band_size: %lu\n", max_d, band_size); best_m = -1; min_k = 0; max_k = 0; d_path_idx = 0; max_idx = 0; for (d = 0; d < max_d; d ++ ) { if (max_k - min_k > band_size) { break; } for (k = min_k; k <= max_k; k += 2) { if ( k == min_k || k != max_k && V[ k - 1 + k_offset ] < V[ k + 1 + k_offset] ) { pre_k = k + 1; x = V[ k + 1 + k_offset]; } else { pre_k = k - 1; x = V[ k - 1 + k_offset] + 1; } y = x - k; d_path[d_path_idx].d = d; d_path[d_path_idx].k = k; d_path[d_path_idx].x1 = x; d_path[d_path_idx].y1 = y; while ( x < q_len && y < t_len && query_seq[x] == target_seq[y] ){ x++; y++; } d_path[d_path_idx].x2 = x; d_path[d_path_idx].y2 = y; d_path[d_path_idx].pre_k = pre_k; d_path_idx ++; V[ k + k_offset ] = x; U[ k + k_offset ] = x + y; if ( x + y > best_m) { best_m = x + y; } if ( x >= q_len || y >= t_len) { aligned = true; max_idx = d_path_idx; break; } } // For banding new_min_k = max_k; new_max_k = min_k; for (k2 = min_k; k2 <= max_k; k2 += 2) { if (U[ k2 + k_offset] >= best_m - band_tolerance ) { if ( k2 < new_min_k ) { new_min_k = k2; } if ( k2 > new_max_k ) { new_max_k = k2; } } } max_k = new_max_k + 1; min_k = new_min_k - 1; // For no banding // max_k ++; // min_k --; // For debuging // printf("min_max_k,d, %ld %ld %ld\n", min_k, max_k, d); if (aligned == true) { align_rtn->aln_q_e = x; align_rtn->aln_t_e = y; align_rtn->dist = d; align_rtn->aln_str_size = (x + y + d) / 2; align_rtn->aln_q_s = 0; align_rtn->aln_t_s = 0; d_path_sort(d_path, max_idx); //print_d_path(d_path, max_idx); if (get_aln_str > 0) { cd = d; ck = k; aln_path_idx = 0; while (cd >= 0 && aln_path_idx < q_len + t_len + 1) { d_path_aux = (d_path_data2 *) get_dpath_idx( cd, ck, max_idx, d_path); aln_path[aln_path_idx].x = d_path_aux -> x2; aln_path[aln_path_idx].y = d_path_aux -> y2; aln_path_idx ++; aln_path[aln_path_idx].x = d_path_aux -> x1; aln_path[aln_path_idx].y = d_path_aux -> y1; aln_path_idx ++; ck = d_path_aux -> pre_k; cd -= 1; } aln_path_idx --; cx = aln_path[aln_path_idx].x; cy = aln_path[aln_path_idx].y; align_rtn->aln_q_s = cx; align_rtn->aln_t_s = cy; aln_pos = 0; while ( aln_path_idx > 0 ) { aln_path_idx --; nx = aln_path[aln_path_idx].x; ny = aln_path[aln_path_idx].y; if (cx == nx && cy == ny){ continue; } if (nx == cx && ny != cy){ //advance in y for (i = 0; i < ny - cy; i++) { align_rtn->q_aln_str[aln_pos + i] = '-'; } for (i = 0; i < ny - cy; i++) { align_rtn->t_aln_str[aln_pos + i] = target_seq[cy + i]; } aln_pos += ny - cy; } else if (nx != cx && ny == cy){ //advance in x for (i = 0; i < nx - cx; i++) { align_rtn->q_aln_str[aln_pos + i] = query_seq[cx + i]; } for (i = 0; i < nx - cx; i++) { align_rtn->t_aln_str[aln_pos + i] = '-'; } aln_pos += nx - cx; } else { for (i = 0; i < nx - cx; i++) { align_rtn->q_aln_str[aln_pos + i] = query_seq[cx + i]; } for (i = 0; i < ny - cy; i++) { align_rtn->t_aln_str[aln_pos + i] = target_seq[cy + i]; } aln_pos += ny - cy; } cx = nx; cy = ny; } align_rtn->aln_str_size = aln_pos; } break; } } free(V); free(U); free(d_path); free(aln_path); return align_rtn; } void free_alignment(alignment * aln) { free(aln->q_aln_str); free(aln->t_aln_str); free(aln); } FALCON-0.1.3/src/c/Makefile000077500000000000000000000011641237512144000151320ustar00rootroot00000000000000DW_align.so: DW_banded.c common.h gcc DW_banded.c -O3 -shared -fPIC -o DW_align.so kmer_lookup.so: kmer_lookup.c common.h gcc kmer_lookup.c -O3 -shared -fPIC -o kmer_lookup.so #falcon: DW_banded.c common.h kmer_lookup.c falcon.c # gcc DW_banded.c kmer_lookup.c falcon.c -O4 -o falcon -fPIC falcon.so: falcon.c common.h DW_banded.c kmer_lookup.c gcc DW_banded.c kmer_lookup.c falcon.c -O3 -shared -fPIC -o falcon.so #falcon2.so: falcon.c common.h DW_banded_2.c kmer_lookup.c # gcc DW_banded_2.c kmer_lookup.c falcon.c -O3 -shared -fPIC -o falcon2.so clean: rm falcon *.so all: DW_align.so kmer_lookup.so falcon.so FALCON-0.1.3/src/c/Makefile.osx000077500000000000000000000025251237512144000157440ustar00rootroot00000000000000DW_align.so: DW_banded.c common.h gcc DW_banded.c -O3 -shared -fPIC -o DW_align.so -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/include/ -L/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/lib kmer_lookup.so: kmer_lookup.c common.h gcc kmer_lookup.c -O3 -shared -fPIC -o kmer_lookup.so -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/include/ -L/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/lib falcon: DW_banded.c common.h kmer_lookup.c falcon.c gcc DW_banded.c kmer_lookup.c falcon.c -O4 -o falcon -fPIC -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/include/ -L/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/lib falcon.so: falcon.c common.h DW_banded.c kmer_lookup.c gcc DW_banded.c kmer_lookup.c falcon.c -O3 -shared -fPIC -o falcon.so -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/include/ -L/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.8.sdk/usr/lib all: DW_align.so kmer_lookup.so falcon.so falcon FALCON-0.1.3/src/c/common.h000077500000000000000000000115131237512144000151320ustar00rootroot00000000000000 /* * ===================================================================================== * * Filename: common.h * * Description: Common delclaration for the code base * * Version: 0.1 * Created: 07/16/2013 07:46:23 AM * Revision: none * Compiler: gcc * * Author: Jason Chin, * Company: * * ===================================================================================== #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ */ typedef long int seq_coor_t; typedef struct { seq_coor_t aln_str_size ; seq_coor_t dist ; seq_coor_t aln_q_s; seq_coor_t aln_q_e; seq_coor_t aln_t_s; seq_coor_t aln_t_e; char * q_aln_str; char * t_aln_str; } alignment; typedef struct { seq_coor_t pre_k; seq_coor_t x1; seq_coor_t y1; seq_coor_t x2; seq_coor_t y2; } d_path_data; typedef struct { seq_coor_t d; seq_coor_t k; seq_coor_t pre_k; seq_coor_t x1; seq_coor_t y1; seq_coor_t x2; seq_coor_t y2; } d_path_data2; typedef struct { seq_coor_t x; seq_coor_t y; } path_point; typedef struct { seq_coor_t start; seq_coor_t last; seq_coor_t count; } kmer_lookup; typedef unsigned char base; typedef base * seq_array; typedef seq_coor_t seq_addr; typedef seq_addr * seq_addr_array; typedef struct { seq_coor_t count; seq_coor_t * query_pos; seq_coor_t * target_pos; } kmer_match; typedef struct { seq_coor_t s1; seq_coor_t e1; seq_coor_t s2; seq_coor_t e2; long int score; } aln_range; typedef struct { char * sequence; unsigned int * eff_cov; } consensus_data; kmer_lookup * allocate_kmer_lookup (seq_coor_t); void init_kmer_lookup ( kmer_lookup *, seq_coor_t ); void free_kmer_lookup(kmer_lookup *); seq_array allocate_seq(seq_coor_t); void init_seq_array( seq_array, seq_coor_t); void free_seq_array(seq_array); seq_addr_array allocate_seq_addr(seq_coor_t size); void free_seq_addr_array(seq_addr_array); aln_range * find_best_aln_range(kmer_match *, seq_coor_t, seq_coor_t, seq_coor_t); void free_aln_range( aln_range *); kmer_match * find_kmer_pos_for_seq( char *, seq_coor_t, unsigned int K, seq_addr_array, kmer_lookup * ); void free_kmer_lookup(kmer_lookup * ); void add_sequence ( seq_coor_t, unsigned int, char *, seq_coor_t, seq_addr_array, seq_array, kmer_lookup *); void mask_k_mer(seq_coor_t, kmer_lookup *, seq_coor_t); alignment * align(char *, seq_coor_t, char *, seq_coor_t, seq_coor_t, int); void free_alignment(alignment *); void free_consensus_data(consensus_data *); FALCON-0.1.3/src/c/falcon.c000077500000000000000000000515041237512144000151030ustar00rootroot00000000000000/* * ===================================================================================== * * Filename: fastcon.c * * Description: * * Version: 0.1 * Created: 07/20/2013 17:00:00 * Revision: none * Compiler: gcc * * Author: Jason Chin, * Company: * * ===================================================================================== #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ */ #include #include #include #include #include #include "common.h" typedef struct { seq_coor_t t_pos; unsigned int delta; char q_base; unsigned int q_id; } align_tag_t; typedef struct { seq_coor_t len; align_tag_t * align_tags; } align_tags_t; typedef struct { seq_coor_t len; char * name; char * seq; } consensusn_seq_t; align_tags_t * get_align_tags( char * aln_q_seq, char * aln_t_seq, seq_coor_t aln_seq_len, aln_range * range, unsigned long q_id, unsigned long local_match_count_window, unsigned long local_match_count_threshold, seq_coor_t t_offset) { #define LONGEST_INDEL_ALLOWED 6 char q_base; char t_base; align_tags_t * tags; seq_coor_t i, j, jj, k; seq_coor_t match_count; tags = calloc( 1, sizeof(align_tags_t) ); tags->len = aln_seq_len; tags->align_tags = calloc( aln_seq_len + 1, sizeof(align_tag_t) ); i = range->s1 - 1; j = range->s2 - 1; match_count = 0; jj = 0; for (k = 0; k< local_match_count_window && k < aln_seq_len; k++) { if (aln_q_seq[k] == aln_t_seq[k] ) { match_count ++; } } for (k = 0; k < aln_seq_len; k++) { if (aln_q_seq[k] != '-') { i ++; jj ++; } if (aln_t_seq[k] != '-') { j ++; jj = 0; } if (local_match_count_threshold > 0) { if (k < aln_seq_len - local_match_count_window && aln_q_seq[k + local_match_count_window] == aln_t_seq[k + local_match_count_window] ) { match_count ++; } if (k > local_match_count_window && aln_q_seq[k - local_match_count_window] == aln_t_seq[k - local_match_count_window] ) { match_count --; } if (match_count < 0) { match_count = 0; } } if ( j + t_offset >= 0) { (tags->align_tags[k]).t_pos = j + t_offset; (tags->align_tags[k]).delta = jj; if (local_match_count_threshold > 0 && jj == 0 && match_count < local_match_count_threshold) { (tags->align_tags[k]).q_base = '*'; } else { (tags->align_tags[k]).q_base = aln_q_seq[k]; } (tags->align_tags[k]).q_id = q_id; } //if (jj > LONGEST_INDEL_ALLOWED) { // break; //} } // sentinal at the end //k = aln_seq_len; tags->len = k; (tags->align_tags[k]).t_pos = -1; (tags->align_tags[k]).delta = -1; (tags->align_tags[k]).q_base = ' '; (tags->align_tags[k]).q_id = UINT_MAX; return tags; } void free_align_tags( align_tags_t * tags) { free( tags->align_tags ); free( tags ); } int compare_tags(const void * a, const void * b) { const align_tag_t * arg1 = a; const align_tag_t * arg2 = b; if (arg1->delta - arg2->delta == 0) { return arg1->q_base - arg2->q_base; } else { return arg1->delta - arg2->delta; } } consensus_data * get_cns_from_align_tags( align_tags_t ** tag_seqs, unsigned long n_tag_seqs, unsigned t_len, unsigned min_cov ) { seq_coor_t i, j, t_pos, tmp_pos; unsigned int * coverage; unsigned int * local_nbase; unsigned int * aux_index; unsigned int cur_delta; unsigned int counter[5] = {0, 0, 0, 0, 0}; unsigned int k; unsigned int max_count; unsigned int max_count_index; seq_coor_t consensus_index; seq_coor_t c_start, c_end, max_start; unsigned int cov_score, max_cov_score; consensus_data * consensus; //char * consensus; align_tag_t ** tag_seq_index; coverage = calloc( t_len, sizeof(unsigned int) ); local_nbase = calloc( t_len, sizeof(unsigned int) ); aux_index = calloc( t_len, sizeof(unsigned int) ); tag_seq_index = calloc( t_len, sizeof(align_tag_t *) ); for (i = 0; i < n_tag_seqs; i++) { for (j = 0; j < tag_seqs[i]->len; j++) { if (tag_seqs[i]->align_tags[j].delta == 0 && tag_seqs[i]->align_tags[j].q_base != '*') { t_pos = tag_seqs[i]->align_tags[j].t_pos; coverage[ t_pos ] ++; } local_nbase[ tag_seqs[i]->align_tags[j].t_pos ] ++; } } for (i = 0; i < t_len; i++) { tag_seq_index[i] = calloc( local_nbase[i] + 1, sizeof(align_tag_t) ); } for (i = 0; i < n_tag_seqs; i++) { for (j = 0; j < tag_seqs[i]->len; j++) { t_pos = tag_seqs[i]->align_tags[j].t_pos; tag_seq_index[ t_pos ][ aux_index[ t_pos ] ] = tag_seqs[i]->align_tags[j]; aux_index[ t_pos ] ++; } } consensus_index = 0; consensus = calloc( 1, sizeof(consensus_data) ); consensus->sequence = calloc( t_len * 2 + 1, sizeof(char) ); consensus->eff_cov = calloc( t_len * 2 + 1, sizeof(unsigned int) ); for (i = 0; i < t_len; i++) { qsort(tag_seq_index[i], local_nbase[i], sizeof(align_tag_t), compare_tags); cur_delta = 0; for (j = 0; j <= local_nbase[i]; j++) { max_count = 0; max_count_index = 0; if (j == local_nbase[i] || tag_seq_index[i][j].delta != cur_delta) { for (k = 0; k < 5; k ++) { if (counter[k] > max_count) { max_count = counter[k]; max_count_index = k; } //reset counter counter[k] = 0; cur_delta = tag_seq_index[i][j].delta; } if (max_count > coverage[i] * 0.5) { switch (max_count_index) { case 0: if (coverage[i] < min_cov + 1) { consensus->sequence[consensus_index] = 'a'; } else { consensus->sequence[consensus_index] = 'A'; } consensus->eff_cov[consensus_index] = coverage[i] ; consensus_index ++; break; case 1: if (coverage[i] < min_cov + 1) { consensus->sequence[consensus_index] = 'c'; } else { consensus->sequence[consensus_index] = 'C'; } consensus->eff_cov[consensus_index] = coverage[i] ; consensus_index ++; break; case 2: if (coverage[i] < min_cov + 1) { consensus->sequence[consensus_index] = 'g'; } else { consensus->sequence[consensus_index] = 'G'; } consensus->eff_cov[consensus_index] = coverage[i] ; consensus_index ++; break; case 3: if (coverage[i] < min_cov + 1) { consensus->sequence[consensus_index] = 't'; } else { consensus->sequence[consensus_index] = 'T'; } consensus->eff_cov[consensus_index] = coverage[i] ; consensus_index ++; break; default: break; } //printf("c:%c\n", consensus[consensus_index-1]); } } if (j == local_nbase[i]) break; switch (tag_seq_index[i][j].q_base) { case 'A': counter[0] ++; break; case 'C': counter[1] ++; break; case 'G': counter[2] ++; break; case 'T': counter[3] ++; break; case '-': counter[4] ++; break; default: break; } /* printf("%ld %ld %ld %u %c %u\n", i, j, tag_seq_index[i][j].t_pos, tag_seq_index[i][j].delta, tag_seq_index[i][j].q_base, tag_seq_index[i][j].q_id); */ } } //printf("%s\n", consensus); for (i = 0; i < t_len; i++) { free(tag_seq_index[i]); } free(tag_seq_index); free(aux_index); free(coverage); free(local_nbase); return consensus; } //const unsigned int K = 8; consensus_data * generate_consensus( char ** input_seq, unsigned int n_seq, unsigned min_cov, unsigned K, unsigned long local_match_count_window, unsigned long local_match_count_threshold, double min_idt) { unsigned int i, j, k; unsigned int seq_count; unsigned int aligned_seq_count; kmer_lookup * lk_ptr; seq_array sa_ptr; seq_addr_array sda_ptr; kmer_match * kmer_match_ptr; aln_range * arange_; aln_range * arange; alignment * aln; align_tags_t * tags; align_tags_t ** tags_list; //char * consensus; consensus_data * consensus; double max_diff; max_diff = 1.0 - min_idt; seq_count = n_seq; //for (j=0; j < seq_count; j++) { // printf("seq_len: %u %u\n", j, strlen(input_seq[j])); //}; fflush(stdout); tags_list = calloc( seq_count, sizeof(align_tags_t *) ); lk_ptr = allocate_kmer_lookup( 1 << (K * 2) ); sa_ptr = allocate_seq( (seq_coor_t) strlen( input_seq[0]) ); sda_ptr = allocate_seq_addr( (seq_coor_t) strlen( input_seq[0]) ); add_sequence( 0, K, input_seq[0], strlen(input_seq[0]), sda_ptr, sa_ptr, lk_ptr); //mask_k_mer(1 << (K * 2), lk_ptr, 16); aligned_seq_count = 0; for (j=1; j < seq_count; j++) { //printf("seq_len: %ld %u\n", j, strlen(input_seq[j])); kmer_match_ptr = find_kmer_pos_for_seq(input_seq[j], strlen(input_seq[j]), K, sda_ptr, lk_ptr); #define INDEL_ALLOWENCE_0 6 arange = find_best_aln_range(kmer_match_ptr, K, K * INDEL_ALLOWENCE_0, 5); // narrow band to avoid aligning through big indels //printf("1:%ld %ld %ld %ld\n", arange_->s1, arange_->e1, arange_->s2, arange_->e2); //arange = find_best_aln_range2(kmer_match_ptr, K, K * INDEL_ALLOWENCE_0, 5); // narrow band to avoid aligning through big indels //printf("2:%ld %ld %ld %ld\n\n", arange->s1, arange->e1, arange->s2, arange->e2); #define INDEL_ALLOWENCE_1 400 if (arange->e1 - arange->s1 < 100 || arange->e2 - arange->s2 < 100 || abs( (arange->e1 - arange->s1 ) - (arange->e2 - arange->s2) ) > INDEL_ALLOWENCE_1) { free_kmer_match( kmer_match_ptr); free_aln_range(arange); continue; } //printf("%ld %s\n", strlen(input_seq[j]), input_seq[j]); //printf("%ld %s\n\n", strlen(input_seq[0]), input_seq[0]); #define INDEL_ALLOWENCE_2 150 aln = align(input_seq[j]+arange->s1, arange->e1 - arange->s1 , input_seq[0]+arange->s2, arange->e2 - arange->s2 , INDEL_ALLOWENCE_2, 1); if (aln->aln_str_size > 500 && ((double) aln->dist / (double) aln->aln_str_size) < max_diff) { tags_list[aligned_seq_count] = get_align_tags( aln->q_aln_str, aln->t_aln_str, aln->aln_str_size, arange, j, local_match_count_window, local_match_count_threshold, 0); aligned_seq_count ++; } /*** for (k = 0; k < tags_list[j]->len; k++) { printf("%ld %d %c\n", tags_list[j]->align_tags[k].t_pos, tags_list[j]->align_tags[k].delta, tags_list[j]->align_tags[k].q_base); } ***/ free_aln_range(arange); free_alignment(aln); free_kmer_match( kmer_match_ptr); } consensus = get_cns_from_align_tags( tags_list, aligned_seq_count, strlen(input_seq[0]), min_cov ); //free(consensus); free_seq_addr_array(sda_ptr); free_seq_array(sa_ptr); free_kmer_lookup(lk_ptr); for (j=0; j < aligned_seq_count; j++) { free_align_tags(tags_list[j]); } free(tags_list); return consensus; } consensus_data * generate_utg_consensus( char ** input_seq, seq_coor_t *offset, unsigned int n_seq, unsigned min_cov, unsigned K, double min_idt) { unsigned int i, j, k; unsigned int seq_count; unsigned int aligned_seq_count; aln_range * arange; alignment * aln; align_tags_t * tags; align_tags_t ** tags_list; //char * consensus; consensus_data * consensus; double max_diff; seq_coor_t utg_len; seq_coor_t r_len; max_diff = 1.0 - min_idt; seq_count = n_seq; /*** for (j=0; j < seq_count; j++) { printf("seq_len: %u %u\n", j, strlen(input_seq[j])); }; fflush(stdout); ***/ tags_list = calloc( seq_count+1, sizeof(align_tags_t *) ); utg_len = strlen(input_seq[0]); aligned_seq_count = 0; arange = calloc( 1, sizeof(aln_range) ); arange->s1 = 0; arange->e1 = strlen(input_seq[0]); arange->s2 = 0; arange->e2 = strlen(input_seq[0]); tags_list[aligned_seq_count] = get_align_tags( input_seq[0], input_seq[0], strlen(input_seq[0]), arange, 0, 12, 0, 0); aligned_seq_count += 1; for (j=1; j < seq_count; j++) { arange->s1 = 0; arange->e1 = strlen(input_seq[j])-1; arange->s2 = 0; arange->e2 = strlen(input_seq[j])-1; r_len = strlen(input_seq[j]); //printf("seq_len: %u %u\n", j, r_len); if ( offset[j] < 0) { if ((r_len + offset[j]) < 128) { continue; } if ( r_len + offset[j] < utg_len ) { //printf("1: %ld %u %u\n", offset[j], r_len, utg_len); aln = align(input_seq[j] - offset[j], r_len + offset[j] , input_seq[0], r_len + offset[j] , 500, 1); } else { //printf("2: %ld %u %u\n", offset[j], r_len, utg_len); aln = align(input_seq[j] - offset[j], utg_len , input_seq[0], utg_len , 500, 1); } offset[j] = 0; } else { if ( offset[j] > utg_len - 128) { continue; } if ( offset[j] + r_len > utg_len ) { //printf("3: %ld %u %u\n", offset[j], r_len, utg_len); aln = align(input_seq[j], utg_len - offset[j] , input_seq[0]+offset[j], utg_len - offset[j], 500, 1); } else { //printf("4: %ld %u %u\n", offset[j], r_len, utg_len); aln = align(input_seq[j], r_len , input_seq[0]+offset[j], r_len , 500, 1); } } if (aln->aln_str_size > 500 && ((double) aln->dist / (double) aln->aln_str_size) < max_diff) { tags_list[aligned_seq_count] = get_align_tags( aln->q_aln_str, aln->t_aln_str, aln->aln_str_size, arange, j, 12, 0, offset[j]); aligned_seq_count ++; } free_alignment(aln); } free_aln_range(arange); consensus = get_cns_from_align_tags( tags_list, aligned_seq_count, utg_len, 0 ); //free(consensus); for (j=0; j < aligned_seq_count; j++) { free_align_tags(tags_list[j]); } free(tags_list); return consensus; } void free_consensus_data( consensus_data * consensus ){ free(consensus->sequence); free(consensus->eff_cov); free(consensus); } /*** void main() { unsigned int j; char small_buffer[1024]; char big_buffer[65536]; char ** input_seq; char ** seq_id; int seq_count; char * consensus; input_seq = calloc( 501, sizeof(char *)); seq_id = calloc( 501, sizeof(char *)); while(1) { seq_count = 0; while (1) { scanf("%s", small_buffer); seq_id[seq_count] = calloc( strlen(small_buffer) + 1, sizeof(char)); strcpy(seq_id[seq_count], small_buffer); scanf("%s", big_buffer); input_seq[seq_count] = calloc( strlen(big_buffer) + 1 , sizeof(char)); strcpy(input_seq[seq_count], big_buffer); if (strcmp(seq_id[seq_count], "+") == 0) { break; } if (strcmp(seq_id[seq_count], "-") == 0) { break; } //printf("%s\n", seq_id[seq_count]); seq_count += 1; if (seq_count > 500) break; } //printf("sc: %d\n", seq_count); if (seq_count < 10 && strcmp(seq_id[seq_count], "-") != 0 ) continue; if (seq_count < 10 && strcmp(seq_id[seq_count], "-") == 0 ) break; consensus = generate_consensus(input_seq, seq_count, 8, 8); if (strlen(consensus) > 500) { printf(">%s\n%s\n", seq_id[0], consensus); } fflush(stdout); free(consensus); for (j=0; j < seq_count; j++) { free(seq_id[j]); free(input_seq[j]); }; } for (j=0; j < seq_count; j++) { free(seq_id[j]); free(input_seq[j]); }; free(seq_id); free(input_seq); } ***/ FALCON-0.1.3/src/c/kmer_lookup.c000077500000000000000000000436711237512144000161760ustar00rootroot00000000000000/* * ===================================================================================== * * Filename: kmer_count.c * * Description: * * Version: 0.1 * Created: 07/20/2013 17:00:00 * Revision: none * Compiler: gcc * * Author: Jason Chin, * Company: * * ===================================================================================== #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ */ #include #include #include #include "common.h" const unsigned int KMERMATCHINC = 10000; int compare_seq_coor(const void * a, const void * b) { const seq_coor_t * arg1 = a; const seq_coor_t * arg2 = b; return (* arg1) - (* arg2); } kmer_lookup * allocate_kmer_lookup ( seq_coor_t size ) { kmer_lookup * kl; seq_coor_t i; //printf("%lu is allocated for kmer lookup\n", size); kl = (kmer_lookup *) malloc( size * sizeof(kmer_lookup) ); init_kmer_lookup( kl, size); return kl; } void init_kmer_lookup ( kmer_lookup * kl, seq_coor_t size ) { seq_coor_t i; //printf("%lu is allocated for kmer lookup\n", size); for (i=0; i threshold) { kl[i].start = LONG_MAX; kl[i].last = LONG_MAX; //kl[i].count = 0; } } } kmer_match * find_kmer_pos_for_seq( char * seq, seq_coor_t seq_len, unsigned int K, seq_addr_array sda, kmer_lookup * lk) { seq_coor_t i; seq_coor_t kmer_bv; seq_coor_t kmer_mask; seq_coor_t kmer_pos; seq_coor_t next_kmer_pos; unsigned int half_K; seq_coor_t kmer_match_rtn_allocation_size = KMERMATCHINC; kmer_match * kmer_match_rtn; base * sa; kmer_match_rtn = (kmer_match *) malloc( sizeof(kmer_match) ); kmer_match_rtn->count = 0; kmer_match_rtn->query_pos = (seq_coor_t *) calloc( kmer_match_rtn_allocation_size, sizeof( seq_coor_t ) ); kmer_match_rtn->target_pos = (seq_coor_t *) calloc( kmer_match_rtn_allocation_size, sizeof( seq_coor_t ) ); sa = calloc( seq_len, sizeof(base) ); kmer_mask = 0; for (i = 0; i < K; i++) { kmer_mask <<= 2; kmer_mask |= 0x00000003; } for (i = 0; i < seq_len; i++) { switch ( seq[i] ) { case 'A': sa[ i ] = 0; break; case 'C': sa[ i ] = 1; break; case 'G': sa[ i ] = 2; break; case 'T': sa[ i ] = 3; } } kmer_bv = get_kmer_bitvector(sa, K); half_K = K >> 1; for (i = 0; i < seq_len - K; i += half_K) { kmer_bv = get_kmer_bitvector(sa + i, K); if (lk[kmer_bv].start == LONG_MAX) { //for high count k-mers continue; } kmer_pos = lk[ kmer_bv ].start; next_kmer_pos = sda[ kmer_pos ]; kmer_match_rtn->query_pos[ kmer_match_rtn->count ] = i; kmer_match_rtn->target_pos[ kmer_match_rtn->count ] = kmer_pos; kmer_match_rtn->count += 1; if (kmer_match_rtn->count > kmer_match_rtn_allocation_size - 1000) { kmer_match_rtn_allocation_size += KMERMATCHINC; kmer_match_rtn->query_pos = (seq_coor_t *) realloc( kmer_match_rtn->query_pos, kmer_match_rtn_allocation_size * sizeof(seq_coor_t) ); kmer_match_rtn->target_pos = (seq_coor_t *) realloc( kmer_match_rtn->target_pos, kmer_match_rtn_allocation_size * sizeof(seq_coor_t) ); } while ( next_kmer_pos > kmer_pos ){ kmer_pos = next_kmer_pos; next_kmer_pos = sda[ kmer_pos ]; kmer_match_rtn->query_pos[ kmer_match_rtn->count ] = i; kmer_match_rtn->target_pos[ kmer_match_rtn->count ] = kmer_pos; kmer_match_rtn->count += 1; if (kmer_match_rtn->count > kmer_match_rtn_allocation_size - 1000) { kmer_match_rtn_allocation_size += KMERMATCHINC; kmer_match_rtn->query_pos = (seq_coor_t *) realloc( kmer_match_rtn->query_pos, kmer_match_rtn_allocation_size * sizeof(seq_coor_t) ); kmer_match_rtn->target_pos = (seq_coor_t *) realloc( kmer_match_rtn->target_pos, kmer_match_rtn_allocation_size * sizeof(seq_coor_t) ); } } } free(sa); return kmer_match_rtn; } void free_kmer_match( kmer_match * ptr) { free(ptr->query_pos); free(ptr->target_pos); free(ptr); } aln_range* find_best_aln_range(kmer_match * km_ptr, seq_coor_t K, seq_coor_t bin_size, seq_coor_t count_th) { seq_coor_t i; seq_coor_t j; seq_coor_t q_min, q_max, t_min, t_max; seq_coor_t * d_count; seq_coor_t * q_coor; seq_coor_t * t_coor; aln_range * arange; long int d, d_min, d_max; long int cur_score; long int max_score; long int max_k_mer_count; long int max_k_mer_bin; seq_coor_t cur_start; seq_coor_t cur_pos; seq_coor_t max_start; seq_coor_t max_end; seq_coor_t kmer_dist; arange = calloc(1 , sizeof(aln_range)); q_min = LONG_MAX; q_max = 0; t_min = LONG_MAX; t_max = 0; d_min = LONG_MAX; d_max = LONG_MIN; for (i = 0; i < km_ptr->count; i++ ) { if ( km_ptr -> query_pos[i] < q_min) { q_min = km_ptr->query_pos[i]; } if ( km_ptr -> query_pos[i] > q_max) { q_max = km_ptr->query_pos[i]; } if ( km_ptr -> target_pos[i] < t_min) { t_min = km_ptr->target_pos[i]; } if ( km_ptr -> query_pos[i] > t_max) { t_max = km_ptr->target_pos[i]; } d = (long int) km_ptr->query_pos[i] - (long int) km_ptr->target_pos[i]; if ( d < d_min ) { d_min = d; } if ( d > d_max ) { d_max = d; } } //printf("%lu %ld %ld\n" , km_ptr->count, d_min, d_max); d_count = calloc( (d_max - d_min)/bin_size + 1, sizeof(seq_coor_t) ); q_coor = calloc( km_ptr->count, sizeof(seq_coor_t) ); t_coor = calloc( km_ptr->count, sizeof(seq_coor_t) ); for (i = 0; i < km_ptr->count; i++ ) { d = (long int) (km_ptr->query_pos[i]) - (long int) (km_ptr->target_pos[i]); d_count[ (d - d_min)/ (long int) bin_size ] += 1; q_coor[i] = LONG_MAX; t_coor[i] = LONG_MAX; } j = 0; max_k_mer_count = 0; max_k_mer_bin = LONG_MAX; for (i = 0; i < km_ptr->count; i++ ) { d = (long int) (km_ptr->query_pos[i]) - (long int) (km_ptr->target_pos[i]); if ( d_count[ (d - d_min)/ (long int) bin_size ] > max_k_mer_count) { max_k_mer_count = d_count[ (d - d_min)/ (long int) bin_size ]; max_k_mer_bin = (d - d_min)/ (long int) bin_size; } } //printf("k_mer: %lu %lu\n" , max_k_mer_count, max_k_mer_bin); if ( max_k_mer_bin != LONG_MAX && max_k_mer_count > count_th ) { for (i = 0; i < km_ptr->count; i++ ) { d = (long int) (km_ptr->query_pos[i]) - (long int) (km_ptr->target_pos[i]); if ( abs( ( (d - d_min)/ (long int) bin_size ) - max_k_mer_bin ) > 5 ) { continue; } if (d_count[ (d - d_min)/ (long int) bin_size ] > count_th) { q_coor[j] = km_ptr->query_pos[i]; t_coor[j] = km_ptr->target_pos[i]; //printf("d_count: %lu %lu\n" ,i, d_count[(d - d_min)/ (long int) bin_size]); //printf("coor: %lu %lu\n" , q_coor[j], t_coor[j]); j ++; } } } if (j > 1) { arange->s1 = q_coor[0]; arange->e1 = q_coor[0]; arange->s2 = t_coor[0]; arange->e2 = t_coor[0]; arange->score = 0; max_score = 0; cur_score = 0; cur_start = 0; for (i = 1; i < j; i++) { cur_score += 32 - (q_coor[i] - q_coor[i-1]); //printf("deltaD, %lu %ld\n", q_coor[i] - q_coor[i-1], cur_score); if (cur_score < 0) { cur_score = 0; cur_start = i; } else if (cur_score > max_score) { arange->s1 = q_coor[cur_start]; arange->s2 = t_coor[cur_start]; arange->e1 = q_coor[i]; arange->e2 = t_coor[i]; max_score = cur_score; arange->score = max_score; //printf("%lu %lu %lu %lu\n", arange.s1, arange.e1, arange.s2, arange.e2); } } } else { arange->s1 = 0; arange->e1 = 0; arange->s2 = 0; arange->e2 = 0; arange->score = 0; } // printf("free\n"); free(d_count); free(q_coor); free(t_coor); return arange; } aln_range* find_best_aln_range2(kmer_match * km_ptr, seq_coor_t K, seq_coor_t bin_width, seq_coor_t count_th) { seq_coor_t * d_coor; seq_coor_t * hit_score; seq_coor_t * hit_count; seq_coor_t * last_hit; seq_coor_t max_q, max_t; seq_coor_t s, e, max_s, max_e, max_span, d_s, d_e, delta, d_len; seq_coor_t px, py, cx, cy; seq_coor_t max_hit_idx; seq_coor_t max_hit_score, max_hit_count; seq_coor_t i, j; seq_coor_t candidate_idx, max_d, d; aln_range * arange; arange = calloc(1 , sizeof(aln_range)); d_coor = calloc( km_ptr->count, sizeof(seq_coor_t) ); max_q = -1; max_t = -1; for (i = 0; i < km_ptr->count; i++ ) { d_coor[i] = km_ptr->query_pos[i] - km_ptr->target_pos[i]; max_q = max_q > km_ptr->query_pos[i] ? max_q : km_ptr->query_pos[i]; max_t = max_t > km_ptr->target_pos[i] ? max_q : km_ptr->target_pos[i]; } qsort(d_coor, km_ptr->count, sizeof(seq_coor_t), compare_seq_coor); s = 0; e = 0; max_s = -1; max_e = -1; max_span = -1; delta = (long int) ( 0.05 * ( max_q + max_t ) ); d_len = km_ptr->count; d_s = -1; d_e = -1; while (1) { d_s = d_coor[s]; d_e = d_coor[e]; while (d_e < d_s + delta && e < d_len-1) { e += 1; d_e = d_coor[e]; } if ( max_span == -1 || e - s > max_span ) { max_span = e - s; max_s = s; max_e = e; } s += 1; if (s == d_len || e == d_len) { break; } } if (max_s == -1 || max_e == -1 || max_e - max_s < 32) { arange->s1 = 0; arange->e1 = 0; arange->s2 = 0; arange->e2 = 0; arange->score = 0; free(d_coor); return arange; } last_hit = calloc( km_ptr->count, sizeof(seq_coor_t) ); hit_score = calloc( km_ptr->count, sizeof(seq_coor_t) ); hit_count = calloc( km_ptr->count, sizeof(seq_coor_t) ); for (i = 0; i < km_ptr->count; i++ ) { last_hit[i] = -1; hit_score[i] = 0; hit_count[i] = 0; } max_hit_idx = -1; max_hit_score = 0; for (i = 0; i < km_ptr->count; i ++) { cx = km_ptr->query_pos[i]; cy = km_ptr->target_pos[i]; d = cx - cy; if ( d < d_coor[max_s] || d > d_coor[max_e] ) continue; j = i - 1; candidate_idx = -1; max_d = 65535; while (1) { if ( j < 0 ) break; px = km_ptr->query_pos[j]; py = km_ptr->target_pos[j]; d = px - py; if ( d < d_coor[max_s] || d > d_coor[max_e] ) { j--; continue; } if (cx - px > 320) break; //the number here controling how big alignment gap to be considered if (cy > py && cx - px + cy - py < max_d && cy - py <= 320 ) { max_d = cx - px + cy - py; candidate_idx = j; } j--; } if (candidate_idx != -1) { last_hit[i] = candidate_idx; hit_score[i] = hit_score[candidate_idx] + (64 - max_d); hit_count[i] = hit_count[candidate_idx] + 1; if (hit_score[i] < 0) { hit_score[i] = 0; hit_count[i] = 0; } } else { hit_score[i] = 0; hit_count[i] = 0; } if (hit_score[i] > max_hit_score) { max_hit_score = hit_score[i]; max_hit_count = hit_count[i]; max_hit_idx = i; } } if (max_hit_idx == -1) { arange->s1 = 0; arange->e1 = 0; arange->s2 = 0; arange->e2 = 0; arange->score = 0; free(d_coor); free(last_hit); free(hit_score); free(hit_count); return arange; } arange->score = max_hit_count + 1; arange->e1 = km_ptr->query_pos[max_hit_idx]; arange->e2 = km_ptr->target_pos[max_hit_idx]; i = max_hit_idx; while (last_hit[i] != -1) { i = last_hit[i]; } arange->s1 = km_ptr->query_pos[i]; arange->s2 = km_ptr->target_pos[i]; free(d_coor); free(last_hit); free(hit_score); free(hit_count); return arange; } void free_aln_range( aln_range * arange) { free(arange); } FALCON-0.1.3/src/py/000077500000000000000000000000001237512144000136735ustar00rootroot00000000000000FALCON-0.1.3/src/py/__init__.py000066400000000000000000000036061237512144000160110ustar00rootroot00000000000000 #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ from .falcon_kit import * FALCON-0.1.3/src/py/falcon_kit.py000066400000000000000000000163321237512144000163630ustar00rootroot00000000000000 #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ from ctypes import * import os module_path = os.path.split(__file__)[0] seq_coor_t = c_long base_t = c_uint8 class KmerLookup(Structure): _fields_ = [("start", seq_coor_t), ("last", seq_coor_t), ("count", seq_coor_t)] class KmerMatch(Structure): _fields_ = [ ("count", seq_coor_t), ("query_pos", POINTER(seq_coor_t)), ("target_pos", POINTER(seq_coor_t)) ] class AlnRange(Structure): _fields_ = [ ("s1", seq_coor_t), ("e1", seq_coor_t), ("s2", seq_coor_t), ("e2", seq_coor_t), ("score", c_long) ] class ConsensusData(Structure): _fields_ = [ ("sequence", c_char_p), ("eff_cov", POINTER(c_uint)) ] kup = CDLL(os.path.join(module_path, "kmer_lookup.so")) kup.allocate_kmer_lookup.argtypes = [seq_coor_t] kup.allocate_kmer_lookup.restype = POINTER(KmerLookup) kup.init_kmer_lookup.argtypes = [POINTER(KmerLookup), seq_coor_t] kup.free_kmer_lookup.argtypes = [POINTER(KmerLookup)] kup.allocate_seq.argtypes = [seq_coor_t] kup.allocate_seq.restype = POINTER(base_t) kup.init_seq_array.argtypes = [POINTER(base_t), seq_coor_t] kup.free_seq_array.argtypes = [POINTER(base_t)] kup.allocate_seq_addr.argtypes = [seq_coor_t] kup.allocate_seq_addr.restype = POINTER(seq_coor_t) kup.free_seq_addr_array.argtypes = [POINTER(seq_coor_t)] kup.add_sequence.argtypes = [ seq_coor_t, c_uint, POINTER(c_char), seq_coor_t, POINTER(seq_coor_t), POINTER(c_uint8), POINTER(KmerLookup) ] kup.mask_k_mer.argtypes =[ c_long, POINTER(KmerLookup), c_long ] kup.find_kmer_pos_for_seq.argtypes = [ POINTER(c_char), seq_coor_t, c_uint, POINTER(seq_coor_t), POINTER(KmerLookup)] kup.find_kmer_pos_for_seq.restype = POINTER(KmerMatch) kup.free_kmer_match.argtypes = [ POINTER(KmerMatch) ] kup.find_best_aln_range.argtypes = [POINTER(KmerMatch), seq_coor_t, seq_coor_t, seq_coor_t] kup.find_best_aln_range.restype = POINTER(AlnRange) kup.find_best_aln_range2.argtypes = [POINTER(KmerMatch), seq_coor_t, seq_coor_t, seq_coor_t] kup.find_best_aln_range2.restype = POINTER(AlnRange) kup.free_aln_range.argtypes = [POINTER(AlnRange)] class Alignment(Structure): """ typedef struct { seq_coor_t aln_str_size ; seq_coor_t dist ; seq_coor_t aln_q_s; seq_coor_t aln_q_e; seq_coor_t aln_t_s; seq_coor_t aln_t_e; char * q_aln_str; char * t_aln_str; } alignment; """ _fields_ = [ ("aln_str_size", seq_coor_t), ("dist", seq_coor_t), ("aln_q_s", seq_coor_t), ("aln_q_e", seq_coor_t), ("aln_t_s", seq_coor_t), ("aln_t_e", seq_coor_t), ("q_aln_str", c_char_p), ("t_aln_str", c_char_p)] DWA = CDLL(os.path.join(module_path, "DW_align.so")) DWA.align.argtypes = [ POINTER(c_char), c_long, POINTER(c_char), c_long, c_long, c_int ] DWA.align.restype = POINTER(Alignment) DWA.free_alignment.argtypes = [POINTER(Alignment)] falcon = CDLL(os.path.join(module_path,"falcon.so")) falcon.generate_consensus.argtypes = [POINTER(c_char_p), c_uint, c_uint, c_uint, c_uint, c_uint, c_double ] falcon.generate_consensus.restype = POINTER(ConsensusData) falcon.free_consensus_data.argtypes = [ POINTER(ConsensusData) ] def get_alignment(seq1, seq0): K = 8 lk_ptr = kup.allocate_kmer_lookup( 1 << (K * 2) ) sa_ptr = kup.allocate_seq( len(seq0) ) sda_ptr = kup.allocate_seq_addr( len(seq0) ) kup.add_sequence( 0, K, seq0, len(seq0), sda_ptr, sa_ptr, lk_ptr) kmer_match_ptr = kup.find_kmer_pos_for_seq(seq1, len(seq1), K, sda_ptr, lk_ptr) kmer_match = kmer_match_ptr[0] aln_range_ptr = kup.find_best_aln_range(kmer_match_ptr, K, K*10, 50) #x,y = zip( * [ (kmer_match.query_pos[i], kmer_match.target_pos[i]) for i in range(kmer_match.count )] ) kup.free_kmer_match(kmer_match_ptr) aln_range = aln_range_ptr[0] s1, e1, s2, e2 = aln_range.s1, aln_range.e1, aln_range.s2, aln_range.e2 kup.free_aln_range(aln_range_ptr) if e1 - s1 > 500: #s1 = 0 if s1 < 14 else s1 - 14 #s2 = 0 if s2 < 14 else s2 - 14 e1 = len(seq1) if e1 >= len(seq1)-2*K else e1 + K*2 e2 = len(seq0) if e2 >= len(seq0)-2*K else e2 + K*2 alignment = DWA.align(seq1[s1:e1], e1-s1, seq0[s2:e2], e2-s2, 100, 0) #print seq1[s1:e1] #print seq0[s2:e2] #if alignment[0].aln_str_size > 500: #aln_str1 = alignment[0].q_aln_str #aln_str0 = alignment[0].t_aln_str aln_size = alignment[0].aln_str_size aln_dist = alignment[0].dist aln_q_s = alignment[0].aln_q_s aln_q_e = alignment[0].aln_q_e aln_t_s = alignment[0].aln_t_s aln_t_e = alignment[0].aln_t_e #print "X,",alignment[0].aln_q_s, alignment[0].aln_q_e #print "Y,",alignment[0].aln_t_s, alignment[0].aln_t_e #print aln_str1 #print aln_str0 DWA.free_alignment(alignment) kup.free_seq_addr_array(sda_ptr) kup.free_seq_array(sa_ptr) kup.free_kmer_lookup(lk_ptr) if e1 - s1 > 500 and aln_size > 500: return s1, s1+aln_q_e-aln_q_s, s2, s2+aln_t_e-aln_t_s, aln_size, aln_dist else: return None FALCON-0.1.3/src/py_scripts/000077500000000000000000000000001237512144000154425ustar00rootroot00000000000000FALCON-0.1.3/src/py_scripts/falcon_asm.py000077500000000000000000001245011237512144000201240ustar00rootroot00000000000000#!/usr/bin/env python #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ from pbcore.io import FastaReader import networkx as nx import os import shlex import sys import subprocess DEBUG_LOG_LEVEL = 0 class SGNode(object): """ class representing a node in the string graph """ def __init__(self, node_name): self.name = node_name self.out_edges = [] self.in_edges = [] def add_out_edge(self, out_edge): self.out_edges.append(out_edge) def add_in_edge(self, in_edge): self.in_edges.append(in_edge) class SGEdge(object): """ class representing an edge in the string graph """ def __init__(self, in_node, out_node): self.in_node = in_node self.out_node = out_node self.attr = {} def set_attribute(self, attr, value): self.attr[attr] = value def reverse_end( node_id ): node_id, end = node_id.split(":") new_end = "B" if end == "E" else "E" return node_id + ":" + new_end class StringGraph(object): """ class representing the string graph """ def __init__(self): self.nodes = {} self.edges = {} self.n_mark = {} self.e_reduce = {} self.repeat_overlap = {} def add_node(self, node_name): """ add a node into the graph by given a node name """ if node_name not in self.nodes: self.nodes[node_name] = SGNode(node_name) def add_edge(self, in_node_name, out_node_name, **attributes): """ add an edge into the graph by given a pair of nodes """ if (in_node_name, out_node_name) not in self.edges: self.add_node(in_node_name) self.add_node(out_node_name) in_node = self.nodes[in_node_name] out_node = self.nodes[out_node_name] edge = SGEdge(in_node, out_node) self.edges[ (in_node_name, out_node_name) ] = edge in_node.add_out_edge(edge) out_node.add_in_edge(edge) edge = self.edges[ (in_node_name, out_node_name) ] for k, v in attributes.items(): edge.attr[k] = v def init_reduce_dict(self): for e in self.edges: self.e_reduce[e] = False def mark_chimer_edge(self): for e_n, e in self.edges.items(): v = e_n[0] w = e_n[1] overlap_count = 0 for w_out_e in self.nodes[w].out_edges: w_out_n = w_out_e.out_node.name if (v, w_out_n) in self.edges: overlap_count += 1 for v_in_e in self.nodes[v].in_edges: v_in_n = v_in_e.in_node.name if (v_in_n, w) in self.edges: overlap_count += 1 if self.e_reduce[ (v, w) ] != True: if overlap_count == 0: self.e_reduce[(v, w)] = True #print "XXX: chimer edge %s %s removed" % (v, w) v, w = reverse_end(w), reverse_end(v) self.e_reduce[(v, w)] = True #print "XXX: chimer edge %s %s removed" % (v, w) def mark_spur_edge(self): for v in self.nodes: if len(self.nodes[v].out_edges) > 1: for out_edge in self.nodes[v].out_edges: w = out_edge.out_node.name if len(self.nodes[w].out_edges) == 0 and self.e_reduce[ (v, w) ] != True: #print "XXX: spur edge %s %s removed" % (v, w) self.e_reduce[(v, w)] = True v2, w2 = reverse_end(w), reverse_end(v) #print "XXX: spur edge %s %s removed" % (v2, w2) self.e_reduce[(v, w)] = True if len(self.nodes[v].in_edges) > 1: for in_edge in self.nodes[v].in_edges: w = in_edge.in_node.name if len(self.nodes[w].in_edges) == 0 and self.e_reduce[ (w, v) ] != True: #print "XXX: spur edge %s %s removed" % (w, v) self.e_reduce[(w, v)] = True v2, w2 = reverse_end(w), reverse_end(v) #print "XXX: spur edge %s %s removed" % (w2, v2) self.e_reduce[(w, v)] = True def mark_tr_edges(self): """ transitive reduction """ n_mark = self.n_mark e_reduce = self.e_reduce FUZZ = 500 for n in self.nodes: n_mark[n] = "vacant" for n_name, node in self.nodes.items(): out_edges = node.out_edges if len(out_edges) == 0: continue out_edges.sort(key=lambda x: x.attr["length"]) for e in out_edges: w = e.out_node n_mark[ w.name ] = "inplay" max_len = out_edges[-1].attr["length"] max_len += FUZZ for e in out_edges: e_len = e.attr["length"] w = e.out_node if n_mark[w.name] == "inplay": w.out_edges.sort( key=lambda x: x.attr["length"] ) for e2 in w.out_edges: if e2.attr["length"] + e_len < max_len: x = e2.out_node if n_mark[x.name] == "inplay": n_mark[x.name] = "eliminated" for e in out_edges: e_len = e.attr["length"] w = e.out_node w.out_edges.sort( key=lambda x: x.attr["length"] ) if len(w.out_edges) > 0: x = w.out_edges[0].out_node if n_mark[x.name] == "inplay": n_mark[x.name] = "eliminated" for e2 in w.out_edges: if e2.attr["length"] < FUZZ: x = e2.out_node if n_mark[x.name] == "inplay": n_mark[x.name] = "eliminated" for out_edge in out_edges: v = out_edge.in_node w = out_edge.out_node if n_mark[w.name] == "eliminated": e_reduce[ (v.name, w.name) ] = True #print "XXX: tr edge %s %s removed" % (v.name, w.name) v_name, w_name = reverse_end(w.name), reverse_end(v.name) e_reduce[(v_name, w_name)] = True #print "XXX: tr edge %s %s removed" % (v_name, w_name) n_mark[w.name] = "vacant" def mark_best_overlap(self): """ find the best overlapped edges """ best_edges = set() for v in self.nodes: out_edges = self.nodes[v].out_edges if len(out_edges) > 0: out_edges.sort(key=lambda e: e.attr["score"]) e = out_edges[-1] best_edges.add( (e.in_node.name, e.out_node.name) ) in_edges = self.nodes[v].in_edges if len(in_edges) > 0: in_edges.sort(key=lambda e: e.attr["score"]) e = in_edges[-1] best_edges.add( (e.in_node.name, e.out_node.name) ) if DEBUG_LOG_LEVEL > 1: print "X", len(best_edges) for e_n, e in self.edges.items(): v = e_n[0] w = e_n[1] if self.e_reduce[ (v, w) ] != True: if (v, w) not in best_edges: self.e_reduce[(v, w)] = True #print "XXX: in best edge %s %s removed" % (v, w) v2, w2 = reverse_end(w), reverse_end(v) #print "XXX: in best edge %s %s removed" % (v2, w2) self.e_reduce[(v2, w2)] = True def get_out_edges_for_node(self, name, mask=True): rtn = [] for e in self.nodes[name].out_edges: v = e.in_node w = e.out_node if self.e_reduce[ (v.name, w.name) ] == False: rtn.append(e) return rtn def get_in_edges_for_node(self, name, mask=True): rtn = [] for e in self.nodes[name].in_edges: v = e.in_node w = e.out_node if self.e_reduce[ (v.name, w.name) ] == False: rtn.append(e) return rtn def get_best_out_edge_for_node(self, name, mask=True): rtn = [] for e in self.nodes[name].out_edges: v = e.in_node w = e.out_node if self.e_reduce[ (v.name, w.name) ] == False: rtn.append(e) rtn.sort(key=lambda e: e.attr["score"]) return rtn[-1] def get_best_in_edge_for_node(self, name, mask=True): rtn = [] for e in self.nodes[name].in_edges: v = e.in_node w = e.out_node if self.e_reduce[ (v.name, w.name) ] == False: rtn.append(e) rtn.sort(key=lambda e: e.attr["score"]) return rtn[-1] RCMAP = dict(zip("ACGTacgtNn-","TGCAtgcaNn-")) def generate_seq_from_path(sg, seqs, path): subseqs = [] r_id, end = path[0].split(":") count = 0 for i in range( len( path ) -1 ): w_n, v_n = path[i:i+2] edge = sg.edges[ (w_n, v_n ) ] read_id, coor = edge.attr["label"].split(":") b,e = coor.split("-") b = int(b) e = int(e) if b < e: subseqs.append( seqs[read_id][b:e] ) else: subseqs.append( "".join( [RCMAP[c] for c in seqs[read_id][b:e:-1]] ) ) return "".join(subseqs) def reverse_path( path ): new_path = [] for n in list(path[::-1]): rid, end = n.split(":") new_end = "B" if end == "E" else "E" new_path.append( rid+":"+new_end) return new_path def generate_unitig(sg, seqs, out_fn, connected_nodes = None): """ given a string graph:sg and the sequences: seqs, write the unitig fasta file into out_fn the funtion return a reduct graph representing the reduce string graph where the edges are unitigs some extra files generated: unit_edges.dat : an easy to parse file for unitig data unit_edge_paths : the file contains the information of the path of all unitigs uni_graph.gexf: the unitig graph in gexf format for visulization """ G = SGToNXG(sg) if connected_nodes != None: connected_nodes = set(sg.nodes) out_fasta = open(out_fn, "w") nodes_for_tig = set() sg_edges = set() for v, w in sg.edges: if sg.e_reduce[(v, w)] != True: sg_edges.add( (v, w) ) count = 0 edges_in_tigs = set() uni_edges = {} path_f = open("unit_edge_paths","w") uni_edge_f = open("unit_edges.dat", "w") while len(sg_edges) > 0: v, w = sg_edges.pop() #nodes_for_tig.remove(n) upstream_nodes = [] c_node = v p_in_edges = sg.get_in_edges_for_node(c_node) p_out_edges = sg.get_out_edges_for_node(c_node) while len(p_in_edges) == 1 and len(p_out_edges) == 1: p_node = p_in_edges[0].in_node upstream_nodes.append(p_node.name) if (p_node.name, c_node) not in sg_edges: break p_in_edges = sg.get_in_edges_for_node(p_node.name) p_out_edges = sg.get_out_edges_for_node(p_node.name) c_node = p_node.name upstream_nodes.reverse() downstream_nodes = [] c_node = w n_out_edges = sg.get_out_edges_for_node(c_node) n_in_edges = sg.get_in_edges_for_node(c_node) while len(n_out_edges) == 1 and len(n_in_edges) == 1: n_node = n_out_edges[0].out_node downstream_nodes.append(n_node.name) if (c_node, n_node.name) not in sg_edges: break n_out_edges = sg.get_out_edges_for_node(n_node.name) n_in_edges = sg.get_in_edges_for_node(n_node.name) c_node = n_node.name whole_path = upstream_nodes + [v, w] + downstream_nodes count += 1 subseq = generate_seq_from_path(sg, seqs, whole_path) uni_edges.setdefault( (whole_path[0], whole_path[-1]), [] ) uni_edges[(whole_path[0], whole_path[-1])].append( ( whole_path, subseq ) ) print >> uni_edge_f, whole_path[0], whole_path[-1], "-".join(whole_path), subseq print >>path_f, ">%05dc-%s-%s-%d %s" % (count, whole_path[0], whole_path[-1], len(whole_path), " ".join(whole_path)) print >>out_fasta, ">%05dc-%s-%s-%d" % (count, whole_path[0], whole_path[-1], len(whole_path)) print >>out_fasta, subseq for i in range( len( whole_path ) -1 ): w_n, v_n = whole_path[i:i+2] try: sg_edges.remove( (w_n, v_n) ) except KeyError: #if an edge is already deleted, ignore it pass r_whole_path = reverse_path( whole_path ) count += 1 subseq = generate_seq_from_path(sg, seqs, r_whole_path) uni_edges.setdefault( (r_whole_path[0], r_whole_path[-1]), [] ) uni_edges[(r_whole_path[0], r_whole_path[-1])].append( ( r_whole_path, subseq ) ) print >> uni_edge_f, r_whole_path[0], r_whole_path[-1], "-".join(r_whole_path), subseq print >>path_f, ">%05dc-%s-%s-%d %s" % (count, r_whole_path[0], r_whole_path[-1], len(r_whole_path), " ".join(r_whole_path)) print >>out_fasta, ">%05dc-%s-%s-%d" % (count, r_whole_path[0], r_whole_path[-1], len(r_whole_path)) print >>out_fasta, subseq for i in range( len( r_whole_path ) -1 ): w_n, v_n = r_whole_path[i:i+2] try: sg_edges.remove( (w_n, v_n) ) except KeyError: #if an edge is already deleted, ignore it pass path_f.close() uni_edge_f.close() #uni_graph = nx.DiGraph() #for n1, n2 in uni_edges.keys(): # uni_graph.add_edge(n1, n2, count = len( uni_edges[ (n1,n2) ] )) #nx.write_gexf(uni_graph, "uni_graph.gexf") out_fasta.close() return uni_edges def neighbor_bound(G, v, w, radius): """ test if the node v and the node w are connected within a radius in graph G """ g1 = nx.ego_graph(G, v, radius=radius, undirected=False) g2 = nx.ego_graph(G, w, radius=radius, undirected=False) if len(set(g1.edges()) & set(g2.edges())) > 0: return True else: return False def is_branch_node(G, n): """ test whether the node n is a "branch node" which the paths from any of two of its offsprings do not intersect within a given radius """ out_edges = G.out_edges([n]) n2 = [ e[1] for e in out_edges ] is_branch = False for i in range(len(n2)): for j in range(i+1, len(n2)): v = n2[i] w = n2[j] if neighbor_bound(G, v, w, 10) == False: is_branch = True break if is_branch == True: break return is_branch def get_bundle( path, u_graph ): """ find a sub-graph contain the nodes between the start and the end of the path inputs: u_graph : a unitig graph returns: bundle_graph: the whole bundle graph bundle_paths: the paths in the bundle graph sub_graph2_edges: all edges of the bundle graph """ p_start, p_end = path[0], path[-1] p_nodes = set(path) p_edges = set(zip(path[:-1], path[1:])) u_graph_r = u_graph.reverse() down_path = nx.ego_graph(u_graph, p_start, radius=len(p_nodes), undirected=False) up_path = nx.ego_graph(u_graph_r, p_end, radius=len(p_nodes), undirected=False) subgraph_nodes = set(down_path) & set(up_path) sub_graph = nx.DiGraph() for v, w in u_graph.edges_iter(): if v in subgraph_nodes and w in subgraph_nodes: if (v, w) in p_edges: sub_graph.add_edge(v, w, color = "red") else: sub_graph.add_edge(v, w, color = "black") sub_graph2 = nx.DiGraph() tips = set() tips.add(path[0]) sub_graph_r = sub_graph.reverse() visited = set() ct = 0 is_branch = is_branch_node(sub_graph, path[0]) #if the start node is a branch node if is_branch: n = tips.pop() e = sub_graph.out_edges([n])[0] #pick one path the build the subgraph sub_graph2.add_edge(e[0], e[1], n_weight = u_graph[e[0]][e[1]]["n_weight"]) if e[1] not in visited: last_node = e[1] visited.add(e[1]) r_id, orientation = e[1].split(":") orientation = "E" if orientation == "B" else "E" visited.add( r_id +":" + orientation) if not is_branch_node(sub_graph_r, e[1]): tips.add(e[1]) while len(tips) != 0: n = tips.pop() out_edges = sub_graph.out_edges([n]) if len(out_edges) == 1: e = out_edges[0] sub_graph2.add_edge(e[0], e[1], n_weight = u_graph[e[0]][e[1]]["n_weight"]) last_node = e[1] if e[1] not in visited: visited.add(e[1]) r_id, orientation = e[1].split(":") orientation = "E" if orientation == "B" else "E" visited.add( r_id +":" + orientation) if not is_branch_node(sub_graph_r, e[1]): tips.add(e[1]) else: is_branch = is_branch_node(sub_graph, n) if not is_branch: for e in out_edges: sub_graph2.add_edge(e[0], e[1], n_weight = u_graph[e[0]][e[1]]["n_weight"]) last_node = e[1] if e[1] not in visited: r_id, orientation = e[1].split(":") visited.add(e[1]) orientation = "E" if orientation == "B" else "E" visited.add( r_id +":" + orientation) if not is_branch_node(sub_graph_r, e[1]): tips.add(e[1]) ct += 1 last_node = None longest_len = 0 sub_graph2_nodes = sub_graph2.nodes() sub_graph2_edges = sub_graph2.edges() new_path = [path[0]] for n in sub_graph2_nodes: if len(sub_graph2.out_edges(n)) == 0 : path_t = nx.shortest_path(sub_graph2, source = path[0], target = n, weight = "n_weight") path_len = len(path_t) if path_len > longest_len: last_node = n longest_len = path_len new_path = path_t if last_node == None: for n in sub_graph2_nodes: path_t = nx.shortest_path(sub_graph2, source = path[0], target = n, weight = "n_weight") path_len = len(path_t) if path_len > longest_len: last_node = n longest_len = path_len new_path = path_t path = new_path # clean up sub_graph2 according to new begin and end sub_graph2_r = sub_graph2.reverse() down_path = nx.ego_graph(sub_graph2, path[0], radius=len(path), undirected=False) up_path = nx.ego_graph(sub_graph2_r, path[-1], radius=len(path), undirected=False) subgraph_nodes = set(down_path) & set(up_path) for v in sub_graph2_nodes: if v not in subgraph_nodes: sub_graph2.remove_node(v) if DEBUG_LOG_LEVEL > 1: print "new_path", path[0], last_node, len(sub_graph2_nodes), path bundle_paths = [path] p_nodes = set(path) p_edges = set(zip(path[:-1], path[1:])) sub_graph2_nodes = sub_graph2.nodes() sub_graph2_edges = sub_graph2.edges() nodes_idx = dict( [ (n[1], n[0]) for n in enumerate(path) ] ) # create a list of subpath that has no branch non_branch_subpaths = [] wi = 0 vi = 0 v = path[0] while v != path[-1] and wi < len(path)-1: wi += 1 w = path[wi] while len( sub_graph2.successors(w) ) == 1 and len( sub_graph2.predecessors(w) ) == 1 and wi < len(path)-1: wi += 1 w = path[wi] if len( sub_graph2.successors(v) )!= 1 or len( sub_graph2.predecessors(w) )!= 1: branched = True else: branched = False if not branched: non_branch_subpaths.append( path[vi:wi+1] ) v = w vi = wi # create the accompany_graph that has the path of the alternative subpaths associate_graph = nx.DiGraph() for v, w in sub_graph2.edges_iter(): if (v, w) not in p_edges: associate_graph.add_edge(v, w, n_weight = sub_graph2[v][w]["n_weight"]) if DEBUG_LOG_LEVEL > 1: print "associate_graph size:", len(associate_graph) print "non_branch_subpaths",len(non_branch_subpaths), non_branch_subpaths # construct the bundle graph associate_graph_nodes = set(associate_graph.nodes()) bundle_graph = nx.DiGraph() bundle_graph.add_path( path ) for i in range(len(non_branch_subpaths)-1): if len(non_branch_subpaths[i]) == 0 or len( non_branch_subpaths[i+1] ) == 0: continue e1, e2 = non_branch_subpaths[i: i+2] v = e1[-1] w = e2[0] if v == w: continue in_between_node_count = nodes_idx[w] - nodes_idx[v] if v in associate_graph_nodes and w in associate_graph_nodes: try: a_path = nx.shortest_path(associate_graph, v, w, "n_weight") except nx.NetworkXNoPath: continue bundle_graph.add_path( a_path ) bundle_paths.append( a_path ) return bundle_graph, bundle_paths, sub_graph2_edges def get_bundles(u_edges): """ input: all unitig edges output: the assembled primary_tigs.fa and all_tigs.fa """ ASM_graph = nx.DiGraph() out_f = open("primary_tigs.fa", "w") main_tig_paths = open("primary_tigs_paths","w") sv_tigs = open("all_tigs.fa","w") sv_tig_paths = open("all_tigs_paths","w") max_weight = 0 for v, w in u_edges: x = max( [len(s[1]) for s in u_edges[ (v,w) ] ] ) if DEBUG_LOG_LEVEL > 1: print "W", v, w, x if x > max_weight: max_weight = x in_edges = {} out_edges = {} for v, w in u_edges: in_edges.setdefault(w, []) out_edges.setdefault(w, []) in_edges[w].append( (v, w) ) out_edges.setdefault(v, []) in_edges.setdefault(v, []) out_edges[v].append( (v, w) ) u_graph = nx.DiGraph() for v,w in u_edges: u_graph.add_edge(v, w, n_weight = max_weight - max( [len(s[1]) for s in u_edges[ (v,w) ] ] ) ) bundle_edge_out = open("bundle_edges","w") bundle_index = 0 G = u_graph.copy() visited_u_edges = set() while len(G) > 0: root_nodes = set() for n in G: if G.in_degree(n) == 0: root_nodes.add(n) if len(root_nodes) == 0: if G.in_degree(n) != 1: root_nodes.add(n) if len(root_nodes) == 0: root_nodes.add( G.nodes()[0] ) candidates = [] for n in list(root_nodes): sp =nx.single_source_shortest_path_length(G, n) sp = sp.items() sp.sort(key=lambda x : x[1]) longest = sp[-1] if DEBUG_LOG_LEVEL > 2: print "L", n, longest[0] if longest[0].split(":")[0] == n.split(":")[0]: #avoid a big loop continue candidates.append ( (longest[1], n, longest[0]) ) if len(candidates) == 0: print "no more candiate", len(G.edges()), len(G.nodes()) if len(G.edges()) > 0: path = G.edges()[0] print path else: break else: candidates.sort() candidate = candidates[-1] if candidate[1] == candidate[2]: G.remove_node(candidate[1]) continue path = nx.shortest_path(G, candidate[1], candidate[2], "n_weight") if DEBUG_LOG_LEVEL > 1: print "X", path[0], path[-1], len(path) cmp_edges = set() g_edges = set(G.edges()) new_path = [] tail = True # avioid confusion due to long palindrome sequence if len(path) > 2: for i in range( 0, len( path ) - 1 ): v_n, w_n = path[i:i+2] new_path.append(v_n) # the comment out code below might be useful for filter out some high connectivity nodes #if (v_n, w_n) in cmp_edges or\ # len(u_graph.out_edges(w_n)) > 5 or\ # len(u_graph.in_edges(w_n)) > 5: if (v_n, w_n) in cmp_edges: tail = False break r_id, end = v_n.split(":") end = "E" if end == "B" else "B" v_n2 = r_id + ":" + end r_id, end = w_n.split(":") end = "E" if end == "B" else "B" w_n2 = r_id + ":" + end if (w_n2, v_n2) in g_edges: cmp_edges.add( (w_n2, v_n2) ) if tail: new_path.append(w_n) else: new_path = path[:] if len(new_path) > 1: path = new_path if DEBUG_LOG_LEVEL > 2: print "Y", path[0], path[-1], len(path) bundle_graph, bundle_paths, bundle_graph_edges = get_bundle( path, G ) for bg_edge in bundle_graph_edges: print >> bundle_edge_out, bundle_index, "edge", bg_edge[0], bg_edge[1] for path_ in bundle_paths: print >>bundle_edge_out, "path", bundle_index, " ".join(path_) edges_to_be_removed = set() if DEBUG_LOG_LEVEL > 2: print "Z", bundle_paths[0][0], bundle_paths[0][-1] print bundle_index, len(path), len(bundle_paths[0]), len(bundle_paths), len(bundle_graph_edges) if len(bundle_graph_edges) > 0: ASM_graph.add_path(bundle_paths[0], ctg="%04d" % bundle_index) extra_u_edges = [] print >> main_tig_paths, ">%04d %s" % ( bundle_index, " ".join(bundle_paths[0]) ) subseqs = [] for i in range(len(bundle_paths[0]) - 1): v, w = bundle_paths[0][i:i+2] edges_to_be_removed.add( (v,w) ) uedges = u_edges[ (v,w) ] uedges.sort( key= lambda x: len(x[0]) ) subseqs.append( uedges[-1][1] ) visited_u_edges.add( "-".join(uedges[-1][0]) ) for ue in uedges: if "-".join(ue[0]) not in visited_u_edges: visited_u_edges.add("-".join(ue[0])) extra_u_edges.append(ue) seq = "".join(subseqs) sv_tig_idx = 0 print >> sv_tig_paths, ">%04d-%04d %s" % ( bundle_index, sv_tig_idx, " ".join(bundle_paths[0]) ) if len(seq) > 0: print >> out_f, ">%04d %s-%s" % (bundle_index, bundle_paths[0][0], bundle_paths[0][-1]) print >> out_f, seq print >> sv_tigs, ">%04d-%04d %s-%s" % (bundle_index, sv_tig_idx, bundle_paths[0][0], bundle_paths[0][-1]) print >> sv_tigs, "".join(subseqs) sv_tig_idx += 1 for sv_path in bundle_paths[1:]: print >> sv_tig_paths, ">%04d-%04d %s" % ( bundle_index, sv_tig_idx, " ".join(sv_path) ) ASM_graph.add_path(sv_path, ctg="%04d" % bundle_index) subseqs = [] for i in range(len(sv_path) - 1): v, w = sv_path[i:i+2] edges_to_be_removed.add( (v,w) ) uedges = u_edges[ (v,w) ] uedges.sort( key= lambda x: len(x[0]) ) subseqs.append( uedges[-1][1] ) visited_u_edges.add( "-".join(uedges[-1][0]) ) for ue in uedges: if "-".join(ue[0]) not in visited_u_edges: visited_u_edges.add("-".join(ue[0])) extra_u_edges.append(ue) seq = "".join(subseqs) if len(seq) > 0: print >> sv_tigs, ">%04d-%04d %s-%s" % (bundle_index, sv_tig_idx, sv_path[0], sv_path[-1]) print >> sv_tigs, "".join(subseqs) sv_tig_idx += 1 for u_path, seq in extra_u_edges: #u_path = u_path.split("-") ASM_graph.add_edge(u_path[0], u_path[-1], ctg="%04d" % bundle_index) print >> sv_tig_paths, ">%04d-%04d-u %s" % ( bundle_index, sv_tig_idx, " ".join(u_path) ) print >> sv_tigs, ">%04d-%04d-u %s-%s" % (bundle_index, sv_tig_idx, u_path[0], u_path[-1]) print >> sv_tigs, seq sv_tig_idx += 1 bundle_index += 1 else: #TODO, consolidate code here v,w = path uedges = u_edges[ (v,w) ] uedges.sort( key= lambda x: len(x[0]) ) subseqs.append( uedges[-1][1] ) seq = "".join(subseqs) print >> sv_tig_paths, ">%04d-%04d %s" % ( bundle_index, sv_tig_idx, " ".join(paths) ) print >> sv_tigs, ">%04d-%04d-u %s-%s" % (bundle_index, sv_tig_idx, path[0], path[-1]) print >> sv_tigs, seq sv_tig_idx += 1 bundle_index += 1 bundle_graph_edges = zip(path[:-1],path[1:]) #clean up the graph edges = set(G.edges()) edges_to_be_removed |= set(bundle_graph_edges) if DEBUG_LOG_LEVEL > 2: print "BGE",bundle_graph_edges edge_remove_count = 0 for v, w in edges_to_be_removed: if (v, w) in edges: G.remove_edge( v, w ) edge_remove_count += 1 if DEBUG_LOG_LEVEL > 2: print "remove edge", bundle_index, w, v edges = set(G.edges()) for v, w in edges_to_be_removed: r_id, end = v.split(":") end = "E" if end == "B" else "B" v = r_id + ":" + end r_id, end = w.split(":") end = "E" if end == "B" else "B" w = r_id + ":" + end if (w, v) in edges: G.remove_edge( w, v ) edge_remove_count += 1 if DEBUG_LOG_LEVEL > 2: print "remove edge", bundle_index, w, v if edge_remove_count == 0: break nodes = G.nodes() for n in nodes: if G.in_degree(n) == 0 and G.out_degree(n) == 0: G.remove_node(n) if DEBUG_LOG_LEVEL > 2: print "remove node", n sv_tig_paths.close() sv_tigs.close() main_tig_paths.close() out_f.close() bundle_edge_out.close() return ASM_graph def SGToNXG(sg): G=nx.DiGraph() max_score = max([ sg.edges[ e ].attr["score"] for e in sg.edges if sg.e_reduce[e] != True ]) out_f = open("edges_list","w") for v, w in sg.edges: if sg.e_reduce[(v, w)] != True: ##if 1: out_degree = len(sg.nodes[v].out_edges) G.add_node( v, size = out_degree ) G.add_node( w, size = out_degree ) label = sg.edges[ (v, w) ].attr["label"] score = sg.edges[ (v, w) ].attr["score"] print >>out_f, v, w, label, score G.add_edge( v, w, label = label, weight = 0.001*score, n_weight = max_score - score ) #print in_node_name, out_node_name out_f.close() return G if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description='a example string graph assembler that is desinged for handling diploid genomes') parser.add_argument('overlap_file', help='a file that contains the overlap information.') parser.add_argument('read_fasta', help='the file that contains the sequence to be assembled') parser.add_argument('--min_len', type=int, default=4000, help='minimum length of the reads to be considered for assembling') parser.add_argument('--min_idt', type=float, default=96, help='minimum alignment identity of the reads to be considered for assembling') parser.add_argument('--disable_chimer_prediction', action="store_true", default=False, help='you may want to disable this as some reads can be falsely identified as chimers in low coverage case') args = parser.parse_args() overlap_file = args.overlap_file read_fasta = args.read_fasta seqs = {} # load all p-reads into memory f = FastaReader(read_fasta) for r in f: seqs[r.name] = r.sequence.upper() G=nx.Graph() edges =set() overlap_data = [] contained_reads = set() overlap_count = {} # loop through the overlapping data to load the data in the a python array # contained reads are identified with open(overlap_file) as f: for l in f: l = l.strip().split() #work around for some ill formed data recored if len(l) != 13: continue f_id, g_id, score, identity = l[:4] if f_id == g_id: # don't need self-self overlapping continue if g_id not in seqs: continue if f_id not in seqs: continue score = int(score) identity = float(identity) contained = l[12] if contained == "contained": contained_reads.add(f_id) continue if contained == "contains": contained_reads.add(g_id) continue if contained == "none": continue if identity < args.min_idt: # only take record with >96% identity as overlapped reads continue #if score > -2000: # continue f_strain, f_start, f_end, f_len = (int(c) for c in l[4:8]) g_strain, g_start, g_end, g_len = (int(c) for c in l[8:12]) # only used reads longer than the 4kb for assembly if f_len < args.min_len: continue if g_len < args.min_len: continue # double check for proper overlap if f_start > 24 and f_len - f_end > 24: # allow 24 base tolerance on both sides of the overlapping continue if g_start > 24 and g_len - g_end > 24: continue if g_strain == 0: if f_start < 24 and g_len - g_end > 24: continue if g_start < 24 and f_len - f_end > 24: continue else: if f_start < 24 and g_start > 24: continue if g_start < 24 and f_start > 24: continue overlap_data.append( (f_id, g_id, score, identity, f_strain, f_start, f_end, f_len, g_strain, g_start, g_end, g_len) ) overlap_count[f_id] = overlap_count.get(f_id,0)+1 overlap_count[g_id] = overlap_count.get(g_id,0)+1 overlap_set = set() sg = StringGraph() for od in overlap_data: f_id, g_id, score, identity = od[:4] if f_id in contained_reads: continue if g_id in contained_reads: continue f_s, f_b, f_e, f_l = od[4:8] g_s, g_b, g_e, g_l = od[8:12] overlap_pair = [f_id, g_id] overlap_pair.sort() overlap_pair = tuple( overlap_pair ) if overlap_pair in overlap_set: # don't allow duplicated records continue else: overlap_set.add(overlap_pair) if g_s == 1: # revered alignment, swapping the begin and end coordinates g_b, g_e = g_e, g_b # build the string graph edges for each overlap if f_b > 24: if g_b < g_e: """ f.B f.E f -----------> g -------------> g.B g.E """ if f_b == 0 or g_e - g_l == 0: continue sg.add_edge( "%s:B" % g_id, "%s:B" % f_id, label = "%s:%d-%d" % (f_id, f_b, 0), length = abs(f_b-0), score = -score) sg.add_edge( "%s:E" % f_id, "%s:E" % g_id, label = "%s:%d-%d" % (g_id, g_e, g_l), length = abs(g_e-g_l), score = -score) else: """ f.B f.E f -----------> g <------------- g.E g.B """ if f_b == 0 or g_e == 0: continue sg.add_edge( "%s:E" % g_id, "%s:B" % f_id, label = "%s:%d-%d" % (f_id, f_b, 0), length = abs(f_b -0), score = -score) sg.add_edge( "%s:E" % f_id, "%s:B" % g_id, label = "%s:%d-%d" % (g_id, g_e, 0), length = abs(g_e- 0), score = -score) else: if g_b < g_e: """ f.B f.E f -----------> g -------------> g.B g.E """ if g_b == 0 or f_e - f_l == 0: continue sg.add_edge( "%s:B" % f_id, "%s:B" % g_id, label = "%s:%d-%d" % (g_id, g_b, 0), length = abs(g_b - 0), score = -score) sg.add_edge( "%s:E" % g_id, "%s:E" % f_id, label = "%s:%d-%d" % (f_id, f_e, f_l), length = abs(f_e-f_l), score = -score) else: """ f.B f.E f -----------> g <------------- g.E g.B """ if g_b - g_l == 0 or f_e - f_l ==0: continue sg.add_edge( "%s:B" % f_id, "%s:E" % g_id, label = "%s:%d-%d" % (g_id, g_b, g_l), length = abs(g_b - g_l), score = -score) sg.add_edge( "%s:B" % g_id, "%s:E" % f_id, label = "%s:%d-%d" % (f_id, f_e, f_l), length = abs(f_e - f_l), score = -score) sg.init_reduce_dict() if not args.disable_chimer_prediction: sg.mark_chimer_edge() sg.mark_spur_edge() sg.mark_tr_edges() # mark those edges that transitive redundant if DEBUG_LOG_LEVEL > 1: print sum( [1 for c in sg.e_reduce.values() if c == True] ) print sum( [1 for c in sg.e_reduce.values() if c == False] ) sg.mark_best_overlap() # mark those edges that are best overlap edges if DEBUG_LOG_LEVEL > 1: print sum( [1 for c in sg.e_reduce.values() if c == False] ) G = SGToNXG(sg) #nx.write_gexf(G, "string_graph.gexf") # output the raw string string graph for visuliation nx.write_adjlist(G, "string_graph.adj") # write out the whole adjacent list of the string graph u_edges = generate_unitig(sg, seqs, out_fn = "unitgs.fa") # reduct to string graph into unitig graph ASM_graph = get_bundles(u_edges ) # get the assembly #nx.write_gexf(ASM_graph, "asm_graph.gexf") FALCON-0.1.3/src/py_scripts/falcon_asm_dev.py000077500000000000000000001120641237512144000207630ustar00rootroot00000000000000#!/usr/bin/env python #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ from pbcore.io import FastaReader import networkx as nx import os import shlex import sys import subprocess class SGNode(object): def __init__(self, node_name): self.name = node_name self.out_edges = [] self.in_edges = [] def add_out_edge(self, out_edge): self.out_edges.append(out_edge) def add_in_edge(self, in_edge): self.in_edges.append(in_edge) class SGEdge(object): def __init__(self, in_node, out_node): self.in_node = in_node self.out_node = out_node self.attr = {} def set_attribute(self, attr, value): self.attr[attr] = value class StringGraph(object): def __init__(self): self.nodes = {} self.edges = {} self.n_mark = {} self.e_reduce = {} self.repeat_overlap = {} def add_node(self, node_name): if node_name not in self.nodes: self.nodes[node_name] = SGNode(node_name) def add_edge(self, in_node_name, out_node_name, **attributes): if (in_node_name, out_node_name) not in self.edges: self.add_node(in_node_name) self.add_node(out_node_name) in_node = self.nodes[in_node_name] out_node = self.nodes[out_node_name] edge = SGEdge(in_node, out_node) self.edges[ (in_node_name, out_node_name) ] = edge in_node.add_out_edge(edge) out_node.add_in_edge(edge) edge = self.edges[ (in_node_name, out_node_name) ] for k, v in attributes.items(): edge.attr[k] = v def mark_tr_edges(self): n_mark = self.n_mark e_reduce = self.e_reduce FUZZ = 500 for n in self.nodes: n_mark[n] = "vacant" for e in self.edges: e_reduce[e] = False for n_name, node in self.nodes.items(): out_edges = node.out_edges if len(out_edges) == 0: continue out_edges.sort(key=lambda x: x.attr["length"]) for e in out_edges: w = e.out_node n_mark[ w.name ] = "inplay" max_len = out_edges[-1].attr["length"] #longest_edge = out_edges[-1] max_len += FUZZ for e in out_edges: e_len = e.attr["length"] w = e.out_node if n_mark[w.name] == "inplay": w.out_edges.sort( key=lambda x: x.attr["length"] ) for e2 in w.out_edges: if e2.attr["length"] + e_len < max_len: x = e2.out_node if n_mark[x.name] == "inplay": n_mark[x.name] = "eliminated" for e in out_edges: e_len = e.attr["length"] w = e.out_node w.out_edges.sort( key=lambda x: x.attr["length"] ) if len(w.out_edges) > 0: x = w.out_edges[0].out_node if n_mark[x.name] == "inplay": n_mark[x.name] = "eliminated" for e2 in w.out_edges: if e2.attr["length"] < FUZZ: x = e2.out_node if n_mark[x.name] == "inplay": n_mark[x.name] = "eliminated" for out_edge in out_edges: v = out_edge.in_node w = out_edge.out_node if n_mark[w.name] == "eliminated": e_reduce[ (v.name, w.name) ] = True n_mark[w.name] = "vacant" def mark_repeat_overlap(self): repeat_overlap = self.repeat_overlap in_degree = {} for n in self.nodes: c = 0 for e in self.nodes[n].in_edges: v = e.in_node w = e.out_node if self.e_reduce[(v.name, w.name)] == False: c += 1 in_degree[n] = c #print n,c #print len([x for x in in_degree.items() if x[1]>1]) for e_n, e in self.edges.items(): v = e.in_node w = e.out_node if self.e_reduce[(v.name, w.name)] == False: repeat_overlap[ (v.name, w.name) ] = False else: repeat_overlap[ (v.name, w.name) ] = True for n in self.nodes: if len(self.nodes[n].out_edges) < 2: continue min_in_deg = None for e in self.nodes[n].out_edges: v = e.in_node w = e.out_node #print n, v.name, w.name if self.e_reduce[ (v.name, w.name) ] == True: continue if min_in_deg == None: min_in_deg = in_degree[w.name] continue if in_degree[w.name] < min_in_deg: min_in_deg = in_degree[w.name] #print n, w.name, in_degree[w.name] for e in self.nodes[n].out_edges: v = e.in_node w = e.out_node assert (v.name, w.name) in self.edges if in_degree[w.name] > min_in_deg: if self.e_reduce[(v.name, w.name)] == False: repeat_overlap[ (v.name, w.name) ] = True for e_n, e in self.edges.items(): v = e.in_node w = e.out_node if repeat_overlap[ (v.name, w.name) ] == True: self.e_reduce[(v.name, w.name)] == True def mark_best_overlap(self): best_edges = set() for v in self.nodes: out_edges = self.nodes[v].out_edges if len(out_edges) > 0: out_edges.sort(key=lambda e: e.attr["score"]) e = out_edges[-1] best_edges.add( (e.in_node.name, e.out_node.name) ) in_edges = self.nodes[v].in_edges if len(in_edges) > 0: in_edges.sort(key=lambda e: e.attr["score"]) e = in_edges[-1] best_edges.add( (e.in_node.name, e.out_node.name) ) print "X", len(best_edges) for e_n, e in self.edges.items(): v = e_n[0] w = e_n[1] if self.e_reduce[ (v, w) ] != True: if (v, w) not in best_edges: self.e_reduce[(v, w)] = True def mark_best_overlap_2(self): best_edges = set() for e in self.edges: v, w = e if w == self.get_best_out_edge_for_node(v).out_node.name and\ v == self.get_best_in_edge_for_node(w).in_node.name: best_edges.add( (v, w) ) for e_n, e in self.edges.items(): v = e_n[0] w = e_n[1] if self.e_reduce[ (v, w) ] != True: if (v, w) not in best_edges: self.e_reduce[(v, w)] = True #print sum( [1 for e_n in self.edges if self.e_reduce[ e_n ] == False] ) def get_out_edges_for_node(self, name, mask=True): rtn = [] for e in self.nodes[name].out_edges: v = e.in_node w = e.out_node if self.e_reduce[ (v.name, w.name) ] == False: rtn.append(e) return rtn def get_in_edges_for_node(self, name, mask=True): rtn = [] for e in self.nodes[name].in_edges: v = e.in_node w = e.out_node if self.e_reduce[ (v.name, w.name) ] == False: rtn.append(e) return rtn def get_best_out_edge_for_node(self, name, mask=True): rtn = [] for e in self.nodes[name].out_edges: v = e.in_node w = e.out_node if self.e_reduce[ (v.name, w.name) ] == False: rtn.append(e) rtn.sort(key=lambda e: e.attr["score"]) return rtn[-1] def get_best_in_edge_for_node(self, name, mask=True): rtn = [] for e in self.nodes[name].in_edges: v = e.in_node w = e.out_node if self.e_reduce[ (v.name, w.name) ] == False: rtn.append(e) rtn.sort(key=lambda e: e.attr["score"]) return rtn[-1] RCMAP = dict(zip("ACGTacgtNn-","TGCAtgcaNn-")) def generate_contig_from_path(sg, seqs, path): subseqs = [] r_id, end = path[0].split(":") if end == "B": subseqs= [ "".join( [RCMAP[c] for c in seqs[r_id][::-1]] ) ] else: subseqs=[ seqs[r_id] ] count = 0 for i in range( len( path ) -1 ): w_n, v_n = path[i:i+2] edge = sg.edges[ (w_n, v_n ) ] read_id, coor = edge.attr["label"].split(":") b,e = coor.split("-") b = int(b) e = int(e) if b < e: subseqs.append( seqs[read_id][b:e] ) else: subseqs.append( "".join( [RCMAP[c] for c in seqs[read_id][b:e:-1]] ) ) return "".join(subseqs) def generate_unitig(sg, seqs, out_fn, connected_nodes = None): G = SGToNXG(sg) if connected_nodes != None: connected_nodes = set(sg.nodes) out_fasta = open(out_fn, "w") nodes_for_tig = set() sg_edges = set() for v, w in sg.edges: if sg.e_reduce[(v, w)] != True: sg_edges.add( (v, w) ) count = 0 edges_in_tigs = set() uni_edges = {} path_f = open("paths","w") uni_edge_f = open("unit_edges.dat", "w") while len(sg_edges) > 0: v, w = sg_edges.pop() #nodes_for_tig.remove(n) upstream_nodes = [] c_node = v p_in_edges = sg.get_in_edges_for_node(c_node) p_out_edges = sg.get_out_edges_for_node(c_node) while len(p_in_edges) == 1 and len(p_out_edges) == 1: p_node = p_in_edges[0].in_node upstream_nodes.append(p_node.name) if (p_node.name, c_node) not in sg_edges: break sg_edges.remove( (p_node.name, c_node) ) p_in_edges = sg.get_in_edges_for_node(p_node.name) p_out_edges = sg.get_out_edges_for_node(p_node.name) c_node = p_node.name upstream_nodes.reverse() downstream_nodes = [] c_node = w n_out_edges = sg.get_out_edges_for_node(c_node) n_in_edges = sg.get_in_edges_for_node(c_node) while len(n_out_edges) == 1 and len(n_in_edges) == 1: n_node = n_out_edges[0].out_node downstream_nodes.append(n_node.name) if (c_node, n_node.name) not in sg_edges: break sg_edges.remove( (c_node, n_node.name) ) n_out_edges = sg.get_out_edges_for_node(n_node.name) n_in_edges = sg.get_in_edges_for_node(n_node.name) c_node = n_node.name whole_path = upstream_nodes + [v, w] + downstream_nodes #print len(whole_path) count += 1 subseqs = [] for i in range( len( whole_path ) - 1): v_n, w_n = whole_path[i:i+2] edge = sg.edges[ (v_n, w_n ) ] edges_in_tigs.add( (v_n, w_n ) ) #print n, next_node.name, e.attr["label"] read_id, coor = edge.attr["label"].split(":") b,e = coor.split("-") b = int(b) e = int(e) if b < e: subseqs.append( seqs[read_id][b:e] ) else: try: subseqs.append( "".join( [RCMAP[c] for c in seqs[read_id][b:e:-1]] ) ) except: print seqs[read_id] uni_edges.setdefault( (whole_path[0], whole_path[-1]), [] ) uni_edges[(whole_path[0], whole_path[-1])].append( ( whole_path, "".join(subseqs) ) ) print >> uni_edge_f, whole_path[0], whole_path[-1], "-".join(whole_path), "".join(subseqs) print >>path_f, ">%05dc-%s-%s-%d %s" % (count, whole_path[0], whole_path[-1], len(whole_path), " ".join(whole_path)) print >>out_fasta, ">%05dc-%s-%s-%d" % (count, whole_path[0], whole_path[-1], len(whole_path)) print >>out_fasta,"".join(subseqs) path_f.close() uni_edge_f.close() uni_graph = nx.DiGraph() for n1, n2 in uni_edges.keys(): uni_graph.add_edge(n1, n2, weight = len( uni_edges[ (n1,n2) ] )) nx.write_gexf(uni_graph, "uni_graph.gexf") out_fasta.close() return uni_edges def neighbor_bound(G, v, w, radius): g1 = nx.ego_graph(G, v, radius=radius, undirected=False) g2 = nx.ego_graph(G, w, radius=radius, undirected=False) if len(set(g1.edges()) & set(g2.edges())) > 0: return True else: return False def is_branch_node(G, n): out_edges = G.out_edges([n]) n2 = [ e[1] for e in out_edges ] is_branch = False for i in range(len(n2)): for j in range(i+1, len(n2)): v = n2[i] w = n2[j] if neighbor_bound(G, v, w, 10) == False: is_branch = True break if is_branch == True: break return is_branch def get_bundle( path, u_graph, u_edges ): # find a sub-graph contain the nodes between the start and the end of the path p_start, p_end = path[0], path[-1] p_nodes = set(path) p_edges = set(zip(path[:-1], path[1:])) u_graph_r = u_graph.reverse() down_path = nx.ego_graph(u_graph, p_start, radius=len(p_nodes), undirected=False) up_path = nx.ego_graph(u_graph_r, p_end, radius=len(p_nodes), undirected=False) subgraph_nodes = set(down_path) & set(up_path) #print len(path), len(down_path), len(up_path), len(bundle_nodes) sub_graph = nx.DiGraph() for v, w in u_graph.edges_iter(): if v in subgraph_nodes and w in subgraph_nodes: if (v, w) in p_edges: sub_graph.add_edge(v, w, color = "red") else: sub_graph.add_edge(v, w, color = "black") sub_graph2 = nx.DiGraph() tips = set() tips.add(path[0]) sub_graph_r = sub_graph.reverse() visited = set() ct = 0 is_branch = is_branch_node(sub_graph, path[0]) #if the start node is a branch node if is_branch: n = tips.pop() e = sub_graph.out_edges([n])[0] #pick one path the build the subgraph sub_graph2.add_edge(e[0], e[1], n_weight = u_graph[e[0]][e[1]]["n_weight"]) if e[1] not in visited: last_node = e[1] visited.add(e[1]) r_id, orientation = e[1].split(":") orientation = "E" if orientation == "B" else "E" visited.add( r_id +":" + orientation) if not is_branch_node(sub_graph_r, e[1]): tips.add(e[1]) while len(tips) != 0: n = tips.pop() #print "n", n out_edges = sub_graph.out_edges([n]) #out_edges = u_graph.out_edges([n]) #print out_edges if len(out_edges) == 1: e = out_edges[0] sub_graph2.add_edge(e[0], e[1], n_weight = u_graph[e[0]][e[1]]["n_weight"]) last_node = e[1] if e[1] not in visited: visited.add(e[1]) r_id, orientation = e[1].split(":") orientation = "E" if orientation == "B" else "E" visited.add( r_id +":" + orientation) if not is_branch_node(sub_graph_r, e[1]): #if not is_branch_node(u_graph_r, e[1]): tips.add(e[1]) else: is_branch = is_branch_node(sub_graph, n) #is_branch = is_branch_node(u_graph, n) if not is_branch: for e in out_edges: sub_graph2.add_edge(e[0], e[1], n_weight = u_graph[e[0]][e[1]]["n_weight"]) last_node = e[1] if e[1] not in visited: r_id, orientation = e[1].split(":") visited.add(e[1]) orientation = "E" if orientation == "B" else "E" visited.add( r_id +":" + orientation) if not is_branch_node(sub_graph_r, e[1]): #if not is_branch_node(u_graph_r, e[1]): tips.add(e[1]) ct += 1 #print ct, len(tips) last_node = None longest_len = 0 sub_graph2_nodes = sub_graph2.nodes() sub_graph2_edges = sub_graph2.edges() new_path = [path[0]] for n in sub_graph2_nodes: if len(sub_graph2.out_edges(n)) == 0 : path_t = nx.shortest_path(sub_graph2, source = path[0], target = n, weight = "n_weight") path_len = len(path_t) if path_len > longest_len: last_node = n longest_len = path_len new_path = path_t if last_node == None: for n in sub_graph2_nodes: path_t = nx.shortest_path(sub_graph2, source = path[0], target = n, weight = "n_weight") path_len = len(path_t) if path_len > longest_len: last_node = n longest_len = path_len new_path = path_t #new_path = nx.shortest_path(sub_graph2, path[0], last_node, "n_weight") path = new_path print "new_path", path[0], last_node, len(sub_graph2_nodes), path bundle_paths = [path] p_nodes = set(path) p_edges = set(zip(path[:-1], path[1:])) nodes_idx = dict( [ (n[1], n[0]) for n in enumerate(path) ] ) # create a list of subpath that has no branch non_branch_subpaths = [ [] ] non_branch_edges = set() mtg_edges = set() for i in range(len(path)-1): v, w = path[i:i+2] if len(sub_graph2.successors(v)) == 1 and len(sub_graph2.predecessors(w)) == 1: non_branch_subpaths[-1].append( (v, w) ) non_branch_edges.add( (v, w) ) else: if len(non_branch_subpaths[-1]) != 0: non_branch_subpaths.append([]) # create the accompany_graph that has the path of the alternative subpaths associate_graph = nx.DiGraph() for v, w in sub_graph2.edges_iter(): if (v, w) not in p_edges: associate_graph.add_edge(v, w, n_weight = sub_graph2[v][w]["n_weight"]) #print "associate_graph size:", len(associate_graph) #print "non_branch_subpaths", non_branch_subpaths # construct the bundle graph associate_graph_nodes = set(associate_graph.nodes()) bundle_graph = nx.DiGraph() bundle_graph.add_path( path ) for i in range(len(non_branch_subpaths)-1): if len(non_branch_subpaths[i]) == 0 or len( non_branch_subpaths[i+1] ) == 0: continue e1, e2 = non_branch_subpaths[i: i+2] v = e1[-1][-1] w = e2[0][0] if v == w: continue #print v, w in_between_node_count = nodes_idx[w] - nodes_idx[v] if v in associate_graph_nodes and w in associate_graph_nodes: try: #print "p2",v, w, nx.shortest_path(accommpany_graph, v, w) #print "p1",v, w, nx.shortest_path(bundle_graph, v, w) a_path = nx.shortest_path(associate_graph, v, w, "n_weight") except nx.NetworkXNoPath: continue bundle_graph.add_path( a_path ) bundle_paths.append( a_path ) #bundle_graph_nodes = bundle_graph.nodes() return bundle_graph, bundle_paths, sub_graph2_edges def get_bundles(u_edges): ASM_graph = nx.DiGraph() out_f = open("primary_tigs.fa", "w") main_tig_paths = open("primary_tigs_paths","w") sv_tigs = open("all_tigs.fa","w") sv_tig_paths = open("all_tigs_paths","w") max_weight = 0 for v, w in u_edges: x = max( [len(s[1]) for s in u_edges[ (v,w) ] ] ) print "W", v, w, x if x > max_weight: max_weight = x in_edges = {} out_edges = {} for v, w in u_edges: in_edges.setdefault(w, []) out_edges.setdefault(w, []) in_edges[w].append( (v, w) ) out_edges.setdefault(v, []) in_edges.setdefault(v, []) out_edges[v].append( (v, w) ) u_graph = nx.DiGraph() for v,w in u_edges: u_graph.add_edge(v, w, n_weight = max_weight - max( [len(s[1]) for s in u_edges[ (v,w) ] ] ) ) bundle_index = 0 G = u_graph.copy() visited_u_edges = set() while len(G) > 0: root_nodes = set() for n in G: if G.in_degree(n) != 1 or G.out_degree(n) !=1 : root_nodes.add(n) if len(root_nodes) == 0: root_nodes.add( G.nodes()[0] ) candidates = [] for n in list(root_nodes): sp =nx.single_source_shortest_path_length(G, n) sp = sp.items() sp.sort(key=lambda x : x[1]) longest = sp[-1] print "L", n, longest[0] if longest[0].split(":")[0] == n.split(":")[0]: #avoid a big loop continue candidates.append ( (longest[1], n, longest[0]) ) if len(candidates) == 0: print "no more candiate", len(G.edges()), len(G.nodes()) if len(G.edges()) > 0: path = G.edges()[0] else: break else: candidates.sort() candidate = candidates[-1] if candidate[1] == candidate[2]: G.remove_node(candidate[1]) continue path = nx.shortest_path(G, candidate[1], candidate[2], "n_weight") print "X", path[0], path[-1], len(path) cmp_edges = set() g_edges = set(G.edges()) new_path = [] tail = True # avioid confusion due to long palindrome sequence for i in range( 0, len( path ) - 1 ): v_n, w_n = path[i:i+2] new_path.append(v_n) #if (v_n, w_n) in cmp_edges or\ # len(u_graph.out_edges(w_n)) > 5 or\ # len(u_graph.in_edges(w_n)) > 5: if (v_n, w_n) in cmp_edges: tail = False break r_id, end = v_n.split(":") end = "E" if end == "B" else "B" v_n2 = r_id + ":" + end r_id, end = w_n.split(":") end = "E" if end == "B" else "B" w_n2 = r_id + ":" + end if (w_n2, v_n2) in g_edges: cmp_edges.add( (w_n2, v_n2) ) if tail: new_path.append(w_n) if len(new_path) > 1: path = new_path print "Y", path[0], path[-1], len(path) #bundle_graph, bundle_paths, bundle_graph_edges = get_bundle( path, u_graph, u_edges ) bundle_graph, bundle_paths, bundle_graph_edges = get_bundle( path, G, G.edges() ) print "Z", bundle_paths[0][0], bundle_paths[0][-1] print bundle_index, len(path), len(bundle_paths[0]), len(bundle_paths), len(bundle_graph_edges) if len(bundle_graph_edges) > 0: #ASM_graph.add_path(bundle_paths[0], ctg="%04d" % bundle_index) extra_u_edges = [] print >> main_tig_paths, ">%04d %s" % ( bundle_index, " ".join(bundle_paths[0]) ) subseqs = [] for i in range(len(bundle_paths[0]) - 1): v, w = bundle_paths[0][i:i+2] uedges = u_edges[ (v,w) ] uedges.sort( key= lambda x: len(x[0]) ) subseqs.append( uedges[-1][1] ) visited_u_edges.add( "-".join(uedges[-1][0]) ) for ue in uedges: if "-".join(ue[0]) not in visited_u_edges: visited_u_edges.add("-".join(ue[0])) extra_u_edges.append(ue) seq = "".join(subseqs) if len(seq) > 0: print >> out_f, ">%04d %s-%s" % (bundle_index, bundle_paths[0][0], bundle_paths[0][-1]) print >> out_f, seq sv_tig_idx = 0 for sv_path in bundle_paths: print >> sv_tig_paths, ">%04d-%04d %s" % ( bundle_index, sv_tig_idx, " ".join(sv_path) ) ASM_graph.add_path(sv_path, ctg="%04d" % bundle_index) subseqs = [] for i in range(len(sv_path) - 1): v, w = sv_path[i:i+2] uedges = u_edges[ (v,w) ] uedges.sort( key= lambda x: len(x[0]) ) subseqs.append( uedges[-1][1] ) visited_u_edges.add( "-".join(uedges[-1][0]) ) for ue in uedges: if "-".join(ue[0]) not in visited_u_edges: visited_u_edges.add("-".join(ue[0])) extra_u_edges.append(ue) seq = "".join(subseqs) if len(seq) > 0: print >> sv_tigs, ">%04d-%04d %s-%s" % (bundle_index, sv_tig_idx, sv_path[0], sv_path[-1]) print >> sv_tigs, "".join(subseqs) sv_tig_idx += 1 for u_path, seq in extra_u_edges: #u_path = u_path.split("-") ASM_graph.add_edge(u_path[0], u_path[-1], ctg="%04d" % bundle_index) print >> sv_tig_paths, ">%04d-%04d-u %s" % ( bundle_index, sv_tig_idx, " ".join(u_path) ) print >> sv_tigs, ">%04d-%04d-u %s-%s" % (bundle_index, sv_tig_idx, u_path[0], u_path[-1]) print >> sv_tigs, seq sv_tig_idx += 1 bundle_index += 1 else: bundle_graph_edges = zip(path[:-1],path[1:]) else: bundle_graph_edges = zip(path[:-1],path[1:]) #clean up the graph edges = set(G.edges()) edges_to_be_removed = list(set(bundle_graph_edges)) print "BGE",bundle_graph_edges edge_remove_count = 0 for v, w in edges_to_be_removed: if (v, w) in edges: G.remove_edge( v, w ) edge_remove_count += 1 print "remove edge", w, v edges = set(G.edges()) for v, w in edges_to_be_removed: r_id, end = v.split(":") end = "E" if end == "B" else "B" v = r_id + ":" + end r_id, end = w.split(":") end = "E" if end == "B" else "B" w = r_id + ":" + end if (w, v) in edges: G.remove_edge( w, v ) edge_remove_count += 1 print "remove edge", w, v if edge_remove_count == 0: print "premature termination", len(edges), len(G.nodes()) break nodes = G.nodes() for n in nodes: if G.in_degree(n) == 0 and G.out_degree(n) == 0: G.remove_node(n) print "remove node", n sv_tig_paths.close() sv_tigs.close() main_tig_paths.close() out_f.close() return ASM_graph def SGToNXG(sg): G=nx.DiGraph() max_score = max([ sg.edges[ e ].attr["score"] for e in sg.edges if sg.e_reduce[e] != True ]) out_f = open("edges_list","w") for v, w in sg.edges: if sg.e_reduce[(v, w)] != True: ##if 1: out_degree = len(sg.nodes[v].out_edges) G.add_node( v, size = out_degree ) G.add_node( w, size = out_degree ) label = sg.edges[ (v, w) ].attr["label"] score = sg.edges[ (v, w) ].attr["score"] print >>out_f, v, w, label, score G.add_edge( v, w, label = label, weight = 0.001*score, n_weight = max_score - score ) #print in_node_name, out_node_name out_f.close() return G if __name__ == "__main__": overlap_file = sys.argv[1] read_fasta = sys.argv[2] seqs = {} #f = FastaReader("pre_assembled_reads.fa") f = FastaReader(read_fasta) for r in f: seqs[r.name] = r.sequence.upper() G=nx.Graph() edges =set() overlap_data = [] contained_reads = set() overlap_count = {} with open(overlap_file) as f: for l in f: l = l.strip().split() if len(l) != 13: continue f_id, g_id, score, identity = l[:4] if f_id == g_id: continue if g_id not in seqs: continue if f_id not in seqs: continue score = int(score) identity = float(identity) contained = l[12] if contained == "contained": contained_reads.add(f_id) continue if contained == "contains": contained_reads.add(g_id) continue if contained == "none": continue if identity < 96: continue #if score > -2000: # continue f_strain, f_start, f_end, f_len = (int(c) for c in l[4:8]) g_strain, g_start, g_end, g_len = (int(c) for c in l[8:12]) if f_len < 4000: continue if g_len < 4000: continue # double check for proper overlap if f_start > 24 and f_len - f_end > 24: continue if g_start > 24 and g_len - g_end > 24: continue if g_strain == 0: if f_start < 24 and g_len - g_end > 24: continue if g_start < 24 and f_len - f_end > 24: continue else: if f_start < 24 and g_start > 24: continue if g_start < 24 and f_start > 24: continue #if g_strain != 0: # continue overlap_data.append( (f_id, g_id, score, identity, f_strain, f_start, f_end, f_len, g_strain, g_start, g_end, g_len) ) overlap_count[f_id] = overlap_count.get(f_id,0)+1 overlap_count[g_id] = overlap_count.get(g_id,0)+1 overlap_set = set() sg = StringGraph() #G=nx.Graph() for od in overlap_data: f_id, g_id, score, identity = od[:4] if f_id in contained_reads: continue if g_id in contained_reads: continue #if overlap_count.get(f_id, 0) < 3 or overlap_count.get(f_id, 0) > 400: # continue #if overlap_count.get(g_id, 0) < 3 or overlap_count.get(g_id, 0) > 400: # continue f_s, f_b, f_e, f_l = od[4:8] g_s, g_b, g_e, g_l = od[8:12] overlap_pair = [f_id, g_id] overlap_pair.sort() overlap_pair = tuple( overlap_pair ) if overlap_pair in overlap_set: continue else: overlap_set.add(overlap_pair) if g_s == 1: g_b, g_e = g_e, g_b if f_b > 24: if g_b < g_e: """ f.B f.E f -----------> g -------------> g.B g.E """ if f_b == 0 or g_e - g_l == 0: continue sg.add_edge( "%s:B" % g_id, "%s:B" % f_id, label = "%s:%d-%d" % (f_id, f_b, 0), length = abs(f_b-0), score = -score) sg.add_edge( "%s:E" % f_id, "%s:E" % g_id, label = "%s:%d-%d" % (g_id, g_e, g_l), length = abs(g_e-g_l), score = -score) else: """ f.B f.E f -----------> g <------------- g.E g.B """ if f_b == 0 or g_e == 0: continue sg.add_edge( "%s:E" % g_id, "%s:B" % f_id, label = "%s:%d-%d" % (f_id, f_b, 0), length = abs(f_b -0), score = -score) sg.add_edge( "%s:E" % f_id, "%s:B" % g_id, label = "%s:%d-%d" % (g_id, g_e, 0), length = abs(g_e- 0), score = -score) else: if g_b < g_e: """ f.B f.E f -----------> g -------------> g.B g.E """ if g_b == 0 or f_e - f_l == 0: continue sg.add_edge( "%s:B" % f_id, "%s:B" % g_id, label = "%s:%d-%d" % (g_id, g_b, 0), length = abs(g_b - 0), score = -score) sg.add_edge( "%s:E" % g_id, "%s:E" % f_id, label = "%s:%d-%d" % (f_id, f_e, f_l), length = abs(f_e-f_l), score = -score) else: """ f.B f.E f -----------> g <------------- g.E g.B """ if g_b - g_l == 0 or f_e - f_l ==0: continue sg.add_edge( "%s:B" % f_id, "%s:E" % g_id, label = "%s:%d-%d" % (g_id, g_b, g_l), length = abs(g_b - g_l), score = -score) sg.add_edge( "%s:B" % g_id, "%s:E" % f_id, label = "%s:%d-%d" % (f_id, f_e, f_l), length = abs(f_e - f_l), score = -score) sg.mark_tr_edges() print sum( [1 for c in sg.e_reduce.values() if c == True] ) print sum( [1 for c in sg.e_reduce.values() if c == False] ) G = SGToNXG(sg) nx.write_adjlist(G, "full_string_graph.adj") sg.mark_best_overlap() print sum( [1 for c in sg.e_reduce.values() if c == False] ) #sg.mark_repeat_overlap() #print sum( [1 for c in sg.repeat_overlap.values() if c == True] ) #print sum( [1 for c in sg.repeat_overlap.values() if c == False] ) #print len(sg.e_reduce), len(sg.repeat_overlap) G = SGToNXG(sg) nx.write_gexf(G, "string_graph.gexf") nx.write_adjlist(G, "string_graph.adj") #generate_max_contig(sg, seqs, out_fn="max_tigs.fa") u_edges = generate_unitig(sg, seqs, out_fn = "unitgs.fa") ASM_graph = get_bundles(u_edges ) nx.write_gexf(ASM_graph, "asm_graph.gexf") FALCON-0.1.3/src/py_scripts/falcon_dedup.py000066400000000000000000000070011237512144000204350ustar00rootroot00000000000000import subprocess from pbcore.io import FastaReader def get_matches(seq0, seq1): with open("tmp_seq0.fa","w") as f: print >>f, ">seq0" print >>f, seq0 with open("tmp_seq1.fa","w") as f: print >>f, ">seq1" print >>f, seq1 mgaps_out=subprocess.check_output("mummer -maxmatch -c -b -l 24 tmp_seq0.fa tmp_seq1.fa | mgaps ", stderr = open("/dev/null"), shell=True) matches = [] cluster = [] for l in mgaps_out.split("\n"): l = l.strip().split() if len(l) == 0: continue if l[0] == ">": seq_id = l[1] if len(cluster) != 0: matches.append(cluster) cluster = [] continue if l[0] == "#": if len(cluster) != 0: matches.append(cluster) cluster = [] continue len_ = int(l[2]) r_s = int(l[0]) q_s = int(l[1]) r_e = r_s + len_ q_e = q_s + len_ cluster.append( ((r_s, r_e), (q_s, q_e)) ) if len(cluster) != 0: matches.append(cluster) return matches u_edges = {} with open("./unit_edges.dat") as f: for l in f: v, w, path, seq = l.strip().split() u_edges.setdefault( (v, w), [] ) u_edges[ (v, w) ].append( (path, seq) ) p_tig_path = {} a_tig_path = {} with open("primary_tigs_paths_c") as f: for l in f: l = l.strip().split() id_ = l[0][1:] path = l[1:] p_tig_path[id_] = path with open("all_tigs_paths") as f: for l in f: l = l.strip().split() id_ = l[0][1:] path = l[1:] a_tig_path[id_] = path p_tig_seqs = {} for r in FastaReader("primary_tigs_c.fa"): p_tig_seqs[r.name] = r.sequence a_tig_seqs = {} for r in FastaReader("all_tigs.fa"): a_tig_seqs[r.name.split()[0]] = r.sequence p_tig_to_node_pos = {} node_pos = [] with open("primary_tigs_node_pos_c") as f: for l in f: l = l.strip().split() p_tig_to_node_pos.setdefault( l[0], []) p_tig_to_node_pos[l[0]].append( (l[1], int(l[2]))) duplicate_a_tigs = [] with open("a_nodup.fa","w") as out_f: for p_tig_id in p_tig_path: main_path = p_tig_path[p_tig_id] main_path_nodes = set(main_path[:]) p_tig_seq = p_tig_seqs[p_tig_id] a_node = [] a_node_range = [] a_node_range_map = {} node_to_pos = dict( p_tig_to_node_pos[p_tig_id] ) for id_ in a_tig_path: if id_[:4] != p_tig_id[:4]: continue if id_.split("-")[1] == "0000": continue a_path = a_tig_path[id_] if a_path[0] in main_path_nodes and a_path[-1] in main_path_nodes: #print p_tig_id, id_, a_path[0], a_path[-1] s, e = node_to_pos[a_path[0]], node_to_pos[a_path[-1]] p_seq = p_tig_seq[s:e] a_seq = a_tig_seqs[id_] seq_match = get_matches(p_seq, a_seq) if len(seq_match) > 1: print >>out_f, ">"+id_ print >>out_f, a_seq continue try: r_s, r_e = seq_match[0][0][0][0], seq_match[0][-1][0][1] except: print "XXX", seq_match if 1.0* (r_e - r_s) / (e - s) > 98: print >>out_f, ">"+id_ print >>out_f, a_seq continue duplicate_a_tigs.append(id_) FALCON-0.1.3/src/py_scripts/falcon_fixasm.py000066400000000000000000000170011237512144000206240ustar00rootroot00000000000000#!/usr/bin/env python #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ import networkx as nx from pbcore.io import FastaReader def neighbor_bound(G, v, w, radius): g1 = nx.ego_graph(G, v, radius=radius, undirected=False) g2 = nx.ego_graph(G, w, radius=radius, undirected=False) if len(g1) < radius or len(g2) < radius: return True print v, len(g1), w, len(g2), radius if len(set(g1.edges()) & set(g2.edges())) > 0: return True else: return False def is_branch_node(G, n): out_edges = G.out_edges([n]) n2 = [ e[1] for e in out_edges ] is_branch = False for i in range(len(n2)): for j in range(i+1, len(n2)): v = n2[i] w = n2[j] if neighbor_bound(G, v, w, 20) == False: is_branch = True break if is_branch == True: break return is_branch def get_r_path(r_edges, u_path): tiling_path = [] pos = 0 for i in range( len(u_path) - 1): v, w = u_path[i:i+2] r_edge_label, overlap = r_edges[ (v, w) ] r_edge_seq_id, range_ = r_edge_label.split(":") range_ = range_.split("-") s, e = int(range_[0]), int(range_[1]) pos += abs(e-s) tiling_path.append( (pos, w, s, e) ) return tiling_path def get_seq(u_edges, r_edges, path): subseqs = [] pos = [] cur_pos = 0 full_tiling_path = [] for i in range( len(path) - 1): v, w = path[i:i+2] pos.append( (v, cur_pos) ) uedges = u_edges[ (v, w) ] uedges.sort( key= lambda x: len(x[0]) ) subseqs.append( uedges[-1][1] ) r_path = get_r_path( r_edges, uedges[-1][0].split("-") ) r_path = [ ( x[0] + cur_pos, x[1], x[2], x[3]) for x in r_path ] full_tiling_path.extend( r_path ) cur_pos += len( uedges[-1][1] ) pos.append( (w, cur_pos) ) return "".join(subseqs), pos, full_tiling_path u_edges = {} with open("unit_edges.dat") as f: for l in f: v, w, path, seq = l.strip().split() u_edges.setdefault( (v, w), [] ) u_edges[ (v, w) ].append( (path, seq) ) len(u_edges) r_edges = {} with open("edges_list") as f: for l in f: v, w, edge_label, overlap = l.strip().split() r_edges[ (v, w) ] = (edge_label, int(overlap) ) primary_tigs_path = {} primary_path_graph = nx.DiGraph() begin_nodes = {} end_nodes ={} with open("primary_tigs_paths") as f: for l in f: l = l.strip().split() name = l[0][1:] path = l[1:] primary_tigs_path[name] = path if len(path) < 3: continue for i in range(len(path)-1): n1 = path[i].split(":")[0] n2 = path[i+1].split(":")[0] primary_path_graph.add_edge( n1, n2) begin_nodes.setdefault(path[0], []) begin_nodes[path[0]].append( name ) end_nodes.setdefault(path[-1], []) end_nodes[path[-1]].append( name ) path_names = primary_tigs_path.keys() path_names.sort() primary_path_graph_r = primary_path_graph.reverse() path_f = open("primary_tigs_paths_c","w") pos_f = open("primary_tigs_node_pos_c", "w") tiling_path_f = open("all_tiling_path_c", "w") with open("primary_tigs_c.fa","w") as out_f: for name in path_names: sub_idx = 0 c_path = [ primary_tigs_path[name][0] ] for v in primary_tigs_path[name][1:]: break_path = False vn = v.split(":")[0] if primary_path_graph.out_degree(vn) > 1: break_path = is_branch_node(primary_path_graph, vn) if primary_path_graph.in_degree(vn) > 1: break_path = is_branch_node(primary_path_graph_r, vn) if break_path: c_path.append(v) seq, pos, full_tiling_path = get_seq(u_edges, r_edges, c_path) for p, w, s, e in full_tiling_path: print >> tiling_path_f, "%s_%02d" % (name, sub_idx), p, w, s, e if len(full_tiling_path) <= 5: continue print >>out_f, ">%s_%02d" % (name, sub_idx) print >>out_f, seq print >>path_f, ">%s_%02d" % (name, sub_idx), " ".join(c_path) #print c_path for node, p in pos: print >> pos_f, "%s_%02d %s %d" % (name, sub_idx, node, p) c_path = [v] sub_idx += 1 else: c_path.append(v) if len(c_path) > 1: seq, pos, full_tiling_path = get_seq(u_edges, r_edges, c_path) for p, w, s, e in full_tiling_path: print >> tiling_path_f, "%s_%02d" % (name, sub_idx), p, w, s, e if len(full_tiling_path) <= 5: continue print >>out_f, ">%s_%02d" % (name, sub_idx) print >>out_f, seq print >>path_f, ">%s_%02d" % (name, sub_idx), " ".join(c_path) for node, p in pos: print >> pos_f, "%s_%02d %s %d" % (name, sub_idx, node, p) with open("all_tigs_paths") as f: for l in f: l = l.strip().split() name = l[0][1:] name = name.split("-") if name[1] == "0000": continue if len(name) == 2: path = l[1:] seq, pos, full_tiling_path = get_seq(u_edges, r_edges, path) for p, w, s, e in full_tiling_path: print >> tiling_path_f, "%s" % ("-".join(name)), p, w, s, e else: path = l[1:] full_tiling_path = get_r_path(r_edges, path) for p, w, s, e in full_tiling_path: print >> tiling_path_f, "%s" % ("-".join(name)), p, w, s, e path_f.close() tiling_path_f.close() pos_f.close() FALCON-0.1.3/src/py_scripts/falcon_overlap.py000077500000000000000000000265541237512144000210250ustar00rootroot00000000000000#!/usr/bin/env python #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ from falcon_kit import * from pbcore.io import FastaReader import numpy as np import collections import sys import multiprocessing as mp from multiprocessing import sharedctypes from ctypes import * global sa_ptr, sda_ptr, lk_ptr global q_seqs, seqs RC_MAP = dict( zip("ACGTacgtNn-", "TGCAtgcaNn-") ) def get_ovelap_alignment(seq1, seq0): K = 8 lk_ptr = kup.allocate_kmer_lookup( 1 << (K * 2) ) sa_ptr = kup.allocate_seq( len(seq0) ) sda_ptr = kup.allocate_seq_addr( len(seq0) ) kup.add_sequence( 0, K, seq0, len(seq0), sda_ptr, sa_ptr, lk_ptr) kmer_match_ptr = kup.find_kmer_pos_for_seq(seq1, len(seq1), K, sda_ptr, lk_ptr) kmer_match = kmer_match_ptr[0] aln_range_ptr = kup.find_best_aln_range(kmer_match_ptr, K, K*5, 50) #x,y = zip( * [ (kmer_match.query_pos[i], kmer_match.target_pos[i]) for i in range(kmer_match.count )] ) aln_range = aln_range_ptr[0] kup.free_kmer_match(kmer_match_ptr) s1, e1, s0, e0 = aln_range.s1, aln_range.e1, aln_range.s2, aln_range.e2 e1 += K + K/2 e0 += K + K/2 kup.free_aln_range(aln_range) len_1 = len(seq1) len_0 = len(seq0) if e1 > len_1: e1 = len_1 if e0 > len_0: e0 = len_0 do_aln = False contain_status = "none" #print s0, e0, s1, e1 if e1 - s1 > 500: if s0 < s1 and s0 > 24: do_aln = False elif s1 <= s0 and s1 > 24: do_aln = False elif s1 < 24 and len_1 - e1 < 24: do_aln = True contain_status = "contains" #print "X1" elif s0 < 24 and len_0 - e0 < 24: do_aln = True contain_status = "contained" #print "X2" else: do_aln = True if s0 < s1: s1 -= s0 #assert s1 > 0 s0 = 0 e1 = len_1 #if len_1 - s1 >= len_0: # do_aln = False # contain_status = "contains" # print "X3", s0, e0, len_0, s1, e1, len_1 elif s1 <= s0: s0 -= s1 #assert s1 > 0 s1 = 0 e0 = len_0 #print s0, e0, s1, e1 #if len_0 - s0 >= len_1: # do_aln = False # contain_status = "contained" # print "X4" #if abs( (e1 - s1) - (e0 - s0 ) ) > 200: #avoid overlap alignment for big indels # do_aln = False if do_aln: alignment = DWA.align(seq1[s1:e1], e1-s1, seq0[s0:e0], e0-s0, 500, 0) #print seq1[s1:e1] #print seq0[s2:e2] #if alignment[0].aln_str_size > 500: #aln_str1 = alignment[0].q_aln_str #aln_str0 = alignment[0].t_aln_str aln_size = alignment[0].aln_str_size aln_dist = alignment[0].dist aln_q_s = alignment[0].aln_q_s aln_q_e = alignment[0].aln_q_e aln_t_s = alignment[0].aln_t_s aln_t_e = alignment[0].aln_t_e assert aln_q_e- aln_q_s <= alignment[0].aln_str_size or aln_t_e- aln_t_s <= alignment[0].aln_str_size #print aln_str1 #print aln_str0 if aln_size > 500 and contain_status == "none": contain_status = "overlap" DWA.free_alignment(alignment) kup.free_seq_addr_array(sda_ptr) kup.free_seq_array(sa_ptr) kup.free_kmer_lookup(lk_ptr) if do_aln: if s1 > 1000 and s0 > 1000: return 0, 0, 0, 0, 0, 0, "none" if len_1 - (s1+aln_q_e-aln_q_s) > 1000 and len_0 - (s0+aln_t_e-aln_t_s) > 1000: return 0, 0, 0, 0, 0, 0, "none" if e1 - s1 > 500 and do_aln and aln_size > 500: #return s1, s1+aln_q_e-aln_q_s, s2, s2+aln_t_e-aln_t_s, aln_size, aln_dist, x, y return s1, s1+aln_q_e-aln_q_s, s0, s0+aln_t_e-aln_t_s, aln_size, aln_dist, contain_status else: return 0, 0, 0, 0, 0, 0, contain_status def get_candidate_aln(hit_input): global q_seqs q_name, hit_index_f, hit_index_r = hit_input q_seq = q_seqs[q_name] rtn = [] hit_index = hit_index_f c = collections.Counter(hit_index) s = [c[0] for c in c.items() if c[1] >50] #s.sort() targets = set() for p in s: hit_id = seqs[p][0] if hit_id in targets or hit_id == q_name: continue targets.add(hit_id) seq1, seq0 = q_seq, q_seqs[hit_id] aln_data = get_ovelap_alignment(seq1, seq0) #rtn = get_alignment(seq1, seq0) if rtn != None: s1, e1, s2, e2, aln_size, aln_dist, c_status = aln_data #print >>f, name, 0, s1, e1, len(seq1), hit_id, 0, s2, e2, len(seq0), aln_size, aln_dist rtn.append( ( hit_id, q_name, aln_dist - aln_size, "%0.2f" % (100 - 100.0*aln_dist/(aln_size+1)), 0, s2, e2, len(seq0), 0, s1, e1, len(seq1), c_status ) ) r_q_seq = "".join([RC_MAP[c] for c in q_seq[::-1]]) hit_index = hit_index_r c = collections.Counter(hit_index) s = [c[0] for c in c.items() if c[1] >50] #s.sort() targets = set() for p in s: hit_id = seqs[p][0] if hit_id in targets or hit_id == q_name: continue targets.add(hit_id) seq1, seq0 = r_q_seq, q_seqs[hit_id] aln_data = get_ovelap_alignment(seq1, seq0) #rtn = get_alignment(seq1, seq0) if rtn != None: s1, e1, s2, e2, aln_size, aln_dist, c_status = aln_data #print >>f, name, 1, s1, e1, len(seq1), hit_id, 0, s2, e2, len(seq0), aln_size, aln_dist rtn.append( ( hit_id, q_name, aln_dist - aln_size, "%0.2f" % (100 - 100.0*aln_dist/(aln_size+1)), 0, s2, e2, len(seq0), 1, len(seq1) - e1, len(seq1)- s1, len(seq1), c_status ) ) return rtn def build_look_up(seqs, K): global sa_ptr, sda_ptr, lk_ptr total_index_base = len(seqs) * 1000 sa_ptr = sharedctypes.RawArray(base_t, total_index_base) c_sa_ptr = cast(sa_ptr, POINTER(base_t)) kup.init_seq_array(c_sa_ptr, total_index_base) sda_ptr = sharedctypes.RawArray(seq_coor_t, total_index_base) c_sda_ptr = cast(sda_ptr, POINTER(seq_coor_t)) lk_ptr = sharedctypes.RawArray(KmerLookup, 1 << (K*2)) c_lk_ptr = cast(lk_ptr, POINTER(KmerLookup)) kup.init_kmer_lookup(c_lk_ptr, 1 << (K*2)) start = 0 for r_name, seq in seqs: kup.add_sequence( start, K, seq, 1000, c_sda_ptr, c_sa_ptr, c_lk_ptr) start += 1000 kup.mask_k_mer(1 << (K * 2), c_lk_ptr, 512) #return sda_ptr, sa_ptr, lk_ptr def get_candidate_hits(q_name): global sa_ptr, sda_ptr, lk_ptr global q_seqs K = 14 q_seq = q_seqs[q_name] rtn = [] c_sda_ptr = cast(sda_ptr, POINTER(seq_coor_t)) c_sa_ptr = cast(sa_ptr, POINTER(base_t)) c_lk_ptr = cast(lk_ptr, POINTER(KmerLookup)) kmer_match_ptr = kup.find_kmer_pos_for_seq(q_seq, len(q_seq), K, c_sda_ptr, c_lk_ptr) kmer_match = kmer_match_ptr[0] count = kmer_match.count hit_index_f = np.array(kmer_match.target_pos[0:count])/1000 kup.free_kmer_match(kmer_match_ptr) r_q_seq = "".join([RC_MAP[c] for c in q_seq[::-1]]) kmer_match_ptr = kup.find_kmer_pos_for_seq(r_q_seq, len(r_q_seq), K, c_sda_ptr, c_lk_ptr) kmer_match = kmer_match_ptr[0] count = kmer_match.count hit_index_r = np.array(kmer_match.target_pos[0:count])/1000 kup.free_kmer_match(kmer_match_ptr) return q_name, hit_index_f, hit_index_r def q_names( q_seqs ): for q_name, q_seq in q_seqs.items(): yield q_name def lookup_data_iterator( q_seqs, m_pool ): for mr in m_pool.imap( get_candidate_hits, q_names(q_seqs)): yield mr if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description='a simple multi-processor overlapper for sequence reads') parser.add_argument('fasta_file', help='a fasta file for all pairwise overlapping of the reads within') parser.add_argument('--min_len', type=int, default=4000, help='minimum length of the reads to be considered for overlapping') parser.add_argument('--n_core', type=int, default=1, help='number of processes used for detailed overlapping evalution') parser.add_argument('--d_core', type=int, default=1, help='number of processes used for k-mer matching') args = parser.parse_args() seqs = [] q_seqs = {} f = FastaReader(args.fasta_file) # take one commnad line argument of the input fasta file name if args.min_len < 2200: args.min_len = 2200 idx = 0 for r in f: if len(r.sequence) < args.min_len: continue seq = r.sequence.upper() for start in range(0, len(seq), 1000): if start+1000 > len(seq): break seqs.append( (r.name, seq[start: start+1000]) ) idx += 1 #seqs.append( (r.name, seq[:1000]) ) seqs.append( (r.name, seq[-1000:]) ) idx += 1 q_seqs[r.name] = seq total_index_base = len(seqs) * 1000 pool = mp.Pool(args.n_core) K = 14 build_look_up(seqs, K) m_pool = mp.Pool(args.d_core) #for r in pool.imap(get_candidate_aln, lookup_data_iterator( q_seqs)): for r in pool.imap(get_candidate_aln, lookup_data_iterator( q_seqs, m_pool)): for h in r: print " ".join([str(x) for x in h]) FALCON-0.1.3/src/py_scripts/falcon_overlap2.py000077500000000000000000000274411237512144000211030ustar00rootroot00000000000000#!/usr/bin/env python #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ from falcon_kit import * from pbcore.io import FastaReader import numpy as np import collections import sys import multiprocessing as mp from multiprocessing import sharedctypes from ctypes import * global sa_ptr, sda_ptr, lk_ptr global q_seqs,t_seqs, seqs RC_MAP = dict( zip("ACGTacgtNn-", "TGCAtgcaNn-") ) def get_ovelap_alignment(seq1, seq0): K = 8 lk_ptr = kup.allocate_kmer_lookup( 1 << (K * 2) ) sa_ptr = kup.allocate_seq( len(seq0) ) sda_ptr = kup.allocate_seq_addr( len(seq0) ) kup.add_sequence( 0, K, seq0, len(seq0), sda_ptr, sa_ptr, lk_ptr) kmer_match_ptr = kup.find_kmer_pos_for_seq(seq1, len(seq1), K, sda_ptr, lk_ptr) kmer_match = kmer_match_ptr[0] aln_range_ptr = kup.find_best_aln_range(kmer_match_ptr, K, K*5, 50) #x,y = zip( * [ (kmer_match.query_pos[i], kmer_match.target_pos[i]) for i in range(kmer_match.count )] ) aln_range = aln_range_ptr[0] kup.free_kmer_match(kmer_match_ptr) s1, e1, s0, e0 = aln_range.s1, aln_range.e1, aln_range.s2, aln_range.e2 e1 += K + K/2 e0 += K + K/2 kup.free_aln_range(aln_range) len_1 = len(seq1) len_0 = len(seq0) if e1 > len_1: e1 = len_1 if e0 > len_0: e0 = len_0 do_aln = False contain_status = "none" #print s0, e0, s1, e1 if e1 - s1 > 500: if s0 < s1 and s0 > 24: do_aln = False elif s1 <= s0 and s1 > 24: do_aln = False elif s1 < 24 and len_1 - e1 < 24: do_aln = True contain_status = "contains" #print "X1" elif s0 < 24 and len_0 - e0 < 24: do_aln = True contain_status = "contained" #print "X2" else: do_aln = True if s0 < s1: s1 -= s0 #assert s1 > 0 s0 = 0 e1 = len_1 #if len_1 - s1 >= len_0: # do_aln = False # contain_status = "contains" # print "X3", s0, e0, len_0, s1, e1, len_1 elif s1 <= s0: s0 -= s1 #assert s1 > 0 s1 = 0 e0 = len_0 #print s0, e0, s1, e1 #if len_0 - s0 >= len_1: # do_aln = False # contain_status = "contained" # print "X4" #if abs( (e1 - s1) - (e0 - s0 ) ) > 200: #avoid overlap alignment for big indels # do_aln = False if do_aln: alignment = DWA.align(seq1[s1:e1], e1-s1, seq0[s0:e0], e0-s0, 500, 0) #print seq1[s1:e1] #print seq0[s2:e2] #if alignment[0].aln_str_size > 500: #aln_str1 = alignment[0].q_aln_str #aln_str0 = alignment[0].t_aln_str aln_size = alignment[0].aln_str_size aln_dist = alignment[0].dist aln_q_s = alignment[0].aln_q_s aln_q_e = alignment[0].aln_q_e aln_t_s = alignment[0].aln_t_s aln_t_e = alignment[0].aln_t_e assert aln_q_e- aln_q_s <= alignment[0].aln_str_size or aln_t_e- aln_t_s <= alignment[0].aln_str_size #print aln_str1 #print aln_str0 if aln_size > 500 and contain_status == "none": contain_status = "overlap" DWA.free_alignment(alignment) kup.free_seq_addr_array(sda_ptr) kup.free_seq_array(sa_ptr) kup.free_kmer_lookup(lk_ptr) if do_aln: if s1 > 1000 and s0 > 1000: return 0, 0, 0, 0, 0, 0, "none" if len_1 - (s1+aln_q_e-aln_q_s) > 1000 and len_0 - (s0+aln_t_e-aln_t_s) > 1000: return 0, 0, 0, 0, 0, 0, "none" if e1 - s1 > 500 and do_aln and aln_size > 500: #return s1, s1+aln_q_e-aln_q_s, s2, s2+aln_t_e-aln_t_s, aln_size, aln_dist, x, y return s1, s1+aln_q_e-aln_q_s, s0, s0+aln_t_e-aln_t_s, aln_size, aln_dist, contain_status else: return 0, 0, 0, 0, 0, 0, contain_status def get_candidate_aln(hit_input): global q_seqs, seqs, t_seqs q_name, hit_index_f, hit_index_r = hit_input q_seq = q_seqs[q_name] rtn = [] hit_index = hit_index_f c = collections.Counter(hit_index) s = [c[0] for c in c.items() if c[1] >50] #s.sort() targets = set() for p in s: hit_id = seqs[p][0] if hit_id in targets or hit_id == q_name: continue targets.add(hit_id) seq1, seq0 = q_seq, t_seqs[hit_id] aln_data = get_ovelap_alignment(seq1, seq0) #rtn = get_alignment(seq1, seq0) if rtn != None: s1, e1, s2, e2, aln_size, aln_dist, c_status = aln_data if c_status == "none": continue #print >>f, name, 0, s1, e1, len(seq1), hit_id, 0, s2, e2, len(seq0), aln_size, aln_dist rtn.append( ( hit_id, q_name, aln_dist - aln_size, "%0.2f" % (100 - 100.0*aln_dist/(aln_size+1)), 0, s2, e2, len(seq0), 0, s1, e1, len(seq1), c_status ) ) r_q_seq = "".join([RC_MAP[c] for c in q_seq[::-1]]) hit_index = hit_index_r c = collections.Counter(hit_index) s = [c[0] for c in c.items() if c[1] >50] #s.sort() targets = set() for p in s: hit_id = seqs[p][0] if hit_id in targets or hit_id == q_name: continue targets.add(hit_id) seq1, seq0 = r_q_seq, t_seqs[hit_id] aln_data = get_ovelap_alignment(seq1, seq0) #rtn = get_alignment(seq1, seq0) if rtn != None: s1, e1, s2, e2, aln_size, aln_dist, c_status = aln_data if c_status == "none": continue #print >>f, name, 1, s1, e1, len(seq1), hit_id, 0, s2, e2, len(seq0), aln_size, aln_dist rtn.append( ( hit_id, q_name, aln_dist - aln_size, "%0.2f" % (100 - 100.0*aln_dist/(aln_size+1)), 0, s2, e2, len(seq0), 1, len(seq1) - e1, len(seq1)- s1, len(seq1), c_status ) ) return rtn def build_look_up(seqs, K): global sa_ptr, sda_ptr, lk_ptr total_index_base = len(seqs) * 1000 sa_ptr = sharedctypes.RawArray(base_t, total_index_base) c_sa_ptr = cast(sa_ptr, POINTER(base_t)) kup.init_seq_array(c_sa_ptr, total_index_base) sda_ptr = sharedctypes.RawArray(seq_coor_t, total_index_base) c_sda_ptr = cast(sda_ptr, POINTER(seq_coor_t)) lk_ptr = sharedctypes.RawArray(KmerLookup, 1 << (K*2)) c_lk_ptr = cast(lk_ptr, POINTER(KmerLookup)) kup.init_kmer_lookup(c_lk_ptr, 1 << (K*2)) start = 0 for r_name, seq in seqs: kup.add_sequence( start, K, seq, 1000, c_sda_ptr, c_sa_ptr, c_lk_ptr) start += 1000 kup.mask_k_mer(1 << (K * 2), c_lk_ptr, 256) #return sda_ptr, sa_ptr, lk_ptr def get_candidate_hits(q_name): global sa_ptr, sda_ptr, lk_ptr global q_seqs K = 14 q_seq = q_seqs[q_name] rtn = [] c_sda_ptr = cast(sda_ptr, POINTER(seq_coor_t)) c_sa_ptr = cast(sa_ptr, POINTER(base_t)) c_lk_ptr = cast(lk_ptr, POINTER(KmerLookup)) kmer_match_ptr = kup.find_kmer_pos_for_seq(q_seq, len(q_seq), K, c_sda_ptr, c_lk_ptr) kmer_match = kmer_match_ptr[0] count = kmer_match.count hit_index_f = np.array(kmer_match.target_pos[0:count])/1000 kup.free_kmer_match(kmer_match_ptr) r_q_seq = "".join([RC_MAP[c] for c in q_seq[::-1]]) kmer_match_ptr = kup.find_kmer_pos_for_seq(r_q_seq, len(r_q_seq), K, c_sda_ptr, c_lk_ptr) kmer_match = kmer_match_ptr[0] count = kmer_match.count hit_index_r = np.array(kmer_match.target_pos[0:count])/1000 kup.free_kmer_match(kmer_match_ptr) return q_name, hit_index_f, hit_index_r def q_names( q_seqs ): for q_name, q_seq in q_seqs.items(): yield q_name def lookup_data_iterator( q_seqs, m_pool ): for mr in m_pool.imap( get_candidate_hits, q_names(q_seqs)): yield mr if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description='a simple multi-processor overlapper for sequence reads') parser.add_argument('query_fa', help='a fasta file to be overlapped with sequence in target') parser.add_argument('target_fa', help='a fasta file as the target sequences for overlapping') parser.add_argument('--min_len', type=int, default=4000, help='minimum length of the reads to be considered for overlapping') parser.add_argument('--n_core', type=int, default=1, help='number of processes used for detailed overlapping evalution') parser.add_argument('--d_core', type=int, default=1, help='number of processes used for k-mer matching') args = parser.parse_args() seqs = [] q_seqs = {} t_seqs = {} f = FastaReader(args.target_fa) # take one commnad line argument of the input fasta file name if args.min_len < 2200: args.min_len = 2200 idx = 0 for r in f: if len(r.sequence) < args.min_len: continue seq = r.sequence.upper() for start in range(0, len(seq), 1000): if start+1000 > len(seq): break seqs.append( (r.name, seq[start: start+1000]) ) idx += 1 seqs.append( (r.name, seq[-1000:]) ) idx += 1 t_seqs[r.name] = seq f = FastaReader(args.query_fa) # take one commnad line argument of the input fasta file name for r in f: if len(r.sequence) < args.min_len: continue seq = r.sequence.upper() q_seqs[r.name] = seq total_index_base = len(seqs) * 1000 pool = mp.Pool(args.n_core) K = 14 build_look_up(seqs, K) m_pool = mp.Pool(args.d_core) #for r in pool.imap(get_candidate_aln, lookup_data_iterator( q_seqs)): for r in pool.imap(get_candidate_aln, lookup_data_iterator( q_seqs, m_pool)): for h in r: print " ".join([str(x) for x in h]) FALCON-0.1.3/src/py_scripts/falcon_qrm.py000077500000000000000000000300541237512144000201420ustar00rootroot00000000000000#!/usr/bin/env python #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ from falcon_kit import * from pbcore.io import FastaReader import numpy as np import collections import sys import multiprocessing as mp from multiprocessing import sharedctypes from ctypes import * import math global sa_ptr, sda_ptr, lk_ptr global q_seqs,t_seqs, seqs global n_candidates, max_candidates seqs = [] RC_MAP = dict( zip("ACGTacgtNn-", "TGCAtgcaNn-") ) all_fivemers = [] cmap = {0:"A", 1:"T", 2:"C", 3:"G"} for i in range(1024): mer = [] for j in range(5): mer.append( cmap[ i >> (2 *j) & 3 ]) all_fivemers.append("".join(mer)) def fivemer_entropy(seq): five_mer_count = {} for i in range(len(seq)-5): five_mer = seq[i:i+5] five_mer_count.setdefault(five_mer, 0) five_mer_count[five_mer] += 1 entropy = 0.0 for five_mer in all_fivemers: p = five_mer_count.get(five_mer, 0) + 1.0 p /= len(seq) entropy += - p * math.log(p) return entropy def get_alignment(seq1, seq0): K = 8 lk_ptr = kup.allocate_kmer_lookup( 1 << (K * 2) ) sa_ptr = kup.allocate_seq( len(seq0) ) sda_ptr = kup.allocate_seq_addr( len(seq0) ) kup.add_sequence( 0, K, seq0, len(seq0), sda_ptr, sa_ptr, lk_ptr) kup.mask_k_mer(1 << (K * 2), lk_ptr, 16) kmer_match_ptr = kup.find_kmer_pos_for_seq(seq1, len(seq1), K, sda_ptr, lk_ptr) kmer_match = kmer_match_ptr[0] aln_range_ptr = kup.find_best_aln_range2(kmer_match_ptr, K, K*50, 25) #x,y = zip( * [ (kmer_match.query_pos[i], kmer_match.target_pos[i]) for i in range(kmer_match.count )] ) aln_range = aln_range_ptr[0] kup.free_kmer_match(kmer_match_ptr) s1, e1, s0, e0, km_score = aln_range.s1, aln_range.e1, aln_range.s2, aln_range.e2, aln_range.score e1 += K + K/2 e0 += K + K/2 kup.free_aln_range(aln_range) len_1 = len(seq1) len_0 = len(seq0) if e1 > len_1: e1 = len_1 if e0 > len_0: e0 = len_0 aln_size = 1 if e1 - s1 > 500: aln_size = max( e1-s1, e0-s0 ) aln_score = int(km_score * 48) aln_q_s = s1 aln_q_e = e1 aln_t_s = s0 aln_t_e = e0 kup.free_seq_addr_array(sda_ptr) kup.free_seq_array(sa_ptr) kup.free_kmer_lookup(lk_ptr) if s1 > 1000 and s0 > 1000: return 0, 0, 0, 0, 0, 0, "none" if len_1 - e1 > 1000 and len_0 - e0 > 1000: return 0, 0, 0, 0, 0, 0, "none" if e1 - s1 > 500 and aln_size > 500: return s1, s1+aln_q_e-aln_q_s, s0, s0+aln_t_e-aln_t_s, aln_size, aln_score, "aln" else: return 0, 0, 0, 0, 0, 0, "none" def get_candidate_aln(hit_input): global q_seqs, seqs, t_seqs, q_len global max_candidates global n_candidates q_name, hit_index_f, hit_index_r = hit_input q_seq = q_seqs[q_name] rtn = [] hit_index = hit_index_f c = collections.Counter(hit_index) s = [(c[0],c[1]) for c in c.items() if c[1] > 4] hit_data = {} #hit_ids = set() for p, hit_count in s: hit_id = seqs[p][0] hit_data.setdefault(hit_id, [0, 0 ,0]) hit_data[hit_id][0] += hit_count; if hit_count > hit_data[hit_id][1]: hit_data[hit_id][1] = hit_count hit_data[hit_id][2] += 1 hit_data = hit_data.items() hit_data.sort( key=lambda x:-x[1][0] ) target_count = {} total_hit = 0 for hit in hit_data[:n_candidates]: hit_id = hit[0] hit_count = hit[1][0] target_count.setdefault(hit_id, 0) if target_count[hit_id] > max_candidates: continue if total_hit > max_candidates: continue seq1, seq0 = q_seq, t_seqs[hit_id] aln_data = get_alignment(seq1, seq0) if rtn != None: s1, e1, s2, e2, aln_size, aln_score, c_status = aln_data if c_status == "none": continue target_count[hit_id] += 1 total_hit += 1 rtn.append( ( q_name, hit_id, -aln_score, "%0.2f" % (100.0*aln_score/(aln_size+1)), 0, s1, e1, len(seq1), 0, s2, e2, len(seq0), c_status + " %d" % hit_count ) ) r_q_seq = "".join([RC_MAP[c] for c in q_seq[::-1]]) hit_index = hit_index_r c = collections.Counter(hit_index) s = [(c[0],c[1]) for c in c.items() if c[1] > 4] hit_data = {} #hit_ids = set() for p, hit_count in s: hit_id = seqs[p][0] hit_data.setdefault(hit_id, [0, 0 ,0]) hit_data[hit_id][0] += hit_count; if hit_count > hit_data[hit_id][1]: hit_data[hit_id][1] = hit_count hit_data[hit_id][2] += 1 hit_data = hit_data.items() hit_data.sort( key=lambda x:-x[1][0] ) target_count = {} total_hit = 0 for hit in hit_data[:n_candidates]: hit_id = hit[0] hit_count = hit[1][0] target_count.setdefault(hit_id, 0) if target_count[hit_id] > max_candidates: continue if total_hit > max_candidates: continue seq1, seq0 = r_q_seq, t_seqs[hit_id] aln_data = get_alignment(seq1, seq0) if rtn != None: s1, e1, s2, e2, aln_size, aln_score, c_status = aln_data if c_status == "none": continue target_count[hit_id] += 1 total_hit += 1 rtn.append( ( q_name, hit_id, -aln_score, "%0.2f" % (100.0*aln_score/(aln_size+1)), 0, len(seq1) - e1, len(seq1) - s1, len(seq1), 1, s2, e2, len(seq0), c_status + " %d" % hit_count ) ) return rtn def build_look_up(seqs, K): global sa_ptr, sda_ptr, lk_ptr total_index_base = len(seqs) * 1000 sa_ptr = sharedctypes.RawArray(base_t, total_index_base) c_sa_ptr = cast(sa_ptr, POINTER(base_t)) kup.init_seq_array(c_sa_ptr, total_index_base) sda_ptr = sharedctypes.RawArray(seq_coor_t, total_index_base) c_sda_ptr = cast(sda_ptr, POINTER(seq_coor_t)) lk_ptr = sharedctypes.RawArray(KmerLookup, 1 << (K*2)) c_lk_ptr = cast(lk_ptr, POINTER(KmerLookup)) kup.init_kmer_lookup(c_lk_ptr, 1 << (K*2)) start = 0 for r_name, seq in seqs: kup.add_sequence( start, K, seq, 1000, c_sda_ptr, c_sa_ptr, c_lk_ptr) start += 1000 kup.mask_k_mer(1 << (K * 2), c_lk_ptr, 1024) #return sda_ptr, sa_ptr, lk_ptr def get_candidate_hits(q_name): global sa_ptr, sda_ptr, lk_ptr global q_seqs K = 14 q_seq = q_seqs[q_name] rtn = [] c_sda_ptr = cast(sda_ptr, POINTER(seq_coor_t)) c_sa_ptr = cast(sa_ptr, POINTER(base_t)) c_lk_ptr = cast(lk_ptr, POINTER(KmerLookup)) kmer_match_ptr = kup.find_kmer_pos_for_seq(q_seq, len(q_seq), K, c_sda_ptr, c_lk_ptr) kmer_match = kmer_match_ptr[0] count = kmer_match.count hit_index_f = np.array(kmer_match.target_pos[0:count])/1000 kup.free_kmer_match(kmer_match_ptr) r_q_seq = "".join([RC_MAP[c] for c in q_seq[::-1]]) kmer_match_ptr = kup.find_kmer_pos_for_seq(r_q_seq, len(r_q_seq), K, c_sda_ptr, c_lk_ptr) kmer_match = kmer_match_ptr[0] count = kmer_match.count hit_index_r = np.array(kmer_match.target_pos[0:count])/1000 kup.free_kmer_match(kmer_match_ptr) return q_name, hit_index_f, hit_index_r def q_names( q_seqs ): for q_name, q_seq in q_seqs.items(): yield q_name def lookup_data_iterator( q_seqs, m_pool ): for mr in m_pool.imap( get_candidate_hits, q_names(q_seqs)): yield mr if __name__ == "__main__": import argparse parser = argparse.ArgumentParser(description='a simple multi-processor overlapper for sequence reads') parser.add_argument('target_fofn', help='a fasta fofn as the target sequences for overlapping') parser.add_argument('query_fofn', help='a fasta fofn to be overlapped with sequence in target') parser.add_argument('--min_len', type=int, default=4000, help='minimum length of the reads to be considered for overlapping') parser.add_argument('--n_core', type=int, default=1, help='number of processes used for detailed overlapping evalution') parser.add_argument('--d_core', type=int, default=1, help='number of processes used for k-mer matching') parser.add_argument('--n_candidates', type=int, default=128, help='number of candidates for read matching') parser.add_argument('--max_candidates', type=int, default=64, help='max number for read matching to output') args = parser.parse_args() max_candidates = args.max_candidates n_candidates = args.n_candidates q_seqs = {} t_seqs = {} if args.min_len < 1200: args.min_len = 1200 with open(args.target_fofn) as fofn: for fn in fofn: fn = fn.strip() f = FastaReader(fn) # take one commnad line argument of the input fasta file name for r in f: if len(r.sequence) < args.min_len: continue seq = r.sequence.upper() for start in range(0, len(seq), 1000): if start+1000 > len(seq): break subseq = seq[start: start+1000] #if fivemer_entropy(subseq) < 4: # continue seqs.append( (r.name, subseq) ) subseq = seq[-1000:] #if fivemer_entropy(subseq) < 4: # continue #seqs.append( (r.name, seq[:1000]) ) seqs.append( (r.name, subseq) ) t_seqs[r.name] = seq with open(args.query_fofn) as fofn: for fn in fofn: fn = fn.strip() f = FastaReader(fn) # take one commnad line argument of the input fasta file name for r in f: seq = r.sequence.upper() #if fivemer_entropy(seq) < 4: # continue q_seqs[r.name] = seq pool = mp.Pool(args.n_core) K = 14 build_look_up(seqs, K) m_pool = mp.Pool(args.d_core) #for r in pool.imap(get_candidate_aln, lookup_data_iterator( q_seqs)): for r in pool.imap(get_candidate_aln, lookup_data_iterator(q_seqs, m_pool)): for h in r: print " ".join([str(x) for x in h]) FALCON-0.1.3/src/py_scripts/falcon_sense.py000066400000000000000000000234661237512144000204660ustar00rootroot00000000000000#!/usr/bin/env python #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ from ctypes import * import sys from multiprocessing import Pool import os import falcon_kit module_path = falcon_kit.__path__[0] falcon = CDLL(os.path.join(module_path, "falcon.so")) falcon.generate_consensus.argtypes = [ POINTER(c_char_p), c_uint, c_uint, c_uint, c_uint, c_uint, c_double ] falcon.generate_consensus.restype = POINTER(falcon_kit.ConsensusData) falcon.free_consensus_data.argtypes = [ POINTER(falcon_kit.ConsensusData) ] def get_alignment(seq1, seq0, edge_tolerance = 1000): kup = falcon_kit.kup K = 8 lk_ptr = kup.allocate_kmer_lookup( 1 << (K * 2) ) sa_ptr = kup.allocate_seq( len(seq0) ) sda_ptr = kup.allocate_seq_addr( len(seq0) ) kup.add_sequence( 0, K, seq0, len(seq0), sda_ptr, sa_ptr, lk_ptr) kup.mask_k_mer(1 << (K * 2), lk_ptr, 16) kmer_match_ptr = kup.find_kmer_pos_for_seq(seq1, len(seq1), K, sda_ptr, lk_ptr) kmer_match = kmer_match_ptr[0] aln_range_ptr = kup.find_best_aln_range2(kmer_match_ptr, K, K*50, 25) #x,y = zip( * [ (kmer_match.query_pos[i], kmer_match.target_pos[i]) for i in range(kmer_match.count )] ) aln_range = aln_range_ptr[0] kup.free_kmer_match(kmer_match_ptr) s1, e1, s0, e0, km_score = aln_range.s1, aln_range.e1, aln_range.s2, aln_range.e2, aln_range.score e1 += K + K/2 e0 += K + K/2 kup.free_aln_range(aln_range) len_1 = len(seq1) len_0 = len(seq0) if e1 > len_1: e1 = len_1 if e0 > len_0: e0 = len_0 aln_size = 1 if e1 - s1 > 500: aln_size = max( e1-s1, e0-s0 ) aln_score = int(km_score * 48) aln_q_s = s1 aln_q_e = e1 aln_t_s = s0 aln_t_e = e0 kup.free_seq_addr_array(sda_ptr) kup.free_seq_array(sa_ptr) kup.free_kmer_lookup(lk_ptr) if s1 > edge_tolerance and s0 > edge_tolerance: return 0, 0, 0, 0, 0, 0, "none" if len_1 - e1 > edge_tolerance and len_0 - e0 > edge_tolerance: return 0, 0, 0, 0, 0, 0, "none" if e1 - s1 > 500 and aln_size > 500: return s1, s1+aln_q_e-aln_q_s, s0, s0+aln_t_e-aln_t_s, aln_size, aln_score, "aln" else: return 0, 0, 0, 0, 0, 0, "none" def get_consensus_without_trim( c_input ): seqs, seed_id, config = c_input min_cov, K, local_match_count_window, local_match_count_threshold, max_n_read, min_idt, edge_tolerance, trim_size = config if len(seqs) > max_n_read: seqs = seqs[:max_n_read] seqs_ptr = (c_char_p * len(seqs))() seqs_ptr[:] = seqs consensus_data_ptr = falcon.generate_consensus( seqs_ptr, len(seqs), min_cov, K, local_match_count_window, local_match_count_threshold, min_idt ) consensus = string_at(consensus_data_ptr[0].sequence)[:] eff_cov = consensus_data_ptr[0].eff_cov[:len(consensus)] falcon.free_consensus_data( consensus_data_ptr ) del seqs_ptr return consensus, seed_id def get_consensus_with_trim( c_input ): seqs, seed_id, config = c_input min_cov, K, local_match_count_window, local_match_count_threshold, max_n_read, min_idt, edge_tolerance, trim_size = config trim_seqs = [] seed = seqs[0] for seq in seqs[1:]: aln_data = get_alignment(seq, seed, edge_tolerance) s1, e1, s2, e2, aln_size, aln_score, c_status = aln_data if c_status == "none": continue if aln_score > 1000 and e1 - s1 > 500: e1 -= trim_size s1 += trim_size trim_seqs.append( (e1-s1, seq[s1:e1]) ) trim_seqs.sort(key = lambda x:-x[0]) #use longest alignment first trim_seqs = [x[1] for x in trim_seqs] if len(trim_seqs) > max_n_read: trim_seqs = trim_seqs[:max_n_read] trim_seqs = [seed] + trim_seqs seqs_ptr = (c_char_p * len(trim_seqs))() seqs_ptr[:] = trim_seqs consensus_data_ptr = falcon.generate_consensus( seqs_ptr, len(trim_seqs), min_cov, K, local_match_count_window, local_match_count_threshold, min_idt ) consensus = string_at(consensus_data_ptr[0].sequence)[:] eff_cov = consensus_data_ptr[0].eff_cov[:len(consensus)] falcon.free_consensus_data( consensus_data_ptr ) del seqs_ptr return consensus, seed_id def get_seq_data(config): seqs = [] seed_id = None seqs_data = [] with sys.stdin as f: for l in f: l = l.strip().split() if len(l) != 2: continue if l[0] not in ("+", "-"): if len(l[1]) > 100: if len(seqs) == 0: seqs.append(l[1]) #the "seed" seed_id = l[0] seqs.append(l[1]) elif l[0] == "+": if len(seqs) > 10: yield (seqs, seed_id, config) #seqs_data.append( (seqs, seed_id) ) seqs = [] seed_id = None elif l[0] == "-": #yield (seqs, seed_id) #seqs_data.append( (seqs, seed_id) ) break if __name__ == "__main__": import argparse import re parser = argparse.ArgumentParser(description='a simple multi-processor consensus sequence generator') parser.add_argument('--n_core', type=int, default=24, help='number of processes used for generating consensus') parser.add_argument('--local_match_count_window', type=int, default=12, help='local match window size') parser.add_argument('--local_match_count_threshold', type=int, default=6, help='local match count threshold') parser.add_argument('--min_cov', type=int, default=6, help='minimum coverage to break the consensus') parser.add_argument('--max_n_read', type=int, default=500, help='minimum number of reads used in generating the consensus') parser.add_argument('--trim', action="store_true", default=False, help='trim the input sequence with k-mer spare dynamic programming to find the mapped range') parser.add_argument('--output_full', action="store_true", default=False, help='output uncorrected regions too') parser.add_argument('--output_multi', action="store_true", default=False, help='output multi correct regions') parser.add_argument('--min_idt', type=float, default=0.70, help='minimum identity of the alignments used for correction') parser.add_argument('--edge_tolerance', type=int, default=1000, help='for trimming, the there is unaligned edge leng > edge_tolerance, ignore the read') parser.add_argument('--trim_size', type=int, default=50, help='the size for triming both ends from initial sparse aligned region') good_region = re.compile("[ACGT]+") args = parser.parse_args() exe_pool = Pool(args.n_core) if args.trim: get_consensus = get_consensus_with_trim else: get_consensus = get_consensus_without_trim K = 8 config = args.min_cov, K, args.local_match_count_window, args.local_match_count_threshold,\ args.max_n_read, args.min_idt, args.edge_tolerance, args.trim_size for res in exe_pool.imap(get_consensus, get_seq_data(config)): cns, seed_id = res if args.output_full == True: if len(cns) > 500: print ">"+seed_id+"_f" print cns else: cns = good_region.findall(cns) if len(cns) == 0: continue if args.output_multi == True: seq_i = 0 for cns_seq in cns: if len(cns_seq) > 500: print ">"+seed_id+"_%d" % seq_i print cns_seq seq_i += 1 else: cns.sort(key = lambda x: len(x)) if len(cns[-1]) > 500: print ">"+seed_id print cns[-1] FALCON-0.1.3/src/py_scripts/falcon_ucns_data.py000066400000000000000000000077021237512144000213050ustar00rootroot00000000000000#!/usr/bin/env python #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ import sys import os rcmap = dict(zip("ACGTacgtNn-", "TGCAtgcaNn-")) if __name__ == "__main__": import argparse import re from pbcore.io import FastaReader tiling_path = {} with open("all_tiling_path_c") as f: for l in f: l = l.strip().split() tiling_path.setdefault( l[0], []) offset = int(l[1]) node_id = l[2].split(":") s = int(l[3]) e = int(l[4]) tiling_path[ l[0] ].append( (offset, node_id[0], node_id[1], s, e) ) f = FastaReader("preads.fa") seq_db = {} for r in f: seq_db[r.name] = r.sequence f = FastaReader("primary_tigs_c.fa") p_tigs_db = {} for r in f: p_tigs_db[r.name] = r.sequence for p_tig_id in p_tigs_db: pread_data = {} offsets = [] seqs = [] p_tig = p_tigs_db[p_tig_id] if len(tiling_path[p_tig_id]) <= 5: continue print p_tig_id, 0, p_tig for offset, s_id, end, s, e in tiling_path[p_tig_id]: seq = seq_db[s_id] if end == "B": s, e = e, s offset = offset - len(seq) seq = "".join([rcmap[c] for c in seq[::-1]]) else: offset = offset - len(seq) print s_id, offset, seq print "+ + +" f = FastaReader("a_nodup.fa") a_tigs_db = {} for r in f: a_tigs_db[r.name] = r.sequence for a_tig_id in a_tigs_db: pread_data = {} offsets = [] seqs = [] a_tig = a_tigs_db[a_tig_id] if len(tiling_path[a_tig_id]) <= 5: continue print a_tig_id, 0, a_tig for offset, s_id, end, s, e in tiling_path[a_tig_id]: seq = seq_db[s_id] if end == "B": s, e = e, s offset = offset - len(seq) seq = "".join([rcmap[c] for c in seq[::-1]]) else: offset = offset - len(seq) print s_id, offset, seq print "+ + +" print "- - -" FALCON-0.1.3/src/py_scripts/falcon_utgcns.py000066400000000000000000000112401237512144000206370ustar00rootroot00000000000000#!/usr/bin/env python #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ from ctypes import * import sys from multiprocessing import Pool import os import falcon_kit module_path = falcon_kit.__path__[0] falcon = CDLL(os.path.join(module_path, "falcon.so")) """ consensus_data * generate_utg_consensus( char ** input_seq, seq_coor_t *offset, unsigned int n_seq, unsigned min_cov, unsigned K, double min_idt) { """ falcon.generate_utg_consensus.argtypes = [ POINTER(c_char_p), POINTER(falcon_kit.seq_coor_t), c_uint, c_uint, c_uint, c_double ] falcon.generate_utg_consensus.restype = POINTER(falcon_kit.ConsensusData) falcon.free_consensus_data.argtypes = [ POINTER(falcon_kit.ConsensusData) ] rcmap = dict(zip("ACGTacgtNn-", "TGCAtgcaNn-")) def get_consensus(c_input): t_id, seqs, offsets, config = c_input K = config[0] seqs_ptr = (c_char_p * len(seqs))() seqs_ptr[:] = seqs offset_ptr = (c_long * len(seqs))( *offsets ) consensus_data_ptr = falcon.generate_utg_consensus( seqs_ptr, offset_ptr, len(seqs), 0, K, 0.) consensus = string_at(consensus_data_ptr[0].sequence)[:] del seqs_ptr del offset_ptr falcon.free_consensus_data( consensus_data_ptr ) return consensus, t_id def echo(c_input): t_id, seqs, offsets, config = c_input return len(seqs), "test" def get_seq_data(config): seqs = [] offsets = [] seed_id = None with sys.stdin as f: for l in f: l = l.strip().split() if len(l) != 3: continue if l[0] not in ("+", "-"): if len(seqs) == 0: seqs.append(l[2]) #the "seed" offsets.append( int(l[1]) ) seed_id = l[0] else: seqs.append(l[2]) offsets.append( int(l[1]) ) elif l[0] == "+": yield (seed_id, seqs, offsets, config) seqs = [] offsets = [] seed_id = None elif l[0] == "-": break if __name__ == "__main__": import argparse import re parser = argparse.ArgumentParser(description='a simple multi-processor consensus sequence generator') parser.add_argument('--n_core', type=int, default=4, help='number of processes used for generating consensus') args = parser.parse_args() exe_pool = Pool(args.n_core) K = 8 config = (K, ) for res in exe_pool.imap(get_consensus, get_seq_data(config)): #for res in exe_pool.imap(echo, get_seq_data(config)): #for res in map(echo, get_seq_data(config)): #for res in map(get_consensus, get_seq_data(config)): cns, t_id = res print ">"+t_id+"|tigcns" print cns FALCON-0.1.3/src/py_scripts/get_rdata.py000077500000000000000000000153101237512144000177510ustar00rootroot00000000000000#!/usr/bin/env python #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ import sys import glob #import pkg_resources import uuid from datetime import datetime from collections import Counter from multiprocessing import Pool #from pbtools.pbdagcon.q_sense import * import os """ try: __p4revision__ = "$Revision: #4 $" __p4change__ = "$Change: 121571 $" revNum = int(__p4revision__.strip("$").split(" ")[1].strip("#")) changeNum = int(__p4change__.strip("$").split(":")[-1]) __version__ = "%s-r%d-c%d" % ( pkg_resources.require("pbtools.pbhgap")[0].version, revNum, changeNum ) except: __version__ = "pbtools.hbar-dtk-github" """ query_fasta_fn = sys.argv[1] target_fasta_fn = sys.argv[2] m4_fofn = sys.argv[3] bestn = int(sys.argv[4]) group_id = int(sys.argv[5]) num_chunk = int(sys.argv[6]) min_cov = int(sys.argv[7]) max_cov = int(sys.argv[8]) trim_align = int(sys.argv[9]) trim_plr = int(sys.argv[10]) rmap = dict(zip("ACGTNacgt-","TGCANntgca-")) def rc(seq): return "".join([rmap[c] for c in seq[::-1]]) """0x239fb832/0_590 0x722a1e26 -1843 81.6327 0 62 590 590 0 6417 6974 9822 254 11407 -74.5375 -67.9 1""" query_to_target = {} with open(m4_fofn) as fofn: for fn in fofn: fn = fn.strip() with open(fn) as m4_f: for l in m4_f: d = l.strip().split() id1, id2 = d[:2] #if -noSplitSubread not used, we will need the following line #id1 = id1.split("/")[0] if id1 == id2: continue if hash(id2) % num_chunk != group_id: continue if int(d[2]) > -1000: continue if int(d[11]) < 4000: continue query_to_target.setdefault(id1, []) query_to_target[id1].append( (int(d[2]), l) ) target_to_query = {} for id1 in query_to_target: query_to_target[id1].sort() rank = 0 for s, ll in query_to_target[id1][:bestn]: l = ll.strip() d = l.split() id1, id2 = d[:2] target_to_query.setdefault(id2,[]) target_to_query[id2].append( ( (int(d[5])-int(d[6]), int(d[2])), l ) ) #target_to_query[id2].append( ( int(d[2]), l ) ) #rank += 1 from pbcore.io import FastaIO query_data = {} with open(query_fasta_fn) as fofn: for fa_fn in fofn: fa_fn = fa_fn.strip() f_s = FastaIO.FastaReader(fa_fn) for s in f_s: id1 = s.name if id1 not in query_to_target: continue query_data[id1]=s.sequence f_s.file.close() target_data = {} with open(target_fasta_fn) as fofn: for fa_fn in fofn: fa_fn = fa_fn.strip() f_s = FastaIO.FastaReader(fa_fn) for s in f_s: id2 = s.name if hash(id2) % num_chunk != group_id: continue target_data[id2]=s.sequence f_s.file.close() ec_data = [] base_count = Counter() r_count =0 for id2 in target_to_query: if len(target_to_query[id2])<10: continue if id2 not in target_data: continue ref_data = (id2, target_data[id2]) ref_len = len(target_data[id2]) base_count.clear() base_count.update( target_data[id2] ) if 1.0*base_count.most_common(1)[0][1]/ref_len > 0.8: # don't do preassmbly if a read is of >80% of the same base continue read_data = [] query_alignment = target_to_query[id2] query_alignment.sort() # get better alignment total_bases = 0 max_cov_bases = max_cov * ref_len * 1.2 #min_cov_bases = min_cov * ref_len * 3 for rank_score, l in query_alignment: rank, score = rank_score #score = rank_score l = l.split() id1 = l[0] #if -noSplitSubread not used, we will need the following line #id1 = id1.split("/")[0] q_s = int(l[5]) + trim_align q_e = int(l[6]) - trim_align strand = int(l[8]) t_s = int(l[9]) t_e = int(l[10]) t_l = int(l[11]) #if strand == 1: # t_s, t_e = t_l - t_e, t_l - t_s # t_s += trim_align # t_e -= trim_align if q_e - q_s < 400: continue total_bases += q_e - q_s if total_bases > max_cov_bases: break q_seq = query_data[id1][q_s:q_e] read_data.append( ( "%s/0/%d_%d" % (id1, q_s, q_e), q_s, q_e, q_seq, strand, t_s, t_e) ) if len(read_data) > 5: r_count += 1 t_id, t_seq = ref_data t_len = len(t_seq) print t_id, t_seq for r in read_data: q_id, q_s, q_e, q_seq, strand, t_s, t_e = r if strand == 1: q_seq = rc(q_seq) print q_id, q_seq #if r_count > 600: # break print "+ +" print "- -" #output_dir,dumb = os.path.split( os.path.abspath( output_file ) ) #output_log = open ( os.path.join( output_dir, "j%02d.log" % group_id ), "w" ) FALCON-0.1.3/src/py_scripts/overlapper.py000066400000000000000000000202001237512144000201650ustar00rootroot00000000000000from falcon_kit import kup, falcon, DWA, get_consensus, get_alignment from pbcore.io import FastaReader import numpy as np import collections import sys seqs = [] q_seqs = {} f = FastaReader(sys.argv[1]) # take one commnad line argument of the input fasta file name for r in f: if len(r.sequence) < 6000: continue seq = r.sequence.upper() seqs.append( (r.name, seq[:500], seq[-500:] ) ) q_seqs[r.name] = seq total_index_base = len(seqs) * 1000 print total_index_base sa_ptr = kup.allocate_seq( total_index_base ) sda_ptr = kup.allocate_seq_addr( total_index_base ) K=14 lk_ptr = kup.allocate_kmer_lookup( 1 << (K * 2) ) start = 0 for r_name, prefix, suffix in seqs: kup.add_sequence( start, K, prefix, 500, sda_ptr, sa_ptr, lk_ptr) start += 500 kup.add_sequence( start, K, suffix, 500, sda_ptr, sa_ptr, lk_ptr) start += 500 #kup.mask_k_mer(1 << (K * 2), lk_ptr, 256) kup.mask_k_mer(1 << (K * 2), lk_ptr, 64) def get_alignment(seq1, seq0): K = 8 lk_ptr = kup.allocate_kmer_lookup( 1 << (K * 2) ) sa_ptr = kup.allocate_seq( len(seq0) ) sda_ptr = kup.allocate_seq_addr( len(seq0) ) kup.add_sequence( 0, K, seq0, len(seq0), sda_ptr, sa_ptr, lk_ptr) kmer_match_ptr = kup.find_kmer_pos_for_seq(seq1, len(seq1), K, sda_ptr, lk_ptr) kmer_match = kmer_match_ptr[0] aln_range = kup.find_best_aln_range(kmer_match_ptr, K, K*5, 50) #x,y = zip( * [ (kmer_match.query_pos[i], kmer_match.target_pos[i]) for i in range(kmer_match.count )] ) kup.free_kmer_match(kmer_match_ptr) s1, e1, s2, e2 = aln_range.s1, aln_range.e1, aln_range.s2, aln_range.e2 if e1 - s1 > 500: #s1 = 0 if s1 < 14 else s1 - 14 #s2 = 0 if s2 < 14 else s2 - 14 e1 = len(seq1) if e1 >= len(seq1)-2*K else e1 + K*2 e2 = len(seq0) if e2 >= len(seq0)-2*K else e2 + K*2 alignment = DWA.align(seq1[s1:e1], e1-s1, seq0[s2:e2], e2-s2, 100, 0) #print seq1[s1:e1] #print seq0[s2:e2] #if alignment[0].aln_str_size > 500: #aln_str1 = alignment[0].q_aln_str #aln_str0 = alignment[0].t_aln_str aln_size = alignment[0].aln_str_size aln_dist = alignment[0].dist aln_q_s = alignment[0].aln_q_s aln_q_e = alignment[0].aln_q_e aln_t_s = alignment[0].aln_t_s aln_t_e = alignment[0].aln_t_e assert aln_q_e- aln_q_s <= alignment[0].aln_str_size or aln_t_e- aln_t_s <= alignment[0].aln_str_size #print aln_str1 #print aln_str0 DWA.free_alignment(alignment) kup.free_seq_addr_array(sda_ptr) kup.free_seq_array(sa_ptr) kup.free_kmer_lookup(lk_ptr) if e1 - s1 > 500 and aln_size > 500: return s1, s1+aln_q_e-aln_q_s, s2, s2+aln_t_e-aln_t_s, aln_size, aln_dist else: return None def get_ovelap_alignment(seq1, seq0): K = 8 lk_ptr = kup.allocate_kmer_lookup( 1 << (K * 2) ) sa_ptr = kup.allocate_seq( len(seq0) ) sda_ptr = kup.allocate_seq_addr( len(seq0) ) kup.add_sequence( 0, K, seq0, len(seq0), sda_ptr, sa_ptr, lk_ptr) kmer_match_ptr = kup.find_kmer_pos_for_seq(seq1, len(seq1), K, sda_ptr, lk_ptr) kmer_match = kmer_match_ptr[0] aln_range = kup.find_best_aln_range(kmer_match_ptr, K, K*5, 50) #x,y = zip( * [ (kmer_match.query_pos[i], kmer_match.target_pos[i]) for i in range(kmer_match.count )] ) kup.free_kmer_match(kmer_match_ptr) s1, e1, s0, e0 = aln_range.s1, aln_range.e1, aln_range.s2, aln_range.e2 len_1 = len(seq1) len_0 = len(seq0) do_aln = False contain_status = "none" if e1 - s1 > 500: if s1 < 100 and len_1 - e1 < 100: do_aln = False contain_status = "contains" elif s0 < 100 and len_0 - e0 < 100: do_aln = False contain_status = "contained" else: do_aln = True if s0 < s1: s1 -= s0 #assert s1 > 0 s0 = 0 e1 = len_1 e0 = len_1 - s1 if len_1 - s1 < len_0 else len_0 if e0 == len_0: do_aln = False contain_status = "contained" if s1 <= s0: s0 -= s1 #assert s1 > 0 s1 = 0 e0 = len_0 e1 = len_0 - s0 if len_0 - s0 < len_1 else len_1 if e1 == len_1: do_aln = False contain_status = "contains" if do_aln: alignment = DWA.align(seq1[s1:e1], e1-s1, seq0[s0:e0], e0-s0, 500, 0) #print seq1[s1:e1] #print seq0[s2:e2] #if alignment[0].aln_str_size > 500: #aln_str1 = alignment[0].q_aln_str #aln_str0 = alignment[0].t_aln_str aln_size = alignment[0].aln_str_size aln_dist = alignment[0].dist aln_q_s = alignment[0].aln_q_s aln_q_e = alignment[0].aln_q_e aln_t_s = alignment[0].aln_t_s aln_t_e = alignment[0].aln_t_e assert aln_q_e- aln_q_s <= alignment[0].aln_str_size or aln_t_e- aln_t_s <= alignment[0].aln_str_size #print aln_str1 #print aln_str0 if aln_size > 500: contain_status = "overlap" DWA.free_alignment(alignment) kup.free_seq_addr_array(sda_ptr) kup.free_seq_array(sa_ptr) kup.free_kmer_lookup(lk_ptr) if e1 - s1 > 500 and do_aln and aln_size > 500: #return s1, s1+aln_q_e-aln_q_s, s2, s2+aln_t_e-aln_t_s, aln_size, aln_dist, x, y return s1, s1+aln_q_e-aln_q_s, s0, s0+aln_t_e-aln_t_s, aln_size, aln_dist, contain_status else: return 0, 0, 0, 0, 0, 0, contain_status rc_map = dict( zip("ACGTacgtNn-", "TGCAtgcaNn-") ) with open("test_ovlp.dat","w") as f: for name, q_seq in q_seqs.items(): kmer_match_ptr = kup.find_kmer_pos_for_seq(q_seq, len(q_seq), K, sda_ptr, lk_ptr) kmer_match = kmer_match_ptr[0] count = kmer_match.count hit_index = np.array(kmer_match.target_pos[0:count])/500 kup.free_kmer_match(kmer_match_ptr) c = collections.Counter(hit_index) s = [c[0] for c in c.items() if c[1] >50] #s.sort() targets = set() for p in s: hit_id = seqs[p/2][0] if hit_id in targets or hit_id == name: continue targets.add(hit_id) seq1, seq0 = q_seq, q_seqs[hit_id ] rtn = get_ovelap_alignment(seq1, seq0) #rtn = get_alignment(seq1, seq0) if rtn != None: s1, e1, s2, e2, aln_size, aln_dist, c_status = rtn #print >>f, name, 0, s1, e1, len(seq1), hit_id, 0, s2, e2, len(seq0), aln_size, aln_dist print >>f, hit_id, name, aln_dist - aln_size, "%0.2f" % (100 - 100.0*aln_dist/(aln_size+1)), 0, s2, e2, len(seq0), 0, s1, e1, len(seq1), c_status r_q_seq = "".join([rc_map[c] for c in q_seq[::-1]]) kmer_match_ptr = kup.find_kmer_pos_for_seq(r_q_seq, len(r_q_seq), K, sda_ptr, lk_ptr) kmer_match = kmer_match_ptr[0] count = kmer_match.count hit_index = np.array(kmer_match.target_pos[0:count])/500 kup.free_kmer_match(kmer_match_ptr) c = collections.Counter(hit_index) s = [c[0] for c in c.items() if c[1] >50] #s.sort() targets = set() for p in s: hit_id = seqs[p/2][0] if hit_id in targets or hit_id == name: continue targets.add(hit_id) seq1, seq0 = r_q_seq, q_seqs[hit_id] rtn = get_ovelap_alignment(seq1, seq0) #rtn = get_alignment(seq1, seq0) if rtn != None: s1, e1, s2, e2, aln_size, aln_dist, c_status = rtn #print >>f, name, 1, s1, e1, len(seq1), hit_id, 0, s2, e2, len(seq0), aln_size, aln_dist print >>f, hit_id, name, aln_dist - aln_size, "%0.2f" % (100 - 100.0*aln_dist/(aln_size+1)), 0, s2, e2, len(seq0), 1, len(seq1) - e1, len(seq1)- s1, len(seq1), c_status FALCON-0.1.3/src/py_scripts/remove_dup_ctg.py000077500000000000000000000060251237512144000210240ustar00rootroot00000000000000#!/usr/bin/env python #################################################################################$$ # Copyright (c) 2011-2014, Pacific Biosciences of California, Inc. # # All rights reserved. # # Redistribution and use in source and binary forms, with or without # modification, are permitted (subject to the limitations in the # disclaimer below) provided that the following conditions are met: # # * Redistributions of source code must retain the above copyright # notice, this list of conditions and the following disclaimer. # # * Redistributions in binary form must reproduce the above # copyright notice, this list of conditions and the following # disclaimer in the documentation and/or other materials provided # with the distribution. # # * Neither the name of Pacific Biosciences nor the names of its # contributors may be used to endorse or promote products derived # from this software without specific prior written permission. # # NO EXPRESS OR IMPLIED LICENSES TO ANY PARTY'S PATENT RIGHTS ARE # GRANTED BY THIS LICENSE. THIS SOFTWARE IS PROVIDED BY PACIFIC # BIOSCIENCES AND ITS CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED # WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES # OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE # DISCLAIMED. IN NO EVENT SHALL PACIFIC BIOSCIENCES OR ITS # CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, # SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT # LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF # USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND # ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, # OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT # OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF # SUCH DAMAGE. #################################################################################$$ import pbcore.io import sys """nucmer -maxmatch all_tigs.fa all_tigs.fa -p all_tigs_self >& /dev/null""" """show-coords -o -H -T all_tigs_self.delta | grep CONTAINS | awk '$7>96' | awk '{print $9}' | sort -u > all_tigs_duplicated_ids""" id_to_remove = set() with open("all_tigs_duplicated_ids") as f: for l in f: l = l.strip().split("-") major, minor = l[:2] id_to_remove.add ( (major, minor) ) f = pbcore.io.FastaReader("all_tigs.fa") with open("a-tigs_nodup.fa", "w") as f_out: for r in f: major, minor = r.name.split()[0].split("-")[:2] if minor == "0000": continue if (major, minor) in id_to_remove: continue if len(r.sequence) < 500: continue print >>f_out, ">"+r.name print >>f_out, r.sequence f = pbcore.io.FastaReader("primary_tigs_c.fa") with open("p-tigs_nodup.fa", "w") as f_out: for r in f: major, minor = r.name.split()[0].split("_")[:2] if (major, "0000") in id_to_remove: continue if len(r.sequence) < 500: continue print >>f_out, ">"+r.name print >>f_out, r.sequence FALCON-0.1.3/src/utils/000077500000000000000000000000001237512144000144035ustar00rootroot00000000000000FALCON-0.1.3/src/utils/fetch_preads.py000066400000000000000000000034051237512144000174060ustar00rootroot00000000000000from pbcore.io import FastaReader import networkx as nx import sys u_graph = nx.DiGraph() u_edges = {} with open("./unit_edges.dat") as f: for l in f: v, w, path, seq = l.strip().split() u_edges.setdefault( (v, w), [] ) u_edges[ (v, w) ].append( (path, seq) ) u_graph.add_edge(v, w) len(u_edges) u_graph_r = u_graph.reverse() p_tig_path = {} a_tig_path = {} with open("primary_tigs_paths_c") as f: for l in f: l = l.strip().split() id_ = l[0][1:] path = l[1:] p_tig_path[id_] = path with open("all_tigs_paths") as f: for l in f: l = l.strip().split() id_ = l[0][1:] path = l[1:] a_tig_path[id_] = path p_ugraph = nx.DiGraph() p_sgraph = nx.DiGraph() p_tig_id = sys.argv[1] main_path = p_tig_path["%s_00" % p_tig_id] all_nodes = set(main_path[:]) main_path_nodes = set(main_path[:]) p_ugraph.add_path(main_path) for id_ in a_tig_path: if id_[:4] == p_tig_id: a_path = a_tig_path[id_] if a_path[0] in main_path_nodes and a_path[-1] in main_path_nodes: p_ugraph.add_path(a_path) for pp in a_path: all_nodes.add(pp) for v, w in u_edges: if v in all_nodes and w in all_nodes: for p, s in u_edges[(v,w)]: p = p.split("-") p_sgraph.add_path(p) #print p for pp in p: all_nodes.add(pp) nx.write_gexf(p_ugraph, "p_ugraph.gexf") nx.write_gexf(p_sgraph, "p_sgraph.gexf") preads = FastaReader(sys.argv[2]) all_nodes_ids = set( [s.split(":")[0] for s in list(all_nodes)] ) with open("p_sgraph_nodes.fa","w") as f: for r in preads: if r.name in all_nodes_ids: print >>f, ">"+r.name print >>f, r.sequence FALCON-0.1.3/test_data/000077500000000000000000000000001237512144000144245ustar00rootroot00000000000000FALCON-0.1.3/test_data/t1.fa000077500000000000000000000217201237512144000152650ustar00rootroot00000000000000>30a5633d_129405_0 AAAAGAGAGAGATCGCCCAATTTGGATTACAGTTAGGCACGCCGCTTGTTTTTTTTTTTATTTGCTTTTCGCAGAAAGGTTCTTTCCTTTAATCAGCGCCTCTTTGATTAATGGCGTCTCCGGCAATTGACAGGATTTGTTGTTTTGCAGTAAAAGGAGAAAAAAAATGAGTATGCCACGAATAACTAGAAATAGGGCTAAAAATGTTGCCAAGATCTTTGTGGCTCGGCCAGAGACAAGCGAGCAATGAGACAAAATTGGTCGCCAGATTTTTCTCTTTCTTTTGGATTTTTTTTTTTCTTATTTTCCAATGCCGTCTGCGGCATTCAAATATGCAACAGCAAAGGGCGCGGAAAAAGCAAGGAAAAATGGTGAAAATGGGGTTGGGTGAGAGATGCCTGGGCATGCCAAAGTAGCTGCCAATTTATTTTGGGCATTTTGCTTGGCTGATAGTTGGCCATCTTTATACTCTTCCCAAAAGTGTGAAAGAATATATGGAAACTAATATTATTAAATCTCTGATGGAGACTTTACGTTTATTTACAAAAATTTGAACCACGGATTTCTTTGACGAATTCAAAATTTAAACGCAGAGTATTGGTTACTTTATTATTATTATATATATCGTTGTGTGATTCTATTTTTTTTTTTCCTGTCGCCGGTTTCTTCTCGTTTTTATCGCGACATTTTTTCTGCGAGATTTTTTTTTTTTTATTTTTTTTTTTTTTTTTTTTTTTTTTGTGCAAAGGGAAAGTGGAATTCCTGGCATAATCGGGCATTGATACGCGGCACAAGAAATAAGTGGCAGAATGGACTTTTTCGCCTTTGTTCGCCGATAGACTTGGGCCACGTTTTATGGCCACGCCCCCTCTTAACCCATGAGAGCACCGCCTGGTCCTGCACGTGATGTGAAAGAATCGCTATGGCCAGCTGAGCCCAGAACGTGTCCTTGCCACCAAACTCCACGCAACTGGCAACTCGAATTCAGAATCTCAGCTGAGTCGACGTCTGGCCAGTGACACTGAGAACAAAAAAAGTGTATAACCTAGCAAAAAAAAAAACAAGAACCCCTTAGTTGGCATAAGTTAAAGATAATAAAAATAAACGCTTGCAAACGGCAAACAATAGCTCAAGTTTGCACTTGATAAAGTGAGTATCAACTATAGAAATGCTATTTTTTTGGCTTTTCGAGTTGGTTTTTTTTGGCAGTGAAGTCCTAATGATAAGTCTAAACTTTTTTTGAGGGGTGGTTAAGTTGCGCTAGGGGTTATAGGTTATATGCCGGAACTCTTGGTTTGCAGGCATGTGAGATTCATGTGACAAATGAGCCAGCAGCCAGGATTCAAGAGATGCGGGAAAGAAAGAAATTTGTAGAATAAAAAAGCAGTTTGGCAGTCTGGCGAGCAAAGGGGGCAGTAGGCGGATATGCCATAAGAAATGCATTGGCAAGGCATTGTGTGAATATCAATATTTGATCAAGAGAGGCTAGCGAAAAATCCCATTCAACGTTCGTTCCATTTACAATTTAGATTTCATTTTTGGGGCGCTTTGTTGTAACATAAAATTAAATAACCACCAAGAAACAAGATGAATACTTGTATATTTACGGGCGATAAAAGTGTCTTTTAAGCCGGCCAAATGGCACTTTAAAGGCCAAATGCCCCCTCCCCCCCTTTCTGGCAAGTGAGCGAATTTCGCCATCACTGATTGACTGCCATAAAGCGAATAATTCAAAGTTCCCTCGACAAACTTTCCTTTTGTTCCTGCCATTTCTTTTCCTTTCTTTTATGGGCTCTCGTGTTTATTTTTTTTCACTCGCTTGCCTTGATTTTTTTTTTTTGTTTATTTTTTTATTTTTTTTTTGGCTTTGGGCGGACTTGGCCCTAGAATTTTATTTTTATGACTCACGCACTTTTGGCGGCTTCCGCTTTGGAGCTGATGCTCCATCAAATGGAGGGGCCATCAACACCTGCTTAGAGTAAAAACGTGGGGTGGCAGTTCGTTGCAAGCGCGAGAGTGTAAGGACCAAAGTGCGAAATATTTATTTCGCTGGCAGAGAAAGCTTGGCTTTGAGTATGGGCCGGGCCTGTCTGCGGACTCATTTGAAATATAAATTTTGAAAACAAAACCAAGGCATATAGCACTGGAAAAAAAAAACGGAGGTTGGCACACGGCAGCCGCAAATGTAAATAAAAAGCGCTAACACTACGAAAAAACAAAAACTCAAACAGAAGGGCAAAAAGCCATCAAAATGAAGGGGCCAAGATTGAAGGGTTGTTAAGTTCTGCAAAATAGGGGGCAGTGATTTGCAGCCCCCCCTATCCCCCCTTCCCACCGGACATCCAAACACACACTGCCAAACTTTCGCGATGCGGCGGCTTAAAACGAAGTTTACGTGCATGGCAAGCGAGTAATTTTCCATTTTGCCGTTTGATTTTTACCTGACCATTTGTTTGGGGCCATTGCAATTGCCAAACAGTGGGCAAAAGTACAGTAAAACATGGTTTTGCATGTGTGAAGAATCAAGGCGCATACCACTTAAAAAAAAATGAAACAGTACAGTGCGAACTGGAAAAGCACGAACAGCGGTAAAGCTAAACAATGCAATTATAAATGATTAAAAATGAATGAATGAATGAAGTGATGAGGGCGTTGCTTTTTTGTAATTAAAAGGTGCCACTTTGCTCTTAACGGGCGTTCACTGGTAAGTCGCCCAGCGGCCCTTTTAAGCTTATTTACTGCTTTATTATATTTAAACACTACACCCGTGCCCCAAAAACTGGGCACAGCTTTTCGCCGCTGCATATCGCCGCCCCTGACAGAGCCCACCTAACCCCGCACCCTCGAAAGTTCGACGCCCCCCCCCCTCCCCTCCCCTGCGCCCAAGACGAGAAATGAATCCCCGAATAACAACAAACAACGATTCCCTCGCCTGTCGCGAGTATTGGAGGCGGGCTTCAAATATATATTTATTTTTGCGGCAGCATTCCGTGGGCAGTCAGCCAAAACACCCCCGCCCCCCCCTTCCCTTACCCTTTCACCCACTGAGGCATCCTGAGGCGGCGAATCTCCTGGCTAGCCAGCCCCCGCACTCCCCTCCCTCCCCTCCTCGTATGGCCGCATCCAACATGAAATGGCTTGCAAATCGTAACTCCTGTCGCGACCGTCCGCGTTGACAGCAAATGGTTTTTCCTTTCTTTTTTATTTTCTGTTTTCGGTGTCTTTTTTTTGTTGTTTGCTGCCGTATTTTTTATCGTTTATTTTTCATTATTTATTTATTTATTTAGCATTGCTTGTGCGATTTCGTTATAATTTATGCTGTAACGCATTTCTTTGCGCCTTAGTAGGCGAGCACGTAAAAACATTCTTAATCAGGCAAGTTTTCCGGGTTTCCGCTTTCACTTTTTTTTTTTTTTTCTGGCTAGCTGCCAACAAATAAAATAGGGACAGCCACTACAAAGAAGTCAGAGCTTTTAGCAACAGTTTGGATCTTAAAGTATTATTTGAAAAGCAAACTTTATTGGCTCTGAATGTAATGTAAAAGAGCGTTCGGAAGAAAGTGGAGATAAGAGTGATTTAAATAAAGCCATTCACGGCTTGAGAGTAATGCGAAAATTCGCCTCGAATTCATCCCAAAAGTTACTCGAACATTGATTTAAGGCCGCAACGCCCACCTGGCAGGGCTCGATTGAAGAACACAAAATGAACCCGGGCCAACCTTTTCCAAAATTGGTAAATTTGCGAAATTCAAGGCGTTTTTTTGAGGTTTGGCGAACCCTTTAACCTCCTCCTTTTAACGAGCCGCCAAAATGTCGCGGGCACCGAGTCTTGTCAACTTATTTTAGTGCTCGCCGCAAAAGACCGCAATCACAAACATACAAACAAGATGGGAAATATTAGCTAAAGAGTGAGACGAGACATATGGGCGGGCGGGAGAGAAGAGAGGGAGTGAGTGGTGAGATGGACATATATGCACCATATGTACATGTACGGGTATATGGCGCGGGAAAACAATGCACTGGCAACACGAAATCCAAAAAAGTGAACTAAAGTTCTAGAGGCGGAAATTTCGCAGCGACCGCAAGCGACGCAGAAAACAGAACCGAAAACGCAAAGCCAACCAACAGCTAGAGAGCTGGCGTTGGGTTAACGTGGGCGTTAGGGGGTTAAGAGGCGGCTGCCACTAAACTTTACGCATTACGCCCCCCACGTTGTCTGCGCACACTACACAGCACACACACCACACACCTTGACCGACAGTGGGGCTGCGCATAAAGGCATTAAATATTAGTTTTCATATGCCATATATCTAGATATATTTATATGAAAAGGTCGCAACGAATAGGTGGCCAAAAATGAGTATTGAGTTACATTCATTGGGCGGAGTTTGAACACTAAGGAATGTATTGTATTATTTTTGACGACGCGAGTGTATGGAACGTCTGACCATTTGGCGACCAGCGGCAGCTGGAAAAGCTCTCAGCAAAAAGAAGGCAAAACAAAGCGAAGGAAATAAAGAATAAAGTGGGGTTGTTTACGACTTTTCACAGAAAATTCATTGTCAAGCCGTGTGTGCGAGAAAATGAGGGAAAATGGGGGAAATGAGGGAAAAATGGTGGAAATGGAAGGGCGCGAATGAGTTGGGAGAATGAGTGGACTGCACTGACACGAAACTGTTGACGAAGTCGGGCCACTTCAAACAGTTGTTCCGCGCCACCCAACGACCTTCCGAGGCCGTTTATTAAAAGCTAATTAAAATTTTAAAACATTTATAGAACAAGCGGAATGAAGAGGAACGTTCGGGGGGAGGGGGGACGTGGCGGGGCTGGCAGGGAAGCAATACACTGGCAAGGACATTCAAGGGGAGGCGAGTCCTTAAAATTGGCCGTAATTAACGTTTGGCAACAACGCAGCAACCACCAAGCGCCATTCTGAATTTTGTTTTAGTAAGCCGGACGAACAAAACCGGAAAATACTTTATGCAATTTTTGTTTTTTTTTGTTTTCTTAAATGCAATATTTAGTGGCCTTTTCCAATTGAAGGCAGCAAAAACGTGATGTGCTACTACCAGCGCCTCTTTTCTTAAGAAGAGATCCTTGGAAAAATCGCCAAGCAATTCGGCTTTCGGCTGACCTCGTATAACTTTCGTTTGCCGGATTTTTCGCCGGCTGTATTTCCGCTGGCATCCTTCTAGACAAAGATCCTTGTGCGTCCGGCACTTCATCCCAATCCCTTGCTGCTCGCCATATCCCTTTCTGAGGCTGTTCCATTTTGGGTTTCTGCGAGGCATTCACTAACTGGCAGGAGCGTGATTTTGCTCCCTCTATTTTTCGCCTTTATCCAGTCAGATTTTCATTTTTTTTCATTTTTTTTTTTACTTCTTTTGATTTGCCATTTCTGCTGCTGACATCGATTTGGTCCGCTTTTCAAAGTGGCAGCAATTCTTTTATTTCCTTTTTGCCACACTCTCCATCTCGGCTCGCGGATTCTTTTGGTTTATCTTGCGCGATTTTTTTTTTTTTTTTTTGTTGAATATCGTTGCGCGACATTTGCATAAGTCCGCATAGAGCCACGTCCCCCCGCCCTGCTGATCCAAACGTTTTGAGAGTGTGCACTGGAAAATTGAAAAAGTCTTTCGTTGATGCAAAACGAAGAGTTGCGCACTTTATATAATATATTCAAAAAAGATAAAATGTGCATTATTGTCGAAATTCAACGATTTATAGCTGCCCTTTTGGGCTGTGTAACATATATGGTTATATAACACATACAGACACCCATGTGCTCTTTATTATTTACACTTGCACCAAGCGCACCACATGCCACTATTTCATCCACCTTGCAACTCTAGTCGCCCCCTCCCCATTCAACTTATTGGGCCACTCATGCGCAAGGACAAAATGCTTTTGCAAATTGATTGAAAGGGCAGCATCGGCTAGATAACTAATAATATATTTTTGTAATAATTATGCCAAAATCTATTAAAATTTATCTATGGAGGAGGAAACCGAGAACAATTTAACATCATTTTCATTTTTTTGTTTTTTTTTTTTTTTGAGATAAAATTCCTGCATGAAATTAAACAAAATACACTGCTTATTCGATTGGTAGCGCAAAATATGAAATTCTATAAGCATAATTTTCATTTTCGCATTTTTATGCTACGAATTGCTCGCCGGTTCTTCTTCGTTTTTCGCTTCAACGCAGGGCGCAGGAATGTATGCAACATTTTTTTGTTATTCAACTGCACCAAATTATTTTGGGAAAAAATAAAAACAGCAGCAGGGCAAAAAAAAAAAAAAAAAGGCAGAGAAACCAAATGGGCATGTCCTAAATTGATGACAGGGCAAAACATGGGCAAAGCAAAGAAATAGGAGAAGTAGTGAGCCAATATACAACTAGTGGGGGTAAATCGAAACTAAACACAAAACAAAAGCTAAACGAAACGCAAAACAAAAAAACAATAAAAAAAGAAAAACGGAAAACGCAACGAAGTTTTTGCACAATTTTAGTGCACACGTTGTTGCATTTATATGTTGTTATTGCCGCTGGTAAATTGACTTAACTGTACTGCAAACAGCAATAATAGCAACGCCCCCACACTGGCCTACACTTGGAAAAAAACAAGGCACAAAAATCATTTTTGACGTTTGAAGTCATTGAAAGCTGCCAGTTGTGCAAGGTCGCCATCAAGCAAAAGGGGGTCAACTGCAAACTTTATGATGATCCTAGCATATTCTAGCATAAATATTTTAGTCATAGAGAAAAAAGAAAGGATTTTTTTTTTCATGTGCACAAAAAAGCATTAGAAAAATCCCCGCTCCTTGTCTAACCCTTATTTTATTTCGTCCTACTGGTTAGGGAACTTGTGGGCCATTATTGGATCCACTGCATATAAGGAGAGATATTCACCTTAACAGAGGGGCCGTAACCTCGCTCGAACTCTCGACGCCCTGCCCTTTCGGAACACCCCGCTTCGCGCCTTCCGCCCATGGGCAATGCCTAATACGGGTGACTTTATTTAATTTAATGCCATTTTGTGCCTGCCCCCACACAAAAAAGGGATGAAGCCTGAAGGAAGCGAACAACCGCAGGAGCAAGCAAAGCTTAATATTCAATAATAATAATACGAGCATAGACTGTGCGATAAGATAGGAAAACATCAAATGCCACAAAGGGCGAAGCAGGATGGGCTTCTTAAAAGGCGATATGCTGTTGTGTATTCAAAAAAAAAAAAAAAAATATTTAAAAAAAAAAAAATAAAAAAAATATATAATGGAGCAAAAGAGTGGTTTATGCGTTACATAAATCGCCTTGAATATACATATATATGTATATAGCATATATTTATTGTGACTATGAAAAATATTGCAATAATTTAGCCGTGCTTTCATTCACAGTTAAGTCGTCGCCCTCAAATGGGCAGCGAACTTCTTAGCACTTTTCAATGTGCACTCAGCAATAAGCTTTAAACAATTCTTTAATATGTTTTCATTTTGCTTGGATCGTGCAGCCTTTTAGCTGTGGGCACTTCCACATGTTGACACTGCGCAAAATACGCAAATGATAAGTGTGGCACAAAAAAATTCAAAAAATGGCTAATGCTACATTGCAGTAACATTTATTTCTGCACTTTTGGCTCATTTAAGTAGCATATTAATGATTTCGTTTATCTCCCGCTTGAAAAAGCGTTTTTCAGATTCATACGTTAACGAAAAATTGTTAGGCATGTTACTCGTATTCATTACTGCGCTGAATGATTGCATTCGATAGTCGCTGCAGTTTGCAGCTATTCCCATGTAAAGTTGTGTGCTGTGTGTGCGGTGCAAGTCTGTGTGCCGTGTGTGTGTTAAAAAAAAAAAGAGAGAGAATTGTTTATTAAATACAGGCAGAAGGCGGAAGGCGCGGCGCATTCGCCTCCTTCTTGCACAATAAACACCAAAAATACAAAGAGCACAACAACAACAACAACAACGAGAAGACACATGTCCGCTTTGGGCGCCTTGTCTAACACCCCGCAACCCGCCCTCGAAACCCACTAAATTCCCAAGGACATACAGGCCCTTGTTTGCACATGAAATAGACACGCACCACACGTGCAATTACCAAGCACCCACCCCCCCACCCCTTTTGTGCTTTTCTTACTTTTCAGTTACTTTGCTTAAACGCTTTTAAATTGTTTTTCCATTTTTCCGTCTTTCCTTCTTTAATTATTTTTTTTTTTTTTTTTTTATTTTTGTTCGTTGTTGCTTTTGGGCAAGGAAAATTTGTAACCTAAGCGCATTTTGGGAATTTAAATGCATTTTATGAGCCTTTTCGTATGCTCCAGGACCCTCGTCCCTTTTTCCTCGATTTTCCATTAAGGGATACACTGAGTGAGCAGAGAAGCGGCAAAAAGATATGTTATGACAGTGTCCAAGTCGCAATTTTCAAAAAGGAGGCATGGCACTAATAATGTACGCTTACTAATAAAAAAAAAATAAAAAATTAAAAAAAAAATAACGCAAAGGAACTGCTAATCGTGCAACAATCACCATGGCTTTACGGAAAATGTTAACAAGTTTCATATTGAAAATGCAACAAAGTACGCGCTTGCCAAACGAACTTTATGCAAAATATCGTTGTAATTACGGCACTTCAACGTAACCAATTTAAGCCAACGAAGCACTGATGGGCTAGTAATGAAATATTTCGCCAGAAGTTAATTACATCACAGTTTAATCGTTCGTTGCCAGGAATTCGCCGGATTCCAATTGAATCAATTTACAAATTAGAAAATTTCACAGTTGCTCAATCACTATGCCAGGCCGGCAGCGCCATCTAGTGTAATATTGTGAAGGGCACTCTCATGCAGCCGTGGGTCAGACATTTGAACTTTGGCGGCGGGTAATTTTTGCTGGGGTGTAGAAGTTGGGGGGGGTTCATTTAAAGGGGCCCACTACACTTAACAAAAATATTTAAGAATTATCCCTGTGCTTGCTGCCCAAAAGGAAAAAAGGCCCCCTTCAAATAATAAATATATGGCACCTGGTGCGGCCACTGTTCTAAATATTAGAGGTGCCAAAAAGGGATGATGGGTCCGGTGGGTGTTGTTTGTGGACGGTGCCGCTTTGCTTAATTGTTATAGCGCAAACATTTCGTGTCATCC FALCON-0.1.3/test_data/t1.fofn000077500000000000000000000000101237512144000156140ustar00rootroot00000000000000./t1.fa FALCON-0.1.3/test_data/t2.fa000077500000000000000000000742141237512144000152740ustar00rootroot00000000000000>5d64830a_48915_0 AGTAGAGATCATCTAAACTTTGGTGGTATTTGGCTAACTTGCTTATGTACACATATTAATTTAATTATACGAGTAAACTATTTCCATATTAGCGTATAGCAGCTACGCATAGTTTATAGAACAATAAAAATGAAATATTTTCGGCGACTTTGAACAAATGACGCTTTAGGGGCCTAACGGAGTATTTTTATGTGATAGACGATTTTTTGGCGGGCCAAAAAAAATAAAAGGGAAATTGGTGCTGCGCATAAAATTGAAAGCAGGCTTGCCCTCCAACCCCGCGTCTGCCCTCCCCCCCCCCCCCGCAGATCAAGAGATTATGCTATCCCGCAATAATTCGCGCCTTGCCCGCTTAACTACGTTGGCCATGCGTCGGGGGCGGGCGTCTATGCAATGGTTCAATTGGGCGTTGACTGGCCGCTGGCTAGTGTAAGCCCAGTTTTGCGGCTTATTGCCGCTACTCGGCTCGGGCAATCACATCGAGGTCATTAATCAGAACACGACAGCCCAAAGCGGCAGCGTCATGCCGCCGCACGTTTAACCCCCCTTGGCGATGCGTACTTGGGAAATCAGCTCATAGAGGTCAGAGGTATTGTATAATTGGGCTTTATGCGAAAAATCAGCGAGCCATCAAATCCCTTGATTCTACAGATACATATGTGTAAACATATGGGCTATTGTCCATCTTTATTTTACTCCCTACTTTAGGTAGTAGCACCGCTTGCAATGTCCATGCAAGCATTTTTCTAAATTATTTCCCTTGTGTTTTGTTTATTTTGTTTTCTTACTCTATGATATTTTTTTTTTATTTATTGTATTTATTTTTTTTTTGGCTTGTGTACGTTTTTATTAGTTTTTAGTTAGGGGCCGAAAAATGTTCTTTTGTTCGATTTGTGGCTGCATTGGCGAGATATATCAAATTAAATTGCAATTTGTTTTTGTAATTGGAGCTTGTCATATTGCTGTTGATGTTGTTGCTCACGATTCGTTTATAGCTATTGTTCCGGCTGTTTTTCAGTTAAGCGAATAAAATGGAATGCGCACACACATGGGCAGAGTGAAAAGAGTAGAAATATGAGAAAATTGACATATAACTCAAGAAAAATTATGCTGATCAAATTGTTATAAACATTGATTTTCCGCTATTTCCTGCTTTTTTTTTTTTGTTCGCTATTTTTCAGGTGAAGTTGGCAAAGCAGAGATTTCGAGGTGCTTTCATGGCTGTGTAAACTTGATAAGTGAGTGTAGCTTTGGTAATTTAAGTTATTTTTTCAACTGGCGAAGTACCCACTAACAAATAAACTGCCAAGAAAATACGCGCATTTATTTATAAAAATGAAAATGAGGAAATGCGAAAAAAAAGAAAAAATAACAAAGCGCCAATCTTCACTATCCAATAAATTTTATTCACTTTTTCGCGCTATGGAGTGTTGCGGCATCTATTTTTCGCAATTTTTTCATGCCAGGCCTCTAGATAATTTTTGTGGGCAGGCTTTGTTCGAAGACAGAGTGCGGCAAGAAGGTGAAATTCATTTTTAATTAAAACATAGCCTAGCGCGCACGCGCAAAAAAGTAAAGGCACGCTAAAAAAATAAAAAAAATTTGCGAAACATAACAATAGGGATATGAATATACACGATAATATATTAGTTGCATAATACAAGTTAGCTATGTGATATATATGCCGAATAACGCACAGGCCCAGCGGGTTTTTTTTTTTTGCCATGTTTCAGTCAATTCATTTCGTGTAGAACCTGTTAGGGCTTTGTGTGCAGTTTTTTTTTTTTAGCACAGCTTCTTCTAATTATTTTTATACTTTTTCCTGGCGAGGCAGCTTTTCAATTTGCCGTGGAAAACGCCCGTTGATTGCCTATTCATTTGCCATTTCGGCGCATGCATCCCTTAATCAGAGGGCACATTTCTGGCTTTACATGGCTTGCCGTAATTTGAAAGTTGGCAAAAATTACAGGTGAAAAGAGACTGCTTAGAAAGTTTTGCCGGGCATGGCAAAGCGCGGCAAATAAATAAGCAAGAAGGATGTAGGAAAAGCGCCTGCGCTGAGAAAAGCGCTGCACGAGAGGAGTGAACAACGAGTGGAAAACGGTGGTAAAATGTCGCAGCATTATTGACTTCAAAAAAATTCTAACAGCAGCACGACAAAAGAGAGGCAGAAAGTAAAATGCCGTTGCAAAAGGTAAAGCCAAGGATATTTATGAAATTAGAAATGCGGGTAGAAAGTATGTTCTGTAGATTTGTAGCAGTCTTTGCATCGGGATATATATTTCTTTAGAAGCTATGAAAAAAAAAAATTCTATATTAAATTACAAGCATTGAAGTGGCTTAAAAAATATATAAGGCTCCCGCTGCTTATGTTAGGAGCTAAAATCTCCTATGAACTTATAAGCCTATTTAGAATGTGCATAGAAATTAATATATATAATATATTTTTTTTTTTCTTATTTAAAGAAAGTTTTTTTTCCCTATTTTGGCGCAGTCCAGTGTCACTAATGGCCCTGACCTTGACTGTGGGCCCTGTGTTGAGGCTTTAAGGCAGCTTATCTGCTGGCTGAAATGTCTATTGGATATCGCGTTCTGGCATCAAGTCGTGGTCAGCACCCATGATGGATCCATTCTTAACTCCGCAAACGCCCTGTTCCGGGCTAGACGCTTACCATATCCAAGTTGCAAAGTGAGGCTATTTCGTGGGGGTGGTGGGGCATTTAATTAAGGCGGGGCGTGGGCAATTGATTATAATCGGGTGCCTAACCCGAATAGCGAAATGTACAGAACAATAACGAATTTCGATATCCAAAAACGGTTATCTCTTAAATGAAATGCGAATTCGCAAACCAATGTGATTGCAATAATGAGGGCAGAAATAACGATAAACAAGCAGTATAAATAAATTAGCATAAAAGGGTGTTGTATGGTCAAAACAGCGGCGTGACCGTTGCCGAACATTTGTTCTAAGGGCAGCAGCTTTTGCTCTTGAATAATGATTAATTTATTGCTAATTTTCGCTGCTCATCATTAATTTGCGAACAGTCAAATGTAAATATGCGGAGCCGCCAGGGCGCAAATAATTCTACTGGCAATATAGAAATCCTAAATTACGCAAATAAAACGGAATTTGCGGTTAATGTATTATCGTTCAATCAGCACGGAGGTAACACTAGTAAATTTCACAGAATTTATAAGCTAGTTATATATGGTTAAAATCGAGAAATTTCCTGCTCGAGCACTTTTTAGGCAAAAAAAAAAAGTATTTTTTTGTTTTTTTATTTTTTTTTTTTTTTGCTTATAATTATCGTGATAAAAGTGTAACAGTTATGTACTTCATCTAGTTGTTTTTCAATCGGCAATTTTCTTACGCACCTTTGCCATAGGGTTGAATTAGTTTTGCTTAACAGCTTCTGTTTGGGTCTTATTCGTTATATTTAACAAATTAATTGTGAGCAGAGTGCGGGCGCATTGCAAACTTTACTTTTCGGCCGTCTAAGTCAAGGCAGATACTTCGAAAACATTTTCAATTGTCATATATTATTTCAAAAACGTTTTCCGCCGAATTTGAATACAGTTAGGCACGCCGCTTGTTTTTTTTTTTTTATGGGCTTTTCAGAAAGGTTCTTTGCTTTAATGCACGCGCTCTTGTGATTAGATGGTGCTGCGGCAATTGAAATTTTGAATTTGTTGTTTGGGCTAGTAAAAGGAGAATAAAAAAATCGAGTAGCGCCACGAAAATAGATAATTAGGGTAAAAAGTTGCGAGATCTTTGTTGTGCTCGCAGAGGACAAGCGACAATGAGAGCAAATTGTCGCAGATTTTTTCTCTTTCTTTTGATTTTTTTTTCTTATTTTTATGCGCGCTGGCGGCATTCAAATATGGCAACAGCAAAGCGGCGGAAAAGCAGGTTAAAAATGGTGAAAATGGGTTGGGGGGAGAGATGCCTGGGCTATGCCTTAAGGTAGCTAGGCCATTTATTTTGCATTTTGGTTGCTGATATGTGCCATGTTTATACTCTCCGCTAATCAGGTGAAAGAATATATGGAAACTAATATTTTAGAATCTGCTGATGGAACTTTTAACTGTTTATTTAAGCAGTATAATTGAAGGCGCAGCGGTATTTTCTTTGACGTAATTCACATTTTAACGGCAGAGTATTGGTAGTTATATATTATTTATATATTCGTTTGTGTGATTCTATTTTTTTTTTCTGTCGCGGTTTCTTCCTCGTTTTATCGCGAACATTTTTCTGCGAGTATTTTTTTTTTTTTTTTTTTTATTTTTTGTGCAAGGAAAGTGGAGATTGCCCTGGCATATAATCGCAATTGATACGGGCCACAAAAAAAGAGGGCAGAGGTACTTTTCGCTTTGTTCGCTCGATAGACAATTGGTACGTTTTATGGCATCGCCTCTGCTTAACCGGCATGAGAGCACCGGCGGCTGGTCCCTGCACGTGTGTGAATTAGAAGTCGCTAATTGCAGCTGAGCCAGAACGTGTCCTTCCAGCCGAACTCCACGTGGCAACTGCAATCGAATTCAGATGCTTGCAGCTGGAGTCGACGTCTGCCGCAGATGACACTGAGTAACGAAAAAAGTGTTATAACGCTAGCAATAAGAAAAAAGCAAGATAGCCGGTAGTTGGCATAAGTTTAAAGTATTAATAAAAATAACGCTTGCAACGCAACAATAGCTCGTTTGCACTTGATAAAGTGGAGTATCAACTATATGATTATGCTTATTTTCTTTGGCTTTTCGAGTTGGTTTTTTTTTGCAGTGAGTGCCTAATGATAAAGTCTACTCTAAACTTTTTGCGGGTGGTTAAATTGGGCCAGGGGTTATGGTTATATCCGGAACGTGCTGGTTGGGCTATGTGAGTTGCATGGTTGACAAATGAGCCTAAGGCAGGCGCAGGATTCTAAGGAGTCGTCGAAGTAAAGGAAATTTGGTAGTATAAACCAGCATGTTGTTTTGCTAAGTCCTGGGCGTAGCTAATAGGGGCAGTAAGGGGAGTATGCATAAGAAATGCATTGCAGGGCTATTGTGTGATATCAAGTATTTGATACAGAGGCTAGCAACAGTCCAATTGCACGTTCGTCGCATTTACAATTTAGATTTTGCATATTTTGGGGGCGGCTTGTTTGTAACATAAATTAAAAACTAACCAGCTAAGAAACAAGATGAATATGCTTGTATTTTACGCGATAAAAACAGTTGTTCTTATTAAGCAGGCCCAAATGGCACTTTATAGGCCAAATGCCGCATCCCCCCCCCTTTTCTGGCAAGTGAGCGAATTTCGGCAATGCAGCTGTATTGAACTGCCGCATAAGAGCGAATAATTCAAAGTTCGCTTGAAAAGCTTTCGCCTTTTGTTGCCTTGCATTTCTTTTCCTTTCTTTATGCTCATCTTGTTTATTTTTTTCGACTCGCTTTCGGCCTTGATTTTTTTTTTTTTTTTTTTTTTGGCTTTGCGGAGCTTTGGCCGAGCTAGAATTTATTTTTATGACTGCACGCACTTTTGCGGGGCTTCCGGCTATTGGAGCTGATGCTCCATCTAAATGGAGCATCAACACCTGCTTTTAGAGTAAAAAAAGTGGGTGGCAGTTTCGTTGGCAAGCGGGCGAGAAGTGAAGGACAGTGGGCGAAATATTTATTTCGCTGGGCAGTAGAAAGCTTGGCCTTTGAGTTATGGCGCGCGGGCTGTCTGGCGGAACTGCAATTTGAATAAAATTGGTTGAACAACACTCCAGCATATTTATGCACTTGGGATAAAAAAAAAAATACGGAAGGTTGGGCCACTACGGCCGCGCAGAATGTAACATAAAAGCGCTACACTGGGCGAATACCAAACTCACAAAGGAGCAAAAGCCAGGTCAAAATGAAGGCCAAGATTGAGGGTTTGTTAAGTTTCTGGCTGCAAAATGGGGGGCAGTTGATTTGCAGCGGCGCCATTACCCCCCTATCCCACCGGCATCGCAACGACAACTGCCACTTTTCGCGATGGCGGCGGCTTAAAACGAAAGTTTATATCGCATGGCAGCGAGTAGGCTTTTGCCATTTTGCCGTTTGAAATTTTTTTACCAGCGCATTTGTTTGGGTCCTATTGCAGATTGCCTAACAGTTGGGAAGTACAGTAAAACATGTTTGTTGGCATGTGTGAAGAATCTAGGGCGCATACATTAAAAAAAATAATAAAAGTACAGTGCGACTTGGAAAAGCCGCAGAACAGCGGCTAGCTAAAAATGCTAATTATATGATTTAAAAATTGAATGAATGATGATGAATGAGGGCGTTCTATTGTAATTAAAAAAGGTGCACTTTTGCTCTGAACGGGCGTTGCTGTAGTCGCCCACGCGGCTTATTAGGCTTATTTACTGCTTATATATTTTAACACTACGCGTGGTGCCCACAAAACTGGGCAGCGGCTTCGCGGACTGGCATAATCGCCGCCGCGCTGACAAACGCCTACCTACCGCCGCACGTGCCGGCTCGAAGTTCGATCCCGCTCGCTCCGCGCTTCGCTGGTCCCGTAGGCCCATATGCGGAGTAATAATGAATCCGCGTCGAATAAACTAACAAGAAACGATTTCCTCGCCTGTTGTCGCGATATTTTTGAGCGGGGCTTCAATGTAAATATTTTATTTTTCGGCAGCACTTGCGCGTGGCAGTACCTAAACCGGCCCCCCCCCGCCTTCCCTTTTCGCTTCACCGCCTGAGGGGCATCCTGAGGCGGCGAATCTCCTGCAGCCAGCGACCGCTACCCTTCCCTCCTCGCGGCTCTGAGATGGCCGCTTCACGATGACAATTGGCTGCACATCGTCTTGCGCTTGTCGCGGGGGGAGCGCCGCCGCGTTGACGCGTAAATGGTTTTACTTTCTTTTTTATTCTGTTTTTTTGCGGTTGTCGTTTTTTTTTGTTGTTGCTGCCGTATTTTTTATCGTTATTTTTCTATTAATTTATTTTTTTTTATTTATTTAGCTATTGCTTGTGGCATTTTTCGTTATATTTATGGCGCTGTGCCTGCATTTCTTGCGGCCTTATAGGGGCACACGAAACATTCTTATGCAGCAAGTTTTGCCGGGTTTTTCGGCGGGCTTTATTTTTTTTTTTTTTTTCTGGCTAGCTGCCACAACATCAAATAGGGACAGCCACTAAAGTCAGAGCTTTAATACAACAGTTTGGTCTTAAGTATTATTTTGAATAAGCAACTTTTGTCTGTAAATGTAATGGTAAAAGAGCTGCGAATGACAAGTTGGAGTATATAGAGTGATTAAATAAAGCATTCAGTTGAGAGTATGGTCGTATCCCCCATATTCGCTCGCAATGTCATCCCATATAAGTAACTGGCGACATTGATATTTAAGCCGGCAACGCCCACTAGGGCGAGGGCTTGCGATTTGAAGACCAAAATGACCGCGGGCCAGAGCGCTTTTCCATATTGGTAAATTTGCGAATTCAAGCGTTTTGAGGTTATTGGCGAACTTTTACGGCCTCGCTGCCTTTAACCGCCTATAAATTTCCGGGCTAGCGGCAGTCGTTGTCAACGTTATTTAGTGCTCTGGCAGCAAGACCAATCACAAACATACAACCAGATGGGAATTATTAGCTAAGAGTTGAGAGCGAGACATATGGGCGGGCGGGAGAGAGAGAGGGAGTGAGGTTGAGTGAGTATGGACATTATATCACATATGTGCATGTAGCGTATATGGCGCGGGAAAACAAGCACTGGCGTAACTACGGTCAATCCAAAAGTGAACTAACCGTTCAGAGGCGGAATTTCGCATGCAGAGAACCGATGCGACCAAGAAAAGCAGAAGGCCGAAACCAAAGCTCTAACCAGCTACAGCTGCGTGGTTATACGTGGGCGTTAGGGGTTAAGGAGGGCGGCTGCATACTTAATAAGCATTGACGCTGCATACGGCCATCGTTGTGCTGGCAGGGCTTACACACACAGCACAGCACGACTAGCTTGACCGACAGTATGGGTGCTCGCATGATAGCATTACATATAGTTTTGCATATGCATAATATCTTAGATATATTATAATGATAAGGTAGCAACGAATATAGGTGGCCAAATTGCGTATTGAGTACATTTAATTGCGGAGTTTGAAGCACTAGTAATGTATTGTATTATTTTGCGCGCGGGCGAGTGTATGGTACGTCTGACCTATGTGGGCGTACGCAGCGCAGCTGGAAAGCTCTCTACATAAAGATAGCAAACAAAGCAGGAAGGAAATAAAGATAAAGTGGGCGACGAGCCTTTTCATCGAAAATTCTATTGTCCAGGCCGTGTGTGCCGAGAAAAAATGAGGGAAATATGGGGGAAATGAGGGGAAAATGGTGGAAATGGCCGGGGCGCGAATTGAGTGGGAGAATGAGTGGAACGCATGCACTGACATCGTAACTGTTGACGAAGTCGGGCGGCTACTTTCAAAGTTTTTGCGCCAAACCCACACAGAGCCTCGCGGCCGTTTTATTAAACTTATTAAAAATGTTTATAAAATTTATAGAAAATCGGAATGAAGAGGAACGTTTCGGGGGTGAGGGGGGACGGTTTGGCGGGGTCTGGCACACTAGGGAAGGCAATACACTGGCATAGGACATTCAAAGGAGCGAGTTCATTTTTAATTGCGCGTAGATTTAACGTTGGGCTATTAACGCAGCAACTCGGGCACGCAGAGCGCCCATGTCTGAATTTTGTTTTAGTACGAGAACAAAAACGGAATACCTTTTATGGCATTTTTTTTGTTTTTTTTTTGTTTTTGCTTTGCAATATTTTAGTGCTTCAAATTGCAGGCAGCATAAAAACGTGTGAGTGCTACTACTCTAGCGGGCTCTTTTCTTAGAAGAGTATCCTTGGAACATCGCCAGCTTTGTCGCATTTCGCTGACGCTCGTATAAGCTTTCGTTTGCGAGATTTCGCGCGTGTATTTCGCGCTTTGGCTATCTTCGTAGGAAGCCAAGAATCGCTTGTGCGTGCCGGGACTCCTGCAATTCGTAGCTGCTCGCTCATTATCCGTTTGCTGAGCTGTTCTATTTTGGGTTCCTGCGAGGACATTCACAAACTGGCAAGGTAGCGTGATTTTGCTCTCCTGTCTATTTCGCCTTATCCAGCAGATTTTCATTTTTTTTTCAATTTTTTTTTTTAGCTTCTCTTTGATTTGCATTTCTGCTGCTGACTATCGAATTATGGGGGCCGCTTTTCAATGTGCACAATTCTGCTTTCATTTTTTTCGCTTTTGGCGAAGCAAACACACATGCTCTCTGCGCTCGCGGTATTCTTTTGTTTTCTTCGCGATTTTTTTTTTTTTGTTGATATCGTTGCGCGACTATTGTGCATAAGTCGCATAGAGCCCGCCCCGCCCTGCTGATCTACGTTTTTTTGAGATGTGTGCACCTGGAAAATGAAAAGTCCTTGCGTTTGAATCGCAAAAGAAGGAGTATTGCGGCAGCTTTTATATACATATATTTCAAAAGTATAAAATGTGGATTATTGTGCTAAATTCAACGAGGTTTATAGCTGCGCCTTTTTGGGGCTGTGTAACATATAGTGGTATTATATCACAGGTCATATGCAGACCCCATTGCTCTTATTCTTTTAGCACTTGCAGCGCTACCGCGGGCAGCAGCTGCCAACTATTTCCTATCGGCCCCTGCAACTTCTAGTCCGCCTCTCCCTTCAGCCTATTGCCCGTGCATGGCGCAAGGATCATATAAATGGCTTTTGCATATTGAAATTGCAGGCGGCGGCTTCGGCTAGATAATACTCAATATATTTTTGTAATATTATGCCAAAAATCTATTAAATATTTATGGCTATGGAGGAGGAATACCGAGAAACAATTACACATCCCATTTTTTCATTTTTTTTGTTTTTTTTTTTTTTTTTGAGTAAATTTCGCTGCGATGAAATTAAAGCAAAAGATATCATCTTTATTCGAATGGTAGCTGCAAAATATGAAATTCTTATACTCATATTTTTCATTTTTCGCATTTTATCTACGAATTGCTCGCCGGTTCTTCGTTTCGGTTGCCATCCGCAGGGCGGGAGAATGTATGCATTACGATTTTTTTTGTTATTACTGCTAGCTACATTTATTTTGGGAAAAAATAAAACAGAGCAGGGCAAAAAAAAAAAAAAGCTAGAGAAACGGCAAATGCATGTCTCTAATTGTATGCCAGGGCAAACATGGGGGTGCAAAGCAAAGAATTAGGAGAAGGTAGTGGTAGCAATAACAACTATAGGGGGTAAAATCGTAAACTAGCTAATACAATACAGCTTAAAACCGAAAGCCAAAACTAAAATACAAGAAAAAAAAAATAGAAAGCGATGAAACGCAACGAAGTTTTGGCCACAATTTAGTGCACATTGTGCCTATTTTATTGTTGTTATTGCGCTTGTACACACTTGAGCGTAAGCTGTACTTGCAACAGGCAATAATAGCAGCGCCCCAACATGCCTCCTGGAAAAAACCCCAAGGCACAAATCATTTTTGACGTTTGAAGTCTTTGAAAGCTTGCAGTGTTGTGGCATAGGTCGCCTAATGAGCAGCACGGTCAACAACTGGCAAACTTTATGTATGATGAGCTAGCATATTCCTTAGGATATATTTTTAGTCAGTAGAGAATACGAAAGGATTTTTTTTTCATTGTGGCACACCTAAGGCATTAGAAATCCCCTCTTGTTCTAACTTTTTATTTGGCGTCCTACTGGTTTAGGGAGAAGCTTGTGGCTCTAATTATTGATGCCACTGGTATATAGGAGAGACTATTCCACGCTTAACAGATGGGGGTCGTCACGCGTCGCTCTAACTCTGCTGCCCCTGCGCCTTGTATCGAAGCAGCCCCCTTCGCCCGGCCGCTGCATGGATGCTATATACGGGTCTTTATTTAATTAATGCGCATTTTGTGGGCTGCCACGCTCCATCACAAAAAAGGGTTGAAGGGCGAAGGAACAACACCGCTAGGTAGCTATACAACTTCAATATTATAATAAATAATGACGAGCTAAATAGACTGTTGCGATTAGATAGGAAAACACATCAAATGCCAGCAAAAGGGCGTAGCAGGATGGGGCTTCTTAAAAAGGCGGATATGCTGTTGTGTATTCAAATAATAAAAAAATAATATTAAAAAAAAAAATATAATAAAAAAGTATATAATGGATGCAAAGAGTGGTTTTATGCGTTACATTAATATCGCCTTGATAATATACAGTATATTGGTATATAGCTATTATTTATGTGGACTAGTGAAAATATTGCATTTTTAAGCCGTCGGCTTTCAAATTCACTTAAGTTCGCGCTTCAAATGCAGCGACTTCTTGCACTTTGCAATGTGCGCACGTCAGCAAATAAGCTTGTAGCATATTTCTTAATATGTTTTGCATTTTGCTTGGATACGTGCACGCTTTTTAGGCTGTGCGCAGCTTCAGCTGTTGACACTGGCGCTAAAATAGCGGCAAATTGATAGTGTGGGAAAAAATTCAAAAAATGGGTCATTATGCACATTTGTCGTAACCGTATATTCTGGCAGCTTTTGGCTCAGTTTAAGTACATTATAATTATTTTCGTTTATGCTCTTGAAAAGCGTTTTTCCGAATTCATACGTTAACGGAAAATTGTTACTAGTGTACTCGGTATCTCATTAAGTGCGTGATGATTCATTCGATGTGCGCTGGCTTAGTTGGCAGGGCTATTGCAGCGATGGTAAAAAGTTGTGTGTGGTGTGTGCGTGCAATCTGTGTGGCTTGTGTGTGTGTTTAAATAAACAAAGAGAAGAGATTGTTTATTATAGCAGCAGACGGGCGGAAGGCAGCGGGCGCATTCGTCCTTCTTGGGCCTAAGCAATAAGGACACCAAATAACGCAAAAGGAGGGAACAACAACCATCAAAACAAGCGAGAATAACAACAGTGTCGCGCTTGGGCGCCTTGTTTAAGCCACAGCCGTGCGCAACCGCGCTCTCTGAAAGCGCCATCTAAATTCAGGACACAACAGGCGCCTGTTGCAGCATGATCGACATACGCACAATACGCACGGTGCTAATTAGCGGCAGGGCACGCCACCCCCCGCCCCTTGGTGCTTTCTTAGCTTTTCAGTAGATTTGCTTCATATACGCTTTAAATTGTTTTCGGCATTTTTCGGCGTCTTTCCTTTCTTATTTATTTTTTTTTGTTTTTTTATTTTTTGTTGCGTTGTTGTTTTGGGGCAATGGAAAATTTGTAGCTAAGGGCGCATTTGGGGAATTTATAACTCCATTTTAGTGAGGTCCTTTTCGTTAAATCGCCCAGGAACGCTCGTCCTTTTTCCATGATTTTGACCTATTAAGGGATGTACACTGAGTGAGCTAGTAGCACCAAGAAAGATATGTTTATGCCAGTGTCGCTTAAGTCGCAGCATTTTCGCACGGGAGGATGCAACGTAATATGTAACGGCTTTATGCTAAATAAAAAAAGTAATAAAAAAACAATTACTAAAAAATAAAATACAAAGCAATGCTATCGCTACATATCAGCTGGCTTACGGAAAATGTTAGCAGGTTTCATTAATTGATAAATGCAAACCAAAGTACGCGCTTGCGGCACGAACTTTATGCAAAATTATGCGTTGTTAAATTAACGGCAGCTTCACGTAACAAATTTAAGGCCACCGAGCAGCTGATTGGGTAGGTAATGAAATATTTCGCAGCATGCAGTTAATTACAACAGTTTAACTGCGTTCGTGTTGCCGGAATTGCGCGGCGGATTCGCATTATTGAATGCAATTTACAAATTCGACTTTCACCTTGGCCTTCATCAGCTATGCAGCGTCAGCGGCGCATCTAGTGTAAATATTGTGAAGCACTCTCATGCAGCCGTGGTCAACCGCAAAAATCTTTGAAACTTTGAAGCGTAATTTGTTGCTGGGGTGTAGAAGTGGGGGGGTTCATTGTTAAGGGTGCCTGCTTAGCTACACTTTAACGGAAAAATATTAAGCATTATCCCTGTGCTTGTCTGCCAAAGGTTAAAAGGCGCCCCTTCAATAATAAATCTATGCACCTGGGTGGGCGCAGCTGTTCTAATATTGAGAAGGTGCCAAAATATGGGAGTGATTGGTCGGTGGGTGTTGGTGTGGGCGGTGCGCTTGGGCTTCATTAGGTTATAGCGGCAACAATTTGCGTGTGCATGCGGCATTAGCTAGGCTTATGGCTCTTGGTCGCGTAGAATATCGATTTATTGCGAAGCTCATAGCTCGAATGTGGTGGGGGTAAGATGGCCCAGGGTTAAGGGATTCATGGGGTTCAGGGGTTTTGGCTAGTAGAGCTGCGCCATATATGCATGAGGTATCGCACATTCAACTAAGCATAGGGCAGGCTAGGAGATGCAAGGTGGCTAATGTGGGGCCAGTGCGGATTTGAAAGATAATAGCGATATGGTCGACACGCAATACTTTCAAAGTAGCGTTAAGGAATGATTGGGCAACTATTCTGAATATGCAGTATTATCTATTGCAGATAGATTATGATTTATATAAATTTTGATAAAGTAGAAGTCAATTTGAGCGGGAAAATATTCTGGTAGTTTAGCTTGTTTAGTTTTTCAGCTTGAGGAGTTAATGCCGCTCCAGGATCCGGGCGAATACGCTAATGCGCCCGCTTACCTCGAAATTTTCGCCAACACTACAGTAGCATAATTGAATTTAGTGTGAGCCGTAGCCGTTCGTGCCTCATTGCCATTCGAAAGCTGGATGAGGTTCAAATTGAAGTTGGGGCATTAACATTTGGACATCGTCAGGAGCGCAGGACATTTTCGGTCACGATGGTGTGGCAGACGTGGCTAAATACAAAATGGGCTACAAGCAGCAGTCGGTACGACAGCCTGCGAAGAAATGGGGGTGGGGCATTTTAAATTGATGTAGCCCAGCTCTTGAGGTCAAAACCTTTCACGGATTGCTTTTCAGTGAATGCCTACACACGAGATGTTGTGCATACATTAAAGGCAAAAAGGTTTTCGTCTTGTTCGTACAGCGCGCCGCCTGTGTTTCTCTGTAAAAGGCAACGTTGCTTGCTTTTGTTTTGCCGGAGCCTTAGACGGGCTAATATGCAGCGAATTTTGTTATTAATATTCAATTAATAAGGCTTGGCATGGCTGGAAGCGCAGCGGGGCGTGCGCGTTGATAAATGCGAAGGCGTCGGCGAAAGTTTTCCTGTACTCGCCTTTGCGGTTTCCTCTTCGTCGTATGCGGCCGCAAAAAAAAAACAAAAAAAAATAATAACAAAAAAAAAATTGATGCATTTATTGGCCGAGCACGTCTCCGGTGCCTGCTGGCCTCGACCAGCGCACGAGCTAATTTATTAGCTGCCTACTGATGAGCGGCAAGAGAGAGGACATCGTTGCTAGCCAGCTGAAGCACAATAAAATTGTGATTACGATTTCTGACGCCATTTTGGTTGCTGCCATTGGAAGAGTGTCCTTAAACTGTTCGCACTTTCTTGTTTTTTTTTCAATTTAAATAGTATAAATAGCTGGTAATGCGAAAAAAATATGTTTGAAGCTGAATTTTTGTTGAACGGTATTTTTACATCGACTATCTTTAGGTTAGTTCGGCTGCTGTGAAAGTAAAGCTTCCGCAAACTCAAATAAAAAAGGATAAATAAACACTATGACAATCGCGAGGAAATACGCGTGCAAGTGTGTGTGTGTGTGTGTGTGGTGTTAATTTGCTGCTAATTGCTAAGTATGTGTGTGTGGGCGTAGTTTTGCCAAAGAGGAGGAAAAACTTTTTGTTTACGGCCAATGCTAATCCTGGGAATGAAAGAGTATTTGCAGCCGAAAAATTGCTAATTGCGTGGGGATCTAATTAACATAATGTTGGGAAGCACACTCCCACTCCTGTTACCTTTTCTTTTCCTTCTTTTTTTGCCAACGTCATCTTATGACACCTGCAGCTTGTGTGGGTGTAGGTGTGTTCTGTGTATTTTGCTTGTGTGTGTCTATTAGCGTTTTCATGGCATGTGTTTTGTGTCCGCTTTCTGTCTACCGGTTAAGTTGCCGTTTAACAGCAAAGTAAAATACAACGTAACATATGTACGCAGTTTTTTGTGTTATAAAAACTCATTAGGAAGCATCTTAATTATTATATTTAAATTTCAATTATTATAATGCATTTTATTTGATCAGCTATTGAAATTGCTTCGTGTGTTTTTATTTTGATTACTTCTGGATAATAGCATTACATCCTCTTGGCACGCAGAATTCGTAAGCTACTTTGGTGGTTTTTTTTTTTTTTGCACGGCTGTCCCTTTCTGTCGTCAGTAATAGCGTGTCTGCTGCGTCTGCTTTGCTTTTTCTGGTGGCATTTTCAGAGCAAGCGTGCGTTCTCGACAATAGCATATCAAAAACAGTTAGTACCTAAAATAAACCTAAAACTCAAGTTACCTAAGATAAGGGCAGACAAGAACGCCGACAGTCAGGCAATTTAAATTAAAGCAAAAAAAACGCAGGCGAAATTCTGGAAAATGTAGGAAAGTTGGGGGTGAAGGGGGGAGGGGGGGGGGGGTGTGTGTGTGTGTGTGTGGAGAAAAAGAGCACAGAAACGAAAAGGAATGGGTTACTCGTGCAGGGGTAAACGACGTTTTGTTCGAAATTAAAGTGTACGCTGCGCTTGATTATGGTTCTATATTTTCGAAAAAAAAGTACAGGTCAGCAGGCAATTATACGCATAATATGTGCATTAGAAAACCCGATTGCACGAGTACACACTAAATTTTAACGATAACCAACGAGTGGGCTGGCTTCACGGATATGGAACATGCGAGGCGAAGAATTATGCAATGGGCTTAACCAAAAACGAACCCACATTTAACTGCAATCGCAAAAGCATAAGCGAATCCTAATCGAAGGTTAATAAAACCTTCATATTTTCGCCTGGGCTGGCGCACTTCACATAGCTGGGCTGGAGCATATGCGTGAGTGATGTATGTATATTCATGATGCTTGGGACAATAATAAACCAGTGAGCACAGGCGATGGCCAGACTGGCGAAAAGTATTTTAATGTCTCGCCTGAAGTAATAAGCGGAAAGTCGCTGGGGTGGCAGGGGATAAAGGTAAGAAAAAGTTAAGTTCCAATCCTCGCGGTCCGCAAGCATCGCTAAAAGAAATGGTGGAATAATGGTGAAAAATGTGGTGTGGGGGTCCGGGGGGTGTGTGGGGTCTAGAAGAGCTATCGGGGAAGTGGTTTTTCTTATGCGCAACCGGACCTGTAAACCGCAACTAAAACATAAAGTTATGTGTGCAGCGCGGGAAAAATCATGTCAGATCATTTCCTGGTACAAGCTTTTGGCCGAAGGAGCACATGGGGTCCTGCCAAACACTGCGGAGGGGAGGGTGAAAGGGTGGGGGGGGTGGAATTGGGGTTGGGGGCCCTGCCACTCGAGCCAAAGCCGCTTTCATTTCGGCTGGCGCTTTTAATGGTCATAAAAATTTTATTATAAATTTTTTAATAATTTAAGGCAAGCATATTGCTGCGCTATAGCGCAGTAGCTAACCTAAAGTGTTTTTCCACAGTTTTTTCTTTGCCTTTTGTCAGAGCCACCCCCCCCCTCCTGCTCTGCTCGCTGGCTCGGCTGCTCCCTTCTTTTTAATCACTGGCATATACGCCTGCCGGGCATTTGCCGGCTCTATATAATGCCGCGACTCGCTTCTGGGGGACGGACTAAGATACGGAGCGGACGAGCGAAATATGCGGACGAAGTGGTAGGCGCTAGTAGTCGCAGTCAAAGTGGCATATATGCGTAATCATGGCGCTGGTATAACAAGGCATACAGAGGGCTAGGCAGAGGCAGACACACAAAATAACGGGGCGAGTATAAGTCAATCGAGAAGGTACTATATGTACGTAGCTATGTATAGGCAATGGCAATATCCTGGAAAACTATAATGCAGTCCACTACTAGTCTTACTAGTGCCGAGCTATGGGCCACTCGTTTTTTTTTAGTTTAGGAATGCCATTATTGCTATACTGTTTTAAATTAAAAATAAGTAGATGGGAAATCAATTTGCTTTAAACTTATGCGTGCCTTAGTCAAATAATACCCTTTAATCATTTTATTGCGCTAAGTGTAGAAACCATGAAAGCAAAAGCAACAGAAATAGCAAAGGATAGCTCGGCCGTATATATAGCTATAGCACGCAGCTAAAGCGATAGTGATAGCGGAATTGTTCGGATATATGTATGTATATGCTGTCCGCTTGCGTGCGCGTTTCGTTTGGCGTTTTAAGCCGCGATTCGAGGGCGTTTTCAGTCGCGTTTTGTCTATTTTGCGCGCTTTTTCGTTTTCCGTTTATTGGTTTTCGCACCTTGTTTTGTTTTTTTTGTTGTTTTTTTTTGGCCCTTAGGCGCAAGGTGGTGGCCCAATTTGATGAATCGTTAGCATTACGGGCCAGTGGCTTAGATGGCGAAAGTAGATATATGGCTGGTAAAAACTGGGCAACTCCTTCCAGACTGTTTCAGAGGATGTGGCAAAAGGGAAAAAACAAGCTTCAGCATGTCTCAACGTTGGCGACCTAACTCGATAAGGAGTTTAAATTGTTTTGGGCCTAGCAATTGCATGAAACTTATATCGTATATAATATGTACTATATGCACCAACAAAGCATATTTAATTCTTTATTTTCATAAGTGTGAATAGCTTAATTGTACTTGCTGGCTTTAAATATTCTGCGCGTTGTAGTACCGGACTACAATAAAAAAAAATTAAAAAAAAAAGAACTGAAGGGGCGAGAAAAATCGAAAGGCGCTATAACAAAAGCCGCAATGCCAAGGGCGAAGCAGGGCAGGCTGAGCAAAGGGGATGAAAGGAAACTAGGGGTGTGGTCCTATGTCCGTGGCATATGTCGCATTACAAATGTTTGGTCGGCCTGTTTGTTTATGGGGCGCACATTGCAGTCGTAGGTAAAATGTACACGAAACCGGCTAAACGGACAGTAAAAAATAAAAAAAAAAATTGAACAGAGCAGTAATGTGTGAATAAACCAGCGTTCAGAATATCGCATGGAGAATATTGTGCAAAACGGCTACAAGGCTTGGCAAAGTTTGCAACGATAATTTTTAGTTTTAGTGAGTAGGATTGCAAGTGAAAGAACGGTAAGCTGTTAATAGTTTGTAGTAAGAAAATTTGGGGACAATGATTTACATAGTATTTTTTTTTTTTAGATGTATTCTTGGATGTCATTTGTTTGTTTGTTTCTTTCGGTGCACAATAATGCAGTAGCATTCACGGCTTTAGGCGGCGTTGCAGTTGCAGTCCGCTGCAGTTGTGGTCCCCAGAATCGCTAGACAGCTGACAAGCCGACATATTTTCGCTTTCGTAATTACCAGCGTTCGCCCCGAACACATCCTCTTACACTGGCTCTATTTTATTTGGCTTAATACTTGACGCGCTACAAACCCGAATTTCATTGCATTCGTTTGACAAATGGCTGAGAAGTAAATAACAAAAAAAAAAAAAAATAAATCCAAACGATAGAAAAAAGTAAAAGGTGGAACGGTCGAAAAGAAAAGGGGATTTTAAACAGTCGGTTATGTCGGCATTTTGGCTAGCGACCTCGCTTGTAATTATATTTTTTAATGGGCATTGCAGCTTGGCAATTCGTTGCATTCATTGCGAAAAAGCTGCAAGCTGTAATTCATTCCAGGGAAAATCGGCAATTAAATCGCGAAAATAAAGAGTGAAAAACAAGAAGAATTTTTAAGCAGATTTTAACGATAATGCTAACTGTTCTAAAGTCACTAATCCAATTATACTACTTATTATTCATTATTAATAAATGCTAAGCCTAACTTTTATATTGCAATCTTCGCATTTCTTTCAGGCTGAGTTAGCACCATTTGCGCTGGAAATGAAGCGAATGCGCAGCAGTCGCCCTTCGCTGGTTGAATGGGCCATTTGATCGGCTTGTGGCTGAATCTGACGCTGGCTCCTGCGCAACCGATTGCCGGTGCCACGCCATAATCGAAGAAGAACGAGTCCAGCGCGAGAGCTCCACGCACGGTGACTCATGCGCTCCTCTCCTGGCATTTTCCTGGCATGCCTCGTCGAGTTGGCTCGTCTACATCGGCGGGCAGTCATCGTCAGCGAAATGATGAAAAGCAAAGCCGATAGGGCCGGCGAAAATGTGGTCTTAGGAACTTTGGGCCATGGAGGGCCGTCGGATCAGACCCGGCGTCGTTGTTACTCGTAACGTTGCCATGCTCGGGTTATGGAGATAAGTTGGTGCCCTGCGAAATGGACGGCAAAGATGAGCTTCGGTGCTGGGCGCAAATCGCAATTTGCAGCGCTTGAACGCTACTGTCGATGGTCGGCAGCGATGTAGGATGGGTTGACTTCTGACATTGGACAATTTATCGACGCTAATGCTGGACGAAAACGATGCCAAGTACAGTGTCGATAGTGGGTCGCAGGTCACAGGCGAGCAGGGGTATTGCGTTCGCTGATTGGCGCGGCAAGCTAACCGTTCTCCGTTCCGGCAGCGAGGCGGGTCCGAAATCACACAGCAAGGTGACTATGCTGGTGCCGGCACCGAAGATGCGTGGAAATCGAAAATTGGAGTGCGTGTCGGCCAGGCGGCAAGCGCCGCTAGACCGAAATCACTGTGATTGACGGCTGGGCTAATGTGCGCTAACCAAGGTATTTGAAATATGTAAGGAACGCGCTGGCTCGATTCGCAAGTGCGATTACAGCCAGACTCGTATACTAGAATATTGGCGGCCGCAAGAAGGAAGCATTCACAACAGCGAGCGTTTGCACGGTGCGGCAGGCGCAATAAACGCGCGCGATGCGAACCTATAGATCGGCCAAACTGGCTGCCTGCGAGGGTGAAATATGGCGCCGGCAAGGTCATCGTTTCGGTGTGGGTGTGCATTGGCTGGTGGCAAATTCCGAAGTGCGCGCGAGGTAATACTAAGCTGTGCAGGCCGACGAGCCAATCCGCGATGAGCTCAGGCTTATCGTTGTTCATGCAACGATGGAGCTATATGACGTGCGGAGACTTTACCACTAGATTGGTAAGTGCATCAAAAGTCAAATAGGAAAAGCGGCTAAAAACGCAGTTGTTAAAGTTGCATGCGATTTCCTTTCAAGATGCTTAATTACAATGTTGCGAGCTAATATCACGATGGCCATACGTCATAAGTGCGAGTGGTTAATGCGGTGGGCGAAGAGCGAACAGAGGGCTAAGACACTGGACATATTAGTTTGTATAGTACTGCAATGGCCTATATATGGTGAATGATAGTTACAATGCCATGCTAATGAGATTCTACGTTTGCCTATAAAGCAAAGCGCAGTTGGTCATGCTAGTTTTCGGGCGCAGCGTCCGCGTTAGGGCGTGGATAGCGCGATCTGGCGCCACCGTTAGCATGCGTTGCGATGTGGCTGGCAATTCGCGGAAGCCTGTAAATCGAATGGATCAGCGAGAACTCCGCTCAGGTGCGTCAAAACTTGCGTTCTGGGGCTAACTGAGCATTCAAGTTAGCATATTCAAATCTCGGCGGCAAGCGTTGTTGGCGTTGCAGCGGAACTGAAATTGAATAGTGAGGCAGCGAAAGGCGGCAGGTCGCTAGCTGTTGAAAGGCTGTGTAATGGATTTTCCGCGAAATCGGTGCGCGAGGGCATACGCTGATAGTGTGAAAAGGGGCACGGCGGGATTATGCAATGCGGCAGCAAGGTGCATATTTGGCCGGAGTGGGGTCGGTTTAGTAAGATGCGATTGTTTGGCGCTTAGCATACGTGGCGAAGGGCGGAGCATATACTTGGCTGTCGTTCGAATGGGCTCAGAGCAATTATCAACTATGAGCAGCGCGAATGCCGATTATATATATATTCGAGAGGGCTATCCCATTTGCTCGGAGGGCCGTGCGAGCGGCGCTTGATTATTCGCGATAGAAGGCGACACATTTTGGCATATACAATTGTATCAGTGTATGATATTCGTATGGCGGTGATTCGTTGTTTAATAACATATGGCTACGCGAACGCGGGCAAGCATAGCCTGCGTTCTGCTGGTCGTTTATGGGCTGCCATGTTTTTTGTGCGTGGCATCATTCTTGATGACCTCGTAATGATTATTATGCTGTGTAGCCGCAGCGAGCGGCCCGTCGCAAGAAGGCCCCCATGCCCAGCGATGTCATCCGGAGGGCATCACGCGGCGGTGATAGCTTAAACGAATTGAAGATGGCGAGCTACGCTCGAAGGCCTTTATGATGTGGAATACGTCGAGCGGGCGGCGATGGACTGGCCAATAATGCTACCATGCAAGCGTCGATGCCCGTGTCAGATGAAGGGGGCCATCACTTGTGTGCCACTAGCCGGACCCGTTAAGTTCGTATGAGCGCTTTTGCGGGCGATTTCGGTGGCGATCGTTACAATCGAGCAGTGTCACATTAAGGGAATCTAAGAATCAAGCATGGAGACCGCCTACAAGGGTTAGTGCCACAGGCGCATGGCTATATGCCCATTAGGTTGAATATGCCGCTCGACTATAAGTCGCGCGGAGCGTGTGAGGGTGCTGGCGTGGTGTGGCAGCGGCGGCGTTGCAAGTTGAAGAAGCGGTGGCAATGAACTGCGGGCCGACGTTGCCGATCGGCAAACTCGCGGCTGCGCACCGTGAATATGGGCGTGGCTGGCAATGGGGGCGGTGCAGTTTGCCGCCGCGCATCAGGCGCAGCGAGATTCAACTAAGTCGCTAGGAATGCAACGGATTTCTCGGTGAGTTTAAATCTATAATATAATATTAGTATATATATATTTTAGGGCAATGAAGACTGTATTGATGGATCTTTACGACTTTCGGCAGGAACAACGGTCGCTGCTGTTGCAGTAATGGTGATCGATAGCGTTTGCGGCGCTATCTACTGGTAATCCCTATTTAAGGGACGAAGCTCCTCGCTAGTAATGCGCGCGCCTGCGCGGCGTGAGCACGCGCAATCGCGCGCGCCTAACGCCCGCCCGCCCCCTAATTCATGCCGGCTCGCGGCAGCGGCCAGCCGCCTCATGCACGGCCAAGGCGGACATCCAGGGCATTTGTGGGCGGCGCTGTGGATCACCACAGTGCCCGTGGATGAATGTGAACATTAATGCGGGCGGTAGGCGGTGGCTGACGCCGCATGGCGGTGGCGGCGGCGTGGGCGGTGTGGCGTTGGGGAGCTGGTGGGCACGTCAGCGCAGGCAGTTGCCAACTTGACGCCAGGCAGCATATCACTGCGGCGCAGGCCCCTAACGCGGGCGGGCGGCGTCGGCAACGAGCGGGCCAGTGGCGCCAGAGTCCGTCCGGTTCAGTTTTATATGTCGCATAATGGCAAAGGACACACGCAGTAAAAAGGACGTCTGCCAGCGTCATGGTTTAAAGCTACAAAGATAGATATACACAAGCAGCATCTTTTCTATATTATATATACTATATAGTAGAAGGCATAGCATAATGGCATGAAATGTATTTAAAAGCCGAACTTAGTGTGTAAACTTTTTTTTTGTCGCTATCGAATATTAGCGTATATTATTCTGTTCAAGCGGTGTGCTGAACTTTCGAAAAGATGCTTGAGTTGTAAATGCAACTGACTTGGATTTTATGCTGATATGGCTGAAATTCGACAGCTCCAGAATGTGAACTGCTTTCGTTTGTAAAATTTGTAAGGGTACAGCTCCGCAAATAACAAAAAAAAAGAGAAAAGAGTTCAAGCACATGCCGCTTGAGTACAAAAGGAATGTGTATCAAAAACGAAAAACAAAACACCAAAAAAACACAGAAGAAAACAATAACCGTATTGTGATTTATATGTATTAGGCGTTAATTATTTTAGATGGCTGAGCTCGATGAGTTTCGCAAGGCGGCAAATTTTATCACTAGTGGGGCTAAATCTTTTTAGCTAGCTTATGCGAGTAAAAAAATCGTATATCATTAAATCACGAAACGGCGTATCGAAAATGAACAAGTAATGTAAAAACCCAAAAAACACAAAAGCAACTACAAAACTTGAAAAAGTTTCATGTTTAAACAGCGTGTTGCAATACAATTTTAAAACGGAAATCGAATCAATTCTAATTATATGACATTATGACAAAACAAAAGCCTGTGTAGTTAGTTGGACAGCTATAAAAACAAAGCAAAATAATAAATAATATAAAAAGCAAACGAAAAAATGACTAAGTAAAGGCAGCAGCTAGCAGAAAGCAAAACTAATTTTTGCTAAGGCCAAAATTTATGCCCAATTTTGTTAGACATTAATTACGTATAGCATCATTAACGAAACGGCACTGAAATGGCAGTGAGAGAAAAATATTTAGTATTAAGGTGGCTGTGCTTGGCATCTGTGAGCGCCGGTCAAGAACGAGGCGCTAAAAGTACATAGTATAACTATTATTTTGGCACACAAACAAACTAAATTGAAATGAGATTGAAAACTATATTTTTAACGGGATGCTACGTAATTATGTGGCAATTTAATCGATAATGCCGCCATAGCCGCTATAAAATACTAAACAAAAAGATGTCGCAAAATATTTGGTGATTTAGGCGAGTAAAGTGAATAAAATTAGAAAACCTAGCCAACATTATTATACTATATATAATATATATATAGTTGCCACTTTCCAACTTTTTGATAGCGGTTTTTTGCAGTGGGAAGCATCTTCAGCATCCATGGTGGAAAAGTTGAAATATTTTTTTGTGCTTAGCCTAAGCGCGTAAATGATATTTGAAGCTTAAATCATTATTATGTAAAAGAGCTAATGAAATATTGTATATCGTATTATATATATATATATCATATATATAGACATAAATGTGTATGTCAACGCTATGAATTAAAAATGCTAGGCTAAGATTAGAGAGAATGTTTGTATATTTTTTTTTTTTTGTCTGCCTGAGTTTAGGTTTAAGGCTATTTATAGAACGTAAATGGAAGCATTTAAAAGGTACAGTAAAACTTAAAAAGCTTCCAGAAAAATTGTAAAACCATTTGAGGCGATCTACCAAAAAAAAAAAAACGCAATTGAAGTGTAAAGCGTTATGTTTAACGGACGGAAATATGCTGGGCCTTATGATTCCTCGTTGGGTCTTTATTCACTAAGGCCAATACTAGTATTATTAATTAACTAATTTATTGTTTGCATATTTTATTTTTTGTTTTTTCTCTTTCGAAAGAATATACATATGTATGTATTTATTCAAAGTAATTGCTGAAAATGCTTATAATAATATGCCGGTTCGTTAAACTGATTTATAGGCACTCGGAATCTAATTCTTATTTTTCCGTCGGCTGCAATTGCGCGAGCTTTTTAACCTTGGGATTAGCGTTCGATAAATAAGAGATAAGTATGGTGTTACATATTTTTTTTTGTTTTTCCGATAATTGTCAATACGATTCAATTATAGGCGTTCAATTATATAGGAGCTTACACACGAAAACGAAATAGAAATTCAAATGAAATCGCTATTAAGGCCTAAGTAGTAATGATAATTGCAGGAAGCGGAGTTGGCGATAAAGAAAGGTACGGAAAAATGATAAATAAGAAGCGTATTGGGGGAGTAAACTAACATAAGTCGAGAACAAGCTCAATATTAAGTTGAGAATAAAGATACGAGAACAAAAGGAAAATAAAATACTTATAAATGCTACACGAGTGCGTTTTTGTCAATCAAATTTATGGGCACCATTTTTCCAAGTGCTAGCAGAATTTTACATTGTGCGAATTTAACATTATACATAGCGTAATTTCTACGAATGGCTACATTTGTCGTAATTTTTAGCACTTTTACGATCAGCGTAATTTCTAGCATACGCGGCTAATGTACGTGCCCTAAAGGGCGCCATTGTGCGAAATGGGGAATAGCTACTCATGACAAGCGGCTCGGAGCCCGGCTTAACGTTATTTGTTCAGCGTGGCAACCGATCACTGCGCATACGCTGCGAAAGAGCGCGCTGCCAAATGGCTCGCCCGCGCCGATACGGTATTCTTTTTTCTTGCACCGACGCGGTCACACTGCCGATTTGTAAGGCAGATCGCTTTTTCGGCCAGTGAGCGAAACGTTGTGAAGCGACGTAGCGGTTAGCAGACGTAACGTGGATAAGGCGCAGAGCGACGAGTTGTTGCAAGATTATATTTTTTTTTGATGTGTGTGCTATAAACGCAGCGTAATAATCGCGCTGCCGAACAGGATATACAAACAAATCGAATTACATCAAGCAGCCTAATGCATGAAATGAAAGGATGGCCCGCAAGCGGTAAAAGCGTGTTCAGCTAAGTTAGCTTTAAGGAGTGCCTGTCGCAGGGATAGCAACGGAGAGAGGGGCGACAGCAGAGAGCGAGAGAGAGAGTAGGATGAGAAACAGGATTATTCGAAAGTGTATGCTACCTCGAGTGCGCGCAGTTGTGGAGGAGTGAGACGAAAAGCGCGAAGTGCTAAGAATGCGCTAAATGAGGAAGTAGAGCAAATAGCGCTATAGCGTGCGGCGTCGGTTTGTAATTTGAATTATTGTGGCTCGTATGCCTCGCGATAGAGAAAAAGCAAGCCATAAAAGAATACACGAAAAAGCGTTCTTTTTTTTGCCACTTTTTTTTTAATGTTTCGAAAACACGTCGAGCGGTAAATATGTCGCCGTCGTCGGGAGAGTGCTCTCTTAGTTTATCACAATAAGCTTTAAAGTCTGAAGTTCAAAAAGCTCATAAAAACAAGAAACAAAAAACAAAAAACAACTAACGCGCTTAAACAAAGTAAATTGGGCTTAACTAGCTGAGACATAAGCATACTTAACACCTAAAGCTCGCAGTTAAAGCATATCCTCAAAGAGAGACGGCAAGAGGAGGGCGGCTAAAGAAAGCGTAGAGCGAGTAGAGGTGAAAAACTGGAAAAGTGTTAACTAGGAAACTGGATTATCAATGCAATGCGCAGCGCAGCCTAGACGTCGCAGTCGGCGCGACCAAACACATTGTAGTTTTGTTTTGGGATTAACAATAATGCACGCCGTTGGCGTCGCTAGCCGCGTGGCTGACTCTGCTGGCTGTTGACGCTGGCTTTGGCGATTTGCGAAGAAACCGTTCGCGAGGAACTGGTTGAGTAAAGCGAGATCCGCCGATCAGACAACCTAGCGGGGGGAGGGCAGGGTGACATGGGCGGGGTGCGCCCGAAATAGGGTGGTGCACTAAACTGGATTCGCAAGGAGCTCAAAAACGTTTCGGCAAACGCCTTGCCATATAGAACGGTTTTTTAACAGAACCAAAACTATGAGATAAGATGATGATTGTATAGGCTGAAGATGTGGATGTCTCACCCTAATAGTAAGCTGAATCGCATCCATTGTTTTGGTCTTAGATGTCAGTTTTACAATGGTAGCATACTTAAGTATGAAACATACTAAAATTGGTGCAGATTAAAATATAGCCATACATTTGCATGGACTAGTTCCGGCTTAAACCGCGTTGGATAGATGCGCTACAATTGCATAGCTCGGAAATGTTGTCTATAAATTCGGATGCTATTCGCACACATATAAGTGACTGAGCAAAGCCATGGACCAATGGTAAATCCAAAAAGGAAGTGATGCTAAGCTGTTAATCTGGCTAATAAGCCTATAGGGATTGTTGAGAATCCTACCTTAAAGAATTTCACTTACAACGGTTCGCAATTCGTCGCATGCGCGCGCTCCAAATGCGTTCTCTGTGTTTGCTGCGCATTCGCCTTTTTCGAGTCCTGTTCGTTTGGTCTCCTTTTGGGGGCAGCTCCTTTTGCTTCGCTCTTGGGCGCTTGCGCTTTTCATAGCTTGTATTCAGCCGTGTGCTTTCTTTTTGCGGCTGTCTTCTTGGCTTCGGCATTGTGTCCGCTTTTCGACGGTGTCGTTTTTTTCCTCTCTGCATTTCTCTGCTTCTCTCTCTCTGTTGTGCCTCGCTTGATGTCGTTCTGCCGCTCTCGGCTCGATCGCTGTCCGCTCGTCGTCGCTCTGGCTTCCGCCGATGGCTGCATGCTCTCTCAGATTTGGAAACCAGCTCTTTTTCGAAGCAAGAAAGAGAGTTTTTTACGTCGGCTAAAAGGCGAGTTTTTTCCAATTCCGTCTAGCACTGGCATAGGCACTCTTTCTTTGTGTTTTTGCACTGCGCTATTGCTCGCTAGTAGTGAGTGGCGGCCACTTTCCTGTTGCCAGCTTTGGAAGCGAACTCTCCTAGCCTCCACTACCCGCTGTGTTAGTGCCGCATGCACCCAACTGGGGCTTAATTGGGTATGTTGCATTATGGAGCGTTCGTTTGTGCTAATCACTGAAATCAACACAGAATGTAATGCGATGCGGAATCGGATCGGTTGTAGGTGTAGCTTCGGGTGCGATTGACCGGCCAGTAAAATGTTCCACTTTTCGCCTGTAACTCGAAATAGCTCCTTCGACGGAATTCTCGGCTCAACGATGAGAAAGTATGGGGAATGCAGCTCGTATATCATAGTATTTATGACAAATTTGATAATGTCAGAGAATTTTTTGTTGTTGCATTATTCGATAAATCTATGAAATCCATCAATGCCATCAAGCGTTTAAAAGCAATTAAATGATCATTACTAGAAAGTTTTGAATGCCCTGAGCTACGTGGTGGCAATCTTTTATTATTACGCACAGGGTCGATGATCTCTGCTATTTTTTTTATATTTTTAATCGAGCTGCAACCGAAGAGCGTTGAAAGCTATTATGTTTATTAACTTTTACGGCTGGCATTTCCTGAGCTGCGAGTTTCCGCTCATAAACTCCTTTTTTTTCTGGCTATTATAGCCATGGTAAGTCGTGTGATTCTTTGCAGTTTTAGCGCAGAAGGAAGTTGTATGTAAAAGTTTAATAAATCGGTATATAAAAGTTAGCAAATGCATATGCATTATAACAGGTGAAAAAAAATGCTGAAGTAGTATAATAAGAGTGTTAATTTTGATGCAGGATTATATTGGGGCAAAAGACAAAGAAGGGCATCCTATTGCATTGCAGTTCTACTTGCAGAGGTGTCTACTACTGCGGGCTAACAGAAAAAGTGTTCGAGAGATGGAAAAAAAAAACCAAAAAAAAAACAAAAATCTCTGCGATGCTTGCAATGTAGCAATTTTTGTAACCTATTAAGTTCTCACAACTTTAACGCCCGCCATGTGCAATGCCATAAGTTTGTTCAACTTTGTGGAGTCGAACAGCCTGCATTTAGCCACAAATTTAGGGTAATTAATTTTTTTGGGGCAGGTTTAATGTGGCACGTACGGTTTCGAAGTAAAGCGAGACGCGAGAACCAACGCACGACGCATTCGATTTTCATCATATATGTAACACGAACCAGAGCGCATAATAGAAAGAATGTGTTGGAATAAAACGACAGCTGGGGCTCCTCGGAGCCAACCAGTGAATAGCAGCAGCAGGAGAAAGCATAGGAGCAGCGAGAGGCACGCCCTTGAGTCCTGGCATATTGCAGCATGCGCTCCTTTTTGCTGTTTGTTCACAGCTCAAATTACGATATGCTATGGCATTGGGCATCGCAAACGGCATGGTGTTGATTGTTTGTATGGTGAGCGAGAAGGGAAATGAGGCGGCAATAGAGTAGTAGATAACTTTTGTATCTGCGGCGCTCGTGTCTCGGCTATTTGCCTAGTATATCCTGTGGCGCAGCCCAAAAGCAAGAGTAAAAGACGAAACACGGTGTGTATCGCGTACGACTGGCCCACATTCTTGACTTTTCGTACAGTCAGCGCGCACACGGACGGAATGCCCTATTAGGCGCCTCGCGGCTAGACTTCAAGTCAATATAAATATCCTACATAAGTAACTCTTATGCTATGTACATTATATTACGTATAGTATGTTATGTAGGCTTCTGGCGGCATGAAGTCGGCCTGTATGGCGATGCAATACGCAATCATGCTCATGGTTATGGATCAGTAAAGTGTGCACTTTTATGGTGCGCATCATGATCATCCAGCATTAAAAGTTGCTGCAACCAATTTAATGAATATATAAAATTTAAGAGGGCGAATATGGAGGAGGGCAGCTTAAATTGAAGATGCACTAAGAGATGAGGGTTAGCATTCCATTAATTAAGATTTATGTCTTTGAAAAGGAACGAGTCCGCATGAAAGTCAAAAGCTAGCACGCATAAGTTAAATCCTGAATAATACCATACATCTATACATTTGAATAAATAATGCACTTAAATGATGCATGTAAAGATGTGAGAAGCAATTGTAAGAATGCTTTCAGTTAGTGAGTAATAAGCGCAAGGTAGTCATGTAATACTCTAACTCTACAATTGAATTCGAATCTGCATACGGTTCTATTCCATCAATATATAATAACTTGTAGGATGTATGTCTGAGATGCATTTATTAAGCACCTAAAGCAAAAAACAGGCGCAATTCCACCGCATGCTAAACCCATCTTCGCAAGGACGCGCAGCTTAAAACGCAGATTATTGTGCAAGTATCGTCATATTTGGGCGTTCTTCAATTCATTTGAGACACGGTCACTTATTCGACCATTTAGCTGTCGCATTTATTAATTATTGAGATAACGTCCGCGCGAGATGGCGCAAGCAGGGAAAAAAAAAAAGCGGCTAAGGAGTATTCAGCTTGAAAATATAACTTTGGTCCAATTGCAGTGAATTAAGAAGTACGACTTAATTAAAAAAGTATTCGTAATATGAAATCGGCTAATTTCCTCATGTGTTAAATGCTGCTGTTTACCTCTTGCCTTTCCGGCTTGGCCAAAATAAAACAAAAAACAAAACGCTGTAAGTGATAAAGTAAACTTAAAAAAAACTAAATAATAATATAAAAAGAAAACAAAGTAAGGCAGTAAGATGTCGAGGAATCTTGTTGGCTGGTCAATTGATACTGGGATTGGTGATATGCGGCAGAGCAGTTCGGGCGTTCCGTTTCGTTCGCAGACGATACCGATGTCGATGGCTGCGTATCGCTGTTTTTCGGGAGCTACACCTTTGAAACCAGAATTGCAAGGCTTGCTTGCAGGTAGATTTGCATATGGCAATAGTGCATATGTTGTTTCTCTCAGGGTAATCTTTGGCTGCTCTGCGGTAAATCCCCAATGTTTTTGCGGCTCCTTGTCTGGCGATGATATTTGTTATTTGTTCCCATAGTCATTATGCATCAACATGAAAATCTTTTCACGCAGCCTCTTTCTCGTTCTCTTGCTGCTCTGTGCTTCTCTAGTTTGATATTAGTGTTGACAACTACTTTCAAGTTTTTTGTTTGGGAAAATGCGCTCAAATGGCGCAAAGAATGAATTGAGGTTTTGCTTCTGCTAGCAGATATGTATGTACTGTTTTTTGCTGGCCAAAATAATTTGAAAACAGTCCTCAGCATGTGTCGGCTGGCATTAAATGCGTTTAAGAAAAATCCATGGATTGTATGGATTTAAATCTCAGCTGTCTTGGCTTGGCTTCGGATGGATCTTAAACTTGACTGTGGCTCTCCTAACACTACACTCTACAACTGTGCTTGCCATCTCTCGGCTTTTTATTTTGACTTTTAATGATGGCAGCTGCGAATTAGTGCGCCTGACTTACCTCTTTCTAGGCCTGATGGGCCTAGTTAATGAGTTTTCTTTATGGGCCAACAGCATGTACCTTTTGCGAGTCGTAGATAGCAGGTACTCACCGACAGATAGTATTCGAGTCAAGATACTAAGATACACCAGCAGAGACGAACGAAGGCGCACTGACCGAACAAGTTTTTTAGTTTTTTTTTTTTTTTTCCTTACTGTTTTCTTGGTTTTTGATTGGCTGGTCTCTGTTGGTCTTCTCTGCCTTTGTGTGGCTTTGTCTCGCTTGAAG FALCON-0.1.3/test_data/t2.fofn000077500000000000000000000000101237512144000156150ustar00rootroot00000000000000./t2.fa