pax_global_header00006660000000000000000000000064137505277270014531gustar00rootroot0000000000000052 comment=36d12438d92bd90af0b4e592dc2ba6b6de21bb38 nim-lapper-0.1.7/000077500000000000000000000000001375052772700136025ustar00rootroot00000000000000nim-lapper-0.1.7/.gitignore000066400000000000000000000000241375052772700155660ustar00rootroot00000000000000nimcache src/lapper nim-lapper-0.1.7/LICENSE000066400000000000000000000020621375052772700146070ustar00rootroot00000000000000MIT License Copyright (c) 2017 Brent S. Pedersen Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. nim-lapper-0.1.7/README.md000066400000000000000000000072731375052772700150720ustar00rootroot00000000000000simple, fast interval searches for nim This uses a binary search in a sorted list of intervals along with knowledge of the longest interval. It works when the size of the largest interval is smaller than the average distance between intervals. As that ratio of largest-size::mean-distance increases, the performance decreases. On realistic (for my use-case) data, this is 1000 times faster to query results and >5000 times faster to check for presence than a brute-force method. Lapper also has a special case `seek` method when we know that the queries will be in order. This method uses a cursor to indicate that start of the last search and does a linear search from that cursor to find matching intervals. This gives an additional 2-fold speedup over the `find` method. API docs and examples in `nim-doc` format are available [here](https://brentp.github.io/nim-lapper/index.html) See the `Performance` section for how large the intervals can be and still get a performance benefit. To use this, it's simply required that your type have a `start(m) int` and `stop(m) int` method to satisfy the [concept](https://nim-lang.org/docs/manual.html#generics-concepts) used by `Lapper` You can install this with `nimble install lapper`. ## Example ```nim import lapper import strutils # define an appropriate data-type. it must have a `start(m) int` and `stop(m) int` method. #type myinterval = tuple[start:int, stop:int, val:int] # if we want to modify the result, then we have to use a ref object type type myinterval = ref object start: int stop: int val: int proc start(m: myinterval): int {.inline.} = return m.start proc stop(m: myinterval): int {.inline.} = return m.stop proc `$`(m:myinterval): string = return "(start:$#, stop:$#, val:$#)" % [$m.start, $m.stop, $m.val] # create some fake data var ivs = new_seq[myinterval]() for i in countup(0, 100, 10): ivs.add(myinterval(start:i, stop:i + 15, val:0)) # make the Lapper "data-structure" var l = lapify(ivs) var empty:seq[myinterval] assert l.find(10, 20, empty) var notfound = not l.find(200, 300, empty) assert notfound var res = new_seq[myinterval]() # find is the more general case, l.seek gives a speed benefit when consecutive queries are in order. echo l.find(50, 70, res) echo res # @[(start: 40, stop: 55, val:0), (start: 50, stop: 65, val: 0), (start: 60, stop: 75, val: 0), (start: 70, stop: 85, val: 0)] for r in res: r.val += 1 # or we can do a function on each overlapping interval l.each_seek(50, 60, proc(a:myinterval) = inc(a.val)) # or l.each_find(50, 60, proc(a:myinterval) = a.val += 10) discard l.seek(50, 70, res) echo res #@[(start:40, stop:55, val:12), (start:50, stop:65, val:12), (start:60, stop:75, val:1)] ``` ## Performance The output of running `bench.nim` (with -d:release) which generates *200K intervals* with positions ranging from 0 to 50 million and max lengths from 10 to 1M is: | max interval size | lapper time | lapper seek time | brute-force time | speedup | seek speedup | each-seek speedup | | ----------------- | ----------- | ---------------- | --------------- | ------- | ------------ | ----------------- | |10|0.06|0.04|387.44|6983.81|9873.11|9681.66| |100|0.05|0.04|384.92|7344.32|10412.97|15200.84| |1000|0.06|0.05|375.37|6250.23|7942.50|15703.24| |10000|0.15|0.14|377.29|2554.61|2702.13|15942.76| |100000|0.99|0.99|377.88|383.36|381.37|16241.61| |1000000|12.52|12.53|425.61|34.01|33.96|17762.58| Note that this is a worst-case scenario as we could also simulate a case where there are few long intervals instead of many large ones as in this case. Even so, we get a 34X speedup with `lapper`. Also note that testing for presence will be even faster than the above comparisons as it returns true as soon as an overlap is found. nim-lapper-0.1.7/bench.nim000066400000000000000000000072721375052772700153760ustar00rootroot00000000000000import lapper import algorithm import math import strutils import random import times #type myinterval = tuple[start:int, stop:int] #proc start(m: myinterval): int {.inline.} = return m.start #proc stop(m: myinterval): int {.inline.} = return m.stop # define an appropriate data-type. it must have a `start(m) int` and `stop(m) int` method. #type myinterval = tuple[start:int, stop:int, val:int] # if we want to modify the result, then we have to use a ref object type type myinterval = ref object start: int stop: int val: int proc start(m: myinterval): int {.inline.} = return m.start proc stop(m: myinterval): int {.inline.} = return m.stop proc `$`(m:myinterval): string = return "(start:$#, stop:$#, val:$#)" % [$m.start, $m.stop, $m.val] proc randomi(imin:int, imax:int): int = return imin + random(imax - imin) proc brute_force(ivs: seq[Interval], start:int, stop:int, res: var seq[Interval]) = if res.len != 0: res.set_len(0) for i in ivs: if i.start <= stop and i.stop >= start: res.add(i) proc make_random(n:int, range_max:int, size_min:int, size_max:int): seq[myinterval] = result = new_seq[myinterval](n) for i in 0.. Module lapper

Module lapper

This module provides a simple data-structure for fast interval searches. It does not use an interval tree, instead, it operates on the assumption that most intervals are of similar length; or, more exactly, that the longest interval in the set is not long compared to the average distance between intervals. On any dataset where that is not the case, this method will not perform well. For cases where this holds true (as it often does with genomic data), we can sort by start and use binary search on the starts, accounting for the length of the longest interval. The advantage of this approach is simplicity of implementation and speed. In realistic tests queries returning the overlapping intervals are 1000 times faster than brute force and queries that merely check for the overlaps are > 5000 times faster.

The main methods are find and seek where the latter uses a cursor and is very fast for cases when the queries are sorted. This is another innovation in this library that allows an addition ~50% speed improvement when consecutive queries are known to be in sort order.

For both find and seek, if the given intervals parameter is nil, the function will return a boolean indicating if any intervals in the set overlap the query. This is much faster than modifying the intervals.

The example below shows off most of the API of Lapper.

import lapper
type myinterval = ref object
   start: int
   stop: int
   val: int
 
 proc start(m: myinterval): int {.inline.} = return m.start
 proc stop(m: myinterval): int {.inline.} = return m.stop
 proc `$`(m:myinterval): string = return "(start:$#, stop:$#, val:$#)" % [$m.start, $m.stop, $m.val]

create some fake data

var ivs = new_seq[myinterval]()
for i in countup(0, 100, 10):
  ivs.add(myinterval(start:i, stop:i + 15, val:0))
make the Lapper "data-structure"
l = lapify(ivs)
empty:seq[myinterval]
l.find(10, 20, empty)
notfound = not l.find(200, 300, empty)
assert notfound
res = new_seq[myinterval]()
find is the more general case, l.seek gives a speed benefit when consecutive queries are in order.
echo l.find(50, 70, res)
echo res
# @[(start: 40, stop: 55, val:0), (start: 50, stop: 65, val: 0), (start: 60, stop: 75, val: 0), (start: 70, stop: 85, val: 0)]
for r in res:
   r.val += 1
or we can do a function on each overlapping interval
l.each_seek(50, 60, proc(a:myinterval) = inc(a.val))
or
l.each_find(50, 60, proc(a:myinterval) = a.val += 10)
discard l.seek(50, 70, res)
echo res
# @[(start:40, stop:55, val:12), (start:50, stop:65, val:12), (start:60, stop:75, val:1)]

Types

Interval = concept i
    start(i) is int
    stop(i) is int
An object/tuple must implement these 2 methods to use this module
Lapper[T] = object
  intervals: seq[T]
  max_len: int
  cursor: int                  ## `cursor` is used internally by ordered find
  
Lapper enables fast interval searches

Procs

proc overlap[T: Interval](a: T; start: int; stop: int): bool {.
inline
.}
overlap returns true if half-open intervals overlap
proc lapify[T: Interval](ivs: var seq[T]): Lapper[T]
create a new Lapper object; ivs will be sorted.
proc len[T: Interval](L: Lapper[T]): int
len returns the number of intervals in the Lapper
proc find[T: Interval](L: Lapper[T]; start: int; stop: int; ivs: var seq[T]): bool
fill ivs with all intervals in L that overlap start .. stop. if ivs is nil, then this will just return true if it finds an interval and false otherwise
proc each_find[T: Interval](L: Lapper[T]; start: int; stop: int; fn: proc (v: T))
call fn(x) for each interval x in L that overlaps start..stop
proc seek[T: Interval](L: var Lapper[T]; start: int; stop: int; ivs: var seq[T]): bool
fill ivs with all intervals in L that overlap start .. stop inclusive. this method will work when queries to this lapper are in sorted (start) order it uses a linear search from the last query instead of a binary search. if ivs is nil, then this will just return true if it finds an interval and false otherwise
proc each_seek[T: Interval](L: var Lapper[T]; start: int; stop: int; fn: proc (v: T)) {.
inline
.}
call fn(x) for each interval x in L that overlaps start..stop this assumes that subsequent calls to this function will be in sorted order
nim-lapper-0.1.7/example.nim000066400000000000000000000026631375052772700157510ustar00rootroot00000000000000import lapper import strutils # define an appropriate data-type. it must have a `start(m) int` and `stop(m) int` method. #type myinterval = tuple[start:int, stop:int, val:int] # if we want to modify the result, then we have to use a ref object type type myinterval = ref object start: int stop: int val: int proc start(m: myinterval): int {.inline.} = return m.start proc stop(m: myinterval): int {.inline.} = return m.stop proc `$`(m:myinterval): string = return "(start:$#, stop:$#, val:$#)" % [$m.start, $m.stop, $m.val] # create some fake data var ivs = new_seq[myinterval]() for i in countup(0, 100, 10): ivs.add(myinterval(start:i, stop:i + 15, val:0)) # make the Lapper "data-structure" var l = lapify(ivs) var empty:seq[myinterval] assert l.find(10, 20, empty) var notfound = not l.find(200, 300, empty) assert notfound var res = new_seq[myinterval]() # find is the more general case, l.seek gives a speed benefit when consecutive queries are in order. echo l.find(50, 70, res) echo res # @[(start: 40, stop: 55, val:0), (start: 50, stop: 65, val: 0), (start: 60, stop: 75, val: 0), (start: 70, stop: 85, val: 0)] for r in res: r.val += 1 # or we can do a function on each overlapping interval l.each_seek(50, 60, proc(a:myinterval) = inc(a.val)) # or l.each_find(50, 60, proc(a:myinterval) = a.val += 10) discard l.seek(50, 70, res) echo res #@[(start:40, stop:55, val:12), (start:50, stop:65, val:12), (start:60, stop:75, val:1)] nim-lapper-0.1.7/lapper.nimble000066400000000000000000000007531375052772700162620ustar00rootroot00000000000000# Package version = "0.1.7" author = "Brent Pedersen" description = "fast, simple interval overlaps with binary search" license = "MIT" # Dependencies requires "nim >= 0.19.2" #, "nim-lang/c2nim>=0.9.13" srcDir = "src" skipFiles = @["bench.nim", "example.nim"] skipDirs = @["tests"] task test, "run the tests": exec "nim c -d:release --lineDir:on -r src/lapper" task docs, "make docs": exec "nim doc2 src/lapper; mkdir -p docs; mv lapper.html docs/index.html" nim-lapper-0.1.7/nim.cfg000066400000000000000000000000321375052772700150410ustar00rootroot00000000000000path = "$projectPath/src" nim-lapper-0.1.7/src/000077500000000000000000000000001375052772700143715ustar00rootroot00000000000000nim-lapper-0.1.7/src/lapper.nim000066400000000000000000000311271375052772700163650ustar00rootroot00000000000000## This module provides a simple data-structure for fast interval searches. It does not use an interval tree, ## instead, it operates on the assumption that most intervals are of similar length; or, more exactly, that the ## longest interval in the set is not long compared to the average distance between intervals. On any dataset ## where that is not the case, this method will not perform well. For cases where this holds true (as it often ## does with genomic data), we can sort by start and use binary search on the starts, accounting for the length ## of the longest interval. The advantage of this approach is simplicity of implementation and speed. In realistic ## tests queries returning the overlapping intervals are 1000 times faster than brute force and queries that merely ## check for the overlaps are > 5000 times faster. ## ## The main methods are `find` and `seek` where the latter uses a cursor and is very fast for cases when the queries ## are sorted. This is another innovation in this library that allows an addition ~50% speed improvement when ## consecutive queries are known to be in sort order. ## ## For both find and seek, if the given intervals parameter is nil, the function will return a boolean indicating if ## any intervals in the set overlap the query. This is much faster than modifying the ## intervals. ## ## The example below shows off most of the API of `Lapper`. ## ## .. code-block:: nim ## import lapper ## type myinterval = ref object ## start: int ## stop: int ## val: int ## ## proc start(m: myinterval): int {.inline.} = return m.start ## proc stop(m: myinterval): int {.inline.} = return m.stop ## proc `$`(m:myinterval): string = return "(start:$#, stop:$#, val:$#)" % [$m.start, $m.stop, $m.val] ## ## create some fake data ## .. code-block:: nim ## var ivs = new_seq[myinterval]() ## for i in countup(0, 100, 10): ## ivs.add(myinterval(start:i, stop:i + 15, val:0)) ## make the Lapper "data-structure" ## .. code-block:: nim ## l = lapify(ivs) ## empty:seq[myinterval] ## .. code-block:: nim ## l.find(10, 20, empty) ## notfound = not l.find(200, 300, empty) ## assert notfound ## .. code-block:: nim ## res = new_seq[myinterval]() ## find is the more general case, l.seek gives a speed benefit when consecutive queries are in order. ## .. code-block:: nim ## echo l.find(50, 70, res) ## echo res ## # @[(start: 40, stop: 55, val:0), (start: 50, stop: 65, val: 0), (start: 60, stop: 75, val: 0), (start: 70, stop: 85, val: 0)] ## for r in res: ## r.val += 1 ## or we can do a function on each overlapping interval ## .. code-block:: nim ## l.each_seek(50, 60, proc(a:myinterval) = inc(a.val)) ## or ## .. code-block:: nim ## l.each_find(50, 60, proc(a:myinterval) = a.val += 10) ## .. code-block:: nim ## discard l.seek(50, 70, res) ## echo res ## # @[(start:40, stop:55, val:12), (start:50, stop:65, val:12), (start:60, stop:75, val:1)] import algorithm type Interval* = concept i ## An object/tuple must implement these 2 methods to use this module start(i) is int stop(i) is int Lapper*[T] = object ## Lapper enables fast interval searches intervals: seq[T] max_len*: int cursor: int ## `cursor` is used internally by ordered find template overlap*[T:Interval](a: T, start:int, stop:int): bool = ## overlap returns true if half-open intervals overlap #return a.start < stop and a.stop > start a.stop > start and a.start < stop proc iv_cmp[T:Interval](a, b: T): int = if a.start < b.start: return -1 if b.start < a.start: return 1 return cmp(a.stop, b.stop) proc lapify*[T:Interval](ivs:var seq[T]): Lapper[T] = ## create a new Lapper object; ivs will be sorted. sort(ivs, iv_cmp) result = Lapper[T](max_len: 0, intervals:ivs) for iv in ivs: if iv.stop - iv.start > result.max_len: result.max_len = iv.stop - iv.start proc lowerBound[T:Interval](a: var seq[T], start: int): int = result = a.low var count = a.high - a.low + 1 var step, pos: int while count != 0: step = count div 2 pos = result + step if a[pos].start < start: result = pos + 1 count -= step + 1 else: count = step proc len*[T:Interval](L:Lapper[T]): int {.inline.} = ## len returns the number of intervals in the Lapper L.intervals.len proc empty*[T:Interval](L:Lapper[T]): bool {.inline.} = return L.intervals.len == 0 iterator find*[T:Interval](L:var Lapper[T], start:int, stop:int): T = ## fill ivs with all intervals in L that overlap start .. stop. #if ivs.len != 0: ivs.set_len(0) shallow(L.intervals) let off = lowerBound(L.intervals, start - L.max_len) for i in off..L.intervals.high: let x = L.intervals[i] if likely(x.overlap(start, stop)): yield x elif x.start >= stop: break proc find*[T:Interval](L:var Lapper[T], start:int, stop:int, ivs:var seq[T]): bool = ## fill ivs with all intervals in L that overlap start .. stop. #if ivs.len != 0: ivs.set_len(0) shallow(L.intervals) let off = lowerBound(L.intervals, start - L.max_len) var n = 0 for i in off..L.intervals.high: let x = L.intervals[i] if x.overlap(start, stop): if n < ivs.len: ivs[n] = x else: ivs.add(x) n += 1 elif x.start >= stop: break if ivs.len > n: ivs.setLen(n) return len(ivs) > 0 proc count*[T:Interval](L:var Lapper[T], start:int, stop:int): int = ## fill ivs with all intervals in L that overlap start .. stop. shallow(L.intervals) let off = lowerBound(L.intervals, start - L.max_len) for i in off..L.intervals.high: let x = L.intervals[i] if x.overlap(start, stop): result.inc elif x.start >= stop: break proc each_find*[T:Interval](L:var Lapper[T], start:int, stop:int, fn: proc (v:T)) = ## call fn(x) for each interval x in L that overlaps start..stop let off = lowerBound(L.intervals, start - L.max_len) for i in off..L.intervals.high: let x = L.intervals[i] if x.overlap(start, stop): fn(x) elif x.start >= stop: break iterator seek*[T:Interval](L:var Lapper[T], start:int, stop:int): T = if L.cursor == 0 or L.intervals[L.cursor].start > start: L.cursor = lowerBound(L.intervals, start - L.max_len) while (L.cursor + 1) < L.intervals.high and L.intervals[L.cursor + 1].start < (start - L.max_len): L.cursor += 1 let old_cursor = L.cursor for i in L.cursor..L.intervals.high: let x = L.intervals[i] if x.overlap(start, stop): yield x elif x.start >= stop: break L.cursor = old_cursor proc seek*[T:Interval](L:var Lapper[T], start:int, stop:int, ivs:var seq[T]): bool = ## fill ivs with all intervals in L that overlap start .. stop inclusive. ## this method will work when queries to this lapper are in sorted (start) order ## it uses a linear search from the last query instead of a binary search. ## if ivs is nil, then this will just return true if it finds an interval and false otherwise if ivs.len != 0: ivs.set_len(0) if L.cursor == 0 or L.intervals[L.cursor].start > start: L.cursor = lowerBound(L.intervals, start - L.max_len) let old_cursor = L.cursor while (L.cursor + 1) < L.intervals.high and L.intervals[L.cursor + 1].start < (start - L.max_len): L.cursor += 1 for i in L.cursor..L.intervals.high: let x = L.intervals[i] if x.overlap(start, stop): ivs.add(x) elif x.start >= stop: break L.cursor = old_cursor return ivs.len != 0 proc each_seek*[T:Interval](L:var Lapper[T], start:int, stop:int, fn:proc (v:T)) {.inline.} = ## call fn(x) for each interval x in L that overlaps start..stop ## this assumes that subsequent calls to this function will be in sorted order if L.cursor == 0 or L.cursor >= L.intervals.high or L.intervals[L.cursor].start > start: L.cursor = lowerBound(L.intervals, start - L.max_len) while (L.cursor + 1) < L.intervals.high and L.intervals[L.cursor + 1].start < (start - L.max_len): L.cursor += 1 let old_cursor = L.cursor for i in L.cursor..L.intervals.high: let x = L.intervals[i] if x.start >= stop: break elif x.stop > start: fn(x) L.cursor = old_cursor iterator items*[T:Interval](L: Lapper[T]): T = for i in L.intervals: yield i when isMainModule: import random import times import strutils proc randomi(imin:int, imax:int): int = return imin + rand(imax - imin) proc brute_force(ivs: seq[Interval], start:int, stop:int, res: var seq[Interval]) = if res.len != 0: res.set_len(0) for i in ivs: if i.overlap(start, stop): res.add(i) # example implementation type myinterval = tuple[start:int, stop:int, val:int] proc start(m: myinterval): int {.inline.} = return m.start proc stop(m: myinterval): int {.inline.} = return m.stop proc make_random(n:int, range_max:int, size_min:int, size_max:int): seq[myinterval] = result = new_seq[myinterval](n) for i in 0..