stringsearch-0.3.6.6/0000755000000000000000000000000012505631536012554 5ustar0000000000000000stringsearch-0.3.6.6/LICENCE0000644000000000000000000000276512505631536013553 0ustar0000000000000000Copyright (c)2010, Daniel Fischer All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of Daniel Fischer nor the names of other contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. stringsearch-0.3.6.6/Setup.hs0000644000000000000000000000005612505631536014211 0ustar0000000000000000import Distribution.Simple main = defaultMain stringsearch-0.3.6.6/stringsearch.cabal0000644000000000000000000001003412505631536016232 0ustar0000000000000000-- stringsearch.cabal auto-generated by cabal init. For additional -- options, see -- http://www.haskell.org/cabal/release/cabal-latest/doc/users-guide/authors.html#pkg-descr. -- The name of the package. Name: stringsearch -- The package version. See the Haskell package versioning policy -- (http://www.haskell.org/haskellwiki/Package_versioning_policy) for -- standards guiding when and how versions should be incremented. Version: 0.3.6.6 Homepage: https://bitbucket.org/dafis/stringsearch Bug-reports: https://bitbucket.org/dafis/stringsearch/issues -- A short (one-line) description of the package. Synopsis: Fast searching, splitting and replacing of ByteStrings -- A longer description of the package. Description: This package provides several functions to quickly search for substrings in strict or lazy ByteStrings. It also provides functions for breaking or splitting on substrings and replacing all occurrences of a substring (the first in case of overlaps) with another. GHC before 6.10 are no longer supported, other compilers only if they support BangPatterns. If you need it to work with other compilers, send a feature request. -- The license under which the package is released. License: BSD3 -- The file containing the license text. License-file: LICENCE -- The package author(s). Author: Daniel Fischer, Chris Kuklewicz, Justin Bailey -- An email address to which users can send suggestions, bug reports, -- and patches. Maintainer: daniel.is.fischer@googlemail.com -- A copyright notice. Copyright: (c) 2007-2011 Daniel Fischer, Chris Kuklewicz, Justin Bailey Category: Text, Search Build-type: Simple -- Extra files to be distributed with the package, such as examples or -- a README. Extra-source-files: CHANGES Tested-with: GHC == 6.10.4, GHC == 6.12.3, GHC == 7.0.2, GHC == 7.0.4, GHC == 7.2.1 -- Constraint on the version of Cabal needed to build this package. Cabal-version: >=1.6 Flag base4 Description: Choose base-4.* Flag base3 Description: Choose base-3.* if base-4 isn't available Default: False Library -- Modules exported by the library. Exposed-modules: Data.ByteString.Search Data.ByteString.Search.BoyerMoore Data.ByteString.Search.DFA Data.ByteString.Search.KarpRabin Data.ByteString.Search.KMP Data.ByteString.Search.KnuthMorrisPratt Data.ByteString.Search.Substitution Data.ByteString.Lazy.Search Data.ByteString.Lazy.Search.DFA Data.ByteString.Lazy.Search.KarpRabin Data.ByteString.Lazy.Search.KMP -- Packages needed in order to build this package. if flag(base4) Build-depends: base >= 4 && < 5, array >= 0.3 && < 0.6, bytestring >= 0.9 && < 1, containers >= 0.3 && < 0.6 else if flag(base3) Build-depends: base >= 3 && < 4, array >= 0.1 && < 0.4, bytestring >= 0.9 && < 1, containers >= 0.1 && < 0.4 else Build-depends: base >= 2 && < 3 Extensions: BangPatterns if flag(base4) ghc-options: -O2 -fspec-constr-count=4 -Wall else ghc-options: -O2 -Wall ghc-prof-options: -auto -- Modules not exported by this package. Other-modules: Data.ByteString.Search.Internal.BoyerMoore Data.ByteString.Search.Internal.KnuthMorrisPratt Data.ByteString.Search.Internal.Utils Data.ByteString.Lazy.Search.Internal.BoyerMoore -- Extra tools (e.g. alex, hsc2hs, ...) needed to build the source. -- Build-tools: source-repository head type: mercurial location: https://bitbucket.org/dafis/stringsearch stringsearch-0.3.6.6/CHANGES0000644000000000000000000000174712505631536013560 0ustar00000000000000000.3.6.6: - Add FlexibleContexts to allow compiling with 7.10.1 0.3.6.2: - move some INLINE pragmas to make it compile with 7.2.1 0.3.6.1: - fix error in docs 0.3.6: - fix typos in haddocks 0.3.5: - set up public repo at bitbucket 0.3.4: - split implementation of Boyer-Moore between strict and lazy targets - set spec-constr-count to 4 for ghc-7 0.3.3: - updated email address 0.3.2: - relaxed constraint on containers package (no incompatible API changes) 0.3.1: - fix spaceleak in split due to pairs holding on to first component - fix docs for splitKeepFront 0.3.0: - improved performance of old KMP searching functions (minor) - changed behaviour for empty patterns - changed return type to [Int] for KMP matching in strict BS - improved performance of old BM searching functions - (minor speedup for ordinary searches, major improvement for worst case) - added new functionality, - breaking, splitting and replacing - new algorithm (DFA) - simultaneous search for multiple patterns (KarpRabin) stringsearch-0.3.6.6/Data/0000755000000000000000000000000012505631536013425 5ustar0000000000000000stringsearch-0.3.6.6/Data/ByteString/0000755000000000000000000000000012505631536015517 5ustar0000000000000000stringsearch-0.3.6.6/Data/ByteString/Search.hs0000644000000000000000000002524612505631536017271 0ustar0000000000000000-- | -- Module : Data.ByteString.Search -- Copyright : Daniel Fischer (2007-2011) -- Chris Kuklewicz -- Licence : BSD3 -- Maintainer : Daniel Fischer -- Stability : Provisional -- Portability : non-portable (BangPatterns) -- -- Fast overlapping Boyer-Moore search of strict -- 'S.ByteString' values. Breaking, splitting and replacing -- using the Boyer-Moore algorithm. -- -- Descriptions of the algorithm can be found at -- -- and -- -- -- Original authors: Daniel Fischer (daniel.is.fischer at googlemail.com) and -- Chris Kuklewicz (haskell at list.mightyreason.com). module Data.ByteString.Search ( -- * Overview -- $overview -- ** Performance -- $performance -- ** Complexity -- $complexity -- ** Partial application -- $partial -- * Finding substrings indices , nonOverlappingIndices -- * Breaking on substrings , breakOn , breakAfter -- * Replacing , replace -- * Splitting , split , splitKeepEnd , splitKeepFront ) where import qualified Data.ByteString.Search.Internal.BoyerMoore as BM import Data.ByteString.Search.Substitution import qualified Data.ByteString as S import qualified Data.ByteString.Lazy as L -- $overview -- -- This module provides functions related to searching a substring within -- a string, using the Boyer-Moore algorithm with minor modifications -- to improve the overall performance and avoid the worst case -- performance degradation of the original Boyer-Moore algorithm for -- periodic patterns. -- -- When searching a pattern in a UTF-8-encoded 'S.ByteString', be aware that -- these functions work on bytes, not characters, so the indices are -- byte-offsets, not character offsets. -- $performance -- -- In general, the Boyer-Moore algorithm is the most efficient method to -- search for a pattern inside a string. The advantage over other algorithms -- (e.g. Naïve, Knuth-Morris-Pratt, Horspool, Sunday) can be made -- arbitrarily large for specially selected patterns and targets, but -- usually, it's a factor of 2–3 versus Knuth-Morris-Pratt and of -- 6–10 versus the naïve algorithm. The Horspool and Sunday -- algorithms, which are simplified variants of the Boyer-Moore algorithm, -- typically have performance between Boyer-Moore and Knuth-Morris-Pratt, -- mostly closer to Boyer-Moore. The advantage of the Boyer-moore variants -- over other algorithms generally becomes larger for longer patterns. For -- very short patterns (or patterns with a very short period), other -- algorithms, e.g. "Data.ByteString.Search.DFA" can be faster (my -- tests suggest that \"very short\" means two, maybe three bytes). -- -- In general, searching in a strict 'S.ByteString' is slightly faster -- than searching in a lazy 'L.ByteString', but for long targets, the -- smaller memory footprint of lazy 'L.ByteString's can make searching -- those (sometimes much) faster. On the other hand, there are cases -- where searching in a strict target is much faster, even for long targets. -- $complexity -- -- Preprocessing the pattern is /O/(@patternLength@ + σ) in time and -- space (σ is the alphabet size, 256 here) for all functions. -- The time complexity of the searching phase for 'indices' -- is /O/(@targetLength@ \/ @patternLength@) in the best case. -- For non-periodic patterns, the worst case complexity is -- /O/(@targetLength@), but for periodic patterns, the worst case complexity -- is /O/(@targetLength@ * @patternLength@) for the original Boyer-Moore -- algorithm. -- -- The searching functions in this module contain a modification which -- drastically improves the performance for periodic patterns. -- I believe that for strict target strings, the worst case is now -- /O/(@targetLength@) also for periodic patterns. -- I may be wrong, though. -- -- The other functions don't have to deal with possible overlapping -- patterns, hence the worst case complexity for the processing phase -- is /O/(@targetLength@) (respectively /O/(@firstIndex + patternLength@) -- for the breaking functions if the pattern occurs). -- $partial -- -- All functions can usefully be partially applied. Given only a pattern, -- the pattern is preprocessed only once, allowing efficient re-use. ------------------------------------------------------------------------------ -- Exported Functions -- ------------------------------------------------------------------------------ -- | @'indices'@ finds the starting indices of all possibly overlapping -- occurrences of the pattern in the target string. -- If the pattern is empty, the result is @[0 .. 'length' target]@. -- -- In general, @'not' . 'null' $ 'indices' pat target@ is a much more -- efficient version of 'S.isInfixOf'. {-# INLINE indices #-} indices :: S.ByteString -- ^ Pattern to find -> S.ByteString -- ^ String to search -> [Int] -- ^ Offsets of matches indices = BM.matchSS -- | @'nonOverlappingIndices'@ finds the starting indices of all -- non-overlapping occurrences of the pattern in the target string. -- It is more efficient than removing indices from the list produced -- by 'indices'. {-# INLINE nonOverlappingIndices #-} nonOverlappingIndices :: S.ByteString -- ^ Pattern to find -> S.ByteString -- ^ String to search -> [Int] -- ^ Offsets of matches nonOverlappingIndices = BM.matchNOS -- | @'breakOn' pattern target@ splits @target@ at the first occurrence -- of @pattern@. If the pattern does not occur in the target, the -- second component of the result is empty, otherwise it starts with -- @pattern@. If the pattern is empty, the first component is empty. -- -- @ -- 'uncurry' 'S.append' . 'breakOn' pattern = 'id' -- @ {-# INLINE breakOn #-} breakOn :: S.ByteString -- ^ String to search for -> S.ByteString -- ^ String to search in -> (S.ByteString, S.ByteString) -- ^ Head and tail of string broken at substring breakOn = BM.breakSubstringS -- | @'breakAfter' pattern target@ splits @target@ behind the first occurrence -- of @pattern@. An empty second component means that either the pattern -- does not occur in the target or the first occurrence of pattern is at -- the very end of target. To discriminate between those cases, use e.g. -- 'S.isSuffixOf'. -- -- @ -- 'uncurry' 'S.append' . 'breakAfter' pattern = 'id' -- @ {-# INLINE breakAfter #-} breakAfter :: S.ByteString -- ^ String to search for -> S.ByteString -- ^ String to search in -> (S.ByteString, S.ByteString) -- ^ Head and tail of string broken after substring breakAfter = BM.breakAfterS -- | @'replace' pat sub text@ replaces all (non-overlapping) occurrences of -- @pat@ in @text@ with @sub@. If occurrences of @pat@ overlap, the first -- occurrence that does not overlap with a replaced previous occurrence -- is substituted. Occurrences of @pat@ arising from a substitution -- will not be substituted. For example: -- -- @ -- 'replace' \"ana\" \"olog\" \"banana\" = \"bologna\" -- 'replace' \"ana\" \"o\" \"bananana\" = \"bono\" -- 'replace' \"aab\" \"abaa\" \"aaabb\" = \"aabaab\" -- @ -- -- The result is a /lazy/ 'L.ByteString', -- which is lazily produced, without copying. -- Equality of pattern and substitution is not checked, but -- -- @ -- ('S.concat' . 'L.toChunks' $ 'replace' pat pat text) == text -- @ -- -- holds. If the pattern is empty but not the substitution, the result -- is equivalent to (were they 'String's) @'cycle' sub@. -- -- For non-empty @pat@ and @sub@ a strict 'S.ByteString', -- -- @ -- 'L.fromChunks' . 'Data.List.intersperse' sub . 'split' pat = 'replace' pat sub -- @ -- -- and analogous relations hold for other types of @sub@. {-# INLINE replace #-} replace :: Substitution rep => S.ByteString -- ^ Substring to replace -> rep -- ^ Replacement string -> S.ByteString -- ^ String to modify -> L.ByteString -- ^ Lazy result replace = BM.replaceAllS -- | @'split' pattern target@ splits @target@ at each (non-overlapping) -- occurrence of @pattern@, removing @pattern@. If @pattern@ is empty, -- the result is an infinite list of empty 'S.ByteString's, if @target@ -- is empty but not @pattern@, the result is an empty list, otherwise -- the following relations hold: -- -- @ -- 'S.concat' . 'Data.List.intersperse' pat . 'split' pat = 'id', -- 'length' ('split' pattern target) == -- 'length' ('nonOverlappingIndices' pattern target) + 1, -- @ -- -- no fragment in the result contains an occurrence of @pattern@. {-# INLINE split #-} split :: S.ByteString -- ^ Pattern to split on -> S.ByteString -- ^ String to split -> [S.ByteString] -- ^ Fragments of string split = BM.splitDropS -- | @'splitKeepEnd' pattern target@ splits @target@ after each (non-overlapping) -- occurrence of @pattern@. If @pattern@ is empty, the result is an -- infinite list of empty 'S.ByteString's, otherwise the following -- relations hold: -- -- @ -- 'S.concat' . 'splitKeepEnd' pattern = 'id', -- @ -- -- all fragments in the result except possibly the last end with -- @pattern@, no fragment contains more than one occurrence of @pattern@. {-# INLINE splitKeepEnd #-} splitKeepEnd :: S.ByteString -- ^ Pattern to split on -> S.ByteString -- ^ String to split -> [S.ByteString] -- ^ Fragments of string splitKeepEnd = BM.splitKeepEndS -- | @'splitKeepFront'@ is like 'splitKeepEnd', except that @target@ is split -- before each occurrence of @pattern@ and hence all fragments -- with the possible exception of the first begin with @pattern@. -- No fragment contains more than one non-overlapping occurrence -- of @pattern@. {-# INLINE splitKeepFront #-} splitKeepFront :: S.ByteString -- ^ Pattern to split on -> S.ByteString -- ^ String to split -> [S.ByteString] -- ^ Fragments of string splitKeepFront = BM.splitKeepFrontS stringsearch-0.3.6.6/Data/ByteString/Search/0000755000000000000000000000000012505631536016724 5ustar0000000000000000stringsearch-0.3.6.6/Data/ByteString/Search/BoyerMoore.hs0000644000000000000000000001672112505631536021351 0ustar0000000000000000-- | -- Module : Data.ByteString.Search.BoyerMoore -- Copyright : Daniel Fischer -- Chris Kuklewicz -- Licence : BSD3 -- Maintainer : Daniel Fischer -- Stability : Provisional -- Portability : non-portable (BangPatterns) -- -- Fast overlapping Boyer-Moore search of both strict and lazy -- 'ByteString' values. -- -- Descriptions of the algorithm can be found at -- -- and -- -- -- Original authors: Daniel Fischer (daniel.is.fischer at googlemail.com) and -- Chris Kuklewicz (haskell at list.mightyreason.com). module Data.ByteString.Search.BoyerMoore {-# DEPRECATED "Use the new interface instead" #-} ( -- * Overview -- $overview -- ** Changes -- $changes -- ** Deprecation -- $deprecation -- ** Parameter and return types -- $types -- ** Lazy ByteStrings -- $lazy -- ** Performance -- $performance -- ** Complexity -- $complexity -- ** Partial application -- $currying -- ** Integer overflow -- $overflow -- * Functions matchLL , matchLS , matchSL , matchSS ) where import Data.ByteString.Search.Internal.BoyerMoore (matchLS, matchSS) import Data.ByteString.Lazy.Search.Internal.BoyerMoore (matchLL, matchSL) -- $overview -- -- This module exists only for backwards compatibility. Nevertheless -- there have been small changes in the behaviour of the functions. -- The module exports four search functions: 'matchLL', 'matchLS', -- 'matchSL', and 'matchSS'. All of them return the list of all -- starting positions of possibly overlapping occurrences of a pattern -- in a string. -- $changes -- -- Formerly, all four functions returned an empty list when passed -- an empty pattern. Now, in accordance with the functions from the other -- modules, @matchXY \"\" target = [0 .. 'length' target]@. -- $deprecation -- -- This module is /deprecated/. You should use the new interface provided -- in "Data.ByteString.Search" resp. "Data.ByteString.Lazy.Search". -- $types -- -- The first parameter is always the pattern string. The second -- parameter is always the target string to be searched. The returned -- list contains the offsets of all /overlapping/ patterns. -- -- A returned @Int@ or @Int64@ is an index into the target string -- which is aligned to the head of the pattern string. Strict targets -- return @Int@ indices and lazy targets return @Int64@ indices. All -- returned lists are computed and returned in a lazy fashion. -- $lazy -- -- 'matchLL' and 'matchLS' take lazy bytestrings as patterns. For -- performance, if the pattern is not a single strict chunk then all -- the the pattern chunks will copied into a concatenated strict -- bytestring. This limits the patterns to a length of (maxBound :: -- Int). -- -- 'matchLL' and 'matchSL' take lazy bytestrings as targets. -- These are written so that while they work they will not retain a -- reference to all the earlier parts of the the lazy bytestring. -- This means the garbage collector would be able to keep only a small -- amount of the target string and free the rest. -- $currying -- -- These functions can all be usefully partially applied. -- Given only a pattern the partially applied version will compute -- the supporting lookup tables only once, allowing for efficient re-use. -- Similarly, the partially applied 'matchLL' and 'matchLS' will compute -- the concatenated pattern only once. -- $performance -- -- In general, the Boyer-Moore algorithm is the most efficient method to -- search for a pattern inside a string. The advantage over other algorithms -- (e.g. Naïve, Knuth-Morris-Pratt, Horspool, Sunday) can be made -- arbitrarily large for specially selected patterns and targets, but -- usually, it's a factor of 2–3 versus Knuth-Morris-Pratt and of -- 6–10 versus the naïve algorithm. The Horspool and Sunday -- algorithms, which are simplified variants of the Boyer-Moore algorithm, -- typically have performance between Boyer-Moore and Knuth-Morris-Pratt, -- mostly closer to Boyer-Moore. The advantage of the Boyer-moore variants -- over other algorithms generally becomes larger for longer patterns. For -- very short patterns (or patterns with a very short period), other -- algorithms, e.g. "Data.ByteString.Search.DFA" can be faster (my -- tests suggest that \"very short\" means two, maybe three bytes). -- -- In general, searching in a strict 'S.ByteString' is slightly faster -- than searching in a lazy 'L.ByteString', but for long targets the -- smaller memory footprint of lazy 'L.ByteStrings' can make searching -- those (sometimes much) faster. On the other hand, there are cases -- where searching in a strict target is much faster, even for long targets. -- -- On 32-bit systems, 'Int'-arithmetic is much faster than 'Int64'-arithmetic, -- so when there are many matches, that can make a significant difference. -- -- Also, the modification to ameliorate the case of periodic patterns -- is defeated by chunk-boundaries, so long patterns with a short period -- and many matches exhibit poor behaviour (consider using @indices@ from -- "Data.ByteString.Lazy.Search.DFA" or "Data.ByteString.Lazy.Search.KMP" -- in those cases, the former for medium-length patterns, the latter for -- long patterns; only 'matchLL' and 'matchSL' suffer from -- this problem, though). -- $complexity -- -- Preprocessing the pattern string is O(@patternLength@). The search -- performance is O(@targetLength@\/@patternLength@) in the best case, -- allowing it to go faster than a Knuth-Morris-Pratt algorithm. With -- a non-periodic pattern the worst case uses O(3\*@targetLength@) -- comparisons. The periodic pattern worst case is quadratic -- O(@targetLength@\*@patternLength@) complexity for the original -- Boyer-Moore algorithm. -- -- The searching functions in this module contain a modification which -- drastically improves the performance for periodic patterns. -- I believe that for strict target strings, the worst case is now -- /O/(@targetLength@) also for periodic patterns and for lazy target -- strings, my semi-educated guess is -- /O/(@targetLength@ * (1 + @patternLength@ \/ @chunkSize@)). -- $overflow -- -- The current code uses @Int@ to keep track of the locations in the -- target string. If the length of the pattern plus the length of any -- strict chunk of the target string is greater or equal to -- @'maxBound'::Int@ then this will overflow causing an error. We try -- to detect this and call 'error' before a segfault occurs. stringsearch-0.3.6.6/Data/ByteString/Search/DFA.hs0000644000000000000000000003057112505631536017660 0ustar0000000000000000{-# LANGUAGE BangPatterns #-} -- | -- Module : Data.ByteString.Search.DFA -- Copyright : Daniel Fischer -- Licence : BSD3 -- Maintainer : Daniel Fischer -- Stability : Provisional -- Portability : non-portable (BangPatterns) -- -- Fast search of strict 'S.ByteString' values. Breaking, splitting and -- replacing using a deterministic finite automaton. module Data.ByteString.Search.DFA ( -- * Overview -- $overview -- ** Complexity and performance -- $complexity -- ** Partial application -- $partial -- * Finding substrings indices , nonOverlappingIndices -- * Breaking on substrings , breakOn , breakAfter -- * Replacing , replace -- * Splitting , split , splitKeepEnd , splitKeepFront ) where import Data.ByteString.Search.Internal.Utils (automaton) import Data.ByteString.Search.Substitution import qualified Data.ByteString as S import qualified Data.ByteString.Lazy as L import qualified Data.ByteString.Lazy.Internal as LI import Data.ByteString.Unsafe (unsafeIndex) import Data.Array.Base (unsafeAt) --import Data.Array.Unboxed import Data.Bits -- $overview -- -- This module provides functions related to searching a substring within -- a string. The searching algorithm uses a deterministic finite automaton -- based on the Knuth-Morris-Pratt algorithm. -- The automaton is implemented as an array of @(patternLength + 1) * σ@ -- state transitions, where σ is the alphabet size (256), so it is -- only suitable for short enough patterns. -- -- When searching a pattern in a UTF-8-encoded 'S.ByteString', be aware that -- these functions work on bytes, not characters, so the indices are -- byte-offsets, not character offsets. -- $complexity -- -- The time and space complexity of the preprocessing phase is -- /O/(@patternLength * σ@). -- The searching phase is /O/(@targetLength@), each target character is -- inspected only once. -- -- In general the functions in this module are slightly faster than the -- corresponding functions using the Knuth-Morris-Pratt algorithm but -- considerably slower than the Boyer-Moore functions. For very short -- patterns or, in the case of 'indices', patterns with a short period -- which occur often, however, times are close to or even below the -- Boyer-Moore times. -- $partial -- -- All functions can usefully be partially applied. Given only a pattern, -- the automaton is constructed only once, allowing efficient re-use. ------------------------------------------------------------------------------ -- Exported Functions -- ------------------------------------------------------------------------------ -- | @'indices'@ finds the starting indices of all possibly overlapping -- occurrences of the pattern in the target string. -- If the pattern is empty, the result is @[0 .. 'length' target]@. {-# INLINE indices #-} indices :: S.ByteString -- ^ Pattern to find -> S.ByteString -- ^ String to search -> [Int] -- ^ Offsets of matches indices = strictSearcher True -- | @'nonOverlappingIndices'@ finds the starting indices of all -- non-overlapping occurrences of the pattern in the target string. -- It is more efficient than removing indices from the list produced -- by 'indices'. {-# INLINE nonOverlappingIndices #-} nonOverlappingIndices :: S.ByteString -- ^ Pattern to find -> S.ByteString -- ^ String to search -> [Int] -- ^ Offsets of matches nonOverlappingIndices = strictSearcher False -- | @'breakOn' pattern target@ splits @target@ at the first occurrence -- of @pattern@. If the pattern does not occur in the target, the -- second component of the result is empty, otherwise it starts with -- @pattern@. If the pattern is empty, the first component is empty. -- -- @ -- 'uncurry' 'S.append' . 'breakOn' pattern = 'id' -- @ breakOn :: S.ByteString -- ^ String to search for -> S.ByteString -- ^ String to search in -> (S.ByteString, S.ByteString) -- ^ Head and tail of string broken at substring breakOn pat = breaker where searcher = strictSearcher False pat breaker str = case searcher str of [] -> (str, S.empty) (i:_) -> S.splitAt i str -- | @'breakAfter' pattern target@ splits @target@ behind the first occurrence -- of @pattern@. An empty second component means that either the pattern -- does not occur in the target or the first occurrence of pattern is at -- the very end of target. To discriminate between those cases, use e.g. -- 'S.isSuffixOf'. -- -- @ -- 'uncurry' 'S.append' . 'breakAfter' pattern = 'id' -- @ breakAfter :: S.ByteString -- ^ String to search for -> S.ByteString -- ^ String to search in -> (S.ByteString, S.ByteString) -- ^ Head and tail of string broken after substring breakAfter pat = breaker where !patLen = S.length pat searcher = strictSearcher False pat breaker str = case searcher str of [] -> (str, S.empty) (i:_) -> S.splitAt (i + patLen) str -- | @'replace' pat sub text@ replaces all (non-overlapping) occurrences of -- @pat@ in @text@ with @sub@. If occurrences of @pat@ overlap, the first -- occurrence that does not overlap with a replaced previous occurrence -- is substituted. Occurrences of @pat@ arising from a substitution -- will not be substituted. For example: -- -- @ -- 'replace' \"ana\" \"olog\" \"banana\" = \"bologna\" -- 'replace' \"ana\" \"o\" \"bananana\" = \"bono\" -- 'replace' \"aab\" \"abaa\" \"aaabb\" = \"aabaab\" -- @ -- -- The result is a /lazy/ 'L.ByteString', -- which is lazily produced, without copying. -- Equality of pattern and substitution is not checked, but -- -- @ -- 'S.concat' . 'L.toChunks' $ 'replace' pat pat text == text -- @ -- -- holds. If the pattern is empty but not the substitution, the result -- is equivalent to (were they 'String's) @'cycle' sub@. -- -- For non-empty @pat@ and @sub@ a strict 'S.ByteString', -- -- @ -- 'L.fromChunks' . 'Data.List.intersperse' sub . 'split' pat = 'replace' pat sub -- @ -- -- and analogous relations hold for other types of @sub@. replace :: Substitution rep => S.ByteString -- ^ Substring to replace -> rep -- ^ Replacement string -> S.ByteString -- ^ String to modify -> L.ByteString -- ^ Lazy result replace pat | S.null pat = \sub -> prependCycle sub . flip LI.chunk LI.Empty | otherwise = let !patLen = S.length pat searcher = strictSearcher False pat repl sub = let {-# NOINLINE subst #-} !subst = substitution sub replacer str | S.null str = [] | otherwise = case searcher str of [] -> [str] (i:_) | i == 0 -> subst $ replacer (S.drop patLen str) | otherwise -> S.take i str : subst (replacer (S.drop (i + patLen) str)) in replacer in \sub -> L.fromChunks . repl sub -- | @'split' pattern target@ splits @target@ at each (non-overlapping) -- occurrence of @pattern@, removing @pattern@. If @pattern@ is empty, -- the result is an infinite list of empty 'S.ByteString's, if @target@ -- is empty but not @pattern@, the result is an empty list, otherwise -- the following relations hold: -- -- @ -- 'S.concat' . 'Data.List.intersperse' pat . 'split' pat = 'id', -- 'length' ('split' pattern target) == -- 'length' ('nonOverlappingIndices' pattern target) + 1, -- @ -- -- no fragment in the result contains an occurrence of @pattern@. split :: S.ByteString -- ^ Pattern to split on -> S.ByteString -- ^ String to split -> [S.ByteString] -- ^ Fragments of string split pat | S.null pat = const (repeat S.empty) split pat = splitter where !patLen = S.length pat searcher = strictSearcher False pat splitter str | S.null str = [] | otherwise = splitter' str splitter' str | S.null str = [S.empty] | otherwise = case searcher str of [] -> [str] (i:_) -> S.take i str : splitter' (S.drop (i + patLen) str) -- | @'splitKeepEnd' pattern target@ splits @target@ after each (non-overlapping) -- occurrence of @pattern@. If @pattern@ is empty, the result is an -- infinite list of empty 'S.ByteString's, otherwise the following -- relations hold: -- -- @ -- 'S.concat' . 'splitKeepEnd' pattern = 'id', -- @ -- -- all fragments in the result except possibly the last end with -- @pattern@, no fragment contains more than one occurrence of @pattern@. splitKeepEnd :: S.ByteString -- ^ Pattern to split on -> S.ByteString -- ^ String to split -> [S.ByteString] -- ^ Fragments of string splitKeepEnd pat | S.null pat = const (repeat S.empty) splitKeepEnd pat = splitter where !patLen = S.length pat searcher = strictSearcher False pat splitter str | S.null str = [] | otherwise = case searcher str of [] -> [str] (i:_) -> S.take (i + patLen) str : splitter (S.drop (i + patLen) str) -- | @'splitKeepFront'@ is like 'splitKeepEnd', except that @target@ is split -- before each occurrence of @pattern@ and hence all fragments -- with the possible exception of the first begin with @pattern@. -- No fragment contains more than one non-overlapping occurrence -- of @pattern@. splitKeepFront :: S.ByteString -- ^ Pattern to split on -> S.ByteString -- ^ String to split -> [S.ByteString] -- ^ Fragments of string splitKeepFront pat | S.null pat = const (repeat S.empty) splitKeepFront pat = splitter where !patLen = S.length pat searcher = strictSearcher False pat splitter str | S.null str = [] | otherwise = case searcher str of [] -> [str] (i:rst) | i == 0 -> case rst of [] -> [str] (j:_) -> S.take j str : splitter' (S.drop j str) | otherwise -> S.take i str : splitter' (S.drop i str) splitter' str | S.null str = [] | otherwise = case searcher (S.drop patLen str) of [] -> [str] (i:_) -> S.take (i + patLen) str : splitter' (S.drop (i + patLen) str) ------------------------------------------------------------------------------ -- Searching Function -- ------------------------------------------------------------------------------ strictSearcher :: Bool -> S.ByteString -> S.ByteString -> [Int] strictSearcher _ !pat | S.null pat = enumFromTo 0 . S.length | S.length pat == 1 = let !w = S.head pat in S.elemIndices w strictSearcher !overlap pat = search where !patLen = S.length pat !auto = automaton pat !p0 = unsafeIndex pat 0 !ams = if overlap then patLen else 0 search str = match 0 0 where !strLen = S.length str {-# INLINE strAt #-} strAt :: Int -> Int strAt !i = fromIntegral (unsafeIndex str i) match 0 idx | idx == strLen = [] | unsafeIndex str idx == p0 = match 1 (idx + 1) | otherwise = match 0 (idx + 1) match state idx | idx == strLen = [] | otherwise = let !nstate = unsafeAt auto ((state `shiftL` 8) + strAt idx) !nxtIdx = idx + 1 in if nstate == patLen then (nxtIdx - patLen) : match ams nxtIdx else match nstate nxtIdx stringsearch-0.3.6.6/Data/ByteString/Search/KarpRabin.hs0000644000000000000000000001545612505631536021144 0ustar0000000000000000{-# LANGUAGE BangPatterns #-} -- | -- Module : Data.ByteString.Search.KarpRabin -- Copyright : (c) 2010 Daniel Fischer -- Licence : BSD3 -- Maintainer : Daniel Fischer -- Stability : Provisional -- Portability : non-portable (BangPatterns) -- -- Simultaneous search for multiple patterns in a strict 'S.ByteString' -- using the Karp-Rabin algorithm. -- -- A description of the algorithm for a single pattern can be found at -- . module Data.ByteString.Search.KarpRabin ( -- * Overview -- $overview -- ** Caution -- $caution -- * Function indicesOfAny ) where import qualified Data.ByteString as S import Data.ByteString.Unsafe (unsafeIndex) import qualified Data.IntMap as IM import Data.Array import Data.Array.Base (unsafeAt) import Data.Word (Word8) import Data.Bits import Data.List (foldl') -- $overview -- -- The Karp-Rabin algorithm works by calculating a hash of the pattern and -- comparing that hash with the hash of a slice of the target string with -- the same length as the pattern. If the hashes are equal, the slice of the -- target is compared to the pattern byte for byte (since the hash -- function generally isn't injective). -- -- For a single pattern, this tends to be more efficient than the naïve -- algorithm, but it cannot compete with algorithms like -- Knuth-Morris-Pratt or Boyer-Moore. -- -- However, the algorithm can be generalised to search for multiple patterns -- simultaneously. If the shortest pattern has length @k@, hash the prefix of -- length @k@ of all patterns and compare the hash of the target's slices of -- length @k@ to them. If there's a match, check whether the slice is part -- of an occurrence of the corresponding pattern. -- -- With a hash-function that -- -- * allows to compute the hash of one slice in constant time from the hash -- of the previous slice, the new and the dropped character, and -- -- * produces few spurious matches, -- -- searching for occurrences of any of @n@ patterns has a best-case complexity -- of /O/(@targetLength@ * @lookup n@). The worst-case complexity is -- /O/(@targetLength@ * @lookup n@ * @sum patternLengths@), the average is -- not much worse than the best case. -- -- The functions in this module store the hashes of the patterns in an -- 'IM.IntMap', so the lookup is /O/(@log n@). Re-hashing is done in constant -- time and spurious matches of the hashes /should be/ sufficiently rare. -- The maximal length of the prefixes to be hashed is 32. -- $caution -- -- Unfortunately, the constant factors are high, so these functions are slow. -- Unless the number of patterns to search for is high (larger than 50 at -- least), repeated search for single patterns using Boyer-Moore or DFA and -- manual merging of the indices is faster. /Much/ faster for less than 40 -- or so patterns. -- -- In summary, this module is more of an interesting curiosity than anything -- else. -- | @'indicesOfAny'@ finds all occurrences of any of several non-empty patterns -- in a strict target string. If no non-empty patterns are given, -- the result is an empty list. Otherwise the result list contains -- the pairs of all indices where any of the (non-empty) patterns start -- and the list of all patterns starting at that index, the patterns being -- represented by their (zero-based) position in the pattern list. -- Empty patterns are filtered out before processing begins. {-# INLINE indicesOfAny #-} indicesOfAny :: [S.ByteString] -- ^ List of non-empty patterns -> S.ByteString -- ^ String to search -> [(Int,[Int])] -- ^ List of matches indicesOfAny pats | null nepats = const [] | otherwise = strictMatcher nepats where nepats = filter (not . S.null) pats ------------------------------------------------------------------------------ -- Workers -- ------------------------------------------------------------------------------ {-# INLINE rehash1 #-} rehash1 :: Int -> Int -> Word8 -> Word8 -> Int rehash1 out h o n = (h `shiftL` 1 - (fromIntegral o `shiftL` out)) + fromIntegral n {-# INLINE rehash2 #-} rehash2 :: Int -> Int -> Word8 -> Word8 -> Int rehash2 out h o n = (h `shiftL` 2 - (fromIntegral o `shiftL` out)) + fromIntegral n {-# INLINE rehash3 #-} rehash3 :: Int -> Int -> Word8 -> Word8 -> Int rehash3 out h o n = (h `shiftL` 3 - (fromIntegral o `shiftL` out)) + fromIntegral n {-# INLINE rehash4 #-} rehash4 :: Int -> Int -> Word8 -> Word8 -> Int rehash4 out h o n = (h `shiftL` 4 - (fromIntegral o `shiftL` out)) + fromIntegral n strictMatcher :: [S.ByteString] -> S.ByteString -> [(Int,[Int])] strictMatcher pats = search where !hLen = minimum (32 : map S.length pats) !shDi = case 32 `quot` hLen of q | q < 4 -> q | otherwise -> 4 !outS = shDi*hLen !patNum = length pats !patArr = listArray (0, patNum - 1) pats {-# INLINE rehash #-} rehash :: Int -> Word8 -> Word8 -> Int rehash = case shDi of 1 -> rehash1 hLen 2 -> rehash2 outS 3 -> rehash3 outS _ -> rehash4 outS hash :: S.ByteString -> Int hash = S.foldl' (\h w -> (h `shiftL` shDi) + fromIntegral w) 0 . S.take hLen !hashMap = foldl' (\mp (h,i) -> IM.insertWith (flip (++)) h [i] mp) IM.empty $ zip (map hash pats) [0 :: Int .. ] search str | strLen < hLen = [] | otherwise = go 0 shash where !strLen = S.length str !maxIdx = strLen - hLen {-# INLINE strAt #-} strAt !i = unsafeIndex str i !shash = hash str go !sI !h = case IM.lookup h hashMap of Nothing -> if sI == maxIdx then [] else go (sI + 1) (rehash h (strAt sI) (strAt (sI + hLen))) Just ps -> let !rst = S.drop sI str {-# INLINE hd #-} hd = strAt sI {-# INLINE more #-} more = if sI == maxIdx then [] else go (sI + 1) (rehash h hd (strAt (sI + hLen))) {-# INLINE okay #-} okay bs = S.isPrefixOf bs rst in case filter (okay . (patArr `unsafeAt`)) ps of [] -> more qs -> seq (length qs) $ (sI,qs) : more stringsearch-0.3.6.6/Data/ByteString/Search/KMP.hs0000644000000000000000000000602012505631536017705 0ustar0000000000000000-- | -- Module : Data.ByteString.Search.KMP -- Copyright : Justin Bailey -- Chris Kuklewicz -- Daniel Fischer -- Licence : BSD3 -- Maintainer : Daniel Fischer -- Stability : Provisional -- Portability : non-portable (BangPatterns) -- -- Fast search of strict 'S.ByteString' values using the -- Knuth-Morris-Pratt algorithm. -- -- A description of the algorithm can be found at -- . -- -- Original authors: Justin Bailey (jgbailey at gmail.com) and -- Chris Kuklewicz (haskell at list.mightyreason.com). module Data.ByteString.Search.KMP ( -- * Overview -- $overview -- ** Complexity and Performance -- $complexity -- ** Partial application -- $partial -- * Functions indices , nonOverlappingIndices ) where import Data.ByteString.Search.Internal.KnuthMorrisPratt (matchSS, indicesS) import qualified Data.ByteString as S -- $overview -- -- This module provides two functions for finding the occurrences of a -- pattern in a target string using the Knuth-Morris-Pratt algorithm. -- It exists only for systematic reasons, the functions from -- "Data.ByteString.Search" are much faster, except for very short patterns, -- in which case "Data.ByteString.Search.DFA" provides better functions. -- $complexity -- -- The preprocessing of the pattern is /O/(@patternLength@) in time and space. -- The time complexity of the searching phase is /O/(@targetLength@) for both -- functions. -- -- In most cases, these functions are considerably slower than the -- Boyer-Moore variants, performance is close to that of those from -- "Data.ByteString.Search.DFA". -- $partial -- -- Both functions can be usefully partially applied. Given only a -- pattern, the auxiliary data will be computed only once, allowing for -- efficient re-use. -- | @'indices'@ finds the starting indices of all possibly overlapping -- occurrences of the pattern in the target string. -- If the pattern is empty, the result is @[0 .. 'length' target]@. {-# INLINE indices #-} indices :: S.ByteString -- ^ Pattern to find -> S.ByteString -- ^ String to search -> [Int] -- ^ Offsets of matches indices = indicesS -- | @'nonOverlappingIndices'@ finds the starting indices of all -- non-overlapping occurrences of the pattern in the target string. -- It is more efficient than removing indices from the list produced -- by 'indices'. {-# INLINE nonOverlappingIndices #-} nonOverlappingIndices :: S.ByteString -- ^ Pattern to find -> S.ByteString -- ^ String to search -> [Int] -- ^ Offsets of matches nonOverlappingIndices = matchSS stringsearch-0.3.6.6/Data/ByteString/Search/KnuthMorrisPratt.hs0000644000000000000000000000772712505631536022575 0ustar0000000000000000-- | -- Module : Data.ByteString.Search.KnuthMorrisPratt -- Copyright : Justin Bailey -- Chris Kuklewicz -- Daniel Fischer -- Licence : BSD3 -- Maintainer : Daniel Fischer -- Stability : Provisional -- Portability : non-portable (BangPatterns) -- -- Fast non-overlapping Knuth-Morris-Pratt search of both strict and -- lazy 'Data.ByteString.ByteString' values. -- -- A description of the algorithm can be found at -- . -- -- Original authors: Justin Bailey (jgbailey at gmail.com) and -- Chris Kuklewicz (haskell at list.mightyreason.com). module Data.ByteString.Search.KnuthMorrisPratt {-# DEPRECATED "Use the new interface instead" #-} ( -- * Overview -- $overview -- ** Changes -- $changes -- ** Deprecation -- $deprecation -- ** Parameter and return types -- $types -- ** Lazy ByteStrings -- $lazy -- * Partial application -- $partial -- * Complexity and Performance -- $complexity -- * Functions matchLL , matchLS , matchSS , matchSL ) where import Data.ByteString.Search.Internal.KnuthMorrisPratt (matchLL, matchLS, matchSL, matchSS) -- $overview -- -- This module exists only for backwards compatibility. Nevertheless -- there have been small changes in the behaviour of the functions. -- The module exports four search functions: 'matchLL', 'matchLS', -- 'matchSL', and 'matchSS'. All of them return the list of all -- starting positions of non-overlapping occurrences of a pattern -- in a string. -- $changes -- -- Formerly, all four functions returned an empty list when passed -- an empty pattern. Now, in accordance with the functions from the other -- modules, @matchXY \"\" target = [0 .. 'length' target]@. -- -- Further, the return type of 'matchLS' and 'matchSS' has changed to -- @['Int']@, since strict 'Data.ByteString.ByteString's are 'Int'-indexed. -- $deprecation -- -- This module is /deprecated/. You should use the new interface provided -- in "Data.ByteString.Search.KMP" and "Data.ByteString.Lazy.Search.KMP" -- or the generally faster functions from "Data.ByteString.Search" and -- "Data.ByteString.Search.DFA", respectively the lazy versions. -- $types -- -- The first parameter is always the pattern string. The second -- parameter is always the target string to be searched. The returned -- list contains the offsets of all /non-overlapping/ patterns. -- -- A returned @Int@ or @Int64@ is an index into the target string -- which is aligned to the head of the pattern string. Strict targets -- return @Int@ indices and lazy targets return @Int64@ indices. All -- returned lists are computed and returned in a lazy fashion. -- $lazy -- -- 'matchLL' and 'matchLS' take lazy bytestrings as patterns. For -- performance, if the pattern is not a single strict chunk then all -- the the pattern chunks will copied into a concatenated strict -- bytestring. This limits the patterns to a length of (maxBound :: -- Int). -- -- 'matchLL' and 'matchSL' take lazy bytestrings as targets. -- These are written so that while they work they will not retain a -- reference to all the earlier parts of the the lazy bytestring. -- This means the garbage collector would be able to keep only a small -- amount of the target string and free the rest. -- $partial -- -- These functions can all be usefully partially applied. Given only a -- pattern, the auxiliary data will be computed only once, allowing for -- efficient re-use. -- $complexity -- -- The preprocessing of the pattern is /O/(@patternLength@) in time and space. -- The time complexity of the searching phase is /O/(@targetLength@) for all -- functions. -- -- In most cases, these functions are considerably slower than the -- Boyer-Moore variants, performance is close to that of those from -- "Data.ByteString.Search.DFA" resp. "Data.ByteString.Lazy.Search.DFA". stringsearch-0.3.6.6/Data/ByteString/Search/Substitution.hs0000644000000000000000000000363212505631536022000 0ustar0000000000000000-- | -- Module : Data.ByteString.Search.Substitution -- Copyright : Daniel Fischer -- Licence : BSD3 -- Maintainer : Daniel Fischer -- Stability : Provisional -- Portability : portable -- -- Class for values to be substituted into strict and lazy 'S.ByteString's -- by the @replace@ functions defined in this package. -- module Data.ByteString.Search.Substitution ( Substitution(..)) where import qualified Data.ByteString as S import qualified Data.ByteString.Lazy as L import qualified Data.ByteString.Lazy.Internal as LI -- | Type class of meaningful substitutions for replace functions -- on ByteStrings. Instances for strict and lazy ByteStrings are -- provided here. class Substitution a where -- | @'substitution'@ transforms a value to a substitution function. substitution :: a -> ([S.ByteString] -> [S.ByteString]) -- | @'prependCycle' sub lazyBS@ shall prepend infinitely many copies -- of @sub@ to @lazyBS@ without entering an infinite loop in case -- of an empty @sub@, so e.g. -- -- @ -- 'prependCycle' \"\" \"ab\" == \"ab\" -- @ -- -- shall (quickly) evaluate to 'True'. -- For non-empty @sub@, the cycle shall be constructed efficiently. prependCycle :: a -> (L.ByteString -> L.ByteString) instance Substitution S.ByteString where {-# INLINE substitution #-} substitution sub = if S.null sub then id else (sub :) {-# INLINE prependCycle #-} prependCycle sub | S.null sub = id | otherwise = let c = LI.Chunk sub c in const c instance Substitution L.ByteString where {-# INLINE substitution #-} substitution LI.Empty = id substitution (LI.Chunk c t) = (c :) . flip (LI.foldrChunks (:)) t {-# INLINE prependCycle #-} prependCycle sub | L.null sub = id prependCycle sub = let cyc = LI.foldrChunks LI.Chunk cyc sub in const cyc stringsearch-0.3.6.6/Data/ByteString/Search/Internal/0000755000000000000000000000000012505631536020500 5ustar0000000000000000stringsearch-0.3.6.6/Data/ByteString/Search/Internal/BoyerMoore.hs0000644000000000000000000005342212505631536023124 0ustar0000000000000000{-# LANGUAGE BangPatterns #-} {-# OPTIONS_HADDOCK hide, prune #-} -- | -- Module : Data.ByteString.Search.Internal.BoyerMoore -- Copyright : Daniel Fischer -- Chris Kuklewicz -- Licence : BSD3 -- Maintainer : Daniel Fischer -- Stability : Provisional -- Portability : non-portable (BangPatterns) -- -- Fast overlapping Boyer-Moore search of both strict and lazy -- 'S.ByteString' values. Breaking, splitting and replacing -- using the Boyer-Moore algorithm. -- -- Descriptions of the algorithm can be found at -- -- and -- -- -- Original authors: Daniel Fischer (daniel.is.fischer at googlemail.com) and -- Chris Kuklewicz (haskell at list.mightyreason.com). module Data.ByteString.Search.Internal.BoyerMoore ( matchLS , matchSS -- Non-overlapping , matchNOS -- Replacing substrings -- replacing , replaceAllS -- Breaking on substrings -- breaking , breakSubstringS , breakAfterS -- Splitting on substrings -- splitting , splitKeepEndS , splitKeepFrontS , splitDropS ) where import Data.ByteString.Search.Internal.Utils (occurs, suffShifts, strictify) import Data.ByteString.Search.Substitution import qualified Data.ByteString as S import qualified Data.ByteString.Lazy as L import qualified Data.ByteString.Lazy.Internal as LI import Data.ByteString.Unsafe (unsafeIndex) import Data.Array.Base (unsafeAt) import Data.Word (Word8) -- overview -- -- This module exports three search functions for searching in strict -- ByteStrings. One for searching non-overlapping occurrences of a strict -- pattern and one each for possibly overlapping occurrences of a lazy -- resp. strict pattern. The common base name is @match@, the suffix -- indicates the type of search to perform. These functions -- return (for a non-empty pattern) a list of all the indices of the target -- string where an occurrence of the pattern begins, if some occurrences -- overlap, all starting indices are reported. The list is produced lazily, -- so not necessarily the entire target string is searched. -- -- The behaviour of these functions when given an empty pattern has changed. -- Formerly, the @matchXY@ functions returned an empty list then, now it's -- @[0 .. 'length' target]@. -- -- Newly added are functions to replace all (non-overlapping) occurrences -- of a pattern within a string, functions to break ByteStrings at the first -- occurrence of a pattern and functions to split ByteStrings at each -- occurrence of a pattern. None of these functions does copying, so they -- don't introduce large memory overhead. -- -- Internally, a lazy pattern is always converted to a strict ByteString, -- which is necessary for an efficient implementation of the algorithm. -- The limit this imposes on the length of the pattern is probably -- irrelevant in practice, but perhaps it should be mentioned. -- This also means that the @matchL*@ functions are mere convenience wrappers. -- Except for the initial 'strictify'ing, there's no difference between lazy -- and strict patterns, they call the same workers. There is, however, a -- difference between strict and lazy target strings. -- For the new functions, no such wrappers are provided, you have to -- 'strictify' lazy patterns yourself. -- caution -- -- When working with a lazy target string, the relation between the pattern -- length and the chunk size can play a big rôle. -- Crossing chunk boundaries is relatively expensive, so when that becomes -- a frequent occurrence, as may happen when the pattern length is close -- to or larger than the chunk size, performance is likely to degrade. -- If it is needed, steps can be taken to ameliorate that effect, but unless -- entirely separate functions are introduced, that would hurt the -- performance for the more common case of patterns much shorter than -- the default chunk size. -- performance -- -- In general, the Boyer-Moore algorithm is the most efficient method to -- search for a pattern inside a string, so most of the time, you'll want -- to use the functions of this module, hence this is where the most work -- has gone. Very short patterns are an exception to this, for those you -- should consider using a finite automaton -- ("Data.ByteString.Search.DFA.Array"). That is also often the better -- choice for searching longer periodic patterns in a lazy ByteString -- with many matches. -- -- Operating on a strict target string is mostly faster than on a lazy -- target string, but the difference is usually small (according to my -- tests). -- -- The known exceptions to this rule of thumb are -- -- [long targets] Then the smaller memory footprint of a lazy target often -- gives (much) better performance. -- -- [high number of matches] When there are very many matches, strict target -- strings are much faster, especially if the pattern is periodic. -- -- If both conditions hold, either may outweigh the other. -- complexity -- -- Preprocessing the pattern is /O/(@patternLength@ + σ) in time and -- space (σ is the alphabet size, 256 here) for all functions. -- The time complexity of the searching phase for @matchXY@ -- is /O/(@targetLength@ \/ @patternLength@) in the best case. -- For non-periodic patterns, the worst case complexity is -- /O/(@targetLength@), but for periodic patterns, the worst case complexity -- is /O/(@targetLength@ * @patternLength@) for the original Boyer-Moore -- algorithm. -- -- The searching functions in this module now contain a modification which -- drastically improves the performance for periodic patterns. -- I believe that for strict target strings, the worst case is now -- /O/(@targetLength@) also for periodic patterns and for lazy target strings, -- my semi-educated guess is -- /O/(@targetLength@ * (1 + @patternLength@ \/ @chunkSize@)). -- I may be very wrong, though. -- -- The other functions don't have to deal with possible overlapping -- patterns, hence the worst case complexity for the processing phase -- is /O/(@targetLength@) (respectively /O/(@firstIndex + patternLength@) -- for the breaking functions if the pattern occurs). -- currying -- -- These functions can all be usefully curried. Given only a pattern -- the curried version will compute the supporting lookup tables only -- once, allowing for efficient re-use. Similarly, the curried -- 'matchLL' and 'matchLS' will compute the concatenated pattern only -- once. -- overflow -- -- The current code uses @Int@ to keep track of the locations in the -- target string. If the length of the pattern plus the length of any -- strict chunk of the target string is greater than -- @'maxBound' :: 'Int'@ then this will overflow causing an error. We -- try to detect this and call 'error' before a segfault occurs. ------------------------------------------------------------------------------ -- Wrappers -- ------------------------------------------------------------------------------ -- matching -- -- These functions find the indices of all (possibly overlapping) -- occurrences of a pattern in a target string. -- If the pattern is empty, the result is @[0 .. length target]@. -- If the pattern is much shorter than the target string -- and the pattern does not occur very near the beginning of the target, -- -- > not . null $ matchSS pattern target -- -- is a much more efficient version of 'S.isInfixOf'. -- | @'matchLS'@ finds the starting indices of all possibly overlapping -- occurrences of the pattern in the target string. -- It is a simple wrapper for 'Data.ByteString.Search.indices'. -- If the pattern is empty, the result is @[0 .. 'length' target]@. {-# INLINE matchLS #-} matchLS :: L.ByteString -- ^ Lazy pattern -> S.ByteString -- ^ Strict target string -> [Int] -- ^ Offsets of matches matchLS pat = search where search = strictSearcher True (strictify pat) -- | @'matchSS'@ finds the starting indices of all possibly overlapping -- occurrences of the pattern in the target string. -- It is an alias for 'Data.ByteString.Search.indices'. -- If the pattern is empty, the result is @[0 .. 'length' target]@. {-# INLINE matchSS #-} matchSS :: S.ByteString -- ^ Strict pattern -> S.ByteString -- ^ Strict target string -> [Int] -- ^ Offsets of matches matchSS pat = search where search = strictSearcher True pat -- | @'matchNOS'@ finds the indices of all non-overlapping occurrences -- of the pattern in the Strict target string. {-# INLINE matchNOS #-} matchNOS :: S.ByteString -- ^ Strict pattern -> S.ByteString -- ^ Strict target string -> [Int] -- ^ Offsets of matches matchNOS pat = search where search = strictSearcher False pat -- replacing -- -- These functions replace all (non-overlapping) occurrences of a pattern -- in the target string. If some occurrences overlap, the earliest is -- replaced and replacing continues at the index after the replaced -- occurrence, for example -- -- > replaceAllL \"ana\" \"olog\" \"banana\" == \"bologna\", -- > replaceAllS \"abacab\" \"u\" \"abacabacabacab\" == \"uacu\", -- > replaceAllS \"aa\" \"aaa\" \"aaaa\" == \"aaaaaa\". -- -- Equality of pattern and substitution is not checked, but -- -- > pat == sub => 'strictify' (replaceAllS pat sub str) == str, -- > pat == sub => replaceAllL pat sub str == str. -- -- The result is a lazily generated lazy ByteString, the first chunks will -- generally be available before the entire target has been scanned. -- If the pattern is empty, but not the substitution, the result is -- equivalent to @'cycle' sub@. {-# INLINE replaceAllS #-} replaceAllS :: Substitution rep => S.ByteString -- ^ Pattern to replace -> rep -- ^ Substitution string -> S.ByteString -- ^ Target string -> L.ByteString -- ^ Lazy result replaceAllS pat | S.null pat = \sub -> prependCycle sub . flip LI.chunk LI.Empty | otherwise = let repl = strictRepl pat in \sub -> L.fromChunks . repl (substitution sub) -- breaking -- -- Break a string on a pattern. The first component of the result -- contains the prefix of the string before the first occurrence of the -- pattern, the second component contains the remainder. -- The following relations hold: -- -- > breakSubstringX \"\" str = (\"\", str) -- > not (pat `isInfixOf` str) == null (snd $ breakSunbstringX pat str) -- > True == case breakSubstringX pat str of -- > (x, y) -> not (pat `isInfixOf` x) -- > && (null y || pat `isPrefixOf` y) -- | This function has the same semantics as 'S.breakSubstring' -- but is generally much faster. {-# INLINE breakSubstringS #-} breakSubstringS :: S.ByteString -- ^ Pattern to break on -> S.ByteString -- ^ String to break up -> (S.ByteString, S.ByteString) -- ^ Prefix and remainder of broken string breakSubstringS = strictBreak breakAfterS :: S.ByteString -> S.ByteString -> (S.ByteString, S.ByteString) breakAfterS pat | S.null pat = \str -> (S.empty, str) breakAfterS pat = breaker where !patLen = S.length pat searcher = strictSearcher False pat breaker str = case searcher str of [] -> (str, S.empty) (i:_) -> S.splitAt (i + patLen) str -- splitting -- -- These functions implement various splitting strategies. -- -- If the pattern to split on is empty, all functions return an -- infinite list of empty ByteStrings. -- Otherwise, the names are rather self-explanatory. -- -- For nonempty patterns, the following relations hold: -- -- > concat (splitKeepXY pat str) == str -- > concat ('Data.List.intersperse' pat (splitDropX pat str)) == str. -- -- All fragments except possibly the last in the result of -- @splitKeepEndX pat@ end with @pat@, none of the fragments contains -- more than one occurrence of @pat@ or is empty. -- -- All fragments except possibly the first in the result of -- @splitKeepFrontX pat@ begin with @pat@, none of the fragments -- contains more than one occurrence of @patq or is empty. -- -- > splitDropX pat str == map dropPat (splitKeepFrontX pat str) -- > where -- > patLen = length pat -- > dropPat frag -- > | pat `isPrefixOf` frag = drop patLen frag -- > | otherwise = frag -- -- but @splitDropX@ is a little more efficient than that. {-# INLINE splitKeepEndS #-} splitKeepEndS :: S.ByteString -- ^ Pattern to split on -> S.ByteString -- ^ String to split -> [S.ByteString] -- ^ List of fragments splitKeepEndS = strictSplitKeepEnd {-# INLINE splitKeepFrontS #-} splitKeepFrontS :: S.ByteString -- ^ Pattern to split on -> S.ByteString -- ^ String to split -> [S.ByteString] -- ^ List of fragments splitKeepFrontS = strictSplitKeepFront {-# INLINE splitDropS #-} splitDropS :: S.ByteString -- ^ Pattern to split on -> S.ByteString -- ^ String to split -> [S.ByteString] -- ^ List of fragments splitDropS = strictSplitDrop ------------------------------------------------------------------------------ -- Search Functions -- ------------------------------------------------------------------------------ strictSearcher :: Bool -> S.ByteString -> S.ByteString -> [Int] strictSearcher _ !pat | S.null pat = enumFromTo 0 . S.length | S.length pat == 1 = let !w = S.head pat in S.elemIndices w strictSearcher !overlap pat = searcher where {-# INLINE patAt #-} patAt :: Int -> Word8 patAt !i = unsafeIndex pat i !patLen = S.length pat !patEnd = patLen - 1 !maxLen = maxBound - patLen !occT = occurs pat -- for bad-character-shift !suffT = suffShifts pat -- for good-suffix-shift !skip = if overlap then unsafeAt suffT 0 else patLen -- shift after a complete match !kept = patLen - skip -- length of known prefix after full match !pe = patAt patEnd -- last pattern byte for fast comparison {-# INLINE occ #-} occ !w = unsafeAt occT (fromIntegral w) {-# INLINE suff #-} suff !i = unsafeAt suffT i searcher str | maxLen < strLen = error "Overflow in BoyerMoore.strictSearcher" | maxDiff < 0 = [] | otherwise = checkEnd patEnd where !strLen = S.length str !strEnd = strLen - 1 !maxDiff = strLen - patLen {-# INLINE strAt #-} strAt !i = unsafeIndex str i -- After a full match, we know how long a prefix of the pattern -- still matches. Do not re-compare the prefix to prevent O(m*n) -- behaviour for periodic patterns. afterMatch !diff !patI = case strAt (diff + patI) of !c | c == patAt patI -> if patI == kept then diff : let !diff' = diff + skip in if maxDiff < diff' then [] else afterMatch diff' patEnd else afterMatch diff (patI - 1) | patI == patEnd -> checkEnd (diff + 2*patEnd + occ c) | otherwise -> let {-# INLINE badShift #-} badShift = patI + occ c {-# INLINE goodShift #-} goodShift = suff patI !diff' = diff + max badShift goodShift in if maxDiff < diff' then [] else checkEnd (diff + patEnd) -- While comparing the last byte of the pattern, the bad- -- character-shift is always at least as large as the good- -- suffix-shift. Eliminating the unnecessary memory reads and -- comparison speeds things up noticeably. checkEnd !sI -- index in string to compare to last of pattern | strEnd < sI = [] | otherwise = case strAt sI of !c | c == pe -> findMatch (sI - patEnd) (patEnd - 1) | otherwise -> checkEnd (sI + patEnd + occ c) -- Once the last byte has matched, we enter the full matcher -- diff is the offset of the window, patI the index of the -- pattern byte to compare next. findMatch !diff !patI = case strAt (diff + patI) of !c | c == patAt patI -> if patI == 0 -- full match, report then diff : let !diff' = diff + skip in if maxDiff < diff' then [] else if skip == patLen then checkEnd (diff' + patEnd) else afterMatch diff' patEnd else findMatch diff (patI - 1) | otherwise -> let !diff' = diff + max (patI + occ c) (suff patI) in if maxDiff < diff' then [] else checkEnd (diff' + patEnd) ------------------------------------------------------------------------------ -- Breaking Functions -- ------------------------------------------------------------------------------ strictBreak :: S.ByteString -> S.ByteString -> (S.ByteString, S.ByteString) strictBreak pat | S.null pat = \str -> (S.empty, str) | otherwise = breaker where searcher = strictSearcher False pat breaker str = case searcher str of [] -> (str, S.empty) (i:_) -> S.splitAt i str ------------------------------------------------------------------------------ -- Splitting Functions -- ------------------------------------------------------------------------------ strictSplitKeepFront :: S.ByteString -> S.ByteString -> [S.ByteString] strictSplitKeepFront pat | S.null pat = const (repeat S.empty) strictSplitKeepFront pat = splitter where !patLen = S.length pat searcher = strictSearcher False pat splitter str | S.null str = [] | otherwise = case searcher str of [] -> [str] (i:_) | i == 0 -> psplitter str | otherwise -> S.take i str : psplitter (S.drop i str) psplitter !str | S.null str = [] | otherwise = case searcher (S.drop patLen str) of [] -> [str] (i:_) -> S.take (i + patLen) str : psplitter (S.drop (i + patLen) str) strictSplitKeepEnd :: S.ByteString -> S.ByteString -> [S.ByteString] strictSplitKeepEnd pat | S.null pat = const (repeat S.empty) strictSplitKeepEnd pat = splitter where !patLen = S.length pat searcher = strictSearcher False pat splitter str | S.null str = [] | otherwise = case searcher str of [] -> [str] (i:_) -> S.take (i + patLen) str : splitter (S.drop (i + patLen) str) strictSplitDrop :: S.ByteString -> S.ByteString -> [S.ByteString] strictSplitDrop pat | S.null pat = const (repeat S.empty) strictSplitDrop pat = splitter' where !patLen = S.length pat searcher = strictSearcher False pat splitter' str | S.null str = [] | otherwise = splitter str splitter str | S.null str = [S.empty] | otherwise = case searcher str of [] -> [str] (i:_) -> S.take i str : splitter (S.drop (i + patLen) str) ------------------------------------------------------------------------------ -- Replacing Functions -- ------------------------------------------------------------------------------ -- replacing loop for strict ByteStrings, called only for -- non-empty patterns and substitutions strictRepl :: S.ByteString -> ([S.ByteString] -> [S.ByteString]) -> S.ByteString -> [S.ByteString] strictRepl pat = repl where !patLen = S.length pat searcher = strictSearcher False pat repl sub = replacer where replacer str | S.null str = [] | otherwise = case searcher str of [] -> [str] (i:_) | i == 0 -> sub $ replacer (S.drop patLen str) | otherwise -> S.take i str : sub (replacer (S.drop (i + patLen) str)) stringsearch-0.3.6.6/Data/ByteString/Search/Internal/KnuthMorrisPratt.hs0000644000000000000000000002260212505631536024336 0ustar0000000000000000{-# LANGUAGE BangPatterns #-} {-# OPTIONS_HADDOCK hide, prune #-} -- | -- Module : Data.ByteString.Search.Internal.KnuthMorrisPratt -- Copyright : Justin Bailey -- Chris Kuklewicz -- Daniel Fischer -- Licence : BSD3 -- Maintainer : Daniel Fischer -- Stability : Provisional -- Portability : non-portable (BangPatterns) -- -- Fast Knuth-Morris-Pratt search of both strict and -- lazy 'S.ByteString' values. -- -- A description of the algorithm can be found at -- . -- Original authors: Justin Bailey (jgbailey at gmail.com) and -- Chris Kuklewicz (haskell at list.mightyreason.com). module Data.ByteString.Search.Internal.KnuthMorrisPratt ( -- * Overview -- $overview -- * Partial application -- $partial -- * Complexity and Performance -- $complexity -- * Finding substrings -- ** Overlapping indicesL , indicesS -- ** Non-overlapping , matchLL , matchLS , matchSL , matchSS ) where import Data.ByteString.Search.Internal.Utils (kmpBorders, strictify) import qualified Data.ByteString as S import qualified Data.ByteString.Lazy as L import Data.ByteString.Unsafe (unsafeIndex) import Data.Array.Base (unsafeAt) --import Data.Array.Unboxed import Data.Int (Int64) -- $overview -- -- This module exports 6 search functions: 'matchLL', 'matchLS', -- 'matchSL', and 'matchSS', which find the indices of all non-overlapping -- occurrences of a pattern in a target string, and the newly added -- 'indicesL' and 'indicesS' which find the indices of -- all (possibly overlapping) occurrences of the pattern in the target -- string. The performance should be the same when the pattern can't -- overlap, but when the pattern occurs often and can have significant -- overlap, the search excluding the overlap is faster. -- -- In all cases, the list of indices is produced lazily. -- -- The behaviour of the old @matchXY@ functions for an empty pattern has -- changed, formerly they returned an empty list, now all functions -- return @[0 .. 'length' target]@ for an empty pattern. -- -- The return type of the @matchXS@ functions changed to @['Int']@, since -- strict ByteStrings are @'Int'@-indexed. -- -- The trailing @L\/S@ in the function names indicate whether they work -- on lazy or strict ByteStrings. Since all patterns are converted to -- strict ByteStrings for performance reasons, the @matchLX@ add just -- an additional bit of wrapping around the worker in comparison to -- @matchSX@. For the new functions, no such wrapping is provided, you -- have to 'strictify' lazy patterns before feeding them to the searcher. -- The limit on the pattern lengths that the conversion to a strict -- ByteString imposes should be irrelevant in practice. -- -- The functions searching in lazy ByteStrings don't keep any references -- to chunks already traversed. This means the garbage collector can free -- those chunks early and only a small part of the target string needs to -- be in memory. -- $partial -- -- These functions can all be usefully partially applied. Given only a -- pattern, the auxiliary data will be computed only once, allowing for -- efficient re-use. -- $complexity -- -- The preprocessing of the pattern is /O/(@patternLength@) in time and space. -- The time complexity of the searching phase is /O/(@targetLength@) for all -- functions. -- -- In most cases, these functions are considerably slower than the -- Boyer-Moore variants, performance is close to that of those from -- "Data.ByteString.Search.DFA" resp. "Data.ByteString.Lazy.Search.DFA". ------------------------------------------------------------------------------ -- Wrappers -- ------------------------------------------------------------------------------ -- | @'indicesL'@ finds all indices of (possibly overlapping) -- occurrences of the pattern in the target string. {-# INLINE indicesL #-} indicesL :: S.ByteString -- ^ Strict pattern -> L.ByteString -- ^ Lazy target string -> [Int64] -- ^ Offsets of matches indicesL pat = search . L.toChunks where search = matcher True pat -- | @'indicesS'@ finds all indices of (possibly overlapping) -- occurrences of the pattern in the target string. {-# INLINE indicesS #-} indicesS :: S.ByteString -- ^ Strict pattern -> S.ByteString -- ^ Strict target string -> [Int] -- ^ Offsets of matches indicesS pat = search . (:[]) where search = matcher True pat -- | @'matchLL'@ finds the starting indices of all /non-overlapping/ occurrences -- of the pattern in the target string. It is a simple wrapper around -- 'Data.ByteString.Lazy.Search.KMP.nonOverlappingIndices' strictifying -- the pattern. {-# INLINE matchLL #-} matchLL :: L.ByteString -- ^ Lazy pattern -> L.ByteString -- ^ Lazy target string -> [Int64] -- ^ Offsets of matches matchLL pat = search . L.toChunks where !spat = strictify pat search = matcher False spat -- | @'matchLS'@ finds the starting indices of all /non-overlapping/ occurrences -- of the pattern in the target string. It is a simple wrapper around -- 'Data.ByteString.Search.KMP.nonOverlappingIndices' strictifying -- the pattern. {-# INLINE matchLS #-} matchLS :: L.ByteString -- ^ Lazy pattern -> S.ByteString -- ^ Strict target string -> [Int] -- ^ Offsets of matches matchLS pat = search . (:[]) where !spat = strictify pat search = matcher False spat -- | @'matchSS'@ finds the starting indices of all /non-overlapping/ occurrences -- of the pattern in the target string. It is an alias for -- 'Data.ByteString.Search.KMP.nonOverlappingIndices'. {-# INLINE matchSS #-} matchSS :: S.ByteString -- ^ Strict pattern -> S.ByteString -- ^ Strict target string -> [Int] -- ^ Offsets of matches matchSS pat = search . (:[]) where search = matcher False pat -- | @'matchSL'@ finds the starting indices of all /non-overlapping/ occurrences -- of the pattern in the target string. It is an alias for -- 'Data.ByteString.Lazy.Search.KMP.nonOverlappingIndices'. {-# INLINE matchSL #-} matchSL :: S.ByteString -- ^ Strict pattern -> L.ByteString -- ^ Lazy target string -> [Int64] -- ^ Offsets of matches matchSL pat = search . L.toChunks where search = matcher False pat ------------------------------------------------------------------------------ -- Worker -- ------------------------------------------------------------------------------ {-# SPECIALISE matcher :: Bool -> S.ByteString -> [S.ByteString] -> [Int], Bool -> S.ByteString -> [S.ByteString] -> [Int64] #-} matcher :: Integral a => Bool -> S.ByteString -> [S.ByteString] -> [a] matcher _ !pat | S.null pat = (0 :) . go 0 where go _ [] = [] go !prior (!str : rest) = [prior + fromIntegral i | i <- [1 .. l]] ++ go prior' rest where !l = S.length str !prior' = prior + fromIntegral l matcher !overlap pat = searcher 0 0 where !patLen = S.length pat !bords = kmpBorders pat !patH = patAt 0 {-# INLINE misi #-} misi !i = unsafeAt bords i {-# INLINE patAt #-} patAt !i = unsafeIndex pat i !ami = if overlap then misi patLen else 0 searcher _ _ [] = [] searcher !prior !patPos (!str : rest) | patPos == 0 = checkHead 0 | otherwise = findMatch patPos 0 where !strLen = S.length str {-# INLINE strAt #-} strAt !i = unsafeIndex str i checkHead !strI | strI == strLen = searcher (prior + fromIntegral strLen) 0 rest | strAt strI == patH = findMatch 1 (strI + 1) | otherwise = checkHead (strI + 1) findMatch !patI !strI | patI == patLen = (prior + fromIntegral strI - fromIntegral patLen) : if ami == 0 then checkHead strI else findMatch ami strI | strI == strLen = searcher (prior + fromIntegral strLen) patI rest | otherwise = if strAt strI == patAt patI then findMatch (patI + 1) (strI + 1) else case misi patI of 0 -> checkHead strI (-1) -> checkHead (strI + 1) pI -> findMatch pI strI stringsearch-0.3.6.6/Data/ByteString/Search/Internal/Utils.hs0000644000000000000000000003143712505631536022144 0ustar0000000000000000{-# LANGUAGE BangPatterns, FlexibleContexts #-} {-# OPTIONS_HADDOCK hide, prune #-} -- | -- Module : Data.ByteString.Search.Internal.Utils -- Copyright : Daniel Fischer -- Licence : BSD3 -- Maintainer : Daniel Fischer -- Stability : Provisional -- Portabiltity : non-portable -- -- Author : Daniel Fischer -- -- Utilities for several searching algorithms. module Data.ByteString.Search.Internal.Utils ( kmpBorders , automaton , occurs , suffShifts , ldrop , ltake , lsplit , release , keep , strictify ) where import qualified Data.ByteString as S import qualified Data.ByteString.Lazy as L import Data.ByteString.Unsafe (unsafeIndex) import Data.Array.Base (unsafeRead, unsafeWrite, unsafeAt) import Data.Array.ST import Data.Array.Unboxed import Control.Monad (when) import Data.Bits import Data.Word (Word8) ------------------------------------------------------------------------------ -- Preprocessing -- ------------------------------------------------------------------------------ {-# INLINE automaton #-} automaton :: S.ByteString -> UArray Int Int automaton !pat = runSTUArray (do let !patLen = S.length pat {-# INLINE patAt #-} patAt !i = fromIntegral (unsafeIndex pat i) !bord = kmpBorders pat aut <- newArray (0, (patLen + 1)*256 - 1) 0 unsafeWrite aut (patAt 0) 1 let loop !state = do let !base = state `shiftL` 8 inner j | j < 0 = if state == patLen then return aut else loop (state+1) | otherwise = do let !i = base + patAt j s <- unsafeRead aut i when (s == 0) (unsafeWrite aut i (j+1)) inner (unsafeAt bord j) if state == patLen then inner (unsafeAt bord state) else inner state loop 1) -- kmpBorders calculates the width of the widest borders of the prefixes -- of the pattern which are not extensible to borders of the next -- longer prefix. Most entries will be 0. {-# INLINE kmpBorders #-} kmpBorders :: S.ByteString -> UArray Int Int kmpBorders pat = runSTUArray (do let !patLen = S.length pat {-# INLINE patAt #-} patAt :: Int -> Word8 patAt i = unsafeIndex pat i ar <- newArray_ (0, patLen) unsafeWrite ar 0 (-1) let dec w j | j < 0 || w == patAt j = return $! j+1 | otherwise = unsafeRead ar j >>= dec w bordLoop !i !j | patLen < i = return ar | otherwise = do let !w = patAt (i-1) j' <- dec w j if i < patLen && patAt j' == patAt i then unsafeRead ar j' >>= unsafeWrite ar i else unsafeWrite ar i j' bordLoop (i+1) j' bordLoop 1 (-1)) ------------------------------------------------------------------------------ -- Boyer-Moore Preprocessing -- ------------------------------------------------------------------------------ {- Table of last occurrences of bytes in the pattern. For each byte we record the (negated) position of its last occurrence in the pattern except at the last position. Thus, if byte b gives a mismatch at pattern position patPos, we know that we can shift the window right by at least patPos - (last occurrence of b in init pat) or, since we negated the positions, patPos + (occurs pat) If the byte doesn't occur in the pattern, we can shift the window so that the start of the pattern is aligned with the byte after this, hence the default value of 1. Complexity: O(patLen + size of alphabet) -} {- Precondition: non-empty pattern This invariant is guaranteed by not exporting occurs, inside this module, we don't call it for empty patterns. -} {-# INLINE occurs #-} occurs :: S.ByteString -> UArray Int Int occurs pat = runSTUArray (do let !patEnd = S.length pat - 1 {-# INLINE patAt #-} patAt :: Int -> Int patAt i = fromIntegral (unsafeIndex pat i) ar <- newArray (0, 255) 1 let loop !i | i == patEnd = return ar | otherwise = do unsafeWrite ar (patAt i) (-i) loop (i + 1) loop 0) {- Table of suffix-shifts. When a mismatch occurs at pattern position patPos, assumed to be not the last position in the pattern, the suffix u of length (patEnd - patPos) has been successfully matched. Let c be the byte in the pattern at position patPos. If the sub-pattern u also occurs in the pattern somewhere *not* preceded by c, let uPos be the position of the last byte in u for the last of all such occurrences. Then there can be no match if the window is shifted less than (patEnd - uPos) places, because either the part of the string which matched the suffix u is not aligned with an occurrence of u in the pattern, or it is aligned with an occurrence of u which is preceded by the same byte c as the originally matched suffix. If the complete sub-pattern u does not occur again in the pattern, or all of its occurrences are preceded by the byte c, then we can align the pattern with the string so that a suffix v of u matches a prefix of the pattern. If v is chosen maximal, no smaller shift can give a match, so we can shift by at least (patLen - length v). If a complete match is encountered, we can shift by at least the same amount as if the first byte of the pattern was a mismatch, no complete match is possible between these positions. For non-periodic patterns, only very short suffixes will usually occur again in the pattern, so if a longer suffix has been matched before a mismatch, the window can then be shifted entirely past the partial match, so that part of the string will not be re-compared. For periodic patterns, the suffix shifts will be shorter in general, leading to an O(strLen * patLen) worst-case performance. To compute the suffix-shifts, we use an array containing the lengths of the longest common suffixes of the entire pattern and its prefix ending with position pos. -} {- Precondition: non-empty pattern -} {-# INLINE suffShifts #-} suffShifts :: S.ByteString -> UArray Int Int suffShifts pat = runSTUArray (do let !patLen = S.length pat !patEnd = patLen - 1 !suff = suffLengths pat ar <- newArray (0,patEnd) patLen let preShift !idx !j | idx < 0 = return () | suff `unsafeAt` idx == idx + 1 = do let !shf = patEnd - idx fillToShf !i | i == shf = return () | otherwise = do unsafeWrite ar i shf fillToShf (i + 1) fillToShf j preShift (idx - 1) shf | otherwise = preShift (idx - 1) j sufShift !idx | idx == patEnd = return ar | otherwise = do unsafeWrite ar (patEnd - unsafeAt suff idx) (patEnd - idx) sufShift (idx + 1) preShift (patEnd - 1) 0 sufShift 0) {- Table of suffix-lengths. The value of this array at place i is the length of the longest common suffix of the entire pattern and the prefix of the pattern ending at position i. Usually, most of the entries will be 0. Only if the byte at position i is the same as the last byte of the pattern can the value be positive. In any case the value at index patEnd is patLen (since the pattern is identical to itself) and 0 <= value at i <= (i + 1). To keep this part of preprocessing linear in the length of the pattern, the implementation must be non-obvious (the obvious algorithm for this is quadratic). When the index under consideration is inside a previously identified common suffix, we align that suffix with the end of the pattern and check whether the suffix ending at the position corresponding to idx is shorter than the part of the suffix up to idx. If that is the case, the length of the suffix ending at idx is that of the suffix at the corresponding position. Otherwise extend the suffix as far as possible. If the index under consideration is not inside a previously identified common suffix, compare with the last byte of the pattern. If that gives a suffix of length > 1, for the next index we're in the previous situation, otherwise we're back in the same situation for the next index. -} {- Precondition: non-empty pattern -} {-# INLINE suffLengths #-} suffLengths :: S.ByteString -> UArray Int Int suffLengths pat = runSTUArray (do let !patLen = S.length pat !patEnd = patLen - 1 !preEnd = patEnd - 1 {-# INLINE patAt #-} patAt i = unsafeIndex pat i -- last byte for comparisons !pe = patAt patEnd -- find index preceding the longest suffix dec !diff !j | j < 0 || patAt j /= patAt (j + diff) = j | otherwise = dec diff (j - 1) ar <- newArray_ (0, patEnd) unsafeWrite ar patEnd patLen let noSuff !i | i < 0 = return ar | patAt i == pe = do let !diff = patEnd - i !nextI = i - 1 !prevI = dec diff nextI if prevI == nextI then unsafeWrite ar i 1 >> noSuff nextI else do unsafeWrite ar i (i - prevI) suffLoop prevI preEnd nextI | otherwise = do unsafeWrite ar i 0 noSuff (i - 1) suffLoop !pre !end !idx | idx < 0 = return ar | pre < idx = if patAt idx /= pe then unsafeWrite ar idx 0 >> suffLoop pre (end - 1) (idx - 1) else do prevS <- unsafeRead ar end if pre + prevS < idx then do unsafeWrite ar idx prevS suffLoop pre (end - 1) (idx - 1) else do let !prI = dec (patEnd - idx) pre unsafeWrite ar idx (idx - prI) suffLoop prI preEnd (idx - 1) | otherwise = noSuff idx noSuff preEnd) ------------------------------------------------------------------------------ -- Helper Functions -- ------------------------------------------------------------------------------ {-# INLINE strictify #-} strictify :: L.ByteString -> S.ByteString strictify = S.concat . L.toChunks -- drop k bytes from a list of strict ByteStrings {-# INLINE ldrop #-} ldrop :: Int -> [S.ByteString] -> [S.ByteString] ldrop _ [] = [] ldrop k (!h : t) | k < l = S.drop k h : t | otherwise = ldrop (k - l) t where !l = S.length h -- take k bytes from a list of strict ByteStrings {-# INLINE ltake #-} ltake :: Int -> [S.ByteString] -> [S.ByteString] ltake _ [] = [] ltake !k (!h : t) | l < k = h : ltake (k - l) t | otherwise = [S.take k h] where !l = S.length h -- split a list of strict ByteStrings at byte k {-# INLINE lsplit #-} lsplit :: Int -> [S.ByteString] -> ([S.ByteString], [S.ByteString]) lsplit _ [] = ([],[]) lsplit !k (!h : t) = case compare k l of LT -> ([S.take k h], S.drop k h : t) EQ -> ([h], t) GT -> let (u, v) = lsplit (k - l) t in (h : u, v) where !l = S.length h -- release is used to keep the zipper in lazySearcher from remembering -- the leading part of the searched string. The deep parameter is the -- number of characters that the past needs to hold. This ensures -- lazy streaming consumption of the searched string. {-# INLINE release #-} release :: Int -> [S.ByteString] -> [S.ByteString] release !deep _ | deep <= 0 = [] release !deep (!x:xs) = let !rest = release (deep-S.length x) xs in x : rest release _ [] = error "stringsearch.release could not find enough past!" -- keep is like release, only we mustn't forget the part of the past -- we don't need anymore for matching but have to keep it for -- breaking, splitting and replacing. -- The names would be more appropriate the other way round, but that's -- a historical accident, so what? {-# INLINE keep #-} keep :: Int -> [S.ByteString] -> ([S.ByteString],[S.ByteString]) keep !deep xs | deep < 1 = ([],xs) keep deep (!x:xs) = let (!p,d) = keep (deep - S.length x) xs in (x:p,d) keep _ [] = error "Forgot too much" stringsearch-0.3.6.6/Data/ByteString/Lazy/0000755000000000000000000000000012505631536016436 5ustar0000000000000000stringsearch-0.3.6.6/Data/ByteString/Lazy/Search.hs0000644000000000000000000003434612505631536020211 0ustar0000000000000000-- | -- Module : Data.ByteString.Lazy.Search -- Copyright : Daniel Fischer -- Chris Kuklewicz -- Licence : BSD3 -- Maintainer : Daniel Fischer -- Stability : Provisional -- Portability : non-portable (BangPatterns) -- -- Fast overlapping Boyer-Moore search of lazy -- 'L.ByteString' values. Breaking, splitting and replacing -- using the Boyer-Moore algorithm. -- -- Descriptions of the algorithm can be found at -- -- and -- -- -- Original authors: Daniel Fischer (daniel.is.fischer at googlemail.com) and -- Chris Kuklewicz (haskell at list.mightyreason.com). module Data.ByteString.Lazy.Search( -- * Overview -- $overview -- ** Performance -- $performance -- ** Caution -- $caution -- ** Complexity -- $complexity -- ** Partial application -- $partial -- ** Integer overflow -- $overflow -- * Finding substrings indices , nonOverlappingIndices -- * Breaking on substrings , breakOn , breakAfter , breakFindAfter -- * Replacing , replace -- * Splitting , split , splitKeepEnd , splitKeepFront -- * Convenience , strictify ) where import qualified Data.ByteString.Lazy.Search.Internal.BoyerMoore as BM import Data.ByteString.Search.Substitution import qualified Data.ByteString as S import qualified Data.ByteString.Lazy as L import Data.Int (Int64) -- $overview -- -- This module provides functions related to searching a substring within -- a string, using the Boyer-Moore algorithm with minor modifications -- to improve the overall performance and ameliorate the worst case -- performance degradation of the original Boyer-Moore algorithm for -- periodic patterns. -- -- Efficiency demands that the pattern be a strict 'S.ByteString', -- to work with a lazy pattern, convert it to a strict 'S.ByteString' -- first via 'strictify' (provided it is not too long). -- If support for long lazy patterns is needed, mail a feature-request. -- -- When searching a pattern in a UTF-8-encoded 'S.ByteString', be aware that -- these functions work on bytes, not characters, so the indices are -- byte-offsets, not character offsets. -- $performance -- -- In general, the Boyer-Moore algorithm is the most efficient method to -- search for a pattern inside a string. The advantage over other algorithms -- (e.g. Naïve, Knuth-Morris-Pratt, Horspool, Sunday) can be made -- arbitrarily large for specially selected patterns and targets, but -- usually, it's a factor of 2–3 versus Knuth-Morris-Pratt and of -- 6–10 versus the naïve algorithm. The Horspool and Sunday -- algorithms, which are simplified variants of the Boyer-Moore algorithm, -- typically have performance between Boyer-Moore and Knuth-Morris-Pratt, -- mostly closer to Boyer-Moore. The advantage of the Boyer-moore variants -- over other algorithms generally becomes larger for longer patterns. For -- very short patterns (or patterns with a very short period), other -- algorithms, e.g. "Data.ByteString.Lazy.Search.DFA" can be faster (my -- tests suggest that \"very short\" means two, maybe three bytes). -- -- In general, searching in a strict 'S.ByteString' is slightly faster -- than searching in a lazy 'L.ByteString', but for long targets the -- smaller memory footprint of lazy 'L.ByteString's can make searching -- those (sometimes much) faster. On the other hand, there are cases -- where searching in a strict target is much faster, even for long targets. -- -- On 32-bit systems, 'Int'-arithmetic is much faster than 'Int64'-arithmetic, -- so when there are many matches, that can make a significant difference. -- -- Also, the modification to ameliorate the case of periodic patterns -- is defeated by chunk-boundaries, so long patterns with a short period -- and many matches exhibit poor behaviour (consider using @indices@ from -- "Data.ByteString.Lazy.Search.DFA" or "Data.ByteString.Lazy.Search.KMP" -- in those cases, the former for medium-length patterns, the latter for -- long patterns; none of the functions except 'indices' suffer from -- this problem, though). -- $caution -- -- When working with a lazy target string, the relation between the pattern -- length and the chunk size can play a big rôle. -- Crossing chunk boundaries is relatively expensive, so when that becomes -- a frequent occurrence, as may happen when the pattern length is close -- to or larger than the chunk size, performance is likely to degrade. -- If it is needed, steps can be taken to ameliorate that effect, but unless -- entirely separate functions are introduced, that would hurt the -- performance for the more common case of patterns much shorter than -- the default chunk size. -- $complexity -- -- Preprocessing the pattern is /O/(@patternLength@ + σ) in time and -- space (σ is the alphabet size, 256 here) for all functions. -- The time complexity of the searching phase for 'indices' -- is /O/(@targetLength@ \/ @patternLength@) in the best case. -- For non-periodic patterns, the worst case complexity is -- /O/(@targetLength@), but for periodic patterns, the worst case complexity -- is /O/(@targetLength@ * @patternLength@) for the original Boyer-Moore -- algorithm. -- -- The searching functions in this module contain a modification which -- drastically improves the performance for periodic patterns, although -- less for lazy targets than for strict ones. -- If I'm not mistaken, the worst case complexity for periodic patterns -- is /O/(@targetLength@ * (1 + @patternLength@ \/ @chunkSize@)). -- -- The other functions don't have to deal with possible overlapping -- patterns, hence the worst case complexity for the processing phase -- is /O/(@targetLength@) (respectively /O/(@firstIndex + patternLength@) -- for the breaking functions if the pattern occurs). -- $partial -- -- All functions can usefully be partially applied. Given only a pattern, -- the pattern is preprocessed only once, allowing efficient re-use. -- $overflow -- -- The current code uses @Int@ to keep track of the locations in the -- target string. If the length of the pattern plus the length of any -- strict chunk of the target string is greater or equal to -- @'maxBound' :: 'Int'@ then this will overflow causing an error. We try -- to detect this and call 'error' before a segfault occurs. ------------------------------------------------------------------------------ -- Exported Functions -- ------------------------------------------------------------------------------ -- | @'indices'@ finds the starting indices of all possibly overlapping -- occurrences of the pattern in the target string. -- If the pattern is empty, the result is @[0 .. 'length' target]@. {-# INLINE indices #-} indices :: S.ByteString -- ^ Strict pattern to find -> L.ByteString -- ^ Lazy string to search -> [Int64] -- ^ Offsets of matches indices = BM.matchSL -- | @'nonOverlappingIndices'@ finds the starting indices of all -- non-overlapping occurrences of the pattern in the target string. -- It is more efficient than removing indices from the list produced -- by 'indices'. {-# INLINE nonOverlappingIndices #-} nonOverlappingIndices :: S.ByteString -- ^ Strict pattern to find -> L.ByteString -- ^ Lazy string to search -> [Int64] -- ^ Offsets of matches nonOverlappingIndices = BM.matchNOL -- | @'breakOn' pattern target@ splits @target@ at the first occurrence -- of @pattern@. If the pattern does not occur in the target, the -- second component of the result is empty, otherwise it starts with -- @pattern@. If the pattern is empty, the first component is empty. -- For a non-empty pattern, the first component is generated lazily, -- thus the first parts of it can be available before the pattern has -- been found or determined to be absent. -- -- @ -- 'uncurry' 'L.append' . 'breakOn' pattern = 'id' -- @ {-# INLINE breakOn #-} breakOn :: S.ByteString -- ^ Strict pattern to search for -> L.ByteString -- ^ Lazy string to search in -> (L.ByteString, L.ByteString) -- ^ Head and tail of string broken at substring breakOn = BM.breakSubstringL -- | @'breakAfter' pattern target@ splits @target@ behind the first occurrence -- of @pattern@. An empty second component means that either the pattern -- does not occur in the target or the first occurrence of pattern is at -- the very end of target. If you need to discriminate between those cases, -- use breakFindAfter. -- If the pattern is empty, the first component is empty. -- For a non-empty pattern, the first component is generated lazily, -- thus the first parts of it can be available before the pattern has -- been found or determined to be absent. -- -- @ -- 'uncurry' 'L.append' . 'breakAfter' pattern = 'id' -- @ {-# INLINE breakAfter #-} breakAfter :: S.ByteString -- ^ Strict pattern to search for -> L.ByteString -- ^ Lazy string to search in -> (L.ByteString, L.ByteString) -- ^ Head and tail of string broken after substring breakAfter = BM.breakAfterL -- | @'breakFindAfter'@ does the same as 'breakAfter' but additionally indicates -- whether the pattern is present in the target. -- -- @ -- 'fst' . 'breakFindAfter' pat = 'breakAfter' pat -- @ {-# INLINE breakFindAfter #-} breakFindAfter :: S.ByteString -- ^ Strict pattern to search for -> L.ByteString -- ^ Lazy string to search in -> ((L.ByteString, L.ByteString), Bool) -- ^ Head and tail of string broken after substring -- and presence of pattern breakFindAfter = BM.breakFindAfterL -- | @'replace' pat sub text@ replaces all (non-overlapping) occurrences of -- @pat@ in @text@ with @sub@. If occurrences of @pat@ overlap, the first -- occurrence that does not overlap with a replaced previous occurrence -- is substituted. Occurrences of @pat@ arising from a substitution -- will not be substituted. For example: -- -- @ -- 'replace' \"ana\" \"olog\" \"banana\" = \"bologna\" -- 'replace' \"ana\" \"o\" \"bananana\" = \"bono\" -- 'replace' \"aab\" \"abaa\" \"aaabb\" = \"aabaab\" -- @ -- -- The result is a lazy 'L.ByteString', -- which is lazily produced, without copying. -- Equality of pattern and substitution is not checked, but -- -- @ -- 'replace' pat pat text == text -- @ -- -- holds (the internal structure is generally different). -- If the pattern is empty but not the substitution, the result -- is equivalent to (were they 'String's) @cycle sub@. -- -- For non-empty @pat@ and @sub@ a lazy 'L.ByteString', -- -- @ -- 'L.concat' . 'Data.List.intersperse' sub . 'split' pat = 'replace' pat sub -- @ -- -- and analogous relations hold for other types of @sub@. {-# INLINE replace #-} replace :: Substitution rep => S.ByteString -- ^ Strict pattern to replace -> rep -- ^ Replacement string -> L.ByteString -- ^ Lazy string to modify -> L.ByteString -- ^ Lazy result replace = BM.replaceAllL -- | @'split' pattern target@ splits @target@ at each (non-overlapping) -- occurrence of @pattern@, removing @pattern@. If @pattern@ is empty, -- the result is an infinite list of empty 'L.ByteString's, if @target@ -- is empty but not @pattern@, the result is an empty list, otherwise -- the following relations hold (where @patL@ is the lazy 'L.ByteString' -- corresponding to @pat@): -- -- @ -- 'L.concat' . 'Data.List.intersperse' patL . 'split' pat = 'id', -- 'length' ('split' pattern target) == -- 'length' ('nonOverlappingIndices' pattern target) + 1, -- @ -- -- no fragment in the result contains an occurrence of @pattern@. {-# INLINE split #-} split :: S.ByteString -- ^ Strict pattern to split on -> L.ByteString -- ^ Lazy string to split -> [L.ByteString] -- ^ Fragments of string split = BM.splitDropL -- | @'splitKeepEnd' pattern target@ splits @target@ after each (non-overlapping) -- occurrence of @pattern@. If @pattern@ is empty, the result is an -- infinite list of empty 'L.ByteString's, otherwise the following -- relations hold: -- -- @ -- 'L.concat' . 'splitKeepEnd' pattern = 'id', -- @ -- -- all fragments in the result except possibly the last end with -- @pattern@, no fragment contains more than one occurrence of @pattern@. {-# INLINE splitKeepEnd #-} splitKeepEnd :: S.ByteString -- ^ Strict pattern to split on -> L.ByteString -- ^ Lazy string to split -> [L.ByteString] -- ^ Fragments of string splitKeepEnd = BM.splitKeepEndL -- | @'splitKeepFront'@ is like 'splitKeepEnd', except that @target@ is split -- before each occurrence of @pattern@ and hence all fragments -- with the possible exception of the first begin with @pattern@. -- No fragment contains more than one non-overlapping occurrence -- of @pattern@. {-# INLINE splitKeepFront #-} splitKeepFront :: S.ByteString -- ^ Strict pattern to split on -> L.ByteString -- ^ Lazy string to split -> [L.ByteString] -- ^ Fragments of string splitKeepFront = BM.splitKeepFrontL -- | @'strictify'@ converts a lazy 'L.ByteString' to a strict 'S.ByteString' -- to make it a suitable pattern. strictify :: L.ByteString -> S.ByteString strictify = S.concat . L.toChunks stringsearch-0.3.6.6/Data/ByteString/Lazy/Search/0000755000000000000000000000000012505631536017643 5ustar0000000000000000stringsearch-0.3.6.6/Data/ByteString/Lazy/Search/DFA.hs0000644000000000000000000004313512505631536020577 0ustar0000000000000000{-# LANGUAGE BangPatterns #-} -- | -- Module : Data.ByteString.Lazy.Search.DFA -- Copyright : Daniel Fischer -- Licence : BSD3 -- Maintainer : Daniel Fischer -- Stability : Provisional -- Portability : non-portable (BangPatterns) -- -- Fast search of lazy 'L.ByteString' values. Breaking, -- splitting and replacing using a deterministic finite automaton. module Data.ByteString.Lazy.Search.DFA ( -- * Overview -- $overview -- ** Complexity and performance -- $complexity -- ** Partial application -- $partial -- * Finding substrings indices , nonOverlappingIndices -- * Breaking on substrings , breakOn , breakAfter , breakFindAfter -- * Replacing , replace -- * Splitting , split , splitKeepEnd , splitKeepFront ) where import Data.ByteString.Search.Internal.Utils (automaton, keep, ldrop, lsplit) import Data.ByteString.Search.Substitution import qualified Data.ByteString as S import qualified Data.ByteString.Lazy as L import qualified Data.ByteString.Lazy.Internal as LI import Data.ByteString.Unsafe (unsafeIndex) import Data.Array.Base (unsafeAt) --import Data.Array.Unboxed (UArray) import Data.Bits import Data.Int (Int64) -- $overview -- -- This module provides functions related to searching a substring within -- a string. The searching algorithm uses a deterministic finite automaton -- based on the Knuth-Morris-Pratt algorithm. -- The automaton is implemented as an array of @(patternLength + 1) * σ@ -- state transitions, where σ is the alphabet size (256), so it is only -- suitable for short enough patterns, therefore the patterns in this module -- are required to be strict 'S.ByteString's. -- -- When searching a pattern in a UTF-8-encoded 'L.ByteString', be aware that -- these functions work on bytes, not characters, so the indices are -- byte-offsets, not character offsets. -- $complexity -- -- The time and space complexity of the preprocessing phase is -- /O/(@patternLength * σ@). -- The searching phase is /O/(@targetLength@), each target character is -- inspected only once. -- -- In general the functions in this module have about the same performance as -- the corresponding functions using the Knuth-Morris-Pratt algorithm but -- are considerably slower than the Boyer-Moore functions. For very short -- patterns or, in the case of 'indices', patterns with a short period -- which occur often, however, times are close to or even below the -- Boyer-Moore times. -- $partial -- -- All functions can usefully be partially applied. Given only a pattern, -- the automaton is constructed only once, allowing efficient re-use. ------------------------------------------------------------------------------ -- Exported Functions -- ------------------------------------------------------------------------------ -- | @'indices'@ finds the starting indices of all possibly overlapping -- occurrences of the pattern in the target string. -- If the pattern is empty, the result is @[0 .. 'length' target]@. {-# INLINE indices #-} indices :: S.ByteString -- ^ Strict pattern to find -> L.ByteString -- ^ Lazy string to search -> [Int64] -- ^ Offsets of matches indices !pat = lazySearcher True pat . L.toChunks -- | @'nonOverlappingIndices'@ finds the starting indices of all -- non-overlapping occurrences of the pattern in the target string. -- It is more efficient than removing indices from the list produced -- by 'indices'. {-# INLINE nonOverlappingIndices #-} nonOverlappingIndices :: S.ByteString -- ^ Strict pattern to find -> L.ByteString -- ^ Lazy string to search -> [Int64] -- ^ Offsets of matches nonOverlappingIndices !pat = lazySearcher False pat . L.toChunks -- | @'breakOn' pattern target@ splits @target@ at the first occurrence -- of @pattern@. If the pattern does not occur in the target, the -- second component of the result is empty, otherwise it starts with -- @pattern@. If the pattern is empty, the first component is empty. -- For a non-empty pattern, the first component is generated lazily, -- thus the first parts of it can be available before the pattern has -- been found or determined to be absent. -- -- @ -- 'uncurry' 'L.append' . 'breakOn' pattern = 'id' -- @ breakOn :: S.ByteString -- ^ Strict pattern to search for -> L.ByteString -- ^ Lazy string to search in -> (L.ByteString, L.ByteString) -- ^ Head and tail of string broken at substring breakOn pat = breaker . L.toChunks where lbrk = lazyBreaker True pat breaker strs = let (f, b) = lbrk strs in (L.fromChunks f, L.fromChunks b) -- | @'breakAfter' pattern target@ splits @target@ behind the first occurrence -- of @pattern@. An empty second component means that either the pattern -- does not occur in the target or the first occurrence of pattern is at -- the very end of target. If you need to discriminate between those cases, -- use breakFindAfter. -- If the pattern is empty, the first component is empty. -- For a non-empty pattern, the first component is generated lazily, -- thus the first parts of it can be available before the pattern has -- been found or determined to be absent. -- @ -- 'uncurry' 'L.append' . 'breakAfter' pattern = 'id' -- @ breakAfter :: S.ByteString -- ^ Strict pattern to search for -> L.ByteString -- ^ Lazy string to search in -> (L.ByteString, L.ByteString) -- ^ Head and tail of string broken after substring breakAfter pat = breaker . L.toChunks where lbrk = lazyBreaker False pat breaker strs = let (f, b) = lbrk strs in (L.fromChunks f, L.fromChunks b) -- | @'breakFindAfter'@ does the same as 'breakAfter' but additionally indicates -- whether the pattern is present in the target. -- -- @ -- 'fst' . 'breakFindAfter' pat = 'breakAfter' pat -- @ breakFindAfter :: S.ByteString -- ^ Strict pattern to search for -> L.ByteString -- ^ Lazy string to search in -> ((L.ByteString, L.ByteString), Bool) -- ^ Head and tail of string broken after substring -- and presence of pattern breakFindAfter pat | S.null pat = \str -> ((L.empty, str), True) breakFindAfter pat = breaker . L.toChunks where !patLen = S.length pat lbrk = lazyBreaker True pat breaker strs = let (f, b) = lbrk strs (f1, b1) = lsplit patLen b mbpat = L.fromChunks f1 in ((foldr LI.chunk mbpat f, L.fromChunks b1), not (null b)) -- | @'replace' pat sub text@ replaces all (non-overlapping) occurrences of -- @pat@ in @text@ with @sub@. If occurrences of @pat@ overlap, the first -- occurrence that does not overlap with a replaced previous occurrence -- is substituted. Occurrences of @pat@ arising from a substitution -- will not be substituted. For example: -- -- @ -- 'replace' \"ana\" \"olog\" \"banana\" = \"bologna\" -- 'replace' \"ana\" \"o\" \"bananana\" = \"bono\" -- 'replace' \"aab\" \"abaa\" \"aaabb\" = \"aabaab\" -- @ -- -- The result is a lazy 'L.ByteString', -- which is lazily produced, without copying. -- Equality of pattern and substitution is not checked, but -- -- @ -- 'replace' pat pat text == text -- @ -- -- holds (the internal structure is generally different). -- If the pattern is empty but not the substitution, the result -- is equivalent to (were they 'String's) @cycle sub@. -- -- For non-empty @pat@ and @sub@ a lazy 'L.ByteString', -- -- @ -- 'L.concat' . 'Data.List.intersperse' sub . 'split' pat = 'replace' pat sub -- @ -- -- and analogous relations hold for other types of @sub@. replace :: Substitution rep => S.ByteString -- ^ Strict pattern to replace -> rep -- ^ Replacement string -> L.ByteString -- ^ Lazy string to modify -> L.ByteString -- ^ Lazy result replace pat | S.null pat = \sub -> prependCycle sub | otherwise = let !patLen = S.length pat breaker = lazyBreaker True pat repl subst strs | null strs = [] | otherwise = let (pre, mtch) = breaker strs in pre ++ case mtch of [] -> [] _ -> subst (repl subst (ldrop patLen mtch)) in \sub -> let {-# NOINLINE subst #-} !subst = substitution sub repl1 = repl subst in L.fromChunks . repl1 . L.toChunks -- | @'split' pattern target@ splits @target@ at each (non-overlapping) -- occurrence of @pattern@, removing @pattern@. If @pattern@ is empty, -- the result is an infinite list of empty 'L.ByteString's, if @target@ -- is empty but not @pattern@, the result is an empty list, otherwise -- the following relations hold (where @patL@ is the lazy 'L.ByteString' -- corresponding to @pat@): -- -- @ -- 'L.concat' . 'Data.List.intersperse' patL . 'split' pat = 'id', -- 'length' ('split' pattern target) == -- 'length' ('nonOverlappingIndices' pattern target) + 1, -- @ -- -- no fragment in the result contains an occurrence of @pattern@. split :: S.ByteString -- ^ Strict pattern to split on -> L.ByteString -- ^ Lazy string to split -> [L.ByteString] -- ^ Fragments of string split pat | S.null pat = const (repeat L.empty) split pat = map L.fromChunks . splitter . L.toChunks where !patLen = S.length pat breaker = lazyBreaker True pat splitter strs | null strs = [] | otherwise = splitter' strs splitter' strs | null strs = [[]] | otherwise = case breaker strs of (pre, mtch) -> pre : case mtch of [] -> [] _ -> splitter' (ldrop patLen mtch) -- | @'splitKeepEnd' pattern target@ splits @target@ after each (non-overlapping) -- occurrence of @pattern@. If @pattern@ is empty, the result is an -- infinite list of empty 'L.ByteString's, otherwise the following -- relations hold: -- -- @ -- 'L.concat' . 'splitKeepEnd' pattern = 'id,' -- @ -- -- all fragments in the result except possibly the last end with -- @pattern@, no fragment contains more than one occurrence of @pattern@. splitKeepEnd :: S.ByteString -- ^ Strict pattern to split on -> L.ByteString -- ^ Lazy string to split -> [L.ByteString] -- ^ Fragments of string splitKeepEnd pat | S.null pat = const (repeat L.empty) splitKeepEnd pat = map L.fromChunks . splitter . L.toChunks where breaker = lazyBreaker False pat splitter [] = [] splitter strs = case breaker strs of (pre, mtch) -> pre : splitter mtch -- | @'splitKeepFront'@ is like 'splitKeepEnd', except that @target@ is split -- before each occurrence of @pattern@ and hence all fragments -- with the possible exception of the first begin with @pattern@. -- No fragment contains more than one non-overlapping occurrence -- of @pattern@. splitKeepFront :: S.ByteString -- ^ Strict pattern to split on -> L.ByteString -- ^ Lazy string to split -> [L.ByteString] -- ^ Fragments of string splitKeepFront pat | S.null pat = const (repeat L.empty) splitKeepFront pat = map L.fromChunks . splitter . L.toChunks where !patLen = S.length pat breaker = lazyBreaker True pat splitter strs = case splitter' strs of ([] : rst) -> rst other -> other splitter' [] = [] splitter' strs = case breaker strs of (pre, mtch) -> pre : case mtch of [] -> [] _ -> case lsplit patLen mtch of (pt, rst) -> if null rst then [pt] else let (h : t) = splitter' rst in (pt ++ h) : t ------------------------------------------------------------------------------ -- Searching Function -- ------------------------------------------------------------------------------ lazySearcher :: Bool -> S.ByteString -> [S.ByteString] -> [Int64] lazySearcher _ !pat | S.null pat = let zgo _ [] = [] zgo !prior (!str : rest) = let !l = S.length str !prior' = prior + fromIntegral l in [prior + fromIntegral i | i <- [1 .. l]] ++ zgo prior' rest in (0:) . zgo 0 | S.length pat == 1 = let !w = S.head pat ixes = S.elemIndices w go _ [] = [] go !prior (!str : rest) = let !prior' = prior + fromIntegral (S.length str) in map ((+ prior) . fromIntegral) (ixes str) ++ go prior' rest in go 0 lazySearcher !overlap pat = search 0 0 where !patLen = S.length pat !auto = automaton pat !p0 = unsafeIndex pat 0 !ams = if overlap then patLen else 0 search _ _ [] = [] search !prior st (!str:rest) = match st 0 where !strLen = S.length str {-# INLINE strAt #-} strAt :: Int -> Int strAt i = fromIntegral (str `unsafeIndex` i) match 0 !idx | idx == strLen = search (prior + fromIntegral strLen) 0 rest | unsafeIndex str idx == p0 = match 1 (idx + 1) | otherwise = match 0 (idx + 1) match state idx | idx == strLen = search (prior + fromIntegral strLen) state rest | otherwise = let nstate = unsafeAt auto ((state `shiftL` 8) + strAt idx) !nxtIdx = idx + 1 in if nstate == patLen then (prior + fromIntegral (nxtIdx - patLen)) : match ams nxtIdx else match nstate nxtIdx ------------------------------------------------------------------------------ -- Breaking -- ------------------------------------------------------------------------------ -- Code duplication :( -- Needed for reasonable performance. lazyBreaker :: Bool -> S.ByteString -> [S.ByteString] -> ([S.ByteString], [S.ByteString]) lazyBreaker before pat | S.null pat = \strs -> ([], strs) | S.length pat == 1 = let !w = S.head pat !a = if before then 0 else 1 ixes = S.elemIndices w scan [] = ([], []) scan (!str:rest) = let !strLen = S.length str in case ixes str of [] -> let (fr, bk) = scan rest in (str : fr, bk) (i:_) -> let !j = i + a in if j == strLen then ([str],rest) else ([S.take j str], S.drop j str : rest) in scan lazyBreaker !before pat = bscan [] 0 where !patLen = S.length pat !auto = automaton pat !p0 = unsafeIndex pat 0 bscan _ _ [] = ([], []) bscan !past !sta (!str:rest) = match sta 0 where !strLen = S.length str {-# INLINE strAt #-} strAt :: Int -> Int strAt i = fromIntegral (str `unsafeIndex` i) match 0 idx | idx == strLen = let (fr, bk) = bscan [] 0 rest in (foldr (flip (.) . (:)) id past (str:fr), bk) | unsafeIndex str idx == p0 = match 1 (idx + 1) | otherwise = match 0 (idx + 1) match state idx | idx == strLen = let (kp, !rl) = if before then keep state (str:past) else ([], str:past) (fr, bk) = bscan kp state rest in (foldr (flip (.) . (:)) id rl fr, bk) | otherwise = let !nstate = unsafeAt auto ((state `shiftL` 8) + strAt idx) !nxtIdx = idx + 1 in if nstate == patLen then case if before then nxtIdx - patLen else nxtIdx of 0 -> (foldr (flip (.) . (:)) id past [], str:rest) stIx | stIx < 0 -> rgo (-stIx) (str:rest) past | stIx == strLen -> (foldr (flip (.) . (:)) id past [str],rest) | otherwise -> (foldr (flip (.) . (:)) id past [S.take stIx str], S.drop stIx str : rest) else match nstate nxtIdx -- Did I already mention that I suck at finding names? {-# INLINE rgo #-} rgo :: Int -> [S.ByteString] -> [S.ByteString] -> ([S.ByteString], [S.ByteString]) rgo !kp acc (!str:more) | sl == kp = (reverse more, str:acc) | sl < kp = rgo (kp - sl) (str:acc) more | otherwise = case S.splitAt (sl - kp) str of (fr, bk) -> (foldr (flip (.) . (:)) id more [fr], bk:acc) where !sl = S.length str rgo _ _ [] = error "Not enough past!" -- If that error is ever encountered, I screwed up badly. stringsearch-0.3.6.6/Data/ByteString/Lazy/Search/KarpRabin.hs0000644000000000000000000002424112505631536022053 0ustar0000000000000000{-# LANGUAGE BangPatterns #-} -- | -- Module : Data.ByteString.Lazy.Search.KarpRabin -- Copyright : (c) 2010 Daniel Fischer -- Licence : BSD3 -- Maintainer : Daniel Fischer -- Stability : Provisional -- Portability : non-portable (BangPatterns) -- -- Simultaneous search for multiple patterns in a lazy 'L.ByteString' -- using the Karp-Rabin algorithm. -- -- A description of the algorithm for a single pattern can be found at -- . module Data.ByteString.Lazy.Search.KarpRabin ( -- * Overview -- $overview -- ** Caution -- $caution -- * Function indicesOfAny ) where import qualified Data.ByteString as S import qualified Data.ByteString.Lazy as L import Data.ByteString.Unsafe (unsafeIndex) import qualified Data.IntMap as IM import Data.Array import Data.Array.Base (unsafeAt) import Data.Word (Word8) import Data.Int (Int64) import Data.Bits import Data.List (foldl') -- $overview -- -- The Karp-Rabin algorithm works by calculating a hash of the pattern and -- comparing that hash with the hash of a slice of the target string with -- the same length as the pattern. If the hashes are equal, the slice of the -- target is compared to the pattern character by character (since the hash -- function generally isn't injective). -- -- For a single pattern, this tends to be more efficient than the naïve -- algorithm, but it cannot compete with algorithms like -- Knuth-Morris-Pratt or Boyer-Moore. -- -- However, the algorithm can be generalised to search for multiple patterns -- simultaneously. If the shortest pattern has length @k@, hash the prefix of -- length @k@ of all patterns and compare the hash of the target's slices of -- length @k@ to them. If there's a match, check whether the slice is part -- of an occurrence of the corresponding pattern. -- -- With a hash-function that -- -- * allows to compute the hash of one slice in constant time from the hash -- of the previous slice, the new and the dropped character, and -- -- * produces few spurious matches, -- -- searching for occurrences of any of @n@ patterns has a best-case complexity -- of /O/(@targetLength@ * @lookup n@). The worst-case complexity is -- /O/(@targetLength@ * @lookup n@ * @sum patternLengths@), the average is -- not much worse than the best case. -- -- The functions in this module store the hashes of the patterns in an -- 'IM.IntMap', so the lookup is /O/(@log n@). Re-hashing is done in constant -- time and spurious matches of the hashes /should be/ sufficiently rare. -- The maximal length of the prefixes to be hashed is 32. -- $caution -- -- Unfortunately, the constant factors are high, so these functions are slow. -- Unless the number of patterns to search for is high (larger than 50 at -- least), repeated search for single patterns using Boyer-Moore or DFA and -- manual merging of the indices is faster. /Much/ faster for less than 40 -- or so patterns. -- -- 'indicesOfAny' has the advantage over multiple single-pattern searches that -- it doesn't hold on to large parts of the string (which is likely to happen -- for multiple searches), however, so in contrast to the strict version, it -- may be useful for relatively few patterns already. -- -- Nevertheless, this module seems more of an interesting curiosity than -- anything else. -- | @'indicesOfAny'@ finds all occurrences of any of several non-empty strict -- patterns in a lazy target string. If no non-empty patterns are given, -- the result is an empty list. Otherwise the result list contains -- the pairs of all indices where any of the (non-empty) patterns start -- and the list of all patterns starting at that index, the patterns being -- represented by their (zero-based) position in the pattern list. -- Empty patterns are filtered out before processing begins. {-# INLINE indicesOfAny #-} indicesOfAny :: [S.ByteString] -- ^ List of non-empty patterns -> L.ByteString -- ^ String to search -> [(Int64,[Int])] -- ^ List of matches indicesOfAny pats | null nepats = const [] | otherwise = lazyMatcher nepats . L.toChunks where nepats = filter (not . S.null) pats ------------------------------------------------------------------------------ -- Workers -- ------------------------------------------------------------------------------ {-# INLINE rehash1 #-} rehash1 :: Int -> Int -> Word8 -> Word8 -> Int rehash1 out h o n = (h `shiftL` 1 - (fromIntegral o `shiftL` out)) + fromIntegral n {-# INLINE rehash2 #-} rehash2 :: Int -> Int -> Word8 -> Word8 -> Int rehash2 out h o n = (h `shiftL` 2 - (fromIntegral o `shiftL` out)) + fromIntegral n {-# INLINE rehash3 #-} rehash3 :: Int -> Int -> Word8 -> Word8 -> Int rehash3 out h o n = (h `shiftL` 3 - (fromIntegral o `shiftL` out)) + fromIntegral n {-# INLINE rehash4 #-} rehash4 :: Int -> Int -> Word8 -> Word8 -> Int rehash4 out h o n = (h `shiftL` 4 - (fromIntegral o `shiftL` out)) + fromIntegral n lazyMatcher :: [S.ByteString] -> [S.ByteString] -> [(Int64,[Int])] lazyMatcher pats = search 0 hLen S.empty where !hLen = minimum (32 : map S.length pats) !shDi = case 32 `quot` hLen of q | q < 4 -> q | otherwise -> 4 !outS = shDi*hLen !patNum = length pats !patArr = listArray (0, patNum - 1) pats {-# INLINE rehash #-} rehash :: Int -> Word8 -> Word8 -> Int rehash = case shDi of 1 -> rehash1 hLen 2 -> rehash2 outS 3 -> rehash3 outS _ -> rehash4 outS hash :: S.ByteString -> Int hash = S.foldl' (\h w -> (h `shiftL` shDi) + fromIntegral w) 0 . S.take hLen !hashMap = foldl' (\mp (h,i) -> IM.insertWith (flip (++)) h [i] mp) IM.empty $ zip (map hash pats) [0 :: Int .. ] search _ _ _ [] = [] search !h !rm !prev (!str : rest) | strLen < rm = let !h' = S.foldl' (\o w -> (o `shiftL` 1) + fromIntegral w) h str !prev' = S.append prev str in search h' (rm - strLen) prev' rest | otherwise = let !h' = S.foldl' (\o w -> (o `shiftL` 1) + fromIntegral w) h (S.take rm str) in if S.null prev then noPast 0 rest str h' else past 0 rest prev 0 str rm h' where !strLen = S.length str noPast !prior rest !str hsh = go hsh 0 where !strLen = S.length str !maxIdx = strLen - hLen {-# INLINE strAt #-} strAt !i = unsafeIndex str i go !h sI = case IM.lookup h hashMap of Nothing -> if sI == maxIdx then case rest of [] -> [] (nxt : more) -> let !h' = rehash h (strAt sI) (unsafeIndex nxt 0) !prior' = prior + fromIntegral strLen !prev = S.drop (sI + 1) str in if hLen == 1 then noPast prior' more nxt h' else past prior' more prev 0 nxt 1 h' else go (rehash h (strAt sI) (strAt (sI + hLen))) (sI + 1) Just ps -> let !rst = S.drop sI str !rLen = strLen - sI {-# INLINE hd #-} hd = strAt sI {-# INLINE more #-} more = if sI == maxIdx then case rest of [] -> [] (nxt : fut) -> let !h' = rehash h hd (unsafeIndex nxt 0) !prior' = prior + fromIntegral strLen in if hLen == 1 then noPast prior' fut nxt h' else past prior' fut rst 1 nxt 1 h' else go (rehash h hd (strAt (sI + hLen))) (sI + 1) okay bs | rLen < S.length bs = S.isPrefixOf rst bs && checkFut (S.drop rLen bs) rest | otherwise = S.isPrefixOf bs rst in case filter (okay . (patArr `unsafeAt`)) ps of [] -> more qs -> seq (length qs) $ (prior + fromIntegral sI,qs) : more past !prior rest !prev !pI !str !sI !hsh | strLen < 4040 = let !prior' = prior - 1 + fromIntegral (sI - hLen) !curr = S.append (S.drop pI prev) str in noPast prior' rest curr hsh | otherwise = go hsh pI sI where !strLen = S.length str {-# INLINE strAt #-} strAt !i = unsafeIndex str i {-# INLINE prevAt #-} prevAt !i = unsafeIndex prev i go !h !p !s | s == hLen = noPast prior rest str h | otherwise = case IM.lookup h hashMap of Nothing -> let {-# INLINE h' #-} h' = rehash h (prevAt p) (strAt s) in go h' (p + 1) (s + 1) Just ps -> let !prst = S.drop p prev {-# INLINE more #-} more = go (rehash h (prevAt p) (strAt s)) (p + 1) (s + 1) okay bs = checkFut bs (prst : str : rest) in case filter (okay . (unsafeAt patArr)) ps of [] -> more qs -> seq (length qs) $ (prior + fromIntegral (s - hLen), qs) : more {-# INLINE checkFut #-} checkFut :: S.ByteString -> [S.ByteString] -> Bool checkFut _ [] = False checkFut !bs (!h : t) | hLen < S.length bs = S.isPrefixOf h bs && checkFut (S.drop hLen bs) t | otherwise = S.isPrefixOf bs h where !hLen = S.length h stringsearch-0.3.6.6/Data/ByteString/Lazy/Search/KMP.hs0000644000000000000000000000753112505631536020634 0ustar0000000000000000-- | -- Module : Data.ByteString.Lazy.Search.KMP -- Copyright : Justin Bailey -- Chris Kuklewicz -- Daniel Fischer -- Licence : BSD3 -- Maintainer : Daniel Fischer -- Stability : Provisional -- Portability : non-portable (BangPatterns) -- -- Fast search of lazy 'L.ByteString' values using the -- Knuth-Morris-Pratt algorithm. -- -- A description of the algorithm can be found at -- . -- -- Original authors: Justin Bailey (jgbailey at gmail.com) and -- Chris Kuklewicz (haskell at list.mightyreason.com). module Data.ByteString.Lazy.Search.KMP (-- * Overview -- $overview -- ** Complexity and Performance -- $complexity -- ** Partial application -- $partial -- * Functions indices , nonOverlappingIndices -- ** Convenience , strictify ) where import Data.ByteString.Search.Internal.KnuthMorrisPratt (matchSL, indicesL) import qualified Data.ByteString as S import qualified Data.ByteString.Lazy as L import Data.Int (Int64) -- $overview -- -- This module provides two functions for finding the occurrences of a -- pattern in a target string using the Knuth-Morris-Pratt algorithm. -- It exists mostly for systematic reasons, the functions from -- "Data.ByteString.Lazy.Search" are much faster, except for very short -- patterns or long patterns with a short period if overlap is allowed. -- In the latter case, 'indices' from this module may be the best choice -- since the Boyer-Moore function's performance degrades if there are many -- matches and the DFA function's automaton needs much space for long -- patterns. -- In the former case, for some pattern\/target combinations DFA has better -- performance, for others KMP, usually the difference is small. -- $complexity -- -- The preprocessing of the pattern is /O/(@patternLength@) in time and space. -- The time complexity of the searching phase is /O/(@targetLength@) for both -- functions. -- -- In most cases, these functions are considerably slower than the -- Boyer-Moore variants, performance is close to that of those from -- "Data.ByteString.Search.DFA". -- $partial -- -- Both functions can be usefully partially applied. Given only a -- pattern, the auxiliary data will be computed only once, allowing for -- efficient re-use. -- | @'indices'@ finds the starting indices of all possibly overlapping -- occurrences of the pattern in the target string. -- If the pattern is empty, the result is @[0 .. 'length' target]@. {-# INLINE indices #-} indices :: S.ByteString -- ^ Strict pattern to find -> L.ByteString -- ^ Lazy string to search -> [Int64] -- ^ Offsets of matches indices = indicesL -- | @'nonOverlappingIndices'@ finds the starting indices of all -- non-overlapping occurrences of the pattern in the target string. -- It is more efficient than removing indices from the list produced -- by 'indices'. {-# INLINE nonOverlappingIndices #-} nonOverlappingIndices :: S.ByteString -- ^ Strict pattern to find -> L.ByteString -- ^ Lazy string to search -> [Int64] -- ^ Offsets of matches nonOverlappingIndices = matchSL -- | @'strictify'@ transforms a lazy 'L.ByteString' into a strict -- 'S.ByteString', to make it a suitable pattern for the searching -- functions. strictify :: L.ByteString -> S.ByteString strictify = S.concat . L.toChunks stringsearch-0.3.6.6/Data/ByteString/Lazy/Search/Internal/0000755000000000000000000000000012505631536021417 5ustar0000000000000000stringsearch-0.3.6.6/Data/ByteString/Lazy/Search/Internal/BoyerMoore.hs0000644000000000000000000011170412505631536024041 0ustar0000000000000000{-# LANGUAGE BangPatterns #-} {-# OPTIONS_HADDOCK hide, prune #-} -- | -- Module : Data.ByteString.Lazy.Search.Internal.BoyerMoore -- Copyright : Daniel Fischer -- Chris Kuklewicz -- Licence : BSD3 -- Maintainer : Daniel Fischer -- Stability : Provisional -- Portability : non-portable (BangPatterns) -- -- Fast overlapping Boyer-Moore search of both strict and lazy -- 'S.ByteString' values. Breaking, splitting and replacing -- using the Boyer-Moore algorithm. -- -- Descriptions of the algorithm can be found at -- -- and -- -- -- Original authors: Daniel Fischer (daniel.is.fischer at googlemail.com) and -- Chris Kuklewicz (haskell at list.mightyreason.com). module Data.ByteString.Lazy.Search.Internal.BoyerMoore ( matchLL , matchSL -- Non-overlapping , matchNOL -- Replacing substrings -- replacing , replaceAllL -- Breaking on substrings -- breaking , breakSubstringL , breakAfterL , breakFindAfterL -- Splitting on substrings -- splitting , splitKeepEndL , splitKeepFrontL , splitDropL ) where import Data.ByteString.Search.Internal.Utils (occurs, suffShifts, ldrop, lsplit, keep, release, strictify) import Data.ByteString.Search.Substitution import qualified Data.ByteString as S import qualified Data.ByteString.Lazy as L import Data.ByteString.Unsafe (unsafeIndex) import Data.Array.Base (unsafeAt) import Data.Word (Word8) import Data.Int (Int64) -- overview -- -- This module exports three search functions for searching in lazy -- ByteSrings, one for searching non-overlapping occurrences of a strict -- pattern, and one each for searchin overlapping occurrences of a strict -- resp. lazy pattern. The common base name is @match@, the suffix -- indicates the type of search. These functions -- return (for a non-empty pattern) a list of all the indices of the target -- string where an occurrence of the pattern begins, if some occurrences -- overlap, all starting indices are reported. The list is produced lazily, -- so not necessarily the entire target string is searched. -- -- The behaviour of these functions when given an empty pattern has changed. -- Formerly, the @matchXY@ functions returned an empty list then, now it's -- @[0 .. 'length' target]@. -- -- Newly added are functions to replace all (non-overlapping) occurrences -- of a pattern within a string, functions to break ByteStrings at the first -- occurrence of a pattern and functions to split ByteStrings at each -- occurrence of a pattern. None of these functions does copying, so they -- don't introduce large memory overhead. -- -- Internally, a lazy pattern is always converted to a strict ByteString, -- which is necessary for an efficient implementation of the algorithm. -- The limit this imposes on the length of the pattern is probably -- irrelevant in practice, but perhaps it should be mentioned. -- This also means that the @matchL*@ functions are mere convenience wrappers. -- Except for the initial 'strictify'ing, there's no difference between lazy -- and strict patterns, they call the same workers. There is, however, a -- difference between strict and lazy target strings. -- For the new functions, no such wrappers are provided, you have to -- 'strictify' lazy patterns yourself. -- caution -- -- When working with a lazy target string, the relation between the pattern -- length and the chunk size can play a big rôle. -- Crossing chunk boundaries is relatively expensive, so when that becomes -- a frequent occurrence, as may happen when the pattern length is close -- to or larger than the chunk size, performance is likely to degrade. -- If it is needed, steps can be taken to ameliorate that effect, but unless -- entirely separate functions are introduced, that would hurt the -- performance for the more common case of patterns much shorter than -- the default chunk size. -- performance -- -- In general, the Boyer-Moore algorithm is the most efficient method to -- search for a pattern inside a string, so most of the time, you'll want -- to use the functions of this module, hence this is where the most work -- has gone. Very short patterns are an exception to this, for those you -- should consider using a finite automaton -- ("Data.ByteString.Search.DFA.Array"). That is also often the better -- choice for searching longer periodic patterns in a lazy ByteString -- with many matches. -- -- Operating on a strict target string is mostly faster than on a lazy -- target string, but the difference is usually small (according to my -- tests). -- -- The known exceptions to this rule of thumb are -- -- [long targets] Then the smaller memory footprint of a lazy target often -- gives (much) better performance. -- -- [high number of matches] When there are very many matches, strict target -- strings are much faster, especially if the pattern is periodic. -- -- If both conditions hold, either may outweigh the other. -- complexity -- -- Preprocessing the pattern is /O/(@patternLength@ + σ) in time and -- space (σ is the alphabet size, 256 here) for all functions. -- The time complexity of the searching phase for @matchXY@ -- is /O/(@targetLength@ \/ @patternLength@) in the best case. -- For non-periodic patterns, the worst case complexity is -- /O/(@targetLength@), but for periodic patterns, the worst case complexity -- is /O/(@targetLength@ * @patternLength@) for the original Boyer-Moore -- algorithm. -- -- The searching functions in this module now contain a modification which -- drastically improves the performance for periodic patterns. -- I believe that for strict target strings, the worst case is now -- /O/(@targetLength@) also for periodic patterns and for lazy target strings, -- my semi-educated guess is -- /O/(@targetLength@ * (1 + @patternLength@ \/ @chunkSize@)). -- I may be very wrong, though. -- -- The other functions don't have to deal with possible overlapping -- patterns, hence the worst case complexity for the processing phase -- is /O/(@targetLength@) (respectively /O/(@firstIndex + patternLength@) -- for the breaking functions if the pattern occurs). -- currying -- -- These functions can all be usefully curried. Given only a pattern -- the curried version will compute the supporting lookup tables only -- once, allowing for efficient re-use. Similarly, the curried -- 'matchLL' and 'matchLS' will compute the concatenated pattern only -- once. -- overflow -- -- The current code uses @Int@ to keep track of the locations in the -- target string. If the length of the pattern plus the length of any -- strict chunk of the target string is greater than -- @'maxBound' :: 'Int'@ then this will overflow causing an error. We -- try to detect this and call 'error' before a segfault occurs. ------------------------------------------------------------------------------ -- Wrappers -- ------------------------------------------------------------------------------ -- matching -- -- These functions find the indices of all (possibly overlapping) -- occurrences of a pattern in a target string. -- If the pattern is empty, the result is @[0 .. length target]@. -- If the pattern is much shorter than the target string -- and the pattern does not occur very near the beginning of the target, -- -- > not . null $ matchSS pattern target -- -- is a much more efficient version of 'S.isInfixOf'. -- | @'matchLL'@ finds the starting indices of all possibly overlapping -- occurrences of the pattern in the target string. -- It is a simple wrapper for 'Data.ByteString.Lazy.Search.indices'. -- If the pattern is empty, the result is @[0 .. 'length' target]@. {-# INLINE matchLL #-} matchLL :: L.ByteString -- ^ Lazy pattern -> L.ByteString -- ^ Lazy target string -> [Int64] -- ^ Offsets of matches matchLL pat = search . L.toChunks where search = lazySearcher True (strictify pat) -- | @'matchSL'@ finds the starting indices of all possibly overlapping -- occurrences of the pattern in the target string. -- It is an alias for 'Data.ByteString.Lazy.Search.indices'. -- If the pattern is empty, the result is @[0 .. 'length' target]@. {-# INLINE matchSL #-} matchSL :: S.ByteString -- ^ Strict pattern -> L.ByteString -- ^ Lazy target string -> [Int64] -- ^ Offsets of matches matchSL pat = search . L.toChunks where search = lazySearcher True pat -- | @'matchNOL'@ finds the indices of all non-overlapping occurrences -- of the pattern in the lazy target string. {-# INLINE matchNOL #-} matchNOL :: S.ByteString -- ^ Strict pattern -> L.ByteString -- ^ Lazy target string -> [Int64] -- ^ Offsets of matches matchNOL pat = search . L.toChunks where search = lazySearcher False pat -- replacing -- -- These functions replace all (non-overlapping) occurrences of a pattern -- in the target string. If some occurrences overlap, the earliest is -- replaced and replacing continues at the index after the replaced -- occurrence, for example -- -- > replaceAllL \"ana\" \"olog\" \"banana\" == \"bologna\", -- > replaceAllS \"abacab\" \"u\" \"abacabacabacab\" == \"uacu\", -- > replaceAllS \"aa\" \"aaa\" \"aaaa\" == \"aaaaaa\". -- -- Equality of pattern and substitution is not checked, but -- -- > pat == sub => 'strictify' (replaceAllS pat sub str) == str, -- > pat == sub => replaceAllL pat sub str == str. -- -- The result is a lazily generated lazy ByteString, the first chunks will -- generally be available before the entire target has been scanned. -- If the pattern is empty, but not the substitution, the result is -- equivalent to @'cycle' sub@. {-# INLINE replaceAllL #-} replaceAllL :: Substitution rep => S.ByteString -- ^ Pattern to replace -> rep -- ^ Substitution string -> L.ByteString -- ^ Target string -> L.ByteString -- ^ Lazy result replaceAllL pat | S.null pat = \sub -> prependCycle sub | S.length pat == 1 = let breaker = lazyBreak pat repl subst strs | null strs = [] | otherwise = case breaker strs of (pre, mtch) -> pre ++ case mtch of [] -> [] _ -> subst (repl subst (ldrop 1 mtch)) in \sub -> let repl1 = repl (substitution sub) in L.fromChunks . repl1 . L.toChunks | otherwise = let repl = lazyRepl pat in \sub -> let repl1 = repl (substitution sub) in L.fromChunks . repl1 . L.toChunks -- breaking -- -- Break a string on a pattern. The first component of the result -- contains the prefix of the string before the first occurrence of the -- pattern, the second component contains the remainder. -- The following relations hold: -- -- > breakSubstringX \"\" str = (\"\", str) -- > not (pat `isInfixOf` str) == null (snd $ breakSunbstringX pat str) -- > True == case breakSubstringX pat str of -- > (x, y) -> not (pat `isInfixOf` x) -- > && (null y || pat `isPrefixOf` y) -- | The analogous function for a lazy target string. -- The first component is generated lazily, so parts of it can be -- available before the pattern is detected (or found to be absent). {-# INLINE breakSubstringL #-} breakSubstringL :: S.ByteString -- ^ Pattern to break on -> L.ByteString -- ^ String to break up -> (L.ByteString, L.ByteString) -- ^ Prefix and remainder of broken string breakSubstringL pat = breaker . L.toChunks where lbrk = lazyBreak pat breaker strs = let (f, b) = lbrk strs in (L.fromChunks f, L.fromChunks b) breakAfterL :: S.ByteString -> L.ByteString -> (L.ByteString, L.ByteString) breakAfterL pat | S.null pat = \str -> (L.empty, str) breakAfterL pat = breaker' . L.toChunks where !patLen = S.length pat breaker = lazyBreak pat breaker' strs = let (pre, mtch) = breaker strs (pl, a) = if null mtch then ([],[]) else lsplit patLen mtch in (L.fromChunks (pre ++ pl), L.fromChunks a) breakFindAfterL :: S.ByteString -> L.ByteString -> ((L.ByteString, L.ByteString), Bool) breakFindAfterL pat | S.null pat = \str -> ((L.empty, str), True) breakFindAfterL pat = breaker' . L.toChunks where !patLen = S.length pat breaker = lazyBreak pat breaker' strs = let (pre, mtch) = breaker strs (pl, a) = if null mtch then ([],[]) else lsplit patLen mtch in ((L.fromChunks (pre ++ pl), L.fromChunks a), not (null mtch)) -- splitting -- -- These functions implement various splitting strategies. -- -- If the pattern to split on is empty, all functions return an -- infinite list of empty ByteStrings. -- Otherwise, the names are rather self-explanatory. -- -- For nonempty patterns, the following relations hold: -- -- > concat (splitKeepXY pat str) == str -- > concat ('Data.List.intersperse' pat (splitDropX pat str)) == str. -- -- All fragments except possibly the last in the result of -- @splitKeepEndX pat@ end with @pat@, none of the fragments contains -- more than one occurrence of @pat@ or is empty. -- -- All fragments except possibly the first in the result of -- @splitKeepFrontX pat@ begin with @pat@, none of the fragments -- contains more than one occurrence of @patq or is empty. -- -- > splitDropX pat str == map dropPat (splitKeepFrontX pat str) -- > where -- > patLen = length pat -- > dropPat frag -- > | pat `isPrefixOf` frag = drop patLen frag -- > | otherwise = frag -- -- but @splitDropX@ is a little more efficient than that. {-# INLINE splitKeepEndL #-} splitKeepEndL :: S.ByteString -- ^ Pattern to split on -> L.ByteString -- ^ String to split -> [L.ByteString] -- ^ List of fragments splitKeepEndL pat | S.null pat = const (repeat L.empty) | otherwise = let splitter = lazySplitKeepEnd pat in map L.fromChunks . splitter . L.toChunks {-# INLINE splitKeepFrontL #-} splitKeepFrontL :: S.ByteString -- ^ Pattern to split on -> L.ByteString -- ^ String to split -> [L.ByteString] -- ^ List of fragments splitKeepFrontL pat | S.null pat = const (repeat L.empty) | otherwise = let splitter = lazySplitKeepFront pat in map L.fromChunks . splitter . L.toChunks {-# INLINE splitDropL #-} splitDropL :: S.ByteString -- ^ Pattern to split on -> L.ByteString -- ^ String to split -> [L.ByteString] -- ^ List of fragments splitDropL pat | S.null pat = const (repeat L.empty) | otherwise = let splitter = lazySplitDrop pat in map L.fromChunks . splitter . L.toChunks ------------------------------------------------------------------------------ -- Search Functions -- ------------------------------------------------------------------------------ lazySearcher :: Bool -> S.ByteString -> [S.ByteString] -> [Int64] lazySearcher _ !pat | S.null pat = let zgo !prior [] = [prior] zgo prior (!str : rest) = let !l = S.length str !prior' = prior + fromIntegral l in [prior + fromIntegral i | i <- [0 .. l-1]] ++ zgo prior' rest in zgo 0 | S.length pat == 1 = let !w = S.head pat ixes = S.elemIndices w go _ [] = [] go !prior (!str : rest) = let !prior' = prior + fromIntegral (S.length str) in map ((+ prior) . fromIntegral) (ixes str) ++ go prior' rest in go 0 lazySearcher !overlap pat = searcher where {-# INLINE patAt #-} patAt :: Int -> Word8 patAt !i = unsafeIndex pat i !patLen = S.length pat !patEnd = patLen - 1 {-# INLINE preEnd #-} preEnd = patEnd - 1 !maxLen = maxBound - patLen !occT = occurs pat -- for bad-character-shift !suffT = suffShifts pat -- for good-suffix-shift !skip = if overlap then unsafeAt suffT 0 else patLen -- shift after a complete match !kept = patLen - skip -- length of known prefix after full match !pe = patAt patEnd -- last pattern byte for fast comparison {-# INLINE occ #-} occ !w = unsafeAt occT (fromIntegral w) {-# INLINE suff #-} suff !i = unsafeAt suffT i searcher lst = case lst of [] -> [] (h : t) -> if maxLen < S.length h then error "Overflow in BoyerMoore.lazySearcher" else seek 0 [] h t 0 patEnd -- seek is used to position the "zipper" of (past, str, future) to the -- correct S.ByteString to search. This is done by ensuring that -- 0 <= strPos < strLen, where strPos = diffPos + patPos. -- Note that future is not a strict parameter. The bytes being compared -- will then be (strAt strPos) and (patAt patPos). -- Splitting this into specialised versions is possible, but it would -- only be useful if the pattern length is close to (or larger than) -- the chunk size. For ordinary patterns of at most a few hundred bytes, -- the overhead of yet more code-paths and larger code size will probably -- outweigh the small gains in the relatively rare calls to seek. seek :: Int64 -> [S.ByteString] -> S.ByteString -> [S.ByteString] -> Int -> Int -> [Int64] seek !prior !past !str future !diffPos !patPos | strPos < 0 = -- need to look at previous chunk case past of (h : t) -> let !hLen = S.length h in seek (prior - fromIntegral hLen) t h (str : future) (diffPos + hLen) patPos [] -> error "seek back too far!" | strEnd < strPos = -- need to look at next chunk if there is case future of (h : t) -> let {-# INLINE prior' #-} prior' = prior + fromIntegral strLen !diffPos' = diffPos - strLen {-# INLINE past' #-} past' = release (-diffPos') (str : past) in if maxLen < S.length h then error "Overflow in BoyerMoore.lazySearcher" else seek prior' past' h t diffPos' patPos [] -> [] | patPos == patEnd = checkEnd strPos | diffPos < 0 = matcherN diffPos patPos | otherwise = matcherP diffPos patPos where !strPos = diffPos + patPos !strLen = S.length str !strEnd = strLen - 1 !maxDiff = strLen - patLen {-# INLINE strAt #-} strAt !i = unsafeIndex str i -- While comparing the last byte of the pattern, the bad- -- character-shift is always at least as large as the good- -- suffix-shift. Eliminating the unnecessary memory reads and -- comparison speeds things up noticeably. checkEnd !sI -- index in string to compare to last of pattern | strEnd < sI = seek prior past str future (sI - patEnd) patEnd | otherwise = case strAt sI of !c | c == pe -> if sI < patEnd then case sI of 0 -> seek prior past str future (-patEnd) preEnd _ -> matcherN (sI - patEnd) preEnd else matcherP (sI - patEnd) preEnd | otherwise -> checkEnd (sI + patEnd + occ c) -- Once the last byte has matched, we enter the full matcher -- diff is the offset of the window, patI the index of the -- pattern byte to compare next. -- matcherN is the tight loop that walks backwards from the end -- of the pattern checking for matching bytes. The offset is -- always negative, so no complete match can occur here. -- When a byte matches, we need to check whether we've reached -- the front of this chunk, otherwise whether we need the next. matcherN !diff !patI = case strAt (diff + patI) of !c | c == patAt patI -> if diff + patI == 0 then seek prior past str future diff (patI - 1) else matcherN diff (patI - 1) | otherwise -> let {-# INLINE badShift #-} badShift = patI + occ c {-# INLINE goodShift #-} goodShift = suff patI !diff' = diff + max badShift goodShift in if maxDiff < diff' then seek prior past str future diff' patEnd else checkEnd (diff' + patEnd) -- matcherP is the tight loop for non-negative offsets. -- When the pattern is shifted, we must check whether we leave -- the current chunk, otherwise we only need to check for a -- complete match. matcherP !diff !patI = case strAt (diff + patI) of !c | c == patAt patI -> if patI == 0 then prior + fromIntegral diff : let !diff' = diff + skip in if maxDiff < diff' then seek prior past str future diff' patEnd else if skip == patLen then checkEnd (diff' + patEnd) else afterMatch diff' patEnd else matcherP diff (patI - 1) | otherwise -> let {-# INLINE badShift #-} badShift = patI + occ c {-# INLINE goodShift #-} goodShift = suff patI !diff' = diff + max badShift goodShift in if maxDiff < diff' then seek prior past str future diff' patEnd else checkEnd (diff' + patEnd) -- After a full match, we know how long a prefix of the pattern -- still matches. Do not re-compare the prefix to prevent O(m*n) -- behaviour for periodic patterns. -- This breaks down at chunk boundaries, but except for long -- patterns with a short period, that shouldn't matter much. afterMatch !diff !patI = case strAt (diff + patI) of !c | c == patAt patI -> if patI == kept then prior + fromIntegral diff : let !diff' = diff + skip in if maxDiff < diff' then seek prior past str future diff' patEnd else afterMatch diff' patEnd else afterMatch diff (patI - 1) | patI == patEnd -> checkEnd (diff + (2*patEnd) + occ c) | otherwise -> let {-# INLINE badShift #-} badShift = patI + occ c {-# INLINE goodShift #-} goodShift = suff patI !diff' = diff + max badShift goodShift in if maxDiff < diff' then seek prior past str future diff' patEnd else checkEnd (diff' + patEnd) ------------------------------------------------------------------------------ -- Breaking Functions -- ------------------------------------------------------------------------------ -- Ugh! Code duplication ahead! -- But we want to get the first component lazily, so it's no good to find -- the first index (if any) and then split. -- Therefore bite the bullet and copy most of the code of lazySearcher. -- No need for afterMatch here, fortunately. lazyBreak ::S.ByteString -> [S.ByteString] -> ([S.ByteString], [S.ByteString]) lazyBreak !pat | S.null pat = \lst -> ([],lst) | S.length pat == 1 = let !w = S.head pat go [] = ([], []) go (!str : rest) = case S.elemIndices w str of [] -> let (pre, post) = go rest in (str : pre, post) (i:_) -> if i == 0 then ([], str : rest) else ([S.take i str], S.drop i str : rest) in go lazyBreak pat = breaker where !patLen = S.length pat !patEnd = patLen - 1 !occT = occurs pat !suffT = suffShifts pat !maxLen = maxBound - patLen !pe = patAt patEnd {-# INLINE patAt #-} patAt !i = unsafeIndex pat i {-# INLINE occ #-} occ !w = unsafeAt occT (fromIntegral w) {-# INLINE suff #-} suff !i = unsafeAt suffT i breaker lst = case lst of [] -> ([],[]) (h:t) -> if maxLen < S.length h then error "Overflow in BoyerMoore.lazyBreak" else seek [] h t 0 patEnd seek :: [S.ByteString] -> S.ByteString -> [S.ByteString] -> Int -> Int -> ([S.ByteString], [S.ByteString]) seek !past !str future !offset !patPos | strPos < 0 = case past of [] -> error "not enough past!" (h : t) -> seek t h (str : future) (offset + S.length h) patPos | strEnd < strPos = case future of [] -> (foldr (flip (.) . (:)) id past [str], []) (h : t) -> let !off' = offset - strLen (past', !discharge) = keep (-off') (str : past) in if maxLen < S.length h then error "Overflow in BoyerMoore.lazyBreak (future)" else let (pre,post) = seek past' h t off' patPos in (foldr (flip (.) . (:)) id discharge pre, post) | patPos == patEnd = checkEnd strPos | offset < 0 = matcherN offset patPos | otherwise = matcherP offset patPos where {-# INLINE strAt #-} strAt !i = unsafeIndex str i !strLen = S.length str !strEnd = strLen - 1 !maxOff = strLen - patLen !strPos = offset + patPos checkEnd !sI | strEnd < sI = seek past str future (sI - patEnd) patEnd | otherwise = case strAt sI of !c | c == pe -> if sI < patEnd then (if sI == 0 then seek past str future (-patEnd) (patEnd - 1) else matcherN (sI - patEnd) (patEnd - 1)) else matcherP (sI - patEnd) (patEnd - 1) | otherwise -> checkEnd (sI + patEnd + occ c) matcherN !off !patI = case strAt (off + patI) of !c | c == patAt patI -> if off + patI == 0 then seek past str future off (patI - 1) else matcherN off (patI - 1) | otherwise -> let !off' = off + max (suff patI) (patI + occ c) in if maxOff < off' then seek past str future off' patEnd else checkEnd (off' + patEnd) matcherP !off !patI = case strAt (off + patI) of !c | c == patAt patI -> if patI == 0 then let !pre = if off == 0 then [] else [S.take off str] !post = S.drop off str in (foldr (flip (.) . (:)) id past pre, post:future) else matcherP off (patI - 1) | otherwise -> let !off' = off + max (suff patI) (patI + occ c) in if maxOff < off' then seek past str future off' patEnd else checkEnd (off' + patEnd) ------------------------------------------------------------------------------ -- Splitting Functions -- ------------------------------------------------------------------------------ -- non-empty pattern lazySplitKeepFront :: S.ByteString -> [S.ByteString] -> [[S.ByteString]] lazySplitKeepFront pat = splitter' where !patLen = S.length pat breaker = lazyBreak pat splitter' strs = case splitter strs of ([]:rest) -> rest other -> other splitter [] = [] splitter strs = case breaker strs of (pre, mtch) -> pre : case mtch of [] -> [] _ -> case lsplit patLen mtch of (pt, rst) -> if null rst then [pt] else let (h : t) = splitter rst in (pt ++ h) : t -- non-empty pattern lazySplitKeepEnd :: S.ByteString -> [S.ByteString] -> [[S.ByteString]] lazySplitKeepEnd pat = splitter where !patLen = S.length pat breaker = lazyBreak pat splitter [] = [] splitter strs = case breaker strs of (pre, mtch) -> let (h : t) = if null mtch then [[]] else case lsplit patLen mtch of (pt, rst) -> pt : splitter rst in (pre ++ h) : t lazySplitDrop :: S.ByteString -> [S.ByteString] -> [[S.ByteString]] lazySplitDrop pat = splitter where !patLen = S.length pat breaker = lazyBreak pat splitter [] = [] splitter strs = splitter' strs splitter' [] = [[]] splitter' strs = case breaker strs of (pre,mtch) -> pre : case mtch of [] -> [] _ -> splitter' (ldrop patLen mtch) ------------------------------------------------------------------------------ -- Replacing Functions -- ------------------------------------------------------------------------------ {- These would be really nice. Unfortunately they're too slow, so instead, there's another instance of almost the same code as in lazySearcher below. -- variant of below lazyFRepl :: S.ByteString -> ([S.ByteString] -> [S.ByteString]) -> [S.ByteString] -> [S.ByteString] lazyFRepl pat = repl where !patLen = S.length pat breaker = lazyBreak pat repl sub = replacer where replacer [] = [] replacer strs = let (pre, mtch) = breaker strs in pre ++ case mtch of [] -> [] _ -> sub (replacer (ldrop patLen mtch)) -- This is nice and short. I really hope it's performing well! lazyBRepl :: S.ByteString -> S.ByteString -> [S.ByteString] -> [S.ByteString] lazyBRepl pat !sub = replacer where !patLen = S.length pat breaker = lazyBreak pat replacer [] = [] replacer strs = let (pre, mtch) = breaker strs in pre ++ case mtch of [] -> [] _ -> sub : replacer (ldrop patLen mtch) -} -- Yet more code duplication. -- -- Benchmark it against an implementation using lazyBreak and, -- unless it's significantly faster, NUKE IT!! -- -- Sigh, it is significantly faster. 10 - 25 %. -- I could live with the 10, but 25 is too much. -- -- Hmm, maybe an implementation via -- replace pat sub = L.intercalate sub . split pat -- would be competitive now. -- TODO: test speed and space usage. -- -- replacing loop for lazy ByteStrings as list of chunks, -- called only for non-empty patterns lazyRepl :: S.ByteString -> ([S.ByteString] -> [S.ByteString]) -> [S.ByteString] -> [S.ByteString] lazyRepl pat = replacer where !patLen = S.length pat !patEnd = patLen - 1 !occT = occurs pat !suffT = suffShifts pat !maxLen = maxBound - patLen !pe = patAt patEnd {-# INLINE patAt #-} patAt !i = unsafeIndex pat i {-# INLINE occ #-} occ !w = unsafeAt occT (fromIntegral w) {-# INLINE suff #-} suff !i = unsafeAt suffT i replacer sub lst = case lst of [] -> [] (h:t) -> if maxLen < S.length h then error "Overflow in BoyerMoore.lazyRepl" else seek [] h t 0 patEnd where chop _ [] = [] chop !k (!str : rest) | k < s = if maxLen < (s - k) then error "Overflow in BoyerMoore.lazyRepl (chop)" else seek [] (S.drop k str) rest 0 patEnd | otherwise = chop (k-s) rest where !s = S.length str seek :: [S.ByteString] -> S.ByteString -> [S.ByteString] -> Int -> Int -> [S.ByteString] seek !past !str fut !offset !patPos | strPos < 0 = case past of [] -> error "not enough past!" (h : t) -> seek t h (str : fut) (offset + S.length h) patPos | strEnd < strPos = case fut of [] -> foldr (flip (.) . (:)) id past [str] (h : t) -> let !off' = offset - strLen (past', !discharge) = keep (-off') (str : past) in if maxLen < S.length h then error "Overflow in BoyerMoore.lazyRepl (future)" else foldr (flip (.) . (:)) id discharge $ seek past' h t off' patPos | patPos == patEnd = checkEnd strPos | offset < 0 = matcherN offset patPos | otherwise = matcherP offset patPos where {-# INLINE strAt #-} strAt !i = unsafeIndex str i !strLen = S.length str !strEnd = strLen - 1 !maxOff = strLen - patLen !strPos = offset + patPos checkEnd !sI | strEnd < sI = seek past str fut (sI - patEnd) patEnd | otherwise = case strAt sI of !c | c == pe -> if sI < patEnd then (if sI == 0 then seek past str fut (-patEnd) (patEnd - 1) else matcherN (sI - patEnd) (patEnd - 1)) else matcherP (sI - patEnd) (patEnd - 1) | otherwise -> checkEnd (sI + patEnd + occ c) matcherN !off !patI = case strAt (off + patI) of !c | c == patAt patI -> if off + patI == 0 then seek past str fut off (patI - 1) else matcherN off (patI - 1) | otherwise -> let !off' = off + max (suff patI) (patI + occ c) in if maxOff < off' then seek past str fut off' patEnd else checkEnd (off' + patEnd) matcherP !off !patI = case strAt (off + patI) of !c | c == patAt patI -> if patI == 0 then foldr (flip (.) . (:)) id past $ let pre = if off == 0 then id else (S.take off str :) in pre . sub $ let !p = off + patLen in if p < strLen then seek [] (S.drop p str) fut 0 patEnd else chop (p - strLen) fut else matcherP off (patI - 1) | otherwise -> let !off' = off + max (suff patI) (patI + occ c) in if maxOff < off' then seek past str fut off' patEnd else checkEnd (off' + patEnd)