edit-distance-0.2.1.1/000755 000765 000024 00000000000 12033303026 016173 5ustar00mbolingbrokestaff000000 000000 edit-distance-0.2.1.1/edit-distance.cabal000644 000765 000024 00000005342 12033303026 021700 0ustar00mbolingbrokestaff000000 000000 Name: edit-distance Version: 0.2.1.1 Cabal-Version: >= 1.2 Category: Algorithms Synopsis: Levenshtein and restricted Damerau-Levenshtein edit distances Description: Optimized edit distances for fuzzy matching, including Levenshtein and restricted Damerau-Levenshtein algorithms. License: BSD3 License-File: LICENSE Extra-Source-Files: README.textile Author: Max Bolingbroke Maintainer: batterseapower@hotmail.com Homepage: http://github.com/batterseapower/edit-distance Build-Type: Simple Flag Tests Description: Enable building the tests Default: False Flag Benchmark Description: Enable building the benchmark suite Default: False Flag SplitBase Description: Choose the new smaller, split-up base package Default: True Library Exposed-Modules: Text.EditDistance Other-Modules: Text.EditDistance.EditCosts Text.EditDistance.SquareSTUArray Text.EditDistance.STUArray Text.EditDistance.Bits Text.EditDistance.MonadUtilities if flag(splitBase) Build-Depends: base >= 3 && < 5, array >= 0.1, random >= 1.0, containers >= 0.1.0.1 else Build-Depends: base < 3 Ghc-Options: -O2 Executable edit-distance-tests Main-Is: Text/EditDistance/Tests.hs Extensions: PatternGuards, PatternSignatures, ScopedTypeVariables Ghc-Options: -O2 if !flag(tests) Buildable: False else Build-Depends: test-framework >= 0.1.1, QuickCheck >= 1.1 && < 2.0, test-framework-quickcheck if flag(splitBase) Build-Depends: base >= 3 && < 5, array >= 0.1, random >= 1.0 else Build-Depends: base < 3 Executable edit-distance-benchmark Main-Is: Text/EditDistance/Benchmark.hs if !flag(benchmark) Buildable: False else if flag(splitBase) Build-Depends: base >= 3 && < 5, array >= 0.1, random >= 1.0, old-time >= 1.0, process >= 1.0, parallel >= 1.0, unix >= 2.3 else Build-Depends: base < 3, parallel >= 1.0, unix >= 2.3 Ghc-Options: -O2edit-distance-0.2.1.1/LICENSE000644 000765 000024 00000002766 12033303026 017213 0ustar00mbolingbrokestaff000000 000000 Copyright (c) 2008, Maximilian Bolingbroke All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of Maximilian Bolingbroke nor the names of other contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.edit-distance-0.2.1.1/README.textile000644 000765 000024 00000003157 12033303026 020536 0ustar00mbolingbrokestaff000000 000000 h1. Edit Distance Algorithms You can help improve this README with extra snippets and advice by using the "GitHub wiki":http://github.com/batterseapower/edit-distance/wikis/readme. h2. Installing To just install the library:
runghc Setup.lhs configure
runghc Setup.lhs build
sudo runghc Setup.lhs install
If you want to build the tests, to check it's all working:
runghc Setup.lhs configure -ftests
runghc Setup.lhs build
dist/build/edit-distance-tests/edit-distance-tests
h2. Description Edit distances algorithms for fuzzy matching. Specifically, this library provides: * "Levenshtein distance":http://en.wikipedia.org/wiki/Levenshtein_distance * "Restricted Damerau-Levenshtein distance":http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance They have been fairly heavily optimized. Indeed, for situations where one of the strings is under 32 characters long I use a rather neat "bit vector" algorithm: see "the authors paper":http://www.cs.uta.fi/~helmu/pubs/psc02.pdf and "the associated errata":[http://www.cs.uta.fi/~helmu/pubs/PSCerr.html for more information. The algorithms __could__ be faster, but they aren't yet slow enough to force me into improving the situation. h2. Example
Text.EditDistance> levenshteinDistance defaultEditCosts "witch" "kitsch"
2
h2. Linkage * "Hackage":http://hackage.haskell.org/cgi-bin/hackage-scripts/package/edit-distance * "Bug Tracker":http://bsp.lighthouseapp.com/projects/14822-hs-edit-distance * "GitHub":http://github.com/batterseapower/edit-distanceedit-distance-0.2.1.1/Setup.lhs000644 000765 000024 00000000115 12033303026 020000 0ustar00mbolingbrokestaff000000 000000 #! /usr/bin/env runhaskell > import Distribution.Simple > main = defaultMainedit-distance-0.2.1.1/Text/000755 000765 000024 00000000000 12033303026 017117 5ustar00mbolingbrokestaff000000 000000 edit-distance-0.2.1.1/Text/EditDistance/000755 000765 000024 00000000000 12033303026 021457 5ustar00mbolingbrokestaff000000 000000 edit-distance-0.2.1.1/Text/EditDistance.hs000644 000765 000024 00000005572 12033303026 022024 0ustar00mbolingbrokestaff000000 000000 {-# LANGUAGE PatternGuards #-} -- | Computing the edit distances between strings module Text.EditDistance ( Costs(..), EditCosts(..), defaultEditCosts, levenshteinDistance, restrictedDamerauLevenshteinDistance ) where import Text.EditDistance.EditCosts import qualified Text.EditDistance.Bits as Bits import qualified Text.EditDistance.STUArray as STUArray import qualified Text.EditDistance.SquareSTUArray as SquareSTUArray -- | Find the Levenshtein edit distance between two strings. That is to say, the number of deletion, -- insertion and substitution operations that are required to make the two strings equal. Note that -- this algorithm therefore does not make use of the 'transpositionCost' field of the costs. See also: -- . levenshteinDistance :: EditCosts -> String -> String -> Int levenshteinDistance costs str1 str2 | isDefaultEditCosts costs , not (betterNotToUseBits str1_len || betterNotToUseBits str2_len) -- The Integer implementation of the Bits algorithm is quite inefficient, but scales better = Bits.levenshteinDistanceWithLengths str1_len str2_len str1 str2 -- than the STUArrays. The Word32 implementation is always better, if it is applicable | otherwise = STUArray.levenshteinDistanceWithLengths costs str1_len str2_len str1 str2 -- STUArray always beat making more allocations with SquareSTUArray for Levenhstein where str1_len = length str1 str2_len = length str2 betterNotToUseBits len = len >= 33 && len <= 82 -- Upper bound determined experimentally -- | Find the "restricted" Damerau-Levenshtein edit distance between two strings. This algorithm calculates the cost of -- the so-called optimal string alignment, which does not always equal the appropriate edit distance. The cost of the optimal -- string alignment is the number of edit operations needed to make the input strings equal under the condition that no substring -- is edited more than once. See also: . restrictedDamerauLevenshteinDistance :: EditCosts -> String -> String -> Int restrictedDamerauLevenshteinDistance costs str1 str2 | isDefaultEditCosts costs , not (betterNotToUseBits str1_len || betterNotToUseBits str2_len) -- The Integer implementation of the Bits algorithm is quite inefficient, but scales better = Bits.restrictedDamerauLevenshteinDistanceWithLengths str1_len str2_len str1 str2 -- than the STUArrays. The Word32 implementation is always better, if it is applicable | otherwise = SquareSTUArray.restrictedDamerauLevenshteinDistanceWithLengths costs str1_len str2_len str1 str2 -- SquareSTUArray usually beat making more use of the heap with STUArray for Damerau where str1_len = length str1 str2_len = length str2 betterNotToUseBits len = len >= 33 && len <= 45 -- Upper bound determined experimentallyedit-distance-0.2.1.1/Text/EditDistance/Benchmark.hs000644 000765 000024 00000007067 12033303026 023717 0ustar00mbolingbrokestaff000000 000000 module Main where import Text.EditDistance.EditCosts import qualified Text.EditDistance.Bits as Bits import qualified Text.EditDistance.STUArray as STUArray import qualified Text.EditDistance.SquareSTUArray as SquareSTUArray import System.IO import System.Exit --import System.Posix.IO import System.Time ( ClockTime(..), getClockTime ) import System.Random import System.Process import Data.List import Control.Monad import Control.Exception --import Control.Concurrent ( forkIO, threadDelay ) import Control.Parallel.Strategies ( NFData, rnf ) sTRING_SIZE_STEP, mAX_STRING_SIZE :: Int sTRING_SIZE_STEP = 3 mAX_STRING_SIZE = 108 time :: IO a -> IO Float time action = do TOD s1 ps1 <- getClockTime action TOD s2 ps2 <- getClockTime return $ (fromIntegral (s2 - s1) + (fromIntegral (ps2 - ps1) / 10^(12 :: Int))) augment :: Monad m => (a -> m b) -> [a] -> m [(a, b)] augment fx xs = liftM (zip xs) $ mapM fx xs sample :: NFData a => (String -> String -> a) -> (Int, Int) -> IO Float sample distance bounds@(i, j) = do -- Generate two random strings of length i and j gen <- newStdGen let (string1, string2_long) = splitAt i (randoms gen) string2 = take j string2_long -- Force the two strings to be evaluated so they don't meddle -- with the benchmarking evaluate (rnf string1) evaluate (rnf string2) -- Our sample is the time taken to find the edit distance putStrLn $ "Sampling " ++ show bounds time $ loop 1000 $ evaluate (distance string1 string2) loop :: Monad m => Int -> m a -> m () loop n act = sequence_ (replicate n act) joinOnKey :: Eq a => [(a, b)] -> [(a, c)] -> [(a, (b, c))] joinOnKey xs ys = [(x_a, (x_b, y_c)) | (x_a, x_b) <- xs, (y_a, y_c) <- ys, x_a == y_a] gnuPlotScript :: String gnuPlotScript = "set term postscript eps enhanced color\n\ \set output \"data.ps\"\n\ \#unset key\n\ \set dgrid3d\n\ \set hidden3d\n\ \#set pm3d map\n\ \#splot \"data.plot\" using 1:2:3\n\ \splot \"data.plot\" using 1:2:3 title \"Bits\" with lines, \"data.plot\" using 1:2:4 title \"STUArray\" with lines, \"data.plot\" using 1:2:5 title \"SquareSTUArray\" with lines\n\ \quit\n" toGnuPlotFormat :: (Show a, Show b, Show c) => [((a, b), [c])] -> String toGnuPlotFormat samples = unlines (header : map sampleToGnuPlotFormat samples) where first_cs = snd $ head samples header = "#\tX\tY" ++ concat (replicate (length first_cs) "\tZ") sampleToGnuPlotFormat ((a, b), cs) = concat $ intersperse "\t" $ [show a, show b] ++ map show cs main :: IO () main = do let sample_range = [(i, j) | i <- [0,sTRING_SIZE_STEP..mAX_STRING_SIZE] , j <- [0,sTRING_SIZE_STEP..mAX_STRING_SIZE]] bits_samples <- augment (sample $ Bits.levenshteinDistance) sample_range sqstu_samples <- augment (sample $ SquareSTUArray.levenshteinDistance defaultEditCosts) sample_range stu_samples <- augment (sample $ STUArray.levenshteinDistance defaultEditCosts) sample_range let paired_samples = bits_samples `joinOnKey` (stu_samples `joinOnKey` sqstu_samples) listified_samples = [((i, j), [a, b, c]) | ((i, j), (a, (b, c))) <- paired_samples] writeFile "data.plot" (toGnuPlotFormat listified_samples) writeFile "plot.script" gnuPlotScript (_inp, _outp, _err, gp_pid) <- runInteractiveCommand "(cat plot.script | gnuplot); RETCODE=$?; rm plot.script; exit $RETCODE" gp_exit_code <- waitForProcess gp_pid case gp_exit_code of ExitSuccess -> putStrLn "Plotted at 'data.ps'" ExitFailure err_no -> putStrLn $ "Failed! Error code " ++ show err_noedit-distance-0.2.1.1/Text/EditDistance/Bits.hs000644 000765 000024 00000030460 12033303026 022717 0ustar00mbolingbrokestaff000000 000000 {-# LANGUAGE PatternGuards, PatternSignatures, ScopedTypeVariables, BangPatterns #-} module Text.EditDistance.Bits ( levenshteinDistance, levenshteinDistanceWithLengths, {-levenshteinDistanceCutoff,-} restrictedDamerauLevenshteinDistance, restrictedDamerauLevenshteinDistanceWithLengths ) where import Data.Bits import Data.Char import Data.Word import Data.List import qualified Data.IntMap as IM --import Debug.Trace --type BitVector = Integer -- Based on the algorithm presented in "A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances" in PSC'02 (Heikki Hyyro). -- See http://www.cs.uta.fi/~helmu/pubs/psc02.pdf and http://www.cs.uta.fi/~helmu/pubs/PSCerr.html for an explanation levenshteinDistance :: String -> String -> Int levenshteinDistance str1 str2 = levenshteinDistanceWithLengths m n str1 str2 where m = length str1 n = length str2 levenshteinDistanceWithLengths :: Int -> Int -> String -> String -> Int levenshteinDistanceWithLengths !m !n str1 str2 | m <= n = if n <= 32 -- n must be larger so this check is sufficient then levenshteinDistance' (undefined :: Word32) m n str1 str2 else levenshteinDistance' (undefined :: Integer) m n str1 str2 | otherwise = if m <= 32 -- m must be larger so this check is sufficient then levenshteinDistance' (undefined :: Word32) n m str2 str1 else levenshteinDistance' (undefined :: Integer) n m str2 str1 {-# SPECIALIZE INLINE levenshteinDistance' :: Word32 -> Int -> Int -> String -> String -> Int #-} {-# SPECIALIZE INLINE levenshteinDistance' :: Integer -> Int -> Int -> String -> String -> Int #-} levenshteinDistance' :: (Num bv, Bits bv) => bv -> Int -> Int -> String -> String -> Int levenshteinDistance' (_bv_dummy :: bv) !m !n str1 str2 | [] <- str1 = n | otherwise = extractAnswer $ foldl' (levenshteinDistanceWorker (matchVectors str1) top_bit_mask vector_mask) (m_ones, 0, m) str2 where m_ones@vector_mask = (2 ^ m) - 1 top_bit_mask = 1 `shiftL` (m - 1) :: bv extractAnswer (_, _, distance) = distance {-# SPECIALIZE levenshteinDistanceWorker :: IM.IntMap Word32 -> Word32 -> Word32 -> (Word32, Word32, Int) -> Char -> (Word32, Word32, Int) #-} {-# SPECIALIZE levenshteinDistanceWorker :: IM.IntMap Integer -> Integer -> Integer -> (Integer, Integer, Int) -> Char -> (Integer, Integer, Int) #-} levenshteinDistanceWorker :: (Num bv, Bits bv) => IM.IntMap bv -> bv -> bv -> (bv, bv, Int) -> Char -> (bv, bv, Int) levenshteinDistanceWorker !str1_mvs !top_bit_mask !vector_mask (!vp, !vn, !distance) !char2 = {- trace (unlines ["pm = " ++ show pm' ,"d0 = " ++ show d0' ,"hp = " ++ show hp' ,"hn = " ++ show hn' ,"vp = " ++ show vp' ,"vn = " ++ show vn' ,"distance' = " ++ show distance' ,"distance'' = " ++ show distance'']) -} (vp', vn', distance'') where pm' = IM.findWithDefault 0 (ord char2) str1_mvs d0' = ((((pm' .&. vp) + vp) .&. vector_mask) `xor` vp) .|. pm' .|. vn hp' = vn .|. sizedComplement vector_mask (d0' .|. vp) hn' = d0' .&. vp hp'_shift = ((hp' `shiftL` 1) .|. 1) .&. vector_mask hn'_shift = (hn' `shiftL` 1) .&. vector_mask vp' = hn'_shift .|. sizedComplement vector_mask (d0' .|. hp'_shift) vn' = d0' .&. hp'_shift distance' = if hp' .&. top_bit_mask /= 0 then distance + 1 else distance distance'' = if hn' .&. top_bit_mask /= 0 then distance' - 1 else distance' {- -- Just can't get this working! -- Based on the algorithm presented in "A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances" in PSC'02 (Heikki Hyyro). -- See http://www.cs.uta.fi/~helmu/pubs/psc02.pdf and http://www.cs.uta.fi/~helmu/pubs/PSCerr.html for an explanation levenshteinDistanceCutoff :: Int -> String -> String -> Int levenshteinDistanceCutoff cutoff str1 str2 | length str1 <= length str2 = levenshteinDistanceCutoff' cutoff str1 str2 | otherwise = levenshteinDistanceCutoff' cutoff str2 str1 levenshteinDistanceCutoff' :: Int -> String -> String -> Int levenshteinDistanceCutoff' cutoff str1 str2 | [] <- str1 = n | otherwise = extractAnswer $ foldl' (levenshteinDistanceCutoffFlatWorker (matchVectors str1)) (foldl' (levenshteinDistanceCutoffDiagWorker (matchVectors str1)) (top_bit_mask, vector_mask, all_ones, 0, initial_pm_offset, initial_dist) str2_diag) str2_flat where m = length str1 n = length str2 vector_length = if testBit bottom_factor 0 then cutoff -- Odd else cutoff + 1 -- Even all_ones@vector_mask = (2 ^ vector_length) - 1 top_bit_mask = trace (show bottom_factor ++ ", " ++ show vector_length) $ 1 `shiftL` (vector_length - 1) extractAnswer (_, _, _, _, _, distance) = distance len_difference = n - m top_factor = cutoff + len_difference bottom_factor = cutoff - len_difference bottom_factor_shift = (bottom_factor `shiftR` 1) initial_dist = bottom_factor_shift -- The distance the virtual first vector ended on initial_pm_offset = (top_factor `shiftR` 1) -- The amount of left shift to apply to the >next< pattern match vector diag_threshold = negate bottom_factor_shift + m -- The index in str2 where we stop going diagonally down and start going across (str2_diag, str2_flat) = splitAt diag_threshold str2 levenshteinDistanceCutoffDiagWorker :: IM.IntMap BitVector -> (BitVector, BitVector, BitVector, BitVector, Int, Int) -> Char -> (BitVector, BitVector, BitVector, BitVector, Int, Int) levenshteinDistanceCutoffDiagWorker !str1_mvs (!top_bit_mask, !vector_mask, !vp, !vn, !pm_offset, !distance) !char2 = trace (unlines ["vp = " ++ show vp ,"vn = " ++ show vn ,"vector_mask = " ++ show vector_mask ,"pm_offset = " ++ show pm_offset ,"unshifted_pm = " ++ show unshifted_pm ,"pm' = " ++ show pm' ,"d0' = " ++ show d0' ,"hp' = " ++ show hp' ,"hn' = " ++ show hn' ,"vp' = " ++ show vp' ,"vn' = " ++ show vn' ,"distance' = " ++ show distance']) (top_bit_mask, vector_mask, vp', vn', pm_offset - 1, distance') where unshifted_pm = IM.findWithDefault 0 (ord char2) str1_mvs pm' = (unshifted_pm `shift` pm_offset) .&. vector_mask d0' = ((((pm' .&. vp) + vp) .&. vector_mask) `xor` vp) .|. pm' .|. vn hp' = vn .|. sizedComplement vector_mask (d0' .|. vp) hn' = d0' .&. vp d0'_shift = d0' `shiftR` 1 vp' = hn' .|. sizedComplement vector_mask (d0'_shift .|. hp') vn' = d0'_shift .&. hp' distance' = if d0' .&. top_bit_mask /= 0 then distance else distance + 1 levenshteinDistanceCutoffFlatWorker :: IM.IntMap BitVector -> (BitVector, BitVector, BitVector, BitVector, Int, Int) -> Char -> (BitVector, BitVector, BitVector, BitVector, Int, Int) levenshteinDistanceCutoffFlatWorker !str1_mvs (!top_bit_mask, !vector_mask, !vp, !vn, !pm_offset, !distance) !char2 = trace (unlines ["pm_offset = " ++ show pm_offset ,"top_bit_mask' = " ++ show top_bit_mask' ,"vector_mask' = " ++ show vector_mask' ,"pm = " ++ show pm' ,"d0 = " ++ show d0' ,"hp = " ++ show hp' ,"hn = " ++ show hn' ,"vp = " ++ show vp' ,"vn = " ++ show vn' ,"distance' = " ++ show distance' ,"distance'' = " ++ show distance'']) (top_bit_mask', vector_mask', vp', vn', pm_offset - 1, distance'') where top_bit_mask' = top_bit_mask `shiftR` 1 vector_mask' = vector_mask `shiftR` 1 pm' = (IM.findWithDefault 0 (ord char2) str1_mvs `rotate` pm_offset) .&. vector_mask' d0' = ((((pm' .&. vp) + vp) `xor` vp) .|. pm' .|. vn) .&. vector_mask' hp' = vn .|. sizedComplement vector_mask' (d0' .|. vp) hn' = d0' .&. vp d0'_shift = d0' `shiftR` 1 vp' = hn' .|. sizedComplement vector_mask' (d0'_shift .|. hp') vn' = d0'_shift .&. hp' distance' = if hp' .&. top_bit_mask' /= 0 then distance + 1 else distance distance'' = if hn' .&. top_bit_mask' /= 0 then distance' - 1 else distance' -} -- Based on the algorithm presented in "A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances" in PSC'02 (Heikki Hyyro). -- See http://www.cs.uta.fi/~helmu/pubs/psc02.pdf and http://www.cs.uta.fi/~helmu/pubs/PSCerr.html for an explanation restrictedDamerauLevenshteinDistance :: String -> String -> Int restrictedDamerauLevenshteinDistance str1 str2 = restrictedDamerauLevenshteinDistanceWithLengths m n str1 str2 where m = length str1 n = length str2 restrictedDamerauLevenshteinDistanceWithLengths :: Int -> Int -> String -> String -> Int restrictedDamerauLevenshteinDistanceWithLengths !m !n str1 str2 | m <= n = if n <= 32 -- n must be larger so this check is sufficient then restrictedDamerauLevenshteinDistance' (undefined :: Word32) m n str1 str2 else restrictedDamerauLevenshteinDistance' (undefined :: Integer) m n str1 str2 | otherwise = if m <= 32 -- m must be larger so this check is sufficient then restrictedDamerauLevenshteinDistance' (undefined :: Word32) n m str2 str1 else restrictedDamerauLevenshteinDistance' (undefined :: Integer) n m str2 str1 {-# SPECIALIZE INLINE restrictedDamerauLevenshteinDistance' :: Word32 -> Int -> Int -> String -> String -> Int #-} {-# SPECIALIZE INLINE restrictedDamerauLevenshteinDistance' :: Integer -> Int -> Int -> String -> String -> Int #-} restrictedDamerauLevenshteinDistance' :: (Num bv, Bits bv) => bv -> Int -> Int -> String -> String -> Int restrictedDamerauLevenshteinDistance' (_bv_dummy :: bv) !m !n str1 str2 | [] <- str1 = n | otherwise = extractAnswer $ foldl' (restrictedDamerauLevenshteinDistanceWorker (matchVectors str1) top_bit_mask vector_mask) (0, 0, m_ones, 0, m) str2 where m_ones@vector_mask = (2 ^ m) - 1 top_bit_mask = 1 `shiftL` (m - 1) :: bv extractAnswer (_, _, _, _, distance) = distance {-# SPECIALIZE restrictedDamerauLevenshteinDistanceWorker :: IM.IntMap Word32 -> Word32 -> Word32 -> (Word32, Word32, Word32, Word32, Int) -> Char -> (Word32, Word32, Word32, Word32, Int) #-} {-# SPECIALIZE restrictedDamerauLevenshteinDistanceWorker :: IM.IntMap Integer -> Integer -> Integer -> (Integer, Integer, Integer, Integer, Int) -> Char -> (Integer, Integer, Integer, Integer, Int) #-} restrictedDamerauLevenshteinDistanceWorker :: (Num bv, Bits bv) => IM.IntMap bv -> bv -> bv -> (bv, bv, bv, bv, Int) -> Char -> (bv, bv, bv, bv, Int) restrictedDamerauLevenshteinDistanceWorker !str1_mvs !top_bit_mask !vector_mask (!pm, !d0, !vp, !vn, !distance) !char2 = (pm', d0', vp', vn', distance'') where pm' = IM.findWithDefault 0 (ord char2) str1_mvs d0' = ((((sizedComplement vector_mask d0) .&. pm') `shiftL` 1) .&. pm) -- No need to mask the shiftL because of the restricted range of pm .|. ((((pm' .&. vp) + vp) .&. vector_mask) `xor` vp) .|. pm' .|. vn hp' = vn .|. sizedComplement vector_mask (d0' .|. vp) hn' = d0' .&. vp hp'_shift = ((hp' `shiftL` 1) .|. 1) .&. vector_mask hn'_shift = (hn' `shiftL` 1) .&. vector_mask vp' = hn'_shift .|. sizedComplement vector_mask (d0' .|. hp'_shift) vn' = d0' .&. hp'_shift distance' = if hp' .&. top_bit_mask /= 0 then distance + 1 else distance distance'' = if hn' .&. top_bit_mask /= 0 then distance' - 1 else distance' {-# SPECIALIZE INLINE sizedComplement :: Word32 -> Word32 -> Word32 #-} {-# SPECIALIZE INLINE sizedComplement :: Integer -> Integer -> Integer #-} sizedComplement :: (Num bv, Bits bv) => bv -> bv -> bv sizedComplement vector_mask vect = vector_mask `xor` vect {-# SPECIALIZE matchVectors :: String -> IM.IntMap Word32 #-} {-# SPECIALIZE matchVectors :: String -> IM.IntMap Integer #-} matchVectors :: (Num bv, Bits bv) => String -> IM.IntMap bv matchVectors = snd . foldl' go (0 :: Int, IM.empty) where go (!ix, !im) char = let ix' = ix + 1 im' = IM.insertWith (.|.) (ord char) (2 ^ ix) im in (ix', im')edit-distance-0.2.1.1/Text/EditDistance/EditCosts.hs000644 000765 000024 00000004000 12033303026 023706 0ustar00mbolingbrokestaff000000 000000 {-# OPTIONS_GHC -funbox-strict-fields #-} module Text.EditDistance.EditCosts ( Costs(..), EditCosts(..), deletionCost, insertionCost, substitutionCost, transpositionCost, defaultEditCosts, isDefaultEditCosts ) where data Costs a = ConstantCost !Int | VariableCost (a -> Int) {-# INLINE cost #-} cost :: Costs a -> a -> Int cost (ConstantCost i) _ = i cost (VariableCost f) x = f x data EditCosts = EditCosts { deletionCosts :: Costs Char, -- ^ Cost of deleting the specified character from the left string insertionCosts :: Costs Char, -- ^ Cost of inserting the specified characters into the right string substitutionCosts :: Costs (Char, Char), -- ^ Cost of substituting a character from the left string with one from the right string -- with arguments in that order. transpositionCosts :: Costs (Char, Char) -- ^ Cost of moving one character backwards and the other forwards -- with arguments in that order. } {-# INLINE deletionCost #-} deletionCost :: EditCosts -> Char -> Int deletionCost ec deleted = cost (deletionCosts ec) deleted {-# INLINE insertionCost #-} insertionCost :: EditCosts -> Char -> Int insertionCost ec inserted = cost (insertionCosts ec) inserted {-# INLINE substitutionCost #-} substitutionCost :: EditCosts -> Char -> Char -> Int substitutionCost ec old new = cost (substitutionCosts ec) (old, new) {-# INLINE transpositionCost #-} transpositionCost :: EditCosts -> Char -> Char -> Int transpositionCost ec backwards forwards = cost (transpositionCosts ec) (backwards, forwards) defaultEditCosts :: EditCosts defaultEditCosts = EditCosts { deletionCosts = ConstantCost 1, insertionCosts = ConstantCost 1, substitutionCosts = ConstantCost 1, transpositionCosts = ConstantCost 1 } isDefaultEditCosts :: EditCosts -> Bool isDefaultEditCosts (EditCosts { deletionCosts = ConstantCost 1, insertionCosts = ConstantCost 1, substitutionCosts = ConstantCost 1, transpositionCosts = ConstantCost 1 }) = True isDefaultEditCosts _ = Falseedit-distance-0.2.1.1/Text/EditDistance/MonadUtilities.hs000644 000765 000024 00000000421 12033303026 024742 0ustar00mbolingbrokestaff000000 000000 {-# LANGUAGE BangPatterns #-} module Text.EditDistance.MonadUtilities where {-# INLINE loopM_ #-} loopM_ :: Monad m => Int -> Int -> (Int -> m ()) -> m () loopM_ !from !to action | from > to = return () | otherwise = do action from loopM_ (from + 1) to actionedit-distance-0.2.1.1/Text/EditDistance/SquareSTUArray.hs000644 000765 000024 00000015557 12033303026 024663 0ustar00mbolingbrokestaff000000 000000 {-# LANGUAGE PatternGuards, ScopedTypeVariables, BangPatterns #-} module Text.EditDistance.SquareSTUArray ( levenshteinDistance, levenshteinDistanceWithLengths, restrictedDamerauLevenshteinDistance, restrictedDamerauLevenshteinDistanceWithLengths ) where import Text.EditDistance.EditCosts import Text.EditDistance.MonadUtilities import Control.Monad import Control.Monad.ST import Data.Array.ST levenshteinDistance :: EditCosts -> String -> String -> Int levenshteinDistance !costs str1 str2 = levenshteinDistanceWithLengths costs str1_len str2_len str1 str2 where str1_len = length str1 str2_len = length str2 levenshteinDistanceWithLengths :: EditCosts -> Int -> Int -> String -> String -> Int levenshteinDistanceWithLengths !costs !str1_len !str2_len str1 str2 = runST (levenshteinDistanceST costs str1_len str2_len str1 str2) levenshteinDistanceST :: EditCosts -> Int -> Int -> String -> String -> ST s Int levenshteinDistanceST !costs !str1_len !str2_len str1 str2 = do -- Create string arrays str1_array <- stringToArray str1 str1_len str2_array <- stringToArray str2 str2_len -- Create array of costs. Say we index it by (i, j) where i is the column index and j the row index. -- Rows correspond to characters of str2 and columns to characters of str1. cost_array <- newArray_ ((0, 0), (str1_len, str2_len)) :: ST s (STUArray s (Int, Int) Int) -- Fill out the first row (j = 0) _ <- (\f -> foldM f 0 ([1..] `zip` str1)) $ \deletion_cost (!i, col_char) -> let deletion_cost' = deletion_cost + deletionCost costs col_char in writeArray cost_array (i, 0) deletion_cost' >> return deletion_cost' -- Fill the remaining rows (j >= 1) _ <- (\f -> foldM f 0 [1..str2_len]) $ \insertion_cost (!j) -> do row_char <- readArray str2_array j -- Initialize the first element of the row (i = 0) let insertion_cost' = insertion_cost + insertionCost costs row_char writeArray cost_array (0, j) insertion_cost' -- Fill the remaining elements of the row (i >= 1) loopM_ 1 str1_len $ \(!i) -> do col_char <- readArray str1_array i cost <- standardCosts costs cost_array row_char col_char (i, j) writeArray cost_array (i, j) cost return insertion_cost' -- Return an actual answer readArray cost_array (str1_len, str2_len) restrictedDamerauLevenshteinDistance :: EditCosts -> String -> String -> Int restrictedDamerauLevenshteinDistance costs str1 str2 = restrictedDamerauLevenshteinDistanceWithLengths costs str1_len str2_len str1 str2 where str1_len = length str1 str2_len = length str2 restrictedDamerauLevenshteinDistanceWithLengths :: EditCosts -> Int -> Int -> String -> String -> Int restrictedDamerauLevenshteinDistanceWithLengths costs str1_len str2_len str1 str2 = runST (restrictedDamerauLevenshteinDistanceST costs str1_len str2_len str1 str2) restrictedDamerauLevenshteinDistanceST :: EditCosts -> Int -> Int -> String -> String -> ST s Int restrictedDamerauLevenshteinDistanceST !costs str1_len str2_len str1 str2 = do -- Create string arrays str1_array <- stringToArray str1 str1_len str2_array <- stringToArray str2 str2_len -- Create array of costs. Say we index it by (i, j) where i is the column index and j the row index. -- Rows correspond to characters of str2 and columns to characters of str1. cost_array <- newArray_ ((0, 0), (str1_len, str2_len)) :: ST s (STUArray s (Int, Int) Int) -- Fill out the first row (j = 0) _ <- (\f -> foldM f 0 ([1..] `zip` str1)) $ \deletion_cost (!i, col_char) -> let deletion_cost' = deletion_cost + deletionCost costs col_char in writeArray cost_array (i, 0) deletion_cost' >> return deletion_cost' -- Fill out the second row (j = 1) when (str2_len > 0) $ do initial_row_char <- readArray str2_array 1 -- Initialize the first element of the second row (i = 0) writeArray cost_array (0, 1) (insertionCost costs initial_row_char) -- Initialize the remaining elements of the row (i >= 1) loopM_ 1 str1_len $ \(!i) -> do col_char <- readArray str1_array i cost <- standardCosts costs cost_array initial_row_char col_char (i, 1) writeArray cost_array (i, 1) cost -- Fill the remaining rows (j >= 2) loopM_ 2 str2_len (\(!j) -> do row_char <- readArray str2_array j prev_row_char <- readArray str2_array (j - 1) -- Initialize the first element of the row (i = 0) writeArray cost_array (0, j) (insertionCost costs row_char * j) -- Initialize the second element of the row (i = 1) when (str1_len > 0) $ do col_char <- readArray str1_array 1 cost <- standardCosts costs cost_array row_char col_char (1, j) writeArray cost_array (1, j) cost -- Fill the remaining elements of the row (i >= 2) loopM_ 2 str1_len (\(!i) -> do col_char <- readArray str1_array i prev_col_char <- readArray str1_array (i - 1) standard_cost <- standardCosts costs cost_array row_char col_char (i, j) cost <- if prev_row_char == col_char && prev_col_char == row_char then do transpose_cost <- fmap (+ (transpositionCost costs col_char row_char)) $ readArray cost_array (i - 2, j - 2) return (standard_cost `min` transpose_cost) else return standard_cost writeArray cost_array (i, j) cost)) -- Return an actual answer readArray cost_array (str1_len, str2_len) {-# INLINE standardCosts #-} standardCosts :: EditCosts -> STUArray s (Int, Int) Int -> Char -> Char -> (Int, Int) -> ST s Int standardCosts !costs !cost_array !row_char !col_char (!i, !j) = do deletion_cost <- fmap (+ (deletionCost costs col_char)) $ readArray cost_array (i - 1, j) insertion_cost <- fmap (+ (insertionCost costs row_char)) $ readArray cost_array (i, j - 1) subst_cost <- fmap (+ if row_char == col_char then 0 else (substitutionCost costs col_char row_char)) (readArray cost_array (i - 1, j - 1)) return $ deletion_cost `min` insertion_cost `min` subst_cost {-# INLINE stringToArray #-} stringToArray :: String -> Int -> ST s (STUArray s Int Char) stringToArray str !str_len = do array <- newArray_ (1, str_len) forM_ (zip [1..] str) (uncurry (writeArray array)) return array {- showArray :: STUArray s (Int, Int) Int -> ST s String showArray array = do ((il, jl), (iu, ju)) <- getBounds array flip (flip foldM "") [(i, j) | i <- [il..iu], j <- [jl.. ju]] $ \rest (i, j) -> do elt <- readArray array (i, j) return $ rest ++ show (i, j) ++ ": " ++ show elt ++ ", " -} edit-distance-0.2.1.1/Text/EditDistance/STUArray.hs000644 000765 000024 00000021365 12033303026 023474 0ustar00mbolingbrokestaff000000 000000 {-# LANGUAGE PatternGuards, ScopedTypeVariables, BangPatterns #-} module Text.EditDistance.STUArray ( levenshteinDistance, levenshteinDistanceWithLengths, restrictedDamerauLevenshteinDistance, restrictedDamerauLevenshteinDistanceWithLengths ) where import Text.EditDistance.EditCosts import Text.EditDistance.MonadUtilities import Control.Monad import Control.Monad.ST import Data.Array.ST levenshteinDistance :: EditCosts -> String -> String -> Int levenshteinDistance !costs str1 str2 = levenshteinDistanceWithLengths costs str1_len str2_len str1 str2 where str1_len = length str1 str2_len = length str2 levenshteinDistanceWithLengths :: EditCosts -> Int -> Int -> String -> String -> Int levenshteinDistanceWithLengths !costs !str1_len !str2_len str1 str2 = runST (levenshteinDistanceST costs str1_len str2_len str1 str2) levenshteinDistanceST :: EditCosts -> Int -> Int -> String -> String -> ST s Int levenshteinDistanceST !costs !str1_len !str2_len str1 str2 = do -- Create string arrays str1_array <- stringToArray str1 str1_len str2_array <- stringToArray str2 str2_len -- Create array of costs for a single row. Say we index costs by (i, j) where i is the column index and j the row index. -- Rows correspond to characters of str2 and columns to characters of str1. We can get away with just storing a single -- row of costs at a time, but we use two because it turns out to be faster cost_row <- newArray_ (0, str1_len) :: ST s (STUArray s Int Int) cost_row' <- newArray_ (0, str1_len) :: ST s (STUArray s Int Int) -- Fill out the first row (j = 0) _ <- (\f -> foldM f 0 ([1..] `zip` str1)) $ \deletion_cost (i, col_char) -> let deletion_cost' = deletion_cost + deletionCost costs col_char in writeArray cost_row i deletion_cost' >> return deletion_cost' -- Fill out the remaining rows (j >= 1) (_, final_row, _) <- foldM (levenshteinDistanceSTRowWorker costs str1_len str1_array str2_array) (0, cost_row, cost_row') [1..str2_len] -- Return an actual answer readArray final_row str1_len levenshteinDistanceSTRowWorker :: EditCosts -> Int -> STUArray s Int Char -> STUArray s Int Char -> (Int, STUArray s Int Int, STUArray s Int Int) -> Int -> ST s (Int, STUArray s Int Int, STUArray s Int Int) levenshteinDistanceSTRowWorker !costs !str1_len !str1_array !str2_array (!insertion_cost, !cost_row, !cost_row') !j = do row_char <- readArray str2_array j -- Initialize the first element of the row (i = 0) let insertion_cost' = insertion_cost + insertionCost costs row_char writeArray cost_row' 0 insertion_cost' -- Fill the remaining elements of the row (i >= 1) loopM_ 1 str1_len (colWorker row_char) return (insertion_cost', cost_row', cost_row) where colWorker row_char !i = do col_char <- readArray str1_array i left_up <- readArray cost_row (i - 1) left <- readArray cost_row' (i - 1) here_up <- readArray cost_row i let here = standardCosts costs row_char col_char left left_up here_up writeArray cost_row' i here restrictedDamerauLevenshteinDistance :: EditCosts -> String -> String -> Int restrictedDamerauLevenshteinDistance !costs str1 str2 = restrictedDamerauLevenshteinDistanceWithLengths costs str1_len str2_len str1 str2 where str1_len = length str1 str2_len = length str2 restrictedDamerauLevenshteinDistanceWithLengths :: EditCosts -> Int -> Int -> String -> String -> Int restrictedDamerauLevenshteinDistanceWithLengths !costs !str1_len !str2_len str1 str2 = runST (restrictedDamerauLevenshteinDistanceST costs str1_len str2_len str1 str2) restrictedDamerauLevenshteinDistanceST :: EditCosts -> Int -> Int -> String -> String -> ST s Int restrictedDamerauLevenshteinDistanceST !costs str1_len str2_len str1 str2 = do -- Create string arrays str1_array <- stringToArray str1 str1_len str2_array <- stringToArray str2 str2_len -- Create array of costs for a single row. Say we index costs by (i, j) where i is the column index and j the row index. -- Rows correspond to characters of str2 and columns to characters of str1. We can get away with just storing two -- rows of costs at a time, but I use three because it turns out to be faster cost_row <- newArray_ (0, str1_len) :: ST s (STUArray s Int Int) -- Fill out the first row (j = 0) _ <- (\f -> foldM f 0 ([1..] `zip` str1)) $ \deletion_cost (!i, col_char) -> let deletion_cost' = deletion_cost + deletionCost costs col_char in writeArray cost_row i deletion_cost' >> return deletion_cost' if (str2_len == 0) then readArray cost_row str1_len else do -- We defer allocation of these arrays to here because they aren't used in the other branch cost_row' <- newArray_ (0, str1_len) :: ST s (STUArray s Int Int) cost_row'' <- newArray_ (0, str1_len) :: ST s (STUArray s Int Int) -- Fill out the second row (j = 1) row_char <- readArray str2_array 1 -- Initialize the first element of the row (i = 0) let zero = insertionCost costs row_char writeArray cost_row' 0 zero -- Fill the remaining elements of the row (i >= 1) loopM_ 1 str1_len (firstRowColWorker str1_array row_char cost_row cost_row') -- Fill out the remaining rows (j >= 2) (_, _, final_row, _, _) <- foldM (restrictedDamerauLevenshteinDistanceSTRowWorker costs str1_len str1_array str2_array) (zero, cost_row, cost_row', cost_row'', row_char) [2..str2_len] -- Return an actual answer readArray final_row str1_len where firstRowColWorker !str1_array !row_char !cost_row !cost_row' !i = do col_char <- readArray str1_array i left_up <- readArray cost_row (i - 1) left <- readArray cost_row' (i - 1) here_up <- readArray cost_row i let here = standardCosts costs row_char col_char left left_up here_up writeArray cost_row' i here restrictedDamerauLevenshteinDistanceSTRowWorker :: EditCosts -> Int -> STUArray s Int Char -> STUArray s Int Char -- String arrays -> (Int, STUArray s Int Int, STUArray s Int Int, STUArray s Int Int, Char) -> Int -- Incoming rows of the matrix in recency order -> ST s (Int, STUArray s Int Int, STUArray s Int Int, STUArray s Int Int, Char) -- Outgoing rows of the matrix in recency order restrictedDamerauLevenshteinDistanceSTRowWorker !costs !str1_len !str1_array !str2_array (!insertion_cost, !cost_row, !cost_row', !cost_row'', !prev_row_char) !j = do row_char <- readArray str2_array j -- Initialize the first element of the row (i = 0) zero_up <- readArray cost_row' 0 let insertion_cost' = insertion_cost + insertionCost costs row_char writeArray cost_row'' 0 insertion_cost' -- Initialize the second element of the row (i = 1) when (str1_len > 0) $ do col_char <- readArray str1_array 1 one_up <- readArray cost_row' 1 let one = standardCosts costs row_char col_char insertion_cost' zero_up one_up writeArray cost_row'' 1 one -- Fill the remaining elements of the row (i >= 2) loopM_ 2 str1_len (colWorker row_char) return (insertion_cost', cost_row', cost_row'', cost_row, row_char) where colWorker !row_char !i = do prev_col_char <- readArray str1_array (i - 1) col_char <- readArray str1_array i left_left_up_up <- readArray cost_row (i - 2) left_up <- readArray cost_row' (i - 1) left <- readArray cost_row'' (i - 1) here_up <- readArray cost_row' i let here_standard_only = standardCosts costs row_char col_char left left_up here_up here = if prev_row_char == col_char && prev_col_char == row_char then here_standard_only `min` (left_left_up_up + transpositionCost costs col_char row_char) else here_standard_only writeArray cost_row'' i here {-# INLINE standardCosts #-} standardCosts :: EditCosts -> Char -> Char -> Int -> Int -> Int -> Int standardCosts !costs !row_char !col_char !cost_left !cost_left_up !cost_up = deletion_cost `min` insertion_cost `min` subst_cost where deletion_cost = cost_left + deletionCost costs col_char insertion_cost = cost_up + insertionCost costs row_char subst_cost = cost_left_up + if row_char == col_char then 0 else substitutionCost costs col_char row_char {-# INLINE stringToArray #-} stringToArray :: String -> Int -> ST s (STUArray s Int Char) stringToArray str !str_length = do array <- newArray_ (1, str_length) forM_ (zip [1..] str) (uncurry (writeArray array)) return arrayedit-distance-0.2.1.1/Text/EditDistance/Tests.hs000644 000765 000024 00000000336 12033303026 023117 0ustar00mbolingbrokestaff000000 000000 module Main where import Text.EditDistance.Tests.Properties import Test.Framework import Data.Monoid main :: IO () main = defaultMain $ map (plusTestOptions $ mempty { topt_maximum_generated_tests = Just 1000 }) tests