edit-distance-0.2.2.1/0000755000000000000000000000000012550525026012557 5ustar0000000000000000edit-distance-0.2.2.1/edit-distance.cabal0000644000000000000000000000435712550525026016271 0ustar0000000000000000name: edit-distance version: 0.2.2.1 cabal-version: >= 1.10 category: Algorithms synopsis: Levenshtein and restricted Damerau-Levenshtein edit distances description: Optimized edit distances for fuzzy matching, including Levenshtein and restricted Damerau-Levenshtein algorithms. license: BSD3 license-File: LICENSE extra-source-files: README.md author: Max Bolingbroke copyright: (c) 2008-2013 Maximilian Bolinbroke maintainer: Oleg Grenrus homepage: http://github.com/phadej/edit-distance build-type: Simple library default-language: Haskell98 exposed-modules: Text.EditDistance other-modules: Text.EditDistance.EditCosts Text.EditDistance.SquareSTUArray Text.EditDistance.STUArray Text.EditDistance.Bits Text.EditDistance.MonadUtilities Text.EditDistance.ArrayUtilities build-depends: base >= 4.5 && < 5, array >= 0.1, random >= 1.0, containers >= 0.1.0.1 ghc-options: -O2 -Wall test-suite edit-distance-tests default-language: Haskell98 main-is: Text/EditDistance/Tests.hs other-modules: Text.EditDistance.Tests.EditOperationOntology Text.EditDistance.Tests.Properties type: exitcode-stdio-1.0 ghc-options: -O2 -Wall build-depends: base >= 4.5 && < 5, array >= 0.1, random >= 1.0, containers >= 0.1.0.1, test-framework >= 0.1.1, QuickCheck >= 2.4 && <2.9, test-framework-quickcheck2 benchmark edit-distance-benchmark default-language: Haskell98 main-is: Text/EditDistance/Benchmark.hs type: exitcode-stdio-1.0 build-depends: base >= 4.5 && < 5, array >= 0.1, random >= 1.0, time >= 1.0, process >= 1.0, deepseq >= 1.2, unix >= 2.3, criterion >= 1.1, containers >= 0.1.0.1 ghc-options: -O2 source-repository head type: git location: https://github.com/phadej/edit-distance.git edit-distance-0.2.2.1/LICENSE0000644000000000000000000000277312550525026013575 0ustar0000000000000000Copyright (c) 2008-2013 Maximilian Bolingbroke All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of Maximilian Bolingbroke nor the names of other contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. edit-distance-0.2.2.1/README.md0000644000000000000000000000253112550525026014037 0ustar0000000000000000# Edit Distance Algorithms [![Build Status](https://travis-ci.org/phadej/edit-distance.svg?branch=master)](https://travis-ci.org/phadej/edit-distance) [![Hackage](https://img.shields.io/hackage/v/edit-distance.svg)](http://hackage.haskell.org/package/edit-distance) ## Installing To just install the library: ``` cabal configure cabal build cabal install ``` ## Description Edit distances algorithms for fuzzy matching. Specifically, this library provides: - [Levenshtein distance](http://en.wikipedia.org/wiki/Levenshtein_distance) - [Restricted Damerau-Levenshtein distance](http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance) They have been fairly heavily optimized. Indeed, for situations where one of the strings is under 64 characters long we use a rather neat "bit vector" algorithm: see [the authors paper](http://www.cs.uta.fi/~helmu/pubs/psc02.pdf) and [the associated errata](http://www.cs.uta.fi/~helmu/pubs/PSCerr.html) for more information. The algorithms _could_ be faster, but they aren't yet slow enough to force us into improving the situation. ## Example ```hs Text.EditDistance> levenshteinDistance defaultEditCosts "witch" "kitsch" 2 ``` ## Links - [Hackage](http://hackage.haskell.org/package/edit-distance) - [GitHub](http://github.com/phadej/edit-distance) - [Original gitHub](http://github.com/batterseapower/edit-distance) edit-distance-0.2.2.1/Setup.lhs0000644000000000000000000000011512550525026014364 0ustar0000000000000000#! /usr/bin/env runhaskell > import Distribution.Simple > main = defaultMainedit-distance-0.2.2.1/Text/0000755000000000000000000000000012550525026013503 5ustar0000000000000000edit-distance-0.2.2.1/Text/EditDistance.hs0000644000000000000000000000604312550525026016402 0ustar0000000000000000{-# LANGUAGE PatternGuards #-} {-# LANGUAGE Safe #-} ---------------------------------------------------------------------------- -- | -- Module : Text.EditDistance -- Copyright : (C) 2010-2015 Maximilian Bolingbroke -- License : BSD-3-Clause (see the file LICENSE) -- -- Maintainer : Oleg Grenrus -- -- Computing the edit distances between strings ---------------------------------------------------------------------------- module Text.EditDistance ( Costs(..), EditCosts(..), defaultEditCosts, levenshteinDistance, restrictedDamerauLevenshteinDistance ) where import Text.EditDistance.EditCosts import qualified Text.EditDistance.Bits as Bits import qualified Text.EditDistance.SquareSTUArray as SquareSTUArray -- | Find the Levenshtein edit distance between two strings. That is to say, the number of deletion, -- insertion and substitution operations that are required to make the two strings equal. Note that -- this algorithm therefore does not make use of the @transpositionCost@ field of the costs. See also: -- . levenshteinDistance :: EditCosts -> String -> String -> Int levenshteinDistance costs str1 str2 | isDefaultEditCosts costs , (str1_len <= 64) == (str2_len <= 64) -- The Integer implementation of the Bits algorithm is quite inefficient, but scales better than the = Bits.levenshteinDistanceWithLengths str1_len str2_len str1 str2 -- STUArrays if both string lengths > 64. The Word64 implementation is always better, if it is applicable | otherwise = SquareSTUArray.levenshteinDistanceWithLengths costs str1_len str2_len str1 str2 -- SquareSTUArray usually beat making more use of the heap with STUArray where str1_len = length str1 str2_len = length str2 -- | Find the \"restricted\" Damerau-Levenshtein edit distance between two strings. This algorithm calculates the cost of -- the so-called optimal string alignment, which does not always equal the appropriate edit distance. The cost of the optimal -- string alignment is the number of edit operations needed to make the input strings equal under the condition that no substring -- is edited more than once. See also: . restrictedDamerauLevenshteinDistance :: EditCosts -> String -> String -> Int restrictedDamerauLevenshteinDistance costs str1 str2 | isDefaultEditCosts costs , (str1_len <= 64) == (str2_len <= 64) -- The Integer implementation of the Bits algorithm is quite inefficient, but scales better than the = Bits.restrictedDamerauLevenshteinDistanceWithLengths str1_len str2_len str1 str2 -- STUArrays if both string lengths > 64. The Word64 implementation is always better, if it is applicable | otherwise = SquareSTUArray.restrictedDamerauLevenshteinDistanceWithLengths costs str1_len str2_len str1 str2 -- SquareSTUArray usually beat making more use of the heap with STUArray where str1_len = length str1 str2_len = length str2 edit-distance-0.2.2.1/Text/EditDistance/0000755000000000000000000000000012550525026016043 5ustar0000000000000000edit-distance-0.2.2.1/Text/EditDistance/ArrayUtilities.hs0000644000000000000000000000344312550525026021355 0ustar0000000000000000{-# LANGUAGE CPP, Trustworthy #-} module Text.EditDistance.ArrayUtilities ( unsafeReadArray, unsafeWriteArray, unsafeReadArray', unsafeWriteArray', stringToArray ) where import Control.Monad (forM_) import Control.Monad.ST import Data.Array.ST import Data.Array.Base (unsafeRead, unsafeWrite) #ifdef __GLASGOW_HASKELL__ import GHC.Arr (unsafeIndex) #else import Data.Ix (index) {-# INLINE unsafeIndex #-} unsafeIndex :: Ix i => (i, i) -> i -> Int unsafeIndex = index #endif {-# INLINE unsafeReadArray #-} unsafeReadArray :: (MArray a e m, Ix i) => a i e -> i -> m e unsafeReadArray marr i = do f <- unsafeReadArray' marr f i {-# INLINE unsafeWriteArray #-} unsafeWriteArray :: (MArray a e m, Ix i) => a i e -> i -> e -> m () unsafeWriteArray marr i e = do f <- unsafeWriteArray' marr f i e {-# INLINE unsafeReadArray' #-} unsafeReadArray' :: (MArray a e m, Ix i) => a i e -> m (i -> m e) unsafeReadArray' marr = do (l,u) <- getBounds marr return $ \i -> unsafeRead marr (unsafeIndex (l,u) i) {-# INLINE unsafeWriteArray' #-} unsafeWriteArray' :: (MArray a e m, Ix i) => a i e -> m (i -> e -> m ()) unsafeWriteArray' marr = do (l,u) <- getBounds marr return $ \i e -> unsafeWrite marr (unsafeIndex (l,u) i) e {-# INLINE stringToArray #-} stringToArray :: String -> Int -> ST s (STUArray s Int Char) stringToArray str str_len = do array <- newArray_ (1, str_len) write <- unsafeWriteArray' array forM_ (zip [1..] str) (uncurry write) return array {- showArray :: STUArray s (Int, Int) Int -> ST s String showArray array = do ((il, jl), (iu, ju)) <- getBounds array flip (flip foldM "") [(i, j) | i <- [il..iu], j <- [jl.. ju]] $ \rest (i, j) -> do elt <- readArray array (i, j) return $ rest ++ show (i, j) ++ ": " ++ show elt ++ ", " -} edit-distance-0.2.2.1/Text/EditDistance/Benchmark.hs0000644000000000000000000001135112550525026020272 0ustar0000000000000000{-# OPTIONS_GHC -fno-full-laziness #-} module Main where import Text.EditDistance.EditCosts import Text.EditDistance.MonadUtilities import qualified Text.EditDistance as BestEffort import qualified Text.EditDistance.Bits as Bits import qualified Text.EditDistance.STUArray as STUArray import qualified Text.EditDistance.SquareSTUArray as SquareSTUArray import Control.DeepSeq ( NFData, rnf ) import Control.Exception import Control.Monad import Criterion.Main import Data.List import Data.Time.Clock.POSIX (getPOSIXTime) import System.Environment import System.Exit import System.Mem import System.Process import System.Random sTRING_SIZE_STEP, mAX_STRING_SIZE :: Int sTRING_SIZE_STEP = 3 mAX_STRING_SIZE = 108 getTime :: IO Double getTime = realToFrac `fmap` getPOSIXTime time :: IO a -> IO Double time action = do ts1 <- getTime action ts2 <- getTime return $ ts2 - ts1 augment :: Monad m => (a -> m b) -> [a] -> m [(a, [b])] augment fx xs = liftM (zip xs) $ mapM (liftM (\b -> [b]) . fx) xs sample :: NFData a => (String -> String -> a) -> (Int, Int) -> IO Double sample distance bounds@(i, j) = do -- Generate two random strings of length i and j gen <- newStdGen let (string1, string2_long) = splitAt i (randoms gen) string2 = take j string2_long -- Force the two strings to be evaluated so they don't meddle -- with the benchmarking evaluate (rnf string1) evaluate (rnf string2) -- Don't want junk from previous runs causing a GC during the test performGC -- Our sample is the time taken to find the edit distance putStrLn $ "Sampling " ++ show bounds time $ loop (100000 `div` (1 + i + j)) $ evaluate (distance string1 string2) >> return () loop :: Monad m => Int -> m () -> m () loop n act = loopM_ 1 n (const act) joinOnKey :: Eq a => [(a, [b])] -> [(a, [b])] -> [(a, [b])] joinOnKey xs ys = [(x_a, (x_b ++ y_c)) | (x_a, x_b) <- xs, (y_a, y_c) <- ys, x_a == y_a] gnuPlotScript :: [String] -> String gnuPlotScript titles = "set term postscript eps enhanced color\n\ \set output \"data.ps\"\n\ \#unset key\n\ \set dgrid3d\n\ \set hidden3d\n\ \#set pm3d map\n\ \#splot \"data.plot\" using 1:2:3\n\ \splot " ++ splot_script ++ "\n\ \quit\n" where --splot_script = "\"data.plot\" using 1:2:3 title \"Bits\" with lines, \"data.plot\" using 1:2:4 title \"STUArray\" with lines, \"data.plot\" using 1:2:5 title \"SquareSTUArray\" with lines" splot_script = intercalate ", " ["\"data.plot\" using 1:2:" ++ show i ++ " title " ++ show title ++ " with lines" | (i, title) <- [3..] `zip` titles] toGnuPlotFormat :: (Show a, Show b, Show c) => [((a, b), [c])] -> String toGnuPlotFormat samples = unlines (header : map sampleToGnuPlotFormat samples) where first_cs = snd $ head samples header = "#\tX\tY" ++ concat (replicate (length first_cs) "\tZ") sampleToGnuPlotFormat ((a, b), cs) = concat $ intersperse "\t" $ [show a, show b] ++ map show cs main :: IO () main = do args <- getArgs let sample_titles = ["Bits", "SquareSTUArray", "STUArray", "Best effort"] sample_fns = [Bits.levenshteinDistance, SquareSTUArray.levenshteinDistance defaultEditCosts, STUArray.levenshteinDistance defaultEditCosts, BestEffort.levenshteinDistance defaultEditCosts] case args of ["plot"] -> do let sample_range = [(i, j) | i <- [0,sTRING_SIZE_STEP..mAX_STRING_SIZE] , j <- [0,sTRING_SIZE_STEP..mAX_STRING_SIZE]] --sample_fns = [Bits.restrictedDamerauLevenshteinDistance, SquareSTUArray.restrictedDamerauLevenshteinDistance defaultEditCosts, STUArray.restrictedDamerauLevenshteinDistance defaultEditCosts, BestEffort.restrictedDamerauLevenshteinDistance defaultEditCosts] sampless <- forM sample_fns $ \sample_fn -> augment (sample sample_fn) sample_range let listified_samples = foldr1 joinOnKey sampless writeFile "data.plot" (toGnuPlotFormat listified_samples) writeFile "plot.script" (gnuPlotScript sample_titles) (_inp, _outp, _err, gp_pid) <- runInteractiveCommand "(cat plot.script | gnuplot); RETCODE=$?; rm plot.script; exit $RETCODE" gp_exit_code <- waitForProcess gp_pid case gp_exit_code of ExitSuccess -> putStrLn "Plotted at 'data.ps'" ExitFailure err_no -> putStrLn $ "Failed! Error code " ++ show err_no _ -> do let mkBench n m name f = bench name $ whnf (uncurry f) (replicate n 'a', replicate m 'b') defaultMain [ bgroup (show (n, m)) (zipWith (mkBench n m) sample_titles sample_fns) | (n, m) <- [(32, 32), (32, mAX_STRING_SIZE), (mAX_STRING_SIZE, 32), (mAX_STRING_SIZE, mAX_STRING_SIZE)]] edit-distance-0.2.2.1/Text/EditDistance/Bits.hs0000644000000000000000000003237112550525026017306 0ustar0000000000000000{-# LANGUAGE PatternGuards, ScopedTypeVariables, BangPatterns, Rank2Types #-} module Text.EditDistance.Bits ( levenshteinDistance, levenshteinDistanceWithLengths, {-levenshteinDistanceCutoff,-} restrictedDamerauLevenshteinDistance, restrictedDamerauLevenshteinDistanceWithLengths ) where import Data.Bits import Data.Char import Data.Word import Data.List import qualified Data.IntMap as IM --import Debug.Trace --type BitVector = Integer -- Continuation-passing foldl's to work around the lack of recursive CPR optimisation in GHC {-# INLINE foldl'3k #-} foldl'3k :: (forall res. (a, b, c) -> x -> ((a, b, c) -> res) -> res) -> (a, b, c) -> [x] -> (a, b, c) foldl'3k f = go where go (!_, !_, !_) _ | False = undefined go ( a, b, c) [] = (a, b, c) go ( a, b, c) (x:xs) = f (a, b, c) x $ \abc -> go abc xs {-# INLINE foldl'5k #-} foldl'5k :: (forall res. (a, b, c, d, e) -> x -> ((a, b, c, d, e) -> res) -> res) -> (a, b, c, d, e) -> [x] -> (a, b, c, d, e) foldl'5k f = go where go (!_, !_, !_, !_, !_) _ | False = undefined go ( a, b, c, d, e) [] = (a, b, c, d, e) go ( a, b, c, d, e) (x:xs) = f (a, b, c, d, e) x $ \abcde -> go abcde xs -- Based on the algorithm presented in "A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances" in PSC'02 (Heikki Hyyro). -- See http://www.cs.uta.fi/~helmu/pubs/psc02.pdf and http://www.cs.uta.fi/~helmu/pubs/PSCerr.html for an explanation levenshteinDistance :: String -> String -> Int levenshteinDistance str1 str2 = levenshteinDistanceWithLengths m n str1 str2 where m = length str1 n = length str2 levenshteinDistanceWithLengths :: Int -> Int -> String -> String -> Int levenshteinDistanceWithLengths !m !n str1 str2 | m <= n = if n <= 64 -- n must be larger so this check is sufficient then levenshteinDistance' (undefined :: Word64) m n str1 str2 else levenshteinDistance' (undefined :: Integer) m n str1 str2 | otherwise = if m <= 64 -- m must be larger so this check is sufficient then levenshteinDistance' (undefined :: Word64) n m str2 str1 else levenshteinDistance' (undefined :: Integer) n m str2 str1 {-# SPECIALIZE levenshteinDistance' :: Word64 -> Int -> Int -> String -> String -> Int #-} {-# SPECIALIZE levenshteinDistance' :: Integer -> Int -> Int -> String -> String -> Int #-} levenshteinDistance' :: (Num bv, Bits bv) => bv -> Int -> Int -> String -> String -> Int levenshteinDistance' (_bv_dummy :: bv) !m !n str1 str2 | [] <- str1 = n | otherwise = extractAnswer $ foldl'3k (levenshteinDistanceWorker (matchVectors str1) top_bit_mask vector_mask) (m_ones, 0, m) str2 where m_ones@vector_mask = (2 ^ m) - 1 top_bit_mask = 1 `shiftL` (m - 1) :: bv extractAnswer (_, _, distance) = distance {-# SPECIALIZE INLINE levenshteinDistanceWorker :: IM.IntMap Word64 -> Word64 -> Word64 -> (Word64, Word64, Int) -> Char -> ((Word64, Word64, Int) -> res) -> res #-} {-# SPECIALIZE INLINE levenshteinDistanceWorker :: IM.IntMap Integer -> Integer -> Integer -> (Integer, Integer, Int) -> Char -> ((Integer, Integer, Int) -> res) -> res #-} levenshteinDistanceWorker :: (Num bv, Bits bv) => IM.IntMap bv -> bv -> bv -> (bv, bv, Int) -> Char -> ((bv, bv, Int) -> res) -> res levenshteinDistanceWorker !str1_mvs !top_bit_mask !vector_mask (!vp, !vn, !distance) !char2 k = {- trace (unlines ["pm = " ++ show pm' ,"d0 = " ++ show d0' ,"hp = " ++ show hp' ,"hn = " ++ show hn' ,"vp = " ++ show vp' ,"vn = " ++ show vn' ,"distance' = " ++ show distance' ,"distance'' = " ++ show distance'']) -} vp' `seq` vn' `seq` distance'' `seq` k (vp', vn', distance'') where pm' = IM.findWithDefault 0 (ord char2) str1_mvs d0' = ((((pm' .&. vp) + vp) .&. vector_mask) `xor` vp) .|. pm' .|. vn hp' = vn .|. sizedComplement vector_mask (d0' .|. vp) hn' = d0' .&. vp hp'_shift = ((hp' `shiftL` 1) .|. 1) .&. vector_mask hn'_shift = (hn' `shiftL` 1) .&. vector_mask vp' = hn'_shift .|. sizedComplement vector_mask (d0' .|. hp'_shift) vn' = d0' .&. hp'_shift distance' = if hp' .&. top_bit_mask /= 0 then distance + 1 else distance distance'' = if hn' .&. top_bit_mask /= 0 then distance' - 1 else distance' {- -- Just can't get this working! -- Based on the algorithm presented in "A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances" in PSC'02 (Heikki Hyyro). -- See http://www.cs.uta.fi/~helmu/pubs/psc02.pdf and http://www.cs.uta.fi/~helmu/pubs/PSCerr.html for an explanation levenshteinDistanceCutoff :: Int -> String -> String -> Int levenshteinDistanceCutoff cutoff str1 str2 | length str1 <= length str2 = levenshteinDistanceCutoff' cutoff str1 str2 | otherwise = levenshteinDistanceCutoff' cutoff str2 str1 levenshteinDistanceCutoff' :: Int -> String -> String -> Int levenshteinDistanceCutoff' cutoff str1 str2 | [] <- str1 = n | otherwise = extractAnswer $ foldl' (levenshteinDistanceCutoffFlatWorker (matchVectors str1)) (foldl' (levenshteinDistanceCutoffDiagWorker (matchVectors str1)) (top_bit_mask, vector_mask, all_ones, 0, initial_pm_offset, initial_dist) str2_diag) str2_flat where m = length str1 n = length str2 vector_length = if testBit bottom_factor 0 then cutoff -- Odd else cutoff + 1 -- Even all_ones@vector_mask = (2 ^ vector_length) - 1 top_bit_mask = trace (show bottom_factor ++ ", " ++ show vector_length) $ 1 `shiftL` (vector_length - 1) extractAnswer (_, _, _, _, _, distance) = distance len_difference = n - m top_factor = cutoff + len_difference bottom_factor = cutoff - len_difference bottom_factor_shift = (bottom_factor `shiftR` 1) initial_dist = bottom_factor_shift -- The distance the virtual first vector ended on initial_pm_offset = (top_factor `shiftR` 1) -- The amount of left shift to apply to the >next< pattern match vector diag_threshold = negate bottom_factor_shift + m -- The index in str2 where we stop going diagonally down and start going across (str2_diag, str2_flat) = splitAt diag_threshold str2 levenshteinDistanceCutoffDiagWorker :: IM.IntMap BitVector -> (BitVector, BitVector, BitVector, BitVector, Int, Int) -> Char -> (BitVector, BitVector, BitVector, BitVector, Int, Int) levenshteinDistanceCutoffDiagWorker !str1_mvs (!top_bit_mask, !vector_mask, !vp, !vn, !pm_offset, !distance) !char2 = trace (unlines ["vp = " ++ show vp ,"vn = " ++ show vn ,"vector_mask = " ++ show vector_mask ,"pm_offset = " ++ show pm_offset ,"unshifted_pm = " ++ show unshifted_pm ,"pm' = " ++ show pm' ,"d0' = " ++ show d0' ,"hp' = " ++ show hp' ,"hn' = " ++ show hn' ,"vp' = " ++ show vp' ,"vn' = " ++ show vn' ,"distance' = " ++ show distance']) (top_bit_mask, vector_mask, vp', vn', pm_offset - 1, distance') where unshifted_pm = IM.findWithDefault 0 (ord char2) str1_mvs pm' = (unshifted_pm `shift` pm_offset) .&. vector_mask d0' = ((((pm' .&. vp) + vp) .&. vector_mask) `xor` vp) .|. pm' .|. vn hp' = vn .|. sizedComplement vector_mask (d0' .|. vp) hn' = d0' .&. vp d0'_shift = d0' `shiftR` 1 vp' = hn' .|. sizedComplement vector_mask (d0'_shift .|. hp') vn' = d0'_shift .&. hp' distance' = if d0' .&. top_bit_mask /= 0 then distance else distance + 1 levenshteinDistanceCutoffFlatWorker :: IM.IntMap BitVector -> (BitVector, BitVector, BitVector, BitVector, Int, Int) -> Char -> (BitVector, BitVector, BitVector, BitVector, Int, Int) levenshteinDistanceCutoffFlatWorker !str1_mvs (!top_bit_mask, !vector_mask, !vp, !vn, !pm_offset, !distance) !char2 = trace (unlines ["pm_offset = " ++ show pm_offset ,"top_bit_mask' = " ++ show top_bit_mask' ,"vector_mask' = " ++ show vector_mask' ,"pm = " ++ show pm' ,"d0 = " ++ show d0' ,"hp = " ++ show hp' ,"hn = " ++ show hn' ,"vp = " ++ show vp' ,"vn = " ++ show vn' ,"distance' = " ++ show distance' ,"distance'' = " ++ show distance'']) (top_bit_mask', vector_mask', vp', vn', pm_offset - 1, distance'') where top_bit_mask' = top_bit_mask `shiftR` 1 vector_mask' = vector_mask `shiftR` 1 pm' = (IM.findWithDefault 0 (ord char2) str1_mvs `rotate` pm_offset) .&. vector_mask' d0' = ((((pm' .&. vp) + vp) `xor` vp) .|. pm' .|. vn) .&. vector_mask' hp' = vn .|. sizedComplement vector_mask' (d0' .|. vp) hn' = d0' .&. vp d0'_shift = d0' `shiftR` 1 vp' = hn' .|. sizedComplement vector_mask' (d0'_shift .|. hp') vn' = d0'_shift .&. hp' distance' = if hp' .&. top_bit_mask' /= 0 then distance + 1 else distance distance'' = if hn' .&. top_bit_mask' /= 0 then distance' - 1 else distance' -} -- Based on the algorithm presented in "A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances" in PSC'02 (Heikki Hyyro). -- See http://www.cs.uta.fi/~helmu/pubs/psc02.pdf and http://www.cs.uta.fi/~helmu/pubs/PSCerr.html for an explanation restrictedDamerauLevenshteinDistance :: String -> String -> Int restrictedDamerauLevenshteinDistance str1 str2 = restrictedDamerauLevenshteinDistanceWithLengths m n str1 str2 where m = length str1 n = length str2 restrictedDamerauLevenshteinDistanceWithLengths :: Int -> Int -> String -> String -> Int restrictedDamerauLevenshteinDistanceWithLengths !m !n str1 str2 | m <= n = if n <= 64 -- n must be larger so this check is sufficient then restrictedDamerauLevenshteinDistance' (undefined :: Word64) m n str1 str2 else restrictedDamerauLevenshteinDistance' (undefined :: Integer) m n str1 str2 | otherwise = if m <= 64 -- m must be larger so this check is sufficient then restrictedDamerauLevenshteinDistance' (undefined :: Word64) n m str2 str1 else restrictedDamerauLevenshteinDistance' (undefined :: Integer) n m str2 str1 {-# SPECIALIZE restrictedDamerauLevenshteinDistance' :: Word64 -> Int -> Int -> String -> String -> Int #-} {-# SPECIALIZE restrictedDamerauLevenshteinDistance' :: Integer -> Int -> Int -> String -> String -> Int #-} restrictedDamerauLevenshteinDistance' :: (Num bv, Bits bv) => bv -> Int -> Int -> String -> String -> Int restrictedDamerauLevenshteinDistance' (_bv_dummy :: bv) !m !n str1 str2 | [] <- str1 = n | otherwise = extractAnswer $ foldl'5k (restrictedDamerauLevenshteinDistanceWorker (matchVectors str1) top_bit_mask vector_mask) (0, 0, m_ones, 0, m) str2 where m_ones@vector_mask = (2 ^ m) - 1 top_bit_mask = 1 `shiftL` (m - 1) :: bv extractAnswer (_, _, _, _, distance) = distance {-# SPECIALIZE INLINE restrictedDamerauLevenshteinDistanceWorker :: IM.IntMap Word64 -> Word64 -> Word64 -> (Word64, Word64, Word64, Word64, Int) -> Char -> ((Word64, Word64, Word64, Word64, Int) -> res) -> res #-} {-# SPECIALIZE INLINE restrictedDamerauLevenshteinDistanceWorker :: IM.IntMap Integer -> Integer -> Integer -> (Integer, Integer, Integer, Integer, Int) -> Char -> ((Integer, Integer, Integer, Integer, Int) -> res) -> res #-} restrictedDamerauLevenshteinDistanceWorker :: (Num bv, Bits bv) => IM.IntMap bv -> bv -> bv -> (bv, bv, bv, bv, Int) -> Char -> ((bv, bv, bv, bv, Int) -> res) -> res restrictedDamerauLevenshteinDistanceWorker !str1_mvs !top_bit_mask !vector_mask (!pm, !d0, !vp, !vn, !distance) !char2 k = pm' `seq` d0' `seq` vp' `seq` vn' `seq` distance'' `seq` k (pm', d0', vp', vn', distance'') where pm' = IM.findWithDefault 0 (ord char2) str1_mvs d0' = ((((sizedComplement vector_mask d0) .&. pm') `shiftL` 1) .&. pm) -- No need to mask the shiftL because of the restricted range of pm .|. ((((pm' .&. vp) + vp) .&. vector_mask) `xor` vp) .|. pm' .|. vn hp' = vn .|. sizedComplement vector_mask (d0' .|. vp) hn' = d0' .&. vp hp'_shift = ((hp' `shiftL` 1) .|. 1) .&. vector_mask hn'_shift = (hn' `shiftL` 1) .&. vector_mask vp' = hn'_shift .|. sizedComplement vector_mask (d0' .|. hp'_shift) vn' = d0' .&. hp'_shift distance' = if hp' .&. top_bit_mask /= 0 then distance + 1 else distance distance'' = if hn' .&. top_bit_mask /= 0 then distance' - 1 else distance' {-# SPECIALIZE INLINE sizedComplement :: Word64 -> Word64 -> Word64 #-} {-# SPECIALIZE INLINE sizedComplement :: Integer -> Integer -> Integer #-} sizedComplement :: (Num bv, Bits bv) => bv -> bv -> bv sizedComplement vector_mask vect = vector_mask `xor` vect {-# SPECIALIZE matchVectors :: String -> IM.IntMap Word64 #-} {-# SPECIALIZE matchVectors :: String -> IM.IntMap Integer #-} matchVectors :: (Num bv, Bits bv) => String -> IM.IntMap bv matchVectors = snd . foldl' go (0 :: Int, IM.empty) where go (!ix, !im) char = let ix' = ix + 1 im' = IM.insertWith (.|.) (ord char) (2 ^ ix) im in (ix', im') edit-distance-0.2.2.1/Text/EditDistance/EditCosts.hs0000644000000000000000000000400112550525026020273 0ustar0000000000000000{-# OPTIONS_GHC -funbox-strict-fields #-} module Text.EditDistance.EditCosts ( Costs(..), EditCosts(..), deletionCost, insertionCost, substitutionCost, transpositionCost, defaultEditCosts, isDefaultEditCosts ) where data Costs a = ConstantCost !Int | VariableCost (a -> Int) {-# INLINE cost #-} cost :: Costs a -> a -> Int cost (ConstantCost i) _ = i cost (VariableCost f) x = f x data EditCosts = EditCosts { deletionCosts :: Costs Char, -- ^ Cost of deleting the specified character from the left string insertionCosts :: Costs Char, -- ^ Cost of inserting the specified characters into the right string substitutionCosts :: Costs (Char, Char), -- ^ Cost of substituting a character from the left string with one from the right string -- with arguments in that order. transpositionCosts :: Costs (Char, Char) -- ^ Cost of moving one character backwards and the other forwards -- with arguments in that order. } {-# INLINE deletionCost #-} deletionCost :: EditCosts -> Char -> Int deletionCost ec deleted = cost (deletionCosts ec) deleted {-# INLINE insertionCost #-} insertionCost :: EditCosts -> Char -> Int insertionCost ec inserted = cost (insertionCosts ec) inserted {-# INLINE substitutionCost #-} substitutionCost :: EditCosts -> Char -> Char -> Int substitutionCost ec old new = cost (substitutionCosts ec) (old, new) {-# INLINE transpositionCost #-} transpositionCost :: EditCosts -> Char -> Char -> Int transpositionCost ec backwards forwards = cost (transpositionCosts ec) (backwards, forwards) defaultEditCosts :: EditCosts defaultEditCosts = EditCosts { deletionCosts = ConstantCost 1, insertionCosts = ConstantCost 1, substitutionCosts = ConstantCost 1, transpositionCosts = ConstantCost 1 } isDefaultEditCosts :: EditCosts -> Bool isDefaultEditCosts (EditCosts { deletionCosts = ConstantCost 1, insertionCosts = ConstantCost 1, substitutionCosts = ConstantCost 1, transpositionCosts = ConstantCost 1 }) = True isDefaultEditCosts _ = False edit-distance-0.2.2.1/Text/EditDistance/MonadUtilities.hs0000644000000000000000000000227212550525026021334 0ustar0000000000000000{-# LANGUAGE BangPatterns #-} module Text.EditDistance.MonadUtilities where {-# INLINE loopM_ #-} loopM_ :: Monad m => Int -> Int -> (Int -> m ()) -> m () loopM_ xfrom xto action = go xfrom xto where go from to | from > to = return () | otherwise = do action from go (from + 1) to -- foldM in Control.Monad is not defined using SAT style so optimises very poorly {-# INLINE foldM #-} foldM :: (Monad m) => (a -> b -> m a) -> a -> [b] -> m a foldM f x xs = foldr (\y rest a -> f a y >>= rest) return xs x {- -- If we define it like this, then we aren't able to deforest wrt. a "build" in xs, which would be sad :( foldM f = go where go a (x:xs) = f a x >>= \fax -> go fax xs go a [] = return a -} -- If we just use a standard foldM then our loops often box stuff up to return from the loop which is then immediately discarded -- TODO: using this instead of foldM improves our benchmarks by about 2% but makes the code quite ugly.. figure out what to do {-# INLINE foldMK #-} foldMK :: (Monad m) => (a -> b -> m a) -> a -> [b] -> (a -> m res) -> m res foldMK f x xs k = foldr (\y rest a -> f a y >>= rest) k xs x edit-distance-0.2.2.1/Text/EditDistance/SquareSTUArray.hs0000644000000000000000000001453012550525026021235 0ustar0000000000000000{-# LANGUAGE PatternGuards, ScopedTypeVariables, BangPatterns, Trustworthy #-} module Text.EditDistance.SquareSTUArray ( levenshteinDistance, levenshteinDistanceWithLengths, restrictedDamerauLevenshteinDistance, restrictedDamerauLevenshteinDistanceWithLengths ) where import Text.EditDistance.EditCosts import Text.EditDistance.MonadUtilities import Text.EditDistance.ArrayUtilities import Control.Monad hiding (foldM) import Control.Monad.ST import Data.Array.ST levenshteinDistance :: EditCosts -> String -> String -> Int levenshteinDistance !costs str1 str2 = levenshteinDistanceWithLengths costs str1_len str2_len str1 str2 where str1_len = length str1 str2_len = length str2 levenshteinDistanceWithLengths :: EditCosts -> Int -> Int -> String -> String -> Int levenshteinDistanceWithLengths !costs !str1_len !str2_len str1 str2 = runST (levenshteinDistanceST costs str1_len str2_len str1 str2) levenshteinDistanceST :: EditCosts -> Int -> Int -> String -> String -> ST s Int levenshteinDistanceST !costs !str1_len !str2_len str1 str2 = do -- Create string arrays str1_array <- stringToArray str1 str1_len str2_array <- stringToArray str2 str2_len -- Create array of costs. Say we index it by (i, j) where i is the column index and j the row index. -- Rows correspond to characters of str2 and columns to characters of str1. cost_array <- newArray_ ((0, 0), (str1_len, str2_len)) :: ST s (STUArray s (Int, Int) Int) read_str1 <- unsafeReadArray' str1_array read_str2 <- unsafeReadArray' str2_array read_cost <- unsafeReadArray' cost_array write_cost <- unsafeWriteArray' cost_array -- Fill out the first row (j = 0) _ <- (\f -> foldM f (1, 0) str1) $ \(i, deletion_cost) col_char -> let deletion_cost' = deletion_cost + deletionCost costs col_char in write_cost (i, 0) deletion_cost' >> return (i + 1, deletion_cost') -- Fill the remaining rows (j >= 1) _ <- (\f -> foldM f 0 [1..str2_len]) $ \insertion_cost (!j) -> do row_char <- read_str2 j -- Initialize the first element of the row (i = 0) let insertion_cost' = insertion_cost + insertionCost costs row_char write_cost (0, j) insertion_cost' -- Fill the remaining elements of the row (i >= 1) loopM_ 1 str1_len $ \(!i) -> do col_char <- read_str1 i cost <- standardCosts costs read_cost row_char col_char (i, j) write_cost (i, j) cost return insertion_cost' -- Return an actual answer read_cost (str1_len, str2_len) restrictedDamerauLevenshteinDistance :: EditCosts -> String -> String -> Int restrictedDamerauLevenshteinDistance costs str1 str2 = restrictedDamerauLevenshteinDistanceWithLengths costs str1_len str2_len str1 str2 where str1_len = length str1 str2_len = length str2 restrictedDamerauLevenshteinDistanceWithLengths :: EditCosts -> Int -> Int -> String -> String -> Int restrictedDamerauLevenshteinDistanceWithLengths costs str1_len str2_len str1 str2 = runST (restrictedDamerauLevenshteinDistanceST costs str1_len str2_len str1 str2) restrictedDamerauLevenshteinDistanceST :: EditCosts -> Int -> Int -> String -> String -> ST s Int restrictedDamerauLevenshteinDistanceST !costs str1_len str2_len str1 str2 = do -- Create string arrays str1_array <- stringToArray str1 str1_len str2_array <- stringToArray str2 str2_len -- Create array of costs. Say we index it by (i, j) where i is the column index and j the row index. -- Rows correspond to characters of str2 and columns to characters of str1. cost_array <- newArray_ ((0, 0), (str1_len, str2_len)) :: ST s (STUArray s (Int, Int) Int) read_str1 <- unsafeReadArray' str1_array read_str2 <- unsafeReadArray' str2_array read_cost <- unsafeReadArray' cost_array write_cost <- unsafeWriteArray' cost_array -- Fill out the first row (j = 0) _ <- (\f -> foldM f (1, 0) str1) $ \(i, deletion_cost) col_char -> let deletion_cost' = deletion_cost + deletionCost costs col_char in write_cost (i, 0) deletion_cost' >> return (i + 1, deletion_cost') -- Fill out the second row (j = 1) when (str2_len > 0) $ do initial_row_char <- read_str2 1 -- Initialize the first element of the second row (i = 0) write_cost (0, 1) (insertionCost costs initial_row_char) -- Initialize the remaining elements of the row (i >= 1) loopM_ 1 str1_len $ \(!i) -> do col_char <- read_str1 i cost <- standardCosts costs read_cost initial_row_char col_char (i, 1) write_cost (i, 1) cost -- Fill the remaining rows (j >= 2) loopM_ 2 str2_len (\(!j) -> do row_char <- read_str2 j prev_row_char <- read_str2 (j - 1) -- Initialize the first element of the row (i = 0) write_cost (0, j) (insertionCost costs row_char * j) -- Initialize the second element of the row (i = 1) when (str1_len > 0) $ do col_char <- read_str1 1 cost <- standardCosts costs read_cost row_char col_char (1, j) write_cost (1, j) cost -- Fill the remaining elements of the row (i >= 2) loopM_ 2 str1_len (\(!i) -> do col_char <- read_str1 i prev_col_char <- read_str1 (i - 1) standard_cost <- standardCosts costs read_cost row_char col_char (i, j) cost <- if prev_row_char == col_char && prev_col_char == row_char then do transpose_cost <- fmap (+ (transpositionCost costs col_char row_char)) $ read_cost (i - 2, j - 2) return (standard_cost `min` transpose_cost) else return standard_cost write_cost (i, j) cost)) -- Return an actual answer read_cost (str1_len, str2_len) {-# INLINE standardCosts #-} standardCosts :: EditCosts -> ((Int, Int) -> ST s Int) -> Char -> Char -> (Int, Int) -> ST s Int standardCosts !costs read_cost !row_char !col_char (!i, !j) = do deletion_cost <- fmap (+ (deletionCost costs col_char)) $ read_cost (i - 1, j) insertion_cost <- fmap (+ (insertionCost costs row_char)) $ read_cost (i, j - 1) subst_cost <- fmap (+ if row_char == col_char then 0 else (substitutionCost costs col_char row_char)) (read_cost (i - 1, j - 1)) return $ deletion_cost `min` insertion_cost `min` subst_cost edit-distance-0.2.2.1/Text/EditDistance/STUArray.hs0000644000000000000000000002077412550525026020063 0ustar0000000000000000{-# LANGUAGE PatternGuards, ScopedTypeVariables, BangPatterns, FlexibleContexts #-} module Text.EditDistance.STUArray ( levenshteinDistance, levenshteinDistanceWithLengths, restrictedDamerauLevenshteinDistance, restrictedDamerauLevenshteinDistanceWithLengths ) where import Text.EditDistance.EditCosts import Text.EditDistance.MonadUtilities import Text.EditDistance.ArrayUtilities import Control.Monad hiding (foldM) import Control.Monad.ST import Data.Array.ST levenshteinDistance :: EditCosts -> String -> String -> Int levenshteinDistance !costs str1 str2 = levenshteinDistanceWithLengths costs str1_len str2_len str1 str2 where str1_len = length str1 str2_len = length str2 levenshteinDistanceWithLengths :: EditCosts -> Int -> Int -> String -> String -> Int levenshteinDistanceWithLengths !costs !str1_len !str2_len str1 str2 = runST (levenshteinDistanceST costs str1_len str2_len str1 str2) levenshteinDistanceST :: EditCosts -> Int -> Int -> String -> String -> ST s Int levenshteinDistanceST !costs !str1_len !str2_len str1 str2 = do -- Create string arrays str1_array <- stringToArray str1 str1_len str2_array <- stringToArray str2 str2_len -- Create array of costs for a single row. Say we index costs by (i, j) where i is the column index and j the row index. -- Rows correspond to characters of str2 and columns to characters of str1. We can get away with just storing a single -- row of costs at a time, but we use two because it turns out to be faster start_cost_row <- newArray_ (0, str1_len) :: ST s (STUArray s Int Int) start_cost_row' <- newArray_ (0, str1_len) :: ST s (STUArray s Int Int) read_str1 <- unsafeReadArray' str1_array read_str2 <- unsafeReadArray' str2_array -- Fill out the first row (j = 0) _ <- (\f -> foldM f (1, 0) str1) $ \(i, deletion_cost) col_char -> let deletion_cost' = deletion_cost + deletionCost costs col_char in unsafeWriteArray start_cost_row i deletion_cost' >> return (i + 1, deletion_cost') -- Fill out the remaining rows (j >= 1) (_, final_row, _) <- (\f -> foldM f (0, start_cost_row, start_cost_row') [1..str2_len]) $ \(!insertion_cost, !cost_row, !cost_row') !j -> do row_char <- read_str2 j -- Initialize the first element of the row (i = 0) let insertion_cost' = insertion_cost + insertionCost costs row_char unsafeWriteArray cost_row' 0 insertion_cost' -- Fill the remaining elements of the row (i >= 1) loopM_ 1 str1_len $ \(!i) -> do col_char <- read_str1 i left_up <- unsafeReadArray cost_row (i - 1) left <- unsafeReadArray cost_row' (i - 1) here_up <- unsafeReadArray cost_row i let here = standardCosts costs row_char col_char left left_up here_up unsafeWriteArray cost_row' i here return (insertion_cost', cost_row', cost_row) -- Return an actual answer unsafeReadArray final_row str1_len restrictedDamerauLevenshteinDistance :: EditCosts -> String -> String -> Int restrictedDamerauLevenshteinDistance !costs str1 str2 = restrictedDamerauLevenshteinDistanceWithLengths costs str1_len str2_len str1 str2 where str1_len = length str1 str2_len = length str2 restrictedDamerauLevenshteinDistanceWithLengths :: EditCosts -> Int -> Int -> String -> String -> Int restrictedDamerauLevenshteinDistanceWithLengths !costs !str1_len !str2_len str1 str2 = runST (restrictedDamerauLevenshteinDistanceST costs str1_len str2_len str1 str2) restrictedDamerauLevenshteinDistanceST :: EditCosts -> Int -> Int -> String -> String -> ST s Int restrictedDamerauLevenshteinDistanceST !costs str1_len str2_len str1 str2 = do -- Create string arrays str1_array <- stringToArray str1 str1_len str2_array <- stringToArray str2 str2_len -- Create array of costs for a single row. Say we index costs by (i, j) where i is the column index and j the row index. -- Rows correspond to characters of str2 and columns to characters of str1. We can get away with just storing two -- rows of costs at a time, but I use three because it turns out to be faster cost_row <- newArray_ (0, str1_len) :: ST s (STUArray s Int Int) read_str1 <- unsafeReadArray' str1_array read_str2 <- unsafeReadArray' str2_array -- Fill out the first row (j = 0) _ <- (\f -> foldM f (1, 0) str1) $ \(i, deletion_cost) col_char -> let deletion_cost' = deletion_cost + deletionCost costs col_char in unsafeWriteArray cost_row i deletion_cost' >> return (i + 1, deletion_cost') if (str2_len == 0) then unsafeReadArray cost_row str1_len else do -- We defer allocation of these arrays to here because they aren't used in the other branch cost_row' <- newArray_ (0, str1_len) :: ST s (STUArray s Int Int) cost_row'' <- newArray_ (0, str1_len) :: ST s (STUArray s Int Int) -- Fill out the second row (j = 1) row_char <- read_str2 1 -- Initialize the first element of the row (i = 0) let zero = insertionCost costs row_char unsafeWriteArray cost_row' 0 zero -- Fill the remaining elements of the row (i >= 1) loopM_ 1 str1_len (firstRowColWorker read_str1 row_char cost_row cost_row') -- Fill out the remaining rows (j >= 2) (_, _, final_row, _, _) <- foldM (restrictedDamerauLevenshteinDistanceSTRowWorker costs str1_len read_str1 read_str2) (zero, cost_row, cost_row', cost_row'', row_char) [2..str2_len] -- Return an actual answer unsafeReadArray final_row str1_len where {-# INLINE firstRowColWorker #-} firstRowColWorker read_str1 !row_char !cost_row !cost_row' !i = do col_char <- read_str1 i left_up <- unsafeReadArray cost_row (i - 1) left <- unsafeReadArray cost_row' (i - 1) here_up <- unsafeReadArray cost_row i let here = standardCosts costs row_char col_char left left_up here_up unsafeWriteArray cost_row' i here {-# INLINE restrictedDamerauLevenshteinDistanceSTRowWorker #-} restrictedDamerauLevenshteinDistanceSTRowWorker :: EditCosts -> Int -> (Int -> ST s Char) -> (Int -> ST s Char) -- String array accessors -> (Int, STUArray s Int Int, STUArray s Int Int, STUArray s Int Int, Char) -> Int -- Incoming rows of the matrix in recency order -> ST s (Int, STUArray s Int Int, STUArray s Int Int, STUArray s Int Int, Char) -- Outgoing rows of the matrix in recency order restrictedDamerauLevenshteinDistanceSTRowWorker !costs !str1_len read_str1 read_str2 (!insertion_cost, !cost_row, !cost_row', !cost_row'', !prev_row_char) !j = do row_char <- read_str2 j -- Initialize the first element of the row (i = 0) zero_up <- unsafeReadArray cost_row' 0 let insertion_cost' = insertion_cost + insertionCost costs row_char unsafeWriteArray cost_row'' 0 insertion_cost' -- Initialize the second element of the row (i = 1) when (str1_len > 0) $ do col_char <- read_str1 1 one_up <- unsafeReadArray cost_row' 1 let one = standardCosts costs row_char col_char insertion_cost' zero_up one_up unsafeWriteArray cost_row'' 1 one -- Fill the remaining elements of the row (i >= 2) loopM_ 2 str1_len (colWorker row_char) return (insertion_cost', cost_row', cost_row'', cost_row, row_char) where colWorker !row_char !i = do prev_col_char <- read_str1 (i - 1) col_char <- read_str1 i left_left_up_up <- unsafeReadArray cost_row (i - 2) left_up <- unsafeReadArray cost_row' (i - 1) left <- unsafeReadArray cost_row'' (i - 1) here_up <- unsafeReadArray cost_row' i let here_standard_only = standardCosts costs row_char col_char left left_up here_up here = if prev_row_char == col_char && prev_col_char == row_char then here_standard_only `min` (left_left_up_up + transpositionCost costs col_char row_char) else here_standard_only unsafeWriteArray cost_row'' i here {-# INLINE standardCosts #-} standardCosts :: EditCosts -> Char -> Char -> Int -> Int -> Int -> Int standardCosts !costs !row_char !col_char !cost_left !cost_left_up !cost_up = deletion_cost `min` insertion_cost `min` subst_cost where deletion_cost = cost_left + deletionCost costs col_char insertion_cost = cost_up + insertionCost costs row_char subst_cost = cost_left_up + if row_char == col_char then 0 else substitutionCost costs col_char row_char edit-distance-0.2.2.1/Text/EditDistance/Tests.hs0000644000000000000000000000033712550525026017504 0ustar0000000000000000module Main where import Text.EditDistance.Tests.Properties import Test.Framework import Data.Monoid main :: IO () main = defaultMain $ map (plusTestOptions $ mempty { topt_maximum_generated_tests = Just 1000 }) tests edit-distance-0.2.2.1/Text/EditDistance/Tests/0000755000000000000000000000000012550525026017145 5ustar0000000000000000edit-distance-0.2.2.1/Text/EditDistance/Tests/EditOperationOntology.hs0000644000000000000000000000712612550525026024010 0ustar0000000000000000{-# LANGUAGE PatternGuards #-} module Text.EditDistance.Tests.EditOperationOntology where import Text.EditDistance.EditCosts import Test.QuickCheck import Control.Monad class Arbitrary ops => EditOperation ops where edit :: String -> ops -> Gen (String, EditCosts -> Int) containsTransposition :: ops -> Bool instance EditOperation op => EditOperation [op] where edit ys ops = foldM (\(xs, c) op -> fmap (\(xs', cost') -> (xs', \ecs -> c ecs + cost' ecs)) $ edit xs op) (ys, const 0) ops containsTransposition = any containsTransposition data EditedString ops = MkEditedString { oldString :: String, newString :: String, operations :: ops, esCost :: EditCosts -> Int } instance Show ops => Show (EditedString ops) where show (MkEditedString old_string new_string ops _cost) = show old_string ++ " ==> " ++ show new_string ++ " (by " ++ show ops ++ ")" instance EditOperation ops => Arbitrary (EditedString ops) where arbitrary = do old_string <- arbitrary edit_operations <- arbitrary (new_string, cost) <- edit old_string edit_operations return $ MkEditedString { oldString = old_string, newString = new_string, operations = edit_operations, esCost = cost } data ExtendedEditOperation = Deletion | Insertion Char | Substitution Char | Transposition deriving (Show) instance Arbitrary ExtendedEditOperation where arbitrary = oneof [return Deletion, fmap Insertion arbitrary, fmap Substitution arbitrary, return Transposition] instance EditOperation ExtendedEditOperation where edit str op = do let max_split_ix | Transposition <- op = length str - 1 | otherwise = length str split_ix <- choose (1, max_split_ix) let (str_l, str_r) = splitAt split_ix str non_null = not $ null str transposable = length str > 1 case op of Deletion | non_null -> do let old_ch = last str_l return (init str_l ++ str_r, \ec -> deletionCost ec old_ch) Insertion new_ch | non_null -> do return (str_l ++ new_ch : str_r, \ec -> insertionCost ec new_ch) Insertion new_ch | otherwise -> return ([new_ch], \ec -> insertionCost ec new_ch) -- Need special case because randomR (1, 0) is undefined Substitution new_ch | non_null -> do let old_ch = last str_l return (init str_l ++ new_ch : str_r, \ec -> substitutionCost ec old_ch new_ch) Transposition | transposable -> do -- Need transposable rather than non_null because randomR (1, 0) is undefined let backwards_ch = head str_r forwards_ch = last str_l return (init str_l ++ backwards_ch : forwards_ch : tail str_r, \ec -> transpositionCost ec backwards_ch forwards_ch) _ -> return (str, const 0) containsTransposition Transposition = True containsTransposition _ = False -- This all really sucks but I can't think of something better right now newtype BasicEditOperation = MkBasic ExtendedEditOperation instance Show BasicEditOperation where show (MkBasic x) = show x instance Arbitrary BasicEditOperation where arbitrary = fmap MkBasic $ oneof [return Deletion, fmap Insertion arbitrary, fmap Substitution arbitrary] instance EditOperation BasicEditOperation where edit str (MkBasic op) = edit str op containsTransposition _ = False edit-distance-0.2.2.1/Text/EditDistance/Tests/Properties.hs0000644000000000000000000001412612550525026021641 0ustar0000000000000000{-# LANGUAGE PatternGuards, ScopedTypeVariables #-} module Text.EditDistance.Tests.Properties ( tests ) where import Text.EditDistance.EditCosts import qualified Text.EditDistance.SquareSTUArray as SquareSTUArray import qualified Text.EditDistance.STUArray as STUArray import qualified Text.EditDistance.Bits as Bits import Text.EditDistance.Tests.EditOperationOntology import Test.Framework import Test.Framework.Providers.QuickCheck2 import Test.QuickCheck tests :: [Test] tests = [ testGroup "Levenshtein Distance (SquareSTUArray)" sqstu_levenshteinDistanceTests , testGroup "Restricted Damerau-Levenshtein Distance (SquareSTUArray)" sqstu_restrictedDamerauLevenshteinDistanceTests , testGroup "Levenshtein Distance (STUArray)" stu_levenshteinDistanceTests , testGroup "Restricted Damerau-Levenshtein Distance (STUArray)" stu_restrictedDamerauLevenshteinDistanceTests , testGroup "Levenshtein Distance (Bits)" bits_levenshteinDistanceTests , testGroup "Restricted Damerau-Levenshtein Distance (Bits)" bits_restrictedDamerauLevenshteinDistanceTests , testGroup "Levenshtein Distance Crosschecks" levenshteinDistanceCrosscheckTests , testGroup "Restricted Damerau-Levenshtein Distance Crosschecks" restrictedDamerauLevenshteinDistanceCrosscheckTests --, testGroup "Levenshtein Distance Cutoff (Bits)" bits_levenshteinDistanceCutoffTests ] where sqstu_levenshteinDistanceTests = standardDistanceTests SquareSTUArray.levenshteinDistance interestingCosts (undefined :: BasicEditOperation) sqstu_restrictedDamerauLevenshteinDistanceTests = standardDistanceTests SquareSTUArray.restrictedDamerauLevenshteinDistance interestingCosts (undefined :: ExtendedEditOperation) stu_levenshteinDistanceTests = standardDistanceTests STUArray.levenshteinDistance interestingCosts (undefined :: BasicEditOperation) stu_restrictedDamerauLevenshteinDistanceTests = standardDistanceTests STUArray.restrictedDamerauLevenshteinDistance interestingCosts (undefined :: ExtendedEditOperation) bits_levenshteinDistanceTests = standardDistanceTests (const Bits.levenshteinDistance) defaultEditCosts (undefined :: BasicEditOperation) bits_restrictedDamerauLevenshteinDistanceTests = standardDistanceTests (const Bits.restrictedDamerauLevenshteinDistance) defaultEditCosts (undefined :: ExtendedEditOperation) --bits_levenshteinDistanceCutoffTests = [ testProperty "Cutoff vs. Non-Cutoff" (forAll arbitrary (\cutoff -> distanceEqIfBelowProperty cutoff (Bits.levenshteinDistanceCutoff cutoff) Bits.levenshteinDistance defaultEditCosts (undefined :: BasicEditOperation))) ] levenshteinDistanceCrosscheckTests = crossCheckTests [ ("SquareSTUArray", SquareSTUArray.levenshteinDistance defaultEditCosts) , ("STUArray", STUArray.levenshteinDistance defaultEditCosts) , ("Bits", Bits.levenshteinDistance) ] (undefined :: BasicEditOperation) restrictedDamerauLevenshteinDistanceCrosscheckTests = crossCheckTests [ ("SquareSTUArray", SquareSTUArray.restrictedDamerauLevenshteinDistance defaultEditCosts) , ("STUArray", STUArray.restrictedDamerauLevenshteinDistance defaultEditCosts) , ("Bits", Bits.restrictedDamerauLevenshteinDistance) ] (undefined :: ExtendedEditOperation) interestingCosts :: EditCosts interestingCosts = EditCosts { deletionCosts = ConstantCost 1, insertionCosts = ConstantCost 2, substitutionCosts = ConstantCost 3, -- Can't be higher than deletion + insertion transpositionCosts = ConstantCost 3 -- Can't be higher than deletion + insertion } crossCheckTests :: forall op. (EditOperation op, Show op) => [(String, String -> String -> Int)] -> op -> [Test] crossCheckTests named_distances _op_dummy = [ testProperty (name1 ++ " vs. " ++ name2) (distanceEqProperty distance1 distance2 _op_dummy) | (ix1, (name1, distance1)) <- enumerated_named_distances, (ix2, (name2, distance2)) <- enumerated_named_distances, ix2 > ix1 ] where enumerated_named_distances = [(1 :: Int)..] `zip` named_distances distanceEqProperty :: (String -> String -> Int) -> (String -> String -> Int) -> op -> EditedString op -> Bool distanceEqProperty distance1 distance2 _op_dummy (MkEditedString old new _ _) = distance1 old new == distance2 old new --distanceEqIfBelowProperty :: (EditOperation op) => Int -> (String -> String -> Int) -> (String -> String -> Int) -> EditCosts -> op -> EditedString op -> Property --distanceEqIfBelowProperty cutoff distance1 distance2 costs _op_dummy (MkEditedString old new ops) = (editCost costs ops <= cutoff) ==> distance1 old new == distance2 old new standardDistanceTests :: forall op. (EditOperation op, Show op) => (EditCosts -> String -> String -> Int) -> EditCosts -> op -> [Test] standardDistanceTests distance costs _op_dummy = [ testProperty "Self distance is zero" prop_self_distance_zero , testProperty "Pure deletion has the right cost" prop_pure_deletion_cost_correct , testProperty "Pure insertion has the right cost" prop_pure_insertion_cost_correct , testProperty "Single operations have the right cost" prop_single_op_cost_is_distance , testProperty "Cost bound is respected" prop_combined_op_cost_at_least_distance ] where testableDistance = distance costs prop_self_distance_zero str = testableDistance str str == 0 prop_pure_deletion_cost_correct str = testableDistance str "" == sum [deletionCost costs c | c <- str] prop_pure_insertion_cost_correct str = testableDistance "" str == sum [insertionCost costs c | c <- str] prop_single_op_cost_is_distance (MkEditedString old new _ops cost :: EditedString op) = (length old > 2) ==> testableDistance old new == cost costs || old == new prop_combined_op_cost_at_least_distance (MkEditedString old new ops cost :: EditedString [op]) = not (containsTransposition ops) ==> testableDistance old new <= cost costs