edit-distance-0.2.1.1/ 000755 000765 000024 00000000000 12033303026 016173 5 ustar 00mbolingbroke staff 000000 000000 edit-distance-0.2.1.1/edit-distance.cabal 000644 000765 000024 00000005342 12033303026 021700 0 ustar 00mbolingbroke staff 000000 000000 Name: edit-distance
Version: 0.2.1.1
Cabal-Version: >= 1.2
Category: Algorithms
Synopsis: Levenshtein and restricted Damerau-Levenshtein edit distances
Description: Optimized edit distances for fuzzy matching, including Levenshtein and restricted Damerau-Levenshtein algorithms.
License: BSD3
License-File: LICENSE
Extra-Source-Files: README.textile
Author: Max Bolingbroke
Maintainer: batterseapower@hotmail.com
Homepage: http://github.com/batterseapower/edit-distance
Build-Type: Simple
Flag Tests
Description: Enable building the tests
Default: False
Flag Benchmark
Description: Enable building the benchmark suite
Default: False
Flag SplitBase
Description: Choose the new smaller, split-up base package
Default: True
Library
Exposed-Modules: Text.EditDistance
Other-Modules: Text.EditDistance.EditCosts
Text.EditDistance.SquareSTUArray
Text.EditDistance.STUArray
Text.EditDistance.Bits
Text.EditDistance.MonadUtilities
if flag(splitBase)
Build-Depends: base >= 3 && < 5, array >= 0.1, random >= 1.0, containers >= 0.1.0.1
else
Build-Depends: base < 3
Ghc-Options: -O2
Executable edit-distance-tests
Main-Is: Text/EditDistance/Tests.hs
Extensions: PatternGuards, PatternSignatures,
ScopedTypeVariables
Ghc-Options: -O2
if !flag(tests)
Buildable: False
else
Build-Depends: test-framework >= 0.1.1, QuickCheck >= 1.1 && < 2.0, test-framework-quickcheck
if flag(splitBase)
Build-Depends: base >= 3 && < 5, array >= 0.1, random >= 1.0
else
Build-Depends: base < 3
Executable edit-distance-benchmark
Main-Is: Text/EditDistance/Benchmark.hs
if !flag(benchmark)
Buildable: False
else
if flag(splitBase)
Build-Depends: base >= 3 && < 5, array >= 0.1, random >= 1.0, old-time >= 1.0, process >= 1.0,
parallel >= 1.0, unix >= 2.3
else
Build-Depends: base < 3,
parallel >= 1.0, unix >= 2.3
Ghc-Options: -O2 edit-distance-0.2.1.1/LICENSE 000644 000765 000024 00000002766 12033303026 017213 0 ustar 00mbolingbroke staff 000000 000000 Copyright (c) 2008, Maximilian Bolingbroke
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted
provided that the following conditions are met:
* Redistributions of source code must retain the above copyright notice, this list of
conditions and the following disclaimer.
* Redistributions in binary form must reproduce the above copyright notice, this list of
conditions and the following disclaimer in the documentation and/or other materials
provided with the distribution.
* Neither the name of Maximilian Bolingbroke nor the names of other contributors may be used to
endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR
IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND
FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR
CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER
IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT
OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. edit-distance-0.2.1.1/README.textile 000644 000765 000024 00000003157 12033303026 020536 0 ustar 00mbolingbroke staff 000000 000000 h1. Edit Distance Algorithms
You can help improve this README with extra snippets and advice by using the "GitHub wiki":http://github.com/batterseapower/edit-distance/wikis/readme.
h2. Installing
To just install the library:
runghc Setup.lhs configure
runghc Setup.lhs build
sudo runghc Setup.lhs install
If you want to build the tests, to check it's all working:
runghc Setup.lhs configure -ftests
runghc Setup.lhs build
dist/build/edit-distance-tests/edit-distance-tests
h2. Description
Edit distances algorithms for fuzzy matching. Specifically, this library provides:
* "Levenshtein distance":http://en.wikipedia.org/wiki/Levenshtein_distance
* "Restricted Damerau-Levenshtein distance":http://en.wikipedia.org/wiki/Damerau-Levenshtein_distance
They have been fairly heavily optimized. Indeed, for situations where one of the strings is under 32 characters long I use a rather neat "bit vector" algorithm: see "the authors paper":http://www.cs.uta.fi/~helmu/pubs/psc02.pdf and "the associated errata":[http://www.cs.uta.fi/~helmu/pubs/PSCerr.html for more information. The algorithms __could__ be faster, but they aren't yet slow enough to force me into improving the situation.
h2. Example
Text.EditDistance> levenshteinDistance defaultEditCosts "witch" "kitsch"
2
h2. Linkage
* "Hackage":http://hackage.haskell.org/cgi-bin/hackage-scripts/package/edit-distance
* "Bug Tracker":http://bsp.lighthouseapp.com/projects/14822-hs-edit-distance
* "GitHub":http://github.com/batterseapower/edit-distance edit-distance-0.2.1.1/Setup.lhs 000644 000765 000024 00000000115 12033303026 020000 0 ustar 00mbolingbroke staff 000000 000000 #! /usr/bin/env runhaskell
> import Distribution.Simple
> main = defaultMain edit-distance-0.2.1.1/Text/ 000755 000765 000024 00000000000 12033303026 017117 5 ustar 00mbolingbroke staff 000000 000000 edit-distance-0.2.1.1/Text/EditDistance/ 000755 000765 000024 00000000000 12033303026 021457 5 ustar 00mbolingbroke staff 000000 000000 edit-distance-0.2.1.1/Text/EditDistance.hs 000644 000765 000024 00000005572 12033303026 022024 0 ustar 00mbolingbroke staff 000000 000000 {-# LANGUAGE PatternGuards #-}
-- | Computing the edit distances between strings
module Text.EditDistance (
Costs(..), EditCosts(..), defaultEditCosts,
levenshteinDistance, restrictedDamerauLevenshteinDistance
) where
import Text.EditDistance.EditCosts
import qualified Text.EditDistance.Bits as Bits
import qualified Text.EditDistance.STUArray as STUArray
import qualified Text.EditDistance.SquareSTUArray as SquareSTUArray
-- | Find the Levenshtein edit distance between two strings. That is to say, the number of deletion,
-- insertion and substitution operations that are required to make the two strings equal. Note that
-- this algorithm therefore does not make use of the 'transpositionCost' field of the costs. See also:
-- .
levenshteinDistance :: EditCosts -> String -> String -> Int
levenshteinDistance costs str1 str2
| isDefaultEditCosts costs
, not (betterNotToUseBits str1_len || betterNotToUseBits str2_len) -- The Integer implementation of the Bits algorithm is quite inefficient, but scales better
= Bits.levenshteinDistanceWithLengths str1_len str2_len str1 str2 -- than the STUArrays. The Word32 implementation is always better, if it is applicable
| otherwise
= STUArray.levenshteinDistanceWithLengths costs str1_len str2_len str1 str2 -- STUArray always beat making more allocations with SquareSTUArray for Levenhstein
where
str1_len = length str1
str2_len = length str2
betterNotToUseBits len = len >= 33 && len <= 82 -- Upper bound determined experimentally
-- | Find the "restricted" Damerau-Levenshtein edit distance between two strings. This algorithm calculates the cost of
-- the so-called optimal string alignment, which does not always equal the appropriate edit distance. The cost of the optimal
-- string alignment is the number of edit operations needed to make the input strings equal under the condition that no substring
-- is edited more than once. See also: .
restrictedDamerauLevenshteinDistance :: EditCosts -> String -> String -> Int
restrictedDamerauLevenshteinDistance costs str1 str2
| isDefaultEditCosts costs
, not (betterNotToUseBits str1_len || betterNotToUseBits str2_len) -- The Integer implementation of the Bits algorithm is quite inefficient, but scales better
= Bits.restrictedDamerauLevenshteinDistanceWithLengths str1_len str2_len str1 str2 -- than the STUArrays. The Word32 implementation is always better, if it is applicable
| otherwise
= SquareSTUArray.restrictedDamerauLevenshteinDistanceWithLengths costs str1_len str2_len str1 str2 -- SquareSTUArray usually beat making more use of the heap with STUArray for Damerau
where
str1_len = length str1
str2_len = length str2
betterNotToUseBits len = len >= 33 && len <= 45 -- Upper bound determined experimentally edit-distance-0.2.1.1/Text/EditDistance/Benchmark.hs 000644 000765 000024 00000007067 12033303026 023717 0 ustar 00mbolingbroke staff 000000 000000 module Main where
import Text.EditDistance.EditCosts
import qualified Text.EditDistance.Bits as Bits
import qualified Text.EditDistance.STUArray as STUArray
import qualified Text.EditDistance.SquareSTUArray as SquareSTUArray
import System.IO
import System.Exit
--import System.Posix.IO
import System.Time ( ClockTime(..), getClockTime )
import System.Random
import System.Process
import Data.List
import Control.Monad
import Control.Exception
--import Control.Concurrent ( forkIO, threadDelay )
import Control.Parallel.Strategies ( NFData, rnf )
sTRING_SIZE_STEP, mAX_STRING_SIZE :: Int
sTRING_SIZE_STEP = 3
mAX_STRING_SIZE = 108
time :: IO a -> IO Float
time action = do
TOD s1 ps1 <- getClockTime
action
TOD s2 ps2 <- getClockTime
return $ (fromIntegral (s2 - s1) + (fromIntegral (ps2 - ps1) / 10^(12 :: Int)))
augment :: Monad m => (a -> m b) -> [a] -> m [(a, b)]
augment fx xs = liftM (zip xs) $ mapM fx xs
sample :: NFData a => (String -> String -> a) -> (Int, Int) -> IO Float
sample distance bounds@(i, j) = do
-- Generate two random strings of length i and j
gen <- newStdGen
let (string1, string2_long) = splitAt i (randoms gen)
string2 = take j string2_long
-- Force the two strings to be evaluated so they don't meddle
-- with the benchmarking
evaluate (rnf string1)
evaluate (rnf string2)
-- Our sample is the time taken to find the edit distance
putStrLn $ "Sampling " ++ show bounds
time $ loop 1000 $ evaluate (distance string1 string2)
loop :: Monad m => Int -> m a -> m ()
loop n act = sequence_ (replicate n act)
joinOnKey :: Eq a => [(a, b)] -> [(a, c)] -> [(a, (b, c))]
joinOnKey xs ys = [(x_a, (x_b, y_c)) | (x_a, x_b) <- xs, (y_a, y_c) <- ys, x_a == y_a]
gnuPlotScript :: String
gnuPlotScript = "set term postscript eps enhanced color\n\
\set output \"data.ps\"\n\
\#unset key\n\
\set dgrid3d\n\
\set hidden3d\n\
\#set pm3d map\n\
\#splot \"data.plot\" using 1:2:3\n\
\splot \"data.plot\" using 1:2:3 title \"Bits\" with lines, \"data.plot\" using 1:2:4 title \"STUArray\" with lines, \"data.plot\" using 1:2:5 title \"SquareSTUArray\" with lines\n\
\quit\n"
toGnuPlotFormat :: (Show a, Show b, Show c) => [((a, b), [c])] -> String
toGnuPlotFormat samples = unlines (header : map sampleToGnuPlotFormat samples)
where
first_cs = snd $ head samples
header = "#\tX\tY" ++ concat (replicate (length first_cs) "\tZ")
sampleToGnuPlotFormat ((a, b), cs) = concat $ intersperse "\t" $ [show a, show b] ++ map show cs
main :: IO ()
main = do
let sample_range = [(i, j) | i <- [0,sTRING_SIZE_STEP..mAX_STRING_SIZE]
, j <- [0,sTRING_SIZE_STEP..mAX_STRING_SIZE]]
bits_samples <- augment (sample $ Bits.levenshteinDistance) sample_range
sqstu_samples <- augment (sample $ SquareSTUArray.levenshteinDistance defaultEditCosts) sample_range
stu_samples <- augment (sample $ STUArray.levenshteinDistance defaultEditCosts) sample_range
let paired_samples = bits_samples `joinOnKey` (stu_samples `joinOnKey` sqstu_samples)
listified_samples = [((i, j), [a, b, c]) | ((i, j), (a, (b, c))) <- paired_samples]
writeFile "data.plot" (toGnuPlotFormat listified_samples)
writeFile "plot.script" gnuPlotScript
(_inp, _outp, _err, gp_pid) <- runInteractiveCommand "(cat plot.script | gnuplot); RETCODE=$?; rm plot.script; exit $RETCODE"
gp_exit_code <- waitForProcess gp_pid
case gp_exit_code of
ExitSuccess -> putStrLn "Plotted at 'data.ps'"
ExitFailure err_no -> putStrLn $ "Failed! Error code " ++ show err_no edit-distance-0.2.1.1/Text/EditDistance/Bits.hs 000644 000765 000024 00000030460 12033303026 022717 0 ustar 00mbolingbroke staff 000000 000000 {-# LANGUAGE PatternGuards, PatternSignatures, ScopedTypeVariables, BangPatterns #-}
module Text.EditDistance.Bits (
levenshteinDistance, levenshteinDistanceWithLengths, {-levenshteinDistanceCutoff,-} restrictedDamerauLevenshteinDistance, restrictedDamerauLevenshteinDistanceWithLengths
) where
import Data.Bits
import Data.Char
import Data.Word
import Data.List
import qualified Data.IntMap as IM
--import Debug.Trace
--type BitVector = Integer
-- Based on the algorithm presented in "A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances" in PSC'02 (Heikki Hyyro).
-- See http://www.cs.uta.fi/~helmu/pubs/psc02.pdf and http://www.cs.uta.fi/~helmu/pubs/PSCerr.html for an explanation
levenshteinDistance :: String -> String -> Int
levenshteinDistance str1 str2 = levenshteinDistanceWithLengths m n str1 str2
where
m = length str1
n = length str2
levenshteinDistanceWithLengths :: Int -> Int -> String -> String -> Int
levenshteinDistanceWithLengths !m !n str1 str2
| m <= n = if n <= 32 -- n must be larger so this check is sufficient
then levenshteinDistance' (undefined :: Word32) m n str1 str2
else levenshteinDistance' (undefined :: Integer) m n str1 str2
| otherwise = if m <= 32 -- m must be larger so this check is sufficient
then levenshteinDistance' (undefined :: Word32) n m str2 str1
else levenshteinDistance' (undefined :: Integer) n m str2 str1
{-# SPECIALIZE INLINE levenshteinDistance' :: Word32 -> Int -> Int -> String -> String -> Int #-}
{-# SPECIALIZE INLINE levenshteinDistance' :: Integer -> Int -> Int -> String -> String -> Int #-}
levenshteinDistance' :: (Num bv, Bits bv) => bv -> Int -> Int -> String -> String -> Int
levenshteinDistance' (_bv_dummy :: bv) !m !n str1 str2
| [] <- str1 = n
| otherwise = extractAnswer $ foldl' (levenshteinDistanceWorker (matchVectors str1) top_bit_mask vector_mask) (m_ones, 0, m) str2
where m_ones@vector_mask = (2 ^ m) - 1
top_bit_mask = 1 `shiftL` (m - 1) :: bv
extractAnswer (_, _, distance) = distance
{-# SPECIALIZE levenshteinDistanceWorker :: IM.IntMap Word32 -> Word32 -> Word32 -> (Word32, Word32, Int) -> Char -> (Word32, Word32, Int) #-}
{-# SPECIALIZE levenshteinDistanceWorker :: IM.IntMap Integer -> Integer -> Integer -> (Integer, Integer, Int) -> Char -> (Integer, Integer, Int) #-}
levenshteinDistanceWorker :: (Num bv, Bits bv) => IM.IntMap bv -> bv -> bv -> (bv, bv, Int) -> Char -> (bv, bv, Int)
levenshteinDistanceWorker !str1_mvs !top_bit_mask !vector_mask (!vp, !vn, !distance) !char2
= {- trace (unlines ["pm = " ++ show pm'
,"d0 = " ++ show d0'
,"hp = " ++ show hp'
,"hn = " ++ show hn'
,"vp = " ++ show vp'
,"vn = " ++ show vn'
,"distance' = " ++ show distance'
,"distance'' = " ++ show distance'']) -} (vp', vn', distance'')
where
pm' = IM.findWithDefault 0 (ord char2) str1_mvs
d0' = ((((pm' .&. vp) + vp) .&. vector_mask) `xor` vp) .|. pm' .|. vn
hp' = vn .|. sizedComplement vector_mask (d0' .|. vp)
hn' = d0' .&. vp
hp'_shift = ((hp' `shiftL` 1) .|. 1) .&. vector_mask
hn'_shift = (hn' `shiftL` 1) .&. vector_mask
vp' = hn'_shift .|. sizedComplement vector_mask (d0' .|. hp'_shift)
vn' = d0' .&. hp'_shift
distance' = if hp' .&. top_bit_mask /= 0 then distance + 1 else distance
distance'' = if hn' .&. top_bit_mask /= 0 then distance' - 1 else distance'
{-
-- Just can't get this working!
-- Based on the algorithm presented in "A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances" in PSC'02 (Heikki Hyyro).
-- See http://www.cs.uta.fi/~helmu/pubs/psc02.pdf and http://www.cs.uta.fi/~helmu/pubs/PSCerr.html for an explanation
levenshteinDistanceCutoff :: Int -> String -> String -> Int
levenshteinDistanceCutoff cutoff str1 str2
| length str1 <= length str2 = levenshteinDistanceCutoff' cutoff str1 str2
| otherwise = levenshteinDistanceCutoff' cutoff str2 str1
levenshteinDistanceCutoff' :: Int -> String -> String -> Int
levenshteinDistanceCutoff' cutoff str1 str2
| [] <- str1 = n
| otherwise = extractAnswer $ foldl' (levenshteinDistanceCutoffFlatWorker (matchVectors str1))
(foldl' (levenshteinDistanceCutoffDiagWorker (matchVectors str1)) (top_bit_mask, vector_mask, all_ones, 0, initial_pm_offset, initial_dist) str2_diag)
str2_flat
where m = length str1
n = length str2
vector_length = if testBit bottom_factor 0
then cutoff -- Odd
else cutoff + 1 -- Even
all_ones@vector_mask = (2 ^ vector_length) - 1
top_bit_mask = trace (show bottom_factor ++ ", " ++ show vector_length) $ 1 `shiftL` (vector_length - 1)
extractAnswer (_, _, _, _, _, distance) = distance
len_difference = n - m
top_factor = cutoff + len_difference
bottom_factor = cutoff - len_difference
bottom_factor_shift = (bottom_factor `shiftR` 1)
initial_dist = bottom_factor_shift -- The distance the virtual first vector ended on
initial_pm_offset = (top_factor `shiftR` 1) -- The amount of left shift to apply to the >next< pattern match vector
diag_threshold = negate bottom_factor_shift + m -- The index in str2 where we stop going diagonally down and start going across
(str2_diag, str2_flat) = splitAt diag_threshold str2
levenshteinDistanceCutoffDiagWorker :: IM.IntMap BitVector -> (BitVector, BitVector, BitVector, BitVector, Int, Int) -> Char -> (BitVector, BitVector, BitVector, BitVector, Int, Int)
levenshteinDistanceCutoffDiagWorker !str1_mvs (!top_bit_mask, !vector_mask, !vp, !vn, !pm_offset, !distance) !char2
= trace (unlines ["vp = " ++ show vp
,"vn = " ++ show vn
,"vector_mask = " ++ show vector_mask
,"pm_offset = " ++ show pm_offset
,"unshifted_pm = " ++ show unshifted_pm
,"pm' = " ++ show pm'
,"d0' = " ++ show d0'
,"hp' = " ++ show hp'
,"hn' = " ++ show hn'
,"vp' = " ++ show vp'
,"vn' = " ++ show vn'
,"distance' = " ++ show distance']) (top_bit_mask, vector_mask, vp', vn', pm_offset - 1, distance')
where
unshifted_pm = IM.findWithDefault 0 (ord char2) str1_mvs
pm' = (unshifted_pm `shift` pm_offset) .&. vector_mask
d0' = ((((pm' .&. vp) + vp) .&. vector_mask) `xor` vp) .|. pm' .|. vn
hp' = vn .|. sizedComplement vector_mask (d0' .|. vp)
hn' = d0' .&. vp
d0'_shift = d0' `shiftR` 1
vp' = hn' .|. sizedComplement vector_mask (d0'_shift .|. hp')
vn' = d0'_shift .&. hp'
distance' = if d0' .&. top_bit_mask /= 0 then distance else distance + 1
levenshteinDistanceCutoffFlatWorker :: IM.IntMap BitVector -> (BitVector, BitVector, BitVector, BitVector, Int, Int) -> Char -> (BitVector, BitVector, BitVector, BitVector, Int, Int)
levenshteinDistanceCutoffFlatWorker !str1_mvs (!top_bit_mask, !vector_mask, !vp, !vn, !pm_offset, !distance) !char2
= trace (unlines ["pm_offset = " ++ show pm_offset
,"top_bit_mask' = " ++ show top_bit_mask'
,"vector_mask' = " ++ show vector_mask'
,"pm = " ++ show pm'
,"d0 = " ++ show d0'
,"hp = " ++ show hp'
,"hn = " ++ show hn'
,"vp = " ++ show vp'
,"vn = " ++ show vn'
,"distance' = " ++ show distance'
,"distance'' = " ++ show distance'']) (top_bit_mask', vector_mask', vp', vn', pm_offset - 1, distance'')
where
top_bit_mask' = top_bit_mask `shiftR` 1
vector_mask' = vector_mask `shiftR` 1
pm' = (IM.findWithDefault 0 (ord char2) str1_mvs `rotate` pm_offset) .&. vector_mask'
d0' = ((((pm' .&. vp) + vp) `xor` vp) .|. pm' .|. vn) .&. vector_mask'
hp' = vn .|. sizedComplement vector_mask' (d0' .|. vp)
hn' = d0' .&. vp
d0'_shift = d0' `shiftR` 1
vp' = hn' .|. sizedComplement vector_mask' (d0'_shift .|. hp')
vn' = d0'_shift .&. hp'
distance' = if hp' .&. top_bit_mask' /= 0 then distance + 1 else distance
distance'' = if hn' .&. top_bit_mask' /= 0 then distance' - 1 else distance'
-}
-- Based on the algorithm presented in "A Bit-Vector Algorithm for Computing Levenshtein and Damerau Edit Distances" in PSC'02 (Heikki Hyyro).
-- See http://www.cs.uta.fi/~helmu/pubs/psc02.pdf and http://www.cs.uta.fi/~helmu/pubs/PSCerr.html for an explanation
restrictedDamerauLevenshteinDistance :: String -> String -> Int
restrictedDamerauLevenshteinDistance str1 str2 = restrictedDamerauLevenshteinDistanceWithLengths m n str1 str2
where
m = length str1
n = length str2
restrictedDamerauLevenshteinDistanceWithLengths :: Int -> Int -> String -> String -> Int
restrictedDamerauLevenshteinDistanceWithLengths !m !n str1 str2
| m <= n = if n <= 32 -- n must be larger so this check is sufficient
then restrictedDamerauLevenshteinDistance' (undefined :: Word32) m n str1 str2
else restrictedDamerauLevenshteinDistance' (undefined :: Integer) m n str1 str2
| otherwise = if m <= 32 -- m must be larger so this check is sufficient
then restrictedDamerauLevenshteinDistance' (undefined :: Word32) n m str2 str1
else restrictedDamerauLevenshteinDistance' (undefined :: Integer) n m str2 str1
{-# SPECIALIZE INLINE restrictedDamerauLevenshteinDistance' :: Word32 -> Int -> Int -> String -> String -> Int #-}
{-# SPECIALIZE INLINE restrictedDamerauLevenshteinDistance' :: Integer -> Int -> Int -> String -> String -> Int #-}
restrictedDamerauLevenshteinDistance' :: (Num bv, Bits bv) => bv -> Int -> Int -> String -> String -> Int
restrictedDamerauLevenshteinDistance' (_bv_dummy :: bv) !m !n str1 str2
| [] <- str1 = n
| otherwise = extractAnswer $ foldl' (restrictedDamerauLevenshteinDistanceWorker (matchVectors str1) top_bit_mask vector_mask) (0, 0, m_ones, 0, m) str2
where m_ones@vector_mask = (2 ^ m) - 1
top_bit_mask = 1 `shiftL` (m - 1) :: bv
extractAnswer (_, _, _, _, distance) = distance
{-# SPECIALIZE restrictedDamerauLevenshteinDistanceWorker :: IM.IntMap Word32 -> Word32 -> Word32 -> (Word32, Word32, Word32, Word32, Int) -> Char -> (Word32, Word32, Word32, Word32, Int) #-}
{-# SPECIALIZE restrictedDamerauLevenshteinDistanceWorker :: IM.IntMap Integer -> Integer -> Integer -> (Integer, Integer, Integer, Integer, Int) -> Char -> (Integer, Integer, Integer, Integer, Int) #-}
restrictedDamerauLevenshteinDistanceWorker :: (Num bv, Bits bv) => IM.IntMap bv -> bv -> bv -> (bv, bv, bv, bv, Int) -> Char -> (bv, bv, bv, bv, Int)
restrictedDamerauLevenshteinDistanceWorker !str1_mvs !top_bit_mask !vector_mask (!pm, !d0, !vp, !vn, !distance) !char2
= (pm', d0', vp', vn', distance'')
where
pm' = IM.findWithDefault 0 (ord char2) str1_mvs
d0' = ((((sizedComplement vector_mask d0) .&. pm') `shiftL` 1) .&. pm) -- No need to mask the shiftL because of the restricted range of pm
.|. ((((pm' .&. vp) + vp) .&. vector_mask) `xor` vp) .|. pm' .|. vn
hp' = vn .|. sizedComplement vector_mask (d0' .|. vp)
hn' = d0' .&. vp
hp'_shift = ((hp' `shiftL` 1) .|. 1) .&. vector_mask
hn'_shift = (hn' `shiftL` 1) .&. vector_mask
vp' = hn'_shift .|. sizedComplement vector_mask (d0' .|. hp'_shift)
vn' = d0' .&. hp'_shift
distance' = if hp' .&. top_bit_mask /= 0 then distance + 1 else distance
distance'' = if hn' .&. top_bit_mask /= 0 then distance' - 1 else distance'
{-# SPECIALIZE INLINE sizedComplement :: Word32 -> Word32 -> Word32 #-}
{-# SPECIALIZE INLINE sizedComplement :: Integer -> Integer -> Integer #-}
sizedComplement :: (Num bv, Bits bv) => bv -> bv -> bv
sizedComplement vector_mask vect = vector_mask `xor` vect
{-# SPECIALIZE matchVectors :: String -> IM.IntMap Word32 #-}
{-# SPECIALIZE matchVectors :: String -> IM.IntMap Integer #-}
matchVectors :: (Num bv, Bits bv) => String -> IM.IntMap bv
matchVectors = snd . foldl' go (0 :: Int, IM.empty)
where
go (!ix, !im) char = let ix' = ix + 1
im' = IM.insertWith (.|.) (ord char) (2 ^ ix) im
in (ix', im') edit-distance-0.2.1.1/Text/EditDistance/EditCosts.hs 000644 000765 000024 00000004000 12033303026 023706 0 ustar 00mbolingbroke staff 000000 000000 {-# OPTIONS_GHC -funbox-strict-fields #-}
module Text.EditDistance.EditCosts (
Costs(..),
EditCosts(..), deletionCost, insertionCost, substitutionCost, transpositionCost,
defaultEditCosts, isDefaultEditCosts
) where
data Costs a = ConstantCost !Int
| VariableCost (a -> Int)
{-# INLINE cost #-}
cost :: Costs a -> a -> Int
cost (ConstantCost i) _ = i
cost (VariableCost f) x = f x
data EditCosts = EditCosts {
deletionCosts :: Costs Char, -- ^ Cost of deleting the specified character from the left string
insertionCosts :: Costs Char, -- ^ Cost of inserting the specified characters into the right string
substitutionCosts :: Costs (Char, Char), -- ^ Cost of substituting a character from the left string with one from the right string -- with arguments in that order.
transpositionCosts :: Costs (Char, Char) -- ^ Cost of moving one character backwards and the other forwards -- with arguments in that order.
}
{-# INLINE deletionCost #-}
deletionCost :: EditCosts -> Char -> Int
deletionCost ec deleted = cost (deletionCosts ec) deleted
{-# INLINE insertionCost #-}
insertionCost :: EditCosts -> Char -> Int
insertionCost ec inserted = cost (insertionCosts ec) inserted
{-# INLINE substitutionCost #-}
substitutionCost :: EditCosts -> Char -> Char -> Int
substitutionCost ec old new = cost (substitutionCosts ec) (old, new)
{-# INLINE transpositionCost #-}
transpositionCost :: EditCosts -> Char -> Char -> Int
transpositionCost ec backwards forwards = cost (transpositionCosts ec) (backwards, forwards)
defaultEditCosts :: EditCosts
defaultEditCosts = EditCosts {
deletionCosts = ConstantCost 1,
insertionCosts = ConstantCost 1,
substitutionCosts = ConstantCost 1,
transpositionCosts = ConstantCost 1
}
isDefaultEditCosts :: EditCosts -> Bool
isDefaultEditCosts (EditCosts { deletionCosts = ConstantCost 1, insertionCosts = ConstantCost 1, substitutionCosts = ConstantCost 1, transpositionCosts = ConstantCost 1 }) = True
isDefaultEditCosts _ = Falseedit-distance-0.2.1.1/Text/EditDistance/MonadUtilities.hs 000644 000765 000024 00000000421 12033303026 024742 0 ustar 00mbolingbroke staff 000000 000000 {-# LANGUAGE BangPatterns #-}
module Text.EditDistance.MonadUtilities where
{-# INLINE loopM_ #-}
loopM_ :: Monad m => Int -> Int -> (Int -> m ()) -> m ()
loopM_ !from !to action
| from > to = return ()
| otherwise = do
action from
loopM_ (from + 1) to action edit-distance-0.2.1.1/Text/EditDistance/SquareSTUArray.hs 000644 000765 000024 00000015557 12033303026 024663 0 ustar 00mbolingbroke staff 000000 000000 {-# LANGUAGE PatternGuards, ScopedTypeVariables, BangPatterns #-}
module Text.EditDistance.SquareSTUArray (
levenshteinDistance, levenshteinDistanceWithLengths, restrictedDamerauLevenshteinDistance, restrictedDamerauLevenshteinDistanceWithLengths
) where
import Text.EditDistance.EditCosts
import Text.EditDistance.MonadUtilities
import Control.Monad
import Control.Monad.ST
import Data.Array.ST
levenshteinDistance :: EditCosts -> String -> String -> Int
levenshteinDistance !costs str1 str2 = levenshteinDistanceWithLengths costs str1_len str2_len str1 str2
where
str1_len = length str1
str2_len = length str2
levenshteinDistanceWithLengths :: EditCosts -> Int -> Int -> String -> String -> Int
levenshteinDistanceWithLengths !costs !str1_len !str2_len str1 str2 = runST (levenshteinDistanceST costs str1_len str2_len str1 str2)
levenshteinDistanceST :: EditCosts -> Int -> Int -> String -> String -> ST s Int
levenshteinDistanceST !costs !str1_len !str2_len str1 str2 = do
-- Create string arrays
str1_array <- stringToArray str1 str1_len
str2_array <- stringToArray str2 str2_len
-- Create array of costs. Say we index it by (i, j) where i is the column index and j the row index.
-- Rows correspond to characters of str2 and columns to characters of str1.
cost_array <- newArray_ ((0, 0), (str1_len, str2_len)) :: ST s (STUArray s (Int, Int) Int)
-- Fill out the first row (j = 0)
_ <- (\f -> foldM f 0 ([1..] `zip` str1)) $ \deletion_cost (!i, col_char) -> let deletion_cost' = deletion_cost + deletionCost costs col_char in writeArray cost_array (i, 0) deletion_cost' >> return deletion_cost'
-- Fill the remaining rows (j >= 1)
_ <- (\f -> foldM f 0 [1..str2_len]) $ \insertion_cost (!j) -> do
row_char <- readArray str2_array j
-- Initialize the first element of the row (i = 0)
let insertion_cost' = insertion_cost + insertionCost costs row_char
writeArray cost_array (0, j) insertion_cost'
-- Fill the remaining elements of the row (i >= 1)
loopM_ 1 str1_len $ \(!i) -> do
col_char <- readArray str1_array i
cost <- standardCosts costs cost_array row_char col_char (i, j)
writeArray cost_array (i, j) cost
return insertion_cost'
-- Return an actual answer
readArray cost_array (str1_len, str2_len)
restrictedDamerauLevenshteinDistance :: EditCosts -> String -> String -> Int
restrictedDamerauLevenshteinDistance costs str1 str2 = restrictedDamerauLevenshteinDistanceWithLengths costs str1_len str2_len str1 str2
where
str1_len = length str1
str2_len = length str2
restrictedDamerauLevenshteinDistanceWithLengths :: EditCosts -> Int -> Int -> String -> String -> Int
restrictedDamerauLevenshteinDistanceWithLengths costs str1_len str2_len str1 str2 = runST (restrictedDamerauLevenshteinDistanceST costs str1_len str2_len str1 str2)
restrictedDamerauLevenshteinDistanceST :: EditCosts -> Int -> Int -> String -> String -> ST s Int
restrictedDamerauLevenshteinDistanceST !costs str1_len str2_len str1 str2 = do
-- Create string arrays
str1_array <- stringToArray str1 str1_len
str2_array <- stringToArray str2 str2_len
-- Create array of costs. Say we index it by (i, j) where i is the column index and j the row index.
-- Rows correspond to characters of str2 and columns to characters of str1.
cost_array <- newArray_ ((0, 0), (str1_len, str2_len)) :: ST s (STUArray s (Int, Int) Int)
-- Fill out the first row (j = 0)
_ <- (\f -> foldM f 0 ([1..] `zip` str1)) $ \deletion_cost (!i, col_char) -> let deletion_cost' = deletion_cost + deletionCost costs col_char in writeArray cost_array (i, 0) deletion_cost' >> return deletion_cost'
-- Fill out the second row (j = 1)
when (str2_len > 0) $ do
initial_row_char <- readArray str2_array 1
-- Initialize the first element of the second row (i = 0)
writeArray cost_array (0, 1) (insertionCost costs initial_row_char)
-- Initialize the remaining elements of the row (i >= 1)
loopM_ 1 str1_len $ \(!i) -> do
col_char <- readArray str1_array i
cost <- standardCosts costs cost_array initial_row_char col_char (i, 1)
writeArray cost_array (i, 1) cost
-- Fill the remaining rows (j >= 2)
loopM_ 2 str2_len (\(!j) -> do
row_char <- readArray str2_array j
prev_row_char <- readArray str2_array (j - 1)
-- Initialize the first element of the row (i = 0)
writeArray cost_array (0, j) (insertionCost costs row_char * j)
-- Initialize the second element of the row (i = 1)
when (str1_len > 0) $ do
col_char <- readArray str1_array 1
cost <- standardCosts costs cost_array row_char col_char (1, j)
writeArray cost_array (1, j) cost
-- Fill the remaining elements of the row (i >= 2)
loopM_ 2 str1_len (\(!i) -> do
col_char <- readArray str1_array i
prev_col_char <- readArray str1_array (i - 1)
standard_cost <- standardCosts costs cost_array row_char col_char (i, j)
cost <- if prev_row_char == col_char && prev_col_char == row_char
then do transpose_cost <- fmap (+ (transpositionCost costs col_char row_char)) $ readArray cost_array (i - 2, j - 2)
return (standard_cost `min` transpose_cost)
else return standard_cost
writeArray cost_array (i, j) cost))
-- Return an actual answer
readArray cost_array (str1_len, str2_len)
{-# INLINE standardCosts #-}
standardCosts :: EditCosts -> STUArray s (Int, Int) Int -> Char -> Char -> (Int, Int) -> ST s Int
standardCosts !costs !cost_array !row_char !col_char (!i, !j) = do
deletion_cost <- fmap (+ (deletionCost costs col_char)) $ readArray cost_array (i - 1, j)
insertion_cost <- fmap (+ (insertionCost costs row_char)) $ readArray cost_array (i, j - 1)
subst_cost <- fmap (+ if row_char == col_char
then 0
else (substitutionCost costs col_char row_char))
(readArray cost_array (i - 1, j - 1))
return $ deletion_cost `min` insertion_cost `min` subst_cost
{-# INLINE stringToArray #-}
stringToArray :: String -> Int -> ST s (STUArray s Int Char)
stringToArray str !str_len = do
array <- newArray_ (1, str_len)
forM_ (zip [1..] str) (uncurry (writeArray array))
return array
{-
showArray :: STUArray s (Int, Int) Int -> ST s String
showArray array = do
((il, jl), (iu, ju)) <- getBounds array
flip (flip foldM "") [(i, j) | i <- [il..iu], j <- [jl.. ju]] $ \rest (i, j) -> do
elt <- readArray array (i, j)
return $ rest ++ show (i, j) ++ ": " ++ show elt ++ ", "
-}
edit-distance-0.2.1.1/Text/EditDistance/STUArray.hs 000644 000765 000024 00000021365 12033303026 023474 0 ustar 00mbolingbroke staff 000000 000000 {-# LANGUAGE PatternGuards, ScopedTypeVariables, BangPatterns #-}
module Text.EditDistance.STUArray (
levenshteinDistance, levenshteinDistanceWithLengths, restrictedDamerauLevenshteinDistance, restrictedDamerauLevenshteinDistanceWithLengths
) where
import Text.EditDistance.EditCosts
import Text.EditDistance.MonadUtilities
import Control.Monad
import Control.Monad.ST
import Data.Array.ST
levenshteinDistance :: EditCosts -> String -> String -> Int
levenshteinDistance !costs str1 str2 = levenshteinDistanceWithLengths costs str1_len str2_len str1 str2
where
str1_len = length str1
str2_len = length str2
levenshteinDistanceWithLengths :: EditCosts -> Int -> Int -> String -> String -> Int
levenshteinDistanceWithLengths !costs !str1_len !str2_len str1 str2 = runST (levenshteinDistanceST costs str1_len str2_len str1 str2)
levenshteinDistanceST :: EditCosts -> Int -> Int -> String -> String -> ST s Int
levenshteinDistanceST !costs !str1_len !str2_len str1 str2 = do
-- Create string arrays
str1_array <- stringToArray str1 str1_len
str2_array <- stringToArray str2 str2_len
-- Create array of costs for a single row. Say we index costs by (i, j) where i is the column index and j the row index.
-- Rows correspond to characters of str2 and columns to characters of str1. We can get away with just storing a single
-- row of costs at a time, but we use two because it turns out to be faster
cost_row <- newArray_ (0, str1_len) :: ST s (STUArray s Int Int)
cost_row' <- newArray_ (0, str1_len) :: ST s (STUArray s Int Int)
-- Fill out the first row (j = 0)
_ <- (\f -> foldM f 0 ([1..] `zip` str1)) $ \deletion_cost (i, col_char) -> let deletion_cost' = deletion_cost + deletionCost costs col_char in writeArray cost_row i deletion_cost' >> return deletion_cost'
-- Fill out the remaining rows (j >= 1)
(_, final_row, _) <- foldM (levenshteinDistanceSTRowWorker costs str1_len str1_array str2_array) (0, cost_row, cost_row') [1..str2_len]
-- Return an actual answer
readArray final_row str1_len
levenshteinDistanceSTRowWorker :: EditCosts -> Int -> STUArray s Int Char -> STUArray s Int Char -> (Int, STUArray s Int Int, STUArray s Int Int) -> Int -> ST s (Int, STUArray s Int Int, STUArray s Int Int)
levenshteinDistanceSTRowWorker !costs !str1_len !str1_array !str2_array (!insertion_cost, !cost_row, !cost_row') !j = do
row_char <- readArray str2_array j
-- Initialize the first element of the row (i = 0)
let insertion_cost' = insertion_cost + insertionCost costs row_char
writeArray cost_row' 0 insertion_cost'
-- Fill the remaining elements of the row (i >= 1)
loopM_ 1 str1_len (colWorker row_char)
return (insertion_cost', cost_row', cost_row)
where
colWorker row_char !i = do
col_char <- readArray str1_array i
left_up <- readArray cost_row (i - 1)
left <- readArray cost_row' (i - 1)
here_up <- readArray cost_row i
let here = standardCosts costs row_char col_char left left_up here_up
writeArray cost_row' i here
restrictedDamerauLevenshteinDistance :: EditCosts -> String -> String -> Int
restrictedDamerauLevenshteinDistance !costs str1 str2 = restrictedDamerauLevenshteinDistanceWithLengths costs str1_len str2_len str1 str2
where
str1_len = length str1
str2_len = length str2
restrictedDamerauLevenshteinDistanceWithLengths :: EditCosts -> Int -> Int -> String -> String -> Int
restrictedDamerauLevenshteinDistanceWithLengths !costs !str1_len !str2_len str1 str2 = runST (restrictedDamerauLevenshteinDistanceST costs str1_len str2_len str1 str2)
restrictedDamerauLevenshteinDistanceST :: EditCosts -> Int -> Int -> String -> String -> ST s Int
restrictedDamerauLevenshteinDistanceST !costs str1_len str2_len str1 str2 = do
-- Create string arrays
str1_array <- stringToArray str1 str1_len
str2_array <- stringToArray str2 str2_len
-- Create array of costs for a single row. Say we index costs by (i, j) where i is the column index and j the row index.
-- Rows correspond to characters of str2 and columns to characters of str1. We can get away with just storing two
-- rows of costs at a time, but I use three because it turns out to be faster
cost_row <- newArray_ (0, str1_len) :: ST s (STUArray s Int Int)
-- Fill out the first row (j = 0)
_ <- (\f -> foldM f 0 ([1..] `zip` str1)) $ \deletion_cost (!i, col_char) -> let deletion_cost' = deletion_cost + deletionCost costs col_char in writeArray cost_row i deletion_cost' >> return deletion_cost'
if (str2_len == 0)
then readArray cost_row str1_len
else do
-- We defer allocation of these arrays to here because they aren't used in the other branch
cost_row' <- newArray_ (0, str1_len) :: ST s (STUArray s Int Int)
cost_row'' <- newArray_ (0, str1_len) :: ST s (STUArray s Int Int)
-- Fill out the second row (j = 1)
row_char <- readArray str2_array 1
-- Initialize the first element of the row (i = 0)
let zero = insertionCost costs row_char
writeArray cost_row' 0 zero
-- Fill the remaining elements of the row (i >= 1)
loopM_ 1 str1_len (firstRowColWorker str1_array row_char cost_row cost_row')
-- Fill out the remaining rows (j >= 2)
(_, _, final_row, _, _) <- foldM (restrictedDamerauLevenshteinDistanceSTRowWorker costs str1_len str1_array str2_array) (zero, cost_row, cost_row', cost_row'', row_char) [2..str2_len]
-- Return an actual answer
readArray final_row str1_len
where
firstRowColWorker !str1_array !row_char !cost_row !cost_row' !i = do
col_char <- readArray str1_array i
left_up <- readArray cost_row (i - 1)
left <- readArray cost_row' (i - 1)
here_up <- readArray cost_row i
let here = standardCosts costs row_char col_char left left_up here_up
writeArray cost_row' i here
restrictedDamerauLevenshteinDistanceSTRowWorker :: EditCosts -> Int
-> STUArray s Int Char -> STUArray s Int Char -- String arrays
-> (Int, STUArray s Int Int, STUArray s Int Int, STUArray s Int Int, Char) -> Int -- Incoming rows of the matrix in recency order
-> ST s (Int, STUArray s Int Int, STUArray s Int Int, STUArray s Int Int, Char) -- Outgoing rows of the matrix in recency order
restrictedDamerauLevenshteinDistanceSTRowWorker !costs !str1_len !str1_array !str2_array (!insertion_cost, !cost_row, !cost_row', !cost_row'', !prev_row_char) !j = do
row_char <- readArray str2_array j
-- Initialize the first element of the row (i = 0)
zero_up <- readArray cost_row' 0
let insertion_cost' = insertion_cost + insertionCost costs row_char
writeArray cost_row'' 0 insertion_cost'
-- Initialize the second element of the row (i = 1)
when (str1_len > 0) $ do
col_char <- readArray str1_array 1
one_up <- readArray cost_row' 1
let one = standardCosts costs row_char col_char insertion_cost' zero_up one_up
writeArray cost_row'' 1 one
-- Fill the remaining elements of the row (i >= 2)
loopM_ 2 str1_len (colWorker row_char)
return (insertion_cost', cost_row', cost_row'', cost_row, row_char)
where
colWorker !row_char !i = do
prev_col_char <- readArray str1_array (i - 1)
col_char <- readArray str1_array i
left_left_up_up <- readArray cost_row (i - 2)
left_up <- readArray cost_row' (i - 1)
left <- readArray cost_row'' (i - 1)
here_up <- readArray cost_row' i
let here_standard_only = standardCosts costs row_char col_char left left_up here_up
here = if prev_row_char == col_char && prev_col_char == row_char
then here_standard_only `min` (left_left_up_up + transpositionCost costs col_char row_char)
else here_standard_only
writeArray cost_row'' i here
{-# INLINE standardCosts #-}
standardCosts :: EditCosts -> Char -> Char -> Int -> Int -> Int -> Int
standardCosts !costs !row_char !col_char !cost_left !cost_left_up !cost_up = deletion_cost `min` insertion_cost `min` subst_cost
where
deletion_cost = cost_left + deletionCost costs col_char
insertion_cost = cost_up + insertionCost costs row_char
subst_cost = cost_left_up + if row_char == col_char then 0 else substitutionCost costs col_char row_char
{-# INLINE stringToArray #-}
stringToArray :: String -> Int -> ST s (STUArray s Int Char)
stringToArray str !str_length = do
array <- newArray_ (1, str_length)
forM_ (zip [1..] str) (uncurry (writeArray array))
return array edit-distance-0.2.1.1/Text/EditDistance/Tests.hs 000644 000765 000024 00000000336 12033303026 023117 0 ustar 00mbolingbroke staff 000000 000000 module Main where
import Text.EditDistance.Tests.Properties
import Test.Framework
import Data.Monoid
main :: IO ()
main = defaultMain $ map (plusTestOptions $ mempty { topt_maximum_generated_tests = Just 1000 }) tests