hierarchical-clustering-0.4.6/0000755000000000000000000000000012553543210014502 5ustar0000000000000000hierarchical-clustering-0.4.6/hierarchical-clustering.cabal0000644000000000000000000000643112553543210022265 0ustar0000000000000000Name: hierarchical-clustering Version: 0.4.6 Synopsis: Fast algorithms for single, average/UPGMA and complete linkage clustering. License: BSD3 License-file: LICENSE Author: Felipe Almeida Lessa Maintainer: felipe.lessa@gmail.com Category: Clustering Build-type: Simple Cabal-version: >= 1.8 Description: This package provides a function to create a dendrogram from a list of items and a distance function between them. Initially a singleton cluster is created for each item, and then new, bigger clusters are created by merging the two clusters with least distance between them. The distance between two clusters is calculated according to the linkage type. The dendrogram represents not only the clusters but also the order on which they were created. . This package has many implementations with different performance characteristics. There are SLINK and CLINK algorithm implementations that are optimal in both space and time. There are also naive implementations using a distance matrix. Using the @dendrogram@ function from @Data.Clustering.Hierarchical@ automatically chooses the best implementation we have. . Changes in version 0.4: . * Specialize the distance type to Double for efficiency reasons. It's uncommon to use distances other than Double. . * Implement SLINK and CLINK. These are optimal algorithms in both space and time for single and complete linkage, respectively, running in /O(n^2)/ time and /O(n)/ space. . * Reorganized internal implementation. . * Some performance improvements for the naive implementation. . * Better test coverage. Also, performance improvements for the test suite, now running in 3 seconds (instead of one minute). . Changes in version 0.3.1.2 (version 0.3.1.1 was skipped): . * Added tests for many things. Use @cabal test@ =). . Changes in version 0.3.1: . * Works with containers 0.4 (thanks, Doug Beardsley). . * Removed some internal unnecessary overheads and added some strictness. . Changes in version 0.3.0.1: . * Listed changes of unreleased version 0.2. . Changes in version 0.3: . * Added function @cutAt@. . * Fixed complexity in Haddock comments. . Changes in version 0.2: . * Added function @elements@. . * Added separate functions for each linkage type. This may be useful if you want to create a dendrogram and your distance data type isn't an instance of @Floating@. Extra-source-files: tests/runtests.hs Source-repository head type: git location: https://github.com/meteficha/hierarchical-clustering Library Hs-source-dirs: src Exposed-modules: Data.Clustering.Hierarchical, Data.Clustering.Hierarchical.Internal.DistanceMatrix, Data.Clustering.Hierarchical.Internal.Optimal, Data.Clustering.Hierarchical.Internal.Types Build-depends: base == 4.* , array >= 0.3 && < 0.6 , containers >= 0.3 && < 0.6 GHC-options: -Wall Test-suite runtests Type: exitcode-stdio-1.0 Hs-source-dirs: tests Main-is: runtests.hs Build-depends: base == 4.* , hspec >= 2.1 , HUnit >= 1.2 , QuickCheck >= 2.4 , hierarchical-clustering GHC-options: -Wall hierarchical-clustering-0.4.6/Setup.hs0000644000000000000000000000005612553543210016137 0ustar0000000000000000import Distribution.Simple main = defaultMain hierarchical-clustering-0.4.6/LICENSE0000644000000000000000000000300112553543210015501 0ustar0000000000000000Copyright (c)2010, Felipe Almeida Lessa All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of Felipe Almeida Lessa nor the names of other contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. hierarchical-clustering-0.4.6/tests/0000755000000000000000000000000012553543210015644 5ustar0000000000000000hierarchical-clustering-0.4.6/tests/runtests.hs0000644000000000000000000001402612553543210020072 0ustar0000000000000000{-# LANGUAGE Rank2Types #-} -- from base import qualified Control.Exception as E import Control.Monad (when, liftM2) import Data.List (delete, sort, nub) import Text.Printf (printf) import Text.Show.Functions () -- from hspec import Test.Hspec (hspec, describe, it, pendingWith, shouldBe, Spec) import Test.Hspec.QuickCheck (prop) -- from HUnit import Test.HUnit (Assertion, assertFailure) -- from QuickCheck import Test.QuickCheck (Property, Arbitrary(..), Gen, forAll) -- from this package import Data.Clustering.Hierarchical import qualified Data.Clustering.Hierarchical.Internal.DistanceMatrix as DM import qualified Data.Clustering.Hierarchical.Internal.Optimal as O main :: IO () main = hspec $ do test_cutAt test_dendrogram test_cutAt :: Spec test_cutAt = describe "cutAt" $ do let dendro :: Dendrogram Char dendro = Branch 0.8 d_0_8_left d_0_8_right d_0_8_left = Branch 0.5 d_0_5_left d_0_5_right d_0_5_left = Branch 0.2 d_0_2_left d_0_2_right d_0_2_left = Leaf 'A' d_0_2_right = Leaf 'B' d_0_5_right = Leaf 'C' d_0_8_right = Leaf 'D' let testFor threshold expected = it (printf "works for 'dendro' with threshold %0.1f" threshold) $ dendro `cutAt` threshold `shouldBe` expected testFor 0.9 [dendro] testFor 0.8 [dendro] testFor 0.7 [d_0_8_left, d_0_8_right] testFor 0.5 [d_0_8_left, d_0_8_right] testFor 0.4 [d_0_5_left, d_0_5_right, d_0_8_right] testFor 0.2 [d_0_5_left, d_0_5_right, d_0_8_right] testFor 0.1 [d_0_2_left, d_0_2_right, d_0_5_right, d_0_8_right] test_dendrogram :: Spec test_dendrogram = do describe "Optimal's singleLinkage" $ do basicDendrogramTests O.singleLinkage prop "really is single linkage" $ propCorrectLinkage O.singleLinkage singleLink describe "Optimal's completeLinkage" $ do basicDendrogramTests O.completeLinkage prop "really is complete linkage" $ propCorrectLinkage O.completeLinkage completeLink describe "DistanceMatrix's singleLinkage" $ do basicDendrogramTests DM.singleLinkage prop "really is single linkage" $ propCorrectLinkage DM.singleLinkage singleLink describe "DistanceMatrix's completeLinkage" $ do basicDendrogramTests DM.completeLinkage prop "really is complete linkage" $ propCorrectLinkage DM.completeLinkage completeLink describe "DistanceMatrix's upgma" $ do basicDendrogramTests DM.upgma prop "really is UPGMA" $ propCorrectLinkage DM.upgma upgma describe "DistanceMatrix's fakeAverageLinkage" $ do basicDendrogramTests DM.fakeAverageLinkage describe "Optimal and DistanceMatrix" $ do let test f1 f2 = forAll nonNullLists $ \ps -> f1 ps euclideanDist ==== f2 ps euclideanDist prop "agree on singleLinkage" $ test O.singleLinkage DM.singleLinkage it "agree on completeLinkage" $ pendingWith "This doesn't work because CLINK doesn't \ \always give the best complete linkage." basicDendrogramTests :: (forall a. [a] -> (a -> a -> Distance) -> Dendrogram a) -> Spec basicDendrogramTests f = do it "fails for an empty input" $ assertErrors (f [] (\_ _ -> zero)) it "works for one element" $ Leaf () == f [()] undefined prop "always returns the elements we gave" $ forAll nonNullLists $ \points -> elements (f points euclideanDist) `isPermutationOf` points prop "works for examples where all elements have the same distance" $ \fixedDist -> forAll nonNullLists $ \xs' -> let xs = nub xs' okay :: Dendrogram Char -> [Char] -> Maybe [Char] okay (Leaf z) ys | z `elem` ys = Just (delete z ys) okay (Branch d l r) ys | d ~= fixedDist = okay l ys >>= okay r okay _ _ = Nothing dist x y | x == y = error "shouldn't calculate (dist x x)" | otherwise = fixedDist in okay (f xs dist) xs == Just [] ---------------------------------------------------------------------- type P = (Double, Double) propCorrectLinkage :: ([P] -> (P -> P -> Distance) -> Dendrogram P) -> (D P -> [P] -> [P] -> Distance) -> Property propCorrectLinkage f link = forAll nonNullLists $ \xs -> correctLinkage link d (f xs d) where d = euclideanDist type D a = a -> a -> Distance correctLinkage :: (D a -> [a] -> [a] -> Distance) -> D a -> Dendrogram a -> Bool correctLinkage link dist = go where go (Branch d l r) = go l && go r && link dist (elements l) (elements r) ~= d go (Leaf _) = True singleLink, completeLink, upgma :: D a -> [a] -> [a] -> Distance singleLink dist xs ys = minimum [x `dist` y | x <- xs, y <- ys] completeLink dist xs ys = maximum [x `dist` y | x <- xs, y <- ys] upgma dist xs ys = sum [x `dist` y | x <- xs, y <- ys] / fromIntegral (length xs * length ys) ---------------------------------------------------------------------- nonNullLists :: Arbitrary a => Gen [a] nonNullLists = liftM2 (:) arbitrary arbitrary isPermutationOf :: Ord a => [a] -> [a] -> Bool isPermutationOf xs ys = sort xs == sort ys euclideanDist :: P -> P -> Double euclideanDist (x1,y1) (x2,y2) = sqrt $ sq (x1-x2) + sq (y1-y2) where sq x = x * x (~=) :: Double -> Double -> Bool a ~= b = abs (a - b) < 1e-5 zero :: Double zero = 0 assertErrors :: a -> Assertion assertErrors x = do b <- E.catch (E.evaluate x >> return True) (\(E.ErrorCall _) -> return False {- Ok -}) when b $ assertFailure "Didn't raise an 'error'." -- | Compare two dendrograms without being concerned about -- permutations. (====) :: Eq a => Dendrogram a -> Dendrogram a -> Bool Leaf x1 ==== Leaf x2 = x1 == x2 Branch d1 l1 r1 ==== Branch d2 l2 r2 = d1 ~= d2 && ((l1 ==== l2 && r1 ==== r2) || (l1 ==== r2 && r1 ==== l2)) _ ==== _ = False hierarchical-clustering-0.4.6/src/0000755000000000000000000000000012553543210015271 5ustar0000000000000000hierarchical-clustering-0.4.6/src/Data/0000755000000000000000000000000012553543210016142 5ustar0000000000000000hierarchical-clustering-0.4.6/src/Data/Clustering/0000755000000000000000000000000012553543210020261 5ustar0000000000000000hierarchical-clustering-0.4.6/src/Data/Clustering/Hierarchical.hs0000644000000000000000000000632112553543210023175 0ustar0000000000000000module Data.Clustering.Hierarchical (-- * Dendrogram data type Dendrogram(..) ,Distance ,elements ,cutAt -- * Linkage data type ,Linkage(..) -- * Clustering function ,dendrogram ) where import Data.Clustering.Hierarchical.Internal.Types (Dendrogram(..), Linkage(..), Distance) import qualified Data.Clustering.Hierarchical.Internal.DistanceMatrix as DM import qualified Data.Clustering.Hierarchical.Internal.Optimal as O -- | List of elements in a dendrogram. elements :: Dendrogram a -> [a] elements = go [] where go acc (Leaf x) = x : acc go acc (Branch _ l r) = go (go acc r) l -- | @dendro \`cutAt\` threshold@ cuts the dendrogram @dendro@ at -- all branches which have distances strictly greater than -- @threshold@. -- -- For example, suppose we have -- -- @ -- dendro = Branch 0.8 -- (Branch 0.5 -- (Branch 0.2 -- (Leaf \'A\') -- (Leaf \'B\')) -- (Leaf \'C\')) -- (Leaf \'D\') -- @ -- -- Then: -- -- @ -- dendro \`cutAt\` 0.9 == dendro \`cutAt\` 0.8 == [dendro] -- no changes -- dendro \`cutAt\` 0.7 == dendro \`cutAt\` 0.5 == [Branch 0.5 (Branch 0.2 (Leaf \'A\') (Leaf \'B\')) (Leaf \'C\'), Leaf \'D\'] -- dendro \`cutAt\` 0.4 == dendro \`cutAt\` 0.2 == [Branch 0.2 (Leaf \'A\') (Leaf \'B\'), Leaf \'C\', Leaf \'D\'] -- dendro \`cutAt\` 0.1 == [Leaf \'A\', Leaf \'B\', Leaf \'C\', Leaf \'D\'] -- no branches at all -- @ cutAt :: Dendrogram a -> Distance -> [Dendrogram a] cutAt dendro threshold = go [] dendro where go acc x@(Leaf _) = x : acc go acc x@(Branch d l r) | d <= threshold = x : acc | otherwise = go (go acc r) l -- cut! -- | Calculates a complete, rooted dendrogram for a list of items -- and a linkage type. The following are the time and space -- complexities for each linkage: -- -- ['SingleLinkage'] /O(n^2)/ time and /O(n)/ space, using the -- SLINK algorithm. This algorithm is optimal in both space -- and time and gives the same answer as the naive algorithm -- using a distance matrix. -- -- ['CompleteLinkage'] /O(n^3)/ time and /O(n^2)/ space, using -- the naive algorithm with a distance matrix. Use 'CLINK' if -- you need more performance. -- -- [Complete linkage with 'CLINK'] /O(n^2)/ time and /O(n)/ -- space, using the CLINK algorithm. Note that this algorithm -- doesn't always give the same answer as the naive algorithm -- using a distance matrix, but it's much faster. -- -- ['UPGMA'] /O(n^3)/ time and /O(n^2)/ space, using the naive -- algorithm with a distance matrix. -- -- ['FakeAverageLinkage'] /O(n^3)/ time and /O(n^2)/ space, using -- the naive algorithm with a distance matrix. dendrogram :: Linkage -- ^ Linkage type to be used. -> [a] -- ^ Items to be clustered. -> (a -> a -> Distance) -- ^ Distance function between items. -> Dendrogram a -- ^ Complete dendrogram. dendrogram SingleLinkage = O.singleLinkage dendrogram CompleteLinkage = DM.completeLinkage dendrogram CLINK = O.completeLinkage dendrogram UPGMA = DM.upgma dendrogram FakeAverageLinkage = DM.fakeAverageLinkage hierarchical-clustering-0.4.6/src/Data/Clustering/Hierarchical/0000755000000000000000000000000012553543210022637 5ustar0000000000000000hierarchical-clustering-0.4.6/src/Data/Clustering/Hierarchical/Internal/0000755000000000000000000000000012553543210024413 5ustar0000000000000000hierarchical-clustering-0.4.6/src/Data/Clustering/Hierarchical/Internal/Types.hs0000644000000000000000000000650712553543210026063 0ustar0000000000000000module Data.Clustering.Hierarchical.Internal.Types ( Dendrogram(..) , Linkage(..) , Distance ) where -- from base import Control.Applicative ((<$>), (<*>)) import Data.Foldable (Foldable (..)) import Data.Monoid (mappend) import Data.Traversable (Traversable(..)) -- | Data structure for storing hierarchical clusters. The -- distance between clusters is stored on the branches. -- Distances between leafs are the distances between the elements -- on those leafs, while distances between branches are defined -- by the linkage used (see 'Linkage'). data Dendrogram a = Leaf a -- ^ The leaf contains the item @a@ itself. | Branch {-# UNPACK #-} !Distance (Dendrogram a) (Dendrogram a) -- ^ Each branch connects two clusters/dendrograms that are -- @d@ distance apart. deriving (Eq, Ord, Show) -- | A distance is simply a synonym of 'Double' for efficiency. type Distance = Double -- | Does not recalculate the distances! instance Functor Dendrogram where fmap f (Leaf d) = Leaf (f d) fmap f (Branch s c1 c2) = Branch s (fmap f c1) (fmap f c2) instance Foldable Dendrogram where foldMap f (Leaf d) = f d foldMap f (Branch _ c1 c2) = foldMap f c1 `mappend` foldMap f c2 instance Traversable Dendrogram where traverse f (Leaf d) = Leaf <$> f d traverse f (Branch s c1 c2) = Branch s <$> traverse f c1 <*> traverse f c2 -- | The linkage type determines how the distance between -- clusters will be calculated. These are the linkage types -- currently available on this library. data Linkage = SingleLinkage -- ^ The distance between two clusters @a@ and @b@ is the -- /minimum/ distance between an element of @a@ and an element -- of @b@. | CompleteLinkage -- ^ The distance between two clusters @a@ and @b@ is the -- /maximum/ distance between an element of @a@ and an element -- of @b@. | CLINK -- ^ The same as 'CompleteLinkage', but using the CLINK -- algorithm. It's much faster however doesn't always give the -- best complete linkage dendrogram. | UPGMA -- ^ Unweighted Pair Group Method with Arithmetic mean, also -- called \"average linkage\". The distance between two -- clusters @a@ and @b@ is the /arithmetic average/ between the -- distances of all elements in @a@ to all elements in @b@. | FakeAverageLinkage -- ^ This method is usually wrongly called \"average linkage\". -- The distance between cluster @a = a1 U a2@ (that is, cluster -- @a@ was formed by the linkage of clusters @a1@ and @a2@) and -- an old cluster @b@ is @(d(a1,b) + d(a2,b)) / 2@. So when -- clustering two elements to create a cluster, this method is -- the same as UPGMA. However, in general when joining two -- clusters this method assigns equal weights to @a1@ and @a2@, -- while UPGMA assigns weights proportional to the number of -- elements in each cluster. See, for example: -- -- * -- , -- which defines the real UPGMA and gives the equation to -- calculate the distance between an old and a new cluster. -- -- * -- , -- code for \"average linkage\" on ai4r library implementing -- what we call here @FakeAverageLinkage@ and not UPGMA. deriving (Eq, Ord, Show, Enum) hierarchical-clustering-0.4.6/src/Data/Clustering/Hierarchical/Internal/Optimal.hs0000644000000000000000000002121712553543210026357 0ustar0000000000000000{-# LANGUAGE BangPatterns, FlexibleContexts #-} -- | Implementations that are optimal in space and time. module Data.Clustering.Hierarchical.Internal.Optimal ( singleLinkage , completeLinkage ) where -- from base import Prelude hiding (pi) import Control.Applicative ((<$>)) import Control.Arrow (first) import Control.Monad (forM_, liftM3, when) import Control.Monad.ST (ST, runST) import Data.Array (Array, listArray, (!)) import Data.Array.ST (STUArray, newArray_, newListArray, readArray, writeArray, getElems, getBounds) -- getAssocs import Data.List (sortBy) import Data.Maybe (fromMaybe) -- from containers import qualified Data.IntMap as IM -- from this package import Data.Clustering.Hierarchical.Internal.Types mkErr :: String -> a mkErr = error . ("Data.Clustering.Hierarchical.Internal.Optimal." ++) type Index = Int data PointerRepresentation s a = PR { pi :: {-# UNPACK #-} !(STUArray s Index Index) , lambda :: {-# UNPACK #-} !(STUArray s Index Distance) , em :: {-# UNPACK #-} !(STUArray s Index Distance) , elm :: {-# UNPACK #-} !(Array Index a) } -- debugPR :: Show a => PointerRepresentation s a -> ST s String -- debugPR pr = do -- pis <- getAssocs (pi pr) -- lambdas <- getAssocs (lambda pr) -- ems <- getAssocs (em pr) -- return $ unlines [ "pi = " ++ show pis -- , "lambda = " ++ show lambdas -- , "em = " ++ show ems -- , "elm = " ++ show (elm pr) -- ] initPR :: Index -> Array Index a -> ST s (PointerRepresentation s a) initPR n xs' = ($ xs') <$> liftM3 PR (newArray_ (1, n)) (newArray_ (1, n)) (newArray_ (1, n)) indexDistance :: [a] -> (a -> a -> Distance) -> (Index, Array Index a, Index -> Index -> Distance) indexDistance xs dist = (n, xs', dist') where !n = length xs !xs' = listArray (1, n) xs dist' i j = dist (xs' ! i) (xs' ! j) infinity :: Distance infinity = 1 / 0 -- | /O(n^2)/ time and /O(n)/ space. See 'singleLinkage' on this module. slink :: [a] -> (a -> a -> Distance) -> ST s (PointerRepresentation s a) slink xs dist = initPR n xs' >>= go 1 where (n, xs', dist') = indexDistance xs dist go !i !pr | i == n + 1 = return pr | otherwise = do writeArray (pi pr) i i writeArray (lambda pr) i infinity forM_ [1..i-1] $ \j -> writeArray (em pr) j (dist' j i) forM_ [1..i-1] $ \j -> do lambda_j <- readArray (lambda pr) j em_j <- readArray (em pr) j pi_j <- readArray (pi pr) j em_pi_j <- readArray (em pr) pi_j if lambda_j >= em_j then do writeArray (em pr) pi_j (em_pi_j `min` lambda_j) writeArray (lambda pr) j em_j writeArray (pi pr) j i else writeArray (em pr) pi_j (em_pi_j `min` em_j) forM_ [1..i-1] $ \j -> do pi_j <- readArray (pi pr) j lambda_j <- readArray (lambda pr) j lambda_pi_j <- readArray (lambda pr) pi_j when (lambda_j >= lambda_pi_j) $ writeArray (pi pr) j i go (i+1) pr -- | /O(n^2)/ time and /O(n)/ space. See 'completeLinkage' on this module. clink :: [a] -> (a -> a -> Distance) -> ST s (PointerRepresentation s a) clink xs dist = initPR n xs' >>= go 1 where (n, xs', dist') = indexDistance xs dist go !i !pr | i == n + 1 = return pr | i == 1 = do writeArray (pi pr) 1 1 writeArray (lambda pr) 1 infinity go 2 pr | otherwise = do -- First part writeArray (pi pr) i i writeArray (lambda pr) i infinity forM_ [1..i-1] $ \j -> writeArray (em pr) j (dist' j i) forM_ [1..i-1] $ \j -> do lambda_j <- readArray (lambda pr) j em_j <- readArray (em pr) j when (lambda_j < em_j) $ do pi_j <- readArray (pi pr) j em_pi_j <- readArray (em pr) pi_j writeArray (em pr) pi_j (em_pi_j `max` em_j) writeArray (em pr) j infinity -- Loop a a <- readArray (em pr) (i-1) >>= go_a_loop (i-1) pr (i-1) -- Loop b b <- readArray (pi pr) a c <- readArray (lambda pr) a writeArray (pi pr) a i writeArray (lambda pr) a =<< readArray (em pr) a go_b_loop i pr a b c -- Final part forM_ [1..i-1] $ \j -> do pi_j <- readArray (pi pr) j pi_pi_j <- readArray (pi pr) pi_j when (pi_pi_j == i) $ do lambda_j <- readArray (lambda pr) j lambda_pi_j <- readArray (lambda pr) pi_j when (lambda_j >= lambda_pi_j) $ writeArray (pi pr) j i -- Recurse go (i+1) pr -- Loop a's core go_a_loop 0 _ a _ = return a go_a_loop !j !pr !a !em_a = do pi_j <- readArray (pi pr) j lambda_j <- readArray (lambda pr) j em_pi_j <- readArray (em pr) pi_j if lambda_j >= em_pi_j then do em_j <- readArray (em pr) j if em_j < em_a then go_a_loop (j-1) pr j em_j else go_a_loop (j-1) pr a em_a else do writeArray (em pr) j infinity go_a_loop (j-1) pr a em_a -- Loop b's core go_b_loop !i !pr !a !b !c | a >= i - 1 = return () | b < i - 1 = do pi_b <- readArray (pi pr) b lambda_b <- readArray (lambda pr) b writeArray (pi pr) b i writeArray (lambda pr) b c go_b_loop i pr a pi_b lambda_b | otherwise = do writeArray (pi pr) b i writeArray (lambda pr) b c return () -- | /O(n log n)/ time and /O(n)/ space. Construct a 'Dendrogram' -- from a 'PointerRepresentation'. buildDendrogram :: PointerRepresentation s a -> ST s (Dendrogram a) buildDendrogram pr = do (1,n) <- getBounds (lambda pr) lambdas <- getElems (lambda pr) pis <- getElems (pi pr) let sorted = sortBy (\(_,l1,_) (_,l2,_) -> l1 `compare` l2) $ zip3 [1..] lambdas pis index <- newListArray (1,n) [1..] let go im [] = case IM.toList im of [(_,x)] -> return x _ -> mkErr "buildDendrogram: final never here" go im ((i, (j,lambda_j,pi_j)):rest) = do left_i <- readArray index j right_i <- readArray index pi_j writeArray (index `asTypeOf` pi pr) pi_j (negate i) let (left, im') | left_i > 0 = (Leaf $ elm pr ! left_i, im) | otherwise = first (fromMaybe e1) $ IM.updateLookupWithKey (\_ _ -> Nothing) ix im where ix = negate left_i (right, im'') | right_i > 0 = (Leaf $ elm pr ! right_i, im') | otherwise = first (fromMaybe e2) $ IM.updateLookupWithKey (\_ _ -> Nothing) ix im' where ix = negate right_i im''' = IM.insert i (Branch lambda_j left right) im'' e1 = mkErr "buildDendrogram: never here 1" e2 = mkErr "buildDendrogram: never here 2" go im''' rest go IM.empty (zip [1..n-1] sorted) -- | /O(n^2)/ time and /O(n)/ space. Calculates a complete, -- rooted dendrogram for a list of items using single linkage -- with the SLINK algorithm. This algorithm is optimal in space -- and time. -- -- [Reference] R. Sibson (1973). \"SLINK: an optimally efficient -- algorithm for the single-link cluster method\". /The/ -- /Computer Journal/ (British Computer Society) 16 (1): -- 30-34. singleLinkage :: [a] -> (a -> a -> Distance) -> Dendrogram a singleLinkage [] _ = mkErr "singleLinkage: empty input" singleLinkage [x] _ = Leaf x singleLinkage xs dist = runST (slink xs dist >>= buildDendrogram) -- | /O(n^2)/ time and /O(n)/ space. Calculates a complete, rooted dendrogram for a list -- of items using complete linkage with the CLINK algorithm. This -- algorithm is optimal in space and time. -- -- [Reference] D. Defays (1977). \"An efficient algorithm for a -- complete link method\". /The Computer Journal/ (British -- Computer Society) 20 (4): 364-366. completeLinkage :: [a] -> (a -> a -> Distance) -> Dendrogram a completeLinkage [] _ = mkErr "completeLinkage: empty input" completeLinkage [x] _ = Leaf x completeLinkage xs dist = runST (clink xs dist >>= buildDendrogram) hierarchical-clustering-0.4.6/src/Data/Clustering/Hierarchical/Internal/DistanceMatrix.hs0000644000000000000000000001714312553543210027674 0ustar0000000000000000{-# LANGUAGE BangPatterns, FlexibleContexts #-} module Data.Clustering.Hierarchical.Internal.DistanceMatrix (singleLinkage ,completeLinkage ,upgma ,fakeAverageLinkage ) where -- from base import Control.Monad (forM_) import Control.Monad.ST (ST, runST) import Data.Array (listArray, (!)) import Data.Array.ST (STArray, STUArray, newArray_, newListArray, readArray, writeArray) import Data.Function (on) import Data.List (delete, tails, (\\)) import Data.STRef (STRef, newSTRef, readSTRef, writeSTRef) -- from containers import qualified Data.IntMap as IM -- from this package import Data.Clustering.Hierarchical.Internal.Types mkErr :: String -> a mkErr = error . ("Data.Clustering.Hierarchical.Internal.DistanceMatrix." ++) -- | Internal (to this package) type used to represent a cluster -- (of possibly just one element). The @key@ should be less than -- or equal to all elements of the cluster. data Cluster = Cluster { key :: {-# UNPACK #-} !Item -- ^ Element used as key. , size :: {-# UNPACK #-} !Int -- ^ At least one, the @key@. } deriving (Eq, Ord, Show) -- | An element of a cluster. type Item = IM.Key -- | Creates a singleton cluster. singleton :: Item -> Cluster singleton k = Cluster {key = k, size = 1} -- | /O(1)/. Joins two clusters, returns the 'key' that didn't -- become 'key' of the new cluster as well. Clusters are not -- monoid because we don't have 'mempty'. merge :: Cluster -> Cluster -> (Cluster, Item) merge c1 c2 = let (kl,km) = if key c1 < key c2 then (key c1, key c2) else (key c2, key c1) in (Cluster {key = kl ,size = size c1 + size c2} ,km) -- | A distance matrix. data DistMatrix s = DM { matrix :: {-# UNPACK #-} !(STUArray s (Item, Item) Distance) , active :: {-# UNPACK #-} !(STRef s [Item]) , clusters :: {-# UNPACK #-} !(STArray s Item Cluster) } -- | /O(n^2)/. Creates a list of possible combinations between -- the given elements. combinations :: [a] -> [(a,a)] combinations xs = [(a,b) | (a:as) <- tails xs, b <- as] -- | /O(n^2)/. Constructs a new distance matrix from a distance -- function and a number @n@ of elements. Elements will be drawn -- from @[1..n]@ fromDistance :: (Item -> Item -> Distance) -> Item -> ST s (DistMatrix s) fromDistance _ n | n < 2 = mkErr "fromDistance: n < 2 is meaningless" fromDistance dist n = do matrix_ <- newArray_ ((1,2), (n-1,n)) active_ <- newSTRef [1..n] forM_ (combinations [1..n]) $ \x -> writeArray matrix_ x (uncurry dist x) clusters_ <- newListArray (1,n) (map singleton [1..n]) return $ DM {matrix = matrix_ ,active = active_ ,clusters = clusters_} -- | /O(n^2)/. Returns the minimum distance of the distance -- matrix. The first key given is less than the second key. findMin :: DistMatrix s -> ST s ((Cluster, Cluster), Distance) findMin dm = readSTRef (active dm) >>= go1 where matrix_ = matrix dm choose b i m' = if m' < snd b then (i, m') else b go1 is@(i1:i2:_) = do di <- readArray matrix_ (i1, i2) -- initial ((b1, b2), d) <- go2 is ((i1, i2), di) c1 <- readArray (clusters dm) b1 c2 <- readArray (clusters dm) b2 return ((c1, c2), d) go1 _ = mkErr "findMin: empty DistMatrix" go2 (i1:is@(_:_)) !b = go3 i1 is b >>= go2 is go2 _ b = return b go3 i1 (i2:is) !b = readArray matrix_ (i1,i2) >>= go3 i1 is . choose b (i1,i2) go3 _ [] b = return b -- | Type for functions that calculate distances between -- clusters. type ClusterDistance = (Cluster, Distance) -- ^ Cluster B1 and distance from A to B1 -> (Cluster, Distance) -- ^ Cluster B2 and distance from A to B2 -> Distance -- ^ Distance from A to (B1 U B2). -- Some cluster distances cdistSingleLinkage :: ClusterDistance cdistSingleLinkage = \(_, d1) (_, d2) -> d1 `min` d2 cdistCompleteLinkage :: ClusterDistance cdistCompleteLinkage = \(_, d1) (_, d2) -> d1 `max` d2 cdistUPGMA :: ClusterDistance cdistUPGMA = \(b1,d1) (b2,d2) -> let n1 = fromIntegral (size b1) n2 = fromIntegral (size b2) in (n1 * d1 + n2 * d2) / (n1 + n2) cdistFakeAverageLinkage :: ClusterDistance cdistFakeAverageLinkage = \(_, d1) (_, d2) -> (d1 + d2) / 2 -- | /O(n)/. Merges two clusters, returning the new cluster and -- the new distance matrix. mergeClusters :: ClusterDistance -> DistMatrix s -> (Cluster, Cluster) -> ST s Cluster mergeClusters cdist (DM matrix_ active_ clusters_) (b1, b2) = do let (bu, kl) = b1 `merge` b2 b1k = key b1 b2k = key b2 km = key bu ix i j | i < j = (i,j) | otherwise = (j,i) -- Calculate new distances activeV <- readSTRef active_ forM_ (activeV \\ [b1k, b2k]) $ \k -> do -- a <- readArray clusters_ k d_a_b1 <- readArray matrix_ $ ix k b1k d_a_b2 <- readArray matrix_ $ ix k b2k let d = cdist (b1, d_a_b1) (b2, d_a_b2) writeArray matrix_ (ix k km) $! d -- Save new cluster, invalidate old one writeArray clusters_ km bu writeArray clusters_ kl $ mkErr "mergeClusters: invalidated" writeSTRef active_ $ delete kl activeV -- Return new cluster. return bu -- | Worker function to create dendrograms based on a -- 'ClusterDistance'. dendrogram' :: ClusterDistance -> [a] -> (a -> a -> Distance) -> Dendrogram a dendrogram' _ [] _ = mkErr "dendrogram': empty input list" dendrogram' _ [x] _ = Leaf x dendrogram' cdist items dist = runST (act ()) where n = length items act _noMonomorphismRestrictionPlease = do let xs = listArray (1, n) items im = IM.fromDistinctAscList $ zip [1..] $ map Leaf items fromDistance (dist `on` (xs !)) n >>= go (n-1) im go !i !ds !dm = do ((c1,c2), distance) <- findMin dm cu <- mergeClusters cdist dm (c1,c2) let dendro c = IM.updateLookupWithKey (\_ _ -> Nothing) (key c) (Just d1, !ds') = dendro c1 ds (Just d2, !ds'') = dendro c2 ds' du = Branch distance d1 d2 case i of 1 -> return du _ -> let !ds''' = IM.insert (key cu) du ds'' in du `seq` go (i-1) ds''' dm -- | /O(n^3)/ time and /O(n^2)/ space. Calculates a complete, -- rooted dendrogram for a list of items using single linkage -- with the naïve algorithm using a distance matrix. singleLinkage :: [a] -> (a -> a -> Distance) -> Dendrogram a singleLinkage = dendrogram' cdistSingleLinkage -- | /O(n^3)/ time and /O(n^2)/ space. Calculates a complete, -- rooted dendrogram for a list of items using complete linkage -- with the naïve algorithm using a distance matrix. completeLinkage :: [a] -> (a -> a -> Distance) -> Dendrogram a completeLinkage = dendrogram' cdistCompleteLinkage -- | /O(n^3)/ time and /O(n^2)/ space. Calculates a complete, -- rooted dendrogram for a list of items using UPGMA with the -- naïve algorithm using a distance matrix. upgma :: [a] -> (a -> a -> Distance) -> Dendrogram a upgma = dendrogram' cdistUPGMA -- | /O(n^3)/ time and /O(n^2)/ space. Calculates a complete, -- rooted dendrogram for a list of items using fake average -- linkage with the naïve algorithm using a distance matrix. fakeAverageLinkage :: [a] -> (a -> a -> Distance) -> Dendrogram a fakeAverageLinkage = dendrogram' cdistFakeAverageLinkage