pax_global_header00006660000000000000000000000064146147532410014521gustar00rootroot0000000000000052 comment=397ad064212aea5fb6bcb59579f2e1e86d9f5367 pyxDamerauLevenshtein-1.8.0/000077500000000000000000000000001461475324100160535ustar00rootroot00000000000000pyxDamerauLevenshtein-1.8.0/.gitignore000066400000000000000000000014231461475324100200430ustar00rootroot00000000000000# Byte-compiled / optimized / DLL files __pycache__/ *.py[cod] # C extensions *.so # Distribution / packaging .Python env/ build/ develop-eggs/ dist/ downloads/ eggs/ .eggs/ lib/ lib64/ parts/ sdist/ var/ *.egg-info/ .installed.cfg *.egg # PyInstaller # Usually these files are written by a python script from a template # before PyInstaller builds the exe, so as to inject date/other infos into it. *.manifest *.spec # Installer logs pip-log.txt pip-delete-this-directory.txt # Unit test / coverage reports htmlcov/ .tox/ .coverage .coverage.* .cache nosetests.xml coverage.xml *,cover # Translations *.mo *.pot # Django stuff: *.log # Sphinx documentation docs/_build/ # PyBuilder target/ # PyCharm .idea/ # MacOS files .DS_Store # cython output pyxdameraulevenshtein/*.cpyxDamerauLevenshtein-1.8.0/.travis.yml000066400000000000000000000003331461475324100201630ustar00rootroot00000000000000dist: focal language: python python: - 3.8 - 3.9 - 3.10 - 3.11 - 3.12 - 3.13 before_install: - pip install --upgrade pip setuptools wheel install: - pip install . script: python tests/test_pyxdl.py pyxDamerauLevenshtein-1.8.0/AUTHORS.md000066400000000000000000000013321461475324100175210ustar00rootroot00000000000000# Authors ## Maintainer Geoffrey Fairchild * [https://www.gfairchild.com/](https://www.gfairchild.com/) * [https://github.com/gfairchild](https://github.com/gfairchild) * [https://www.linkedin.com/in/gfairchild/](https://www.linkedin.com/in/gfairchild/) ## Contributors Sorted by date of first contribution: * [mittagessen](https://github.com/mittagessen) * [Anirudha Bose](https://github.com/onyb) * [Markus Konrad](https://github.com/internaut) * [Simone Basso](https://github.com/simobasso) * [Andrew Lensen](https://github.com/AndLen) * [Sergiusz Bleja](https://github.com/svenski) * [Max Bachmann](https://github.com/maxbachmann) * [Seth Sims](https://github.com/xzy3) * [Thomas A Caswell](https://github.com/tacaswell) pyxDamerauLevenshtein-1.8.0/CHANGES.md000066400000000000000000000124561461475324100174550ustar00rootroot00000000000000# Changes ## 1.8.0 (2024-05-02) * Add Cython to the build process to reduce the likelihood of incompatabilities between the Cython-generated C code and CPython (#38). (courtesy @tacaswell) * Drop Python 3.7 support. * Add Python 3.11-3.13 support. ## 1.7.1 (2022-08-01) * Drop Python 3.6 support (EOL). * Add Python 3.10 support. * Compiled with Cython 0.29.32. * Updating project URL (moved from [@gfairchild](https://github.com/gfairchild)'s personal namespace to [@lanl](https://github.com/lanl)'s namespace). ## 1.7.0 (2021-02-09) * Remove NumPy dependency to simplify build process. Rather than relying on `np.ndarray`, we'll now use native iterables like `list` or `tuple`. * **This is a breaking change if you currently rely on either of the `*_ndarray` methods.** * `damerau_levenshtein_distance_ndarray` refactored to `damerau_levenshtein_distance_seqs`, and the return value is now a `list` rather than `np.array` * `normalized_damerau_levenshtein_distance_ndarray` refactored to `normalized_damerau_levenshtein_distance_seqs`, and the return value is now a `list` rather than `np.array` * The simplest way to migrate to these new methods is to switch to using a native Python `list`. For example: * `damerau_levenshtein_distance_ndarray('test', np.array(['test1', 't1', 'test']))` is now `damerau_levenshtein_distance_seqs('test', ['test1', 't1', 'test'])` * `normalized_damerau_levenshtein_distance_ndarray('test', np.array(['test1', 't1', 'test']))` is now `normalized_damerau_levenshtein_distance_seqs('test', ['test1', 't1', 'test'])` * If you need the return value to be an `np.array`, then you can simply wrap the return value (a `list`) with `np.array` like so: * `np.array(damerau_levenshtein_distance_seqs('test', ['test1', 't1', 'test']))` * Compiled with Cython 0.29.21. ## 1.6.2 (2021-02-08) * Remove Python 2 and 3.5 support (they are EOL). * Bump minimum NumPy version to 1.19.5. * Add Python 3.9 support in `setup.py`. * Compiled with Cython 0.29.21. ## 1.6.1 (2020-07-27) * Fixed bug when first string is longer than the second string (#22). (courtesy @svenski) * Compiled with Cython 0.29.21. * Dropping Python 3.4 support from Travis. ## 1.6.0 (2020-05-01) * Allow `np.ndarrays` as input. * Add support for Python 3.8 to `setup.py`. * Compiled with Cython 0.29.17. ## 1.5.3 (2019-02-25) * Specifying minimum version numbers in `pyproject.toml` and `setup.py`. * Compiled with Cython 0.29.5. ## 1.5.2 (2019-01-07) * Using the `pyproject.toml` standard set forth in [PEP 518](https://www.python.org/dev/peps/pep-0518/), NumPy will now be correctly installed as a dependency prior to running `setup.py`. ## 1.5.1 (2019-01-04) * Fixing NumPy-related install error. (courtesy @simobasso) * Enabling Python 3.7 unit tests in Travis. * Compiled with Cython 0.29.2. ## 1.5.0 (2018-02-04) * Allow tuples and lists as input. (courtesy @internaut) * Dropped support of EOL Python versions (2.6, 3.2, and 3.3). (courtesy @internaut) * Fixed a possible division-by-zero exception. (courtesy @internaut) * Fixed a formatting error in an exception message. (courtesy @internaut) * Compiled with Cython 0.27.3. ## 1.4.1 (2016-07-18) * Clarified that this implementation is of the [optimal string alignment distance algorithm](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Optimal_string_alignment_distance) (see [this issue](https://github.com/gfairchild/pyxDamerauLevenshtein/issues/6) for more information). * Renamed `damerau_levenshtein_distance_withNPArray` to `damerau_levenshtein_distance_ndarray` and `normalized_damerau_levenshtein_distance_withNPArray` to `normalized_damerau_levenshtein_distance_ndarray`. * Cleaned up `np.ndarray` type and dimension checks. * Simplified NumPy functions using [`np.vectorize`](https://docs.scipy.org/doc/numpy/reference/generated/numpy.vectorize.html). * Hardened unicode conversion using [Cython's recommendations](http://docs.cython.org/src/tutorial/strings.html#accepting-strings-from-python-code). * Compiled with Cython 0.24.1. ## 1.3.2 (2015-05-19) * [@mittagessen](https://github.com/mittagessen) fixed a bug in `setup.py` that assumed NumPy was installed in [this PR](https://github.com/gfairchild/pyxDamerauLevenshtein/pull/5). ## 1.3.1 (2015-04-07) * [@ovarene](https://github.com/ovarene) added the ability to compute the edit distance between a string and each string in a [NumPy](http://www.numpy.org/) array in [this PR](https://github.com/gfairchild/pyxDamerauLevenshtein/pull/3). * Compiled with Cython 0.22. ## 1.2.0 (2014-05-06) * Changed `xrange` to `range` in pyx code. * Compiled with Cython 0.20.1. ## 1.1.0 (2013-10-04) * Moving to setuptools (using [ez_setup.py](https://bitbucket.org/pypa/setuptools/downloads/ez_setup.py) to manage it). ## 1.0.2 (2013-09-23) * Performance improvement for short-circuit. * Changed `unsigned int` to `Py_ssize_t` (for 64-bit compatability). * Improved readability (defined offset indices for `storage`). ## 1.0.1 (2013-09-23) * Fixed Python 3 unicode issue (thanks to Stefan Behnel - https://groups.google.com/d/msg/cython-users/ofT3fo48ohs/rrf3dtbHkm4J). * Fixed a possible memory leak (thanks to Stefan Behnel). * Examples are now Python 3-compatible. ## 1.0.0 (2013-07-03) * Initial release. pyxDamerauLevenshtein-1.8.0/LICENSE000066400000000000000000000027441461475324100170670ustar00rootroot00000000000000Copyright (c) 2013, Triad National Security, LLC All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of Triad National Security, LLC nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. pyxDamerauLevenshtein-1.8.0/MANIFEST.in000066400000000000000000000001541461475324100176110ustar00rootroot00000000000000include pyproject.toml include pyxdameraulevenshtein/py.typed recursive-include pyxdameraulevenshtein *.pyi pyxDamerauLevenshtein-1.8.0/README.md000066400000000000000000000131411461475324100173320ustar00rootroot00000000000000# pyxDamerauLevenshtein [![Build Status](https://app.travis-ci.com/lanl/pyxDamerauLevenshtein.svg?branch=master)](https://app.travis-ci.com/lanl/pyxDamerauLevenshtein) ## LICENSE This software is licensed under the [BSD 3-Clause License](http://opensource.org/licenses/BSD-3-Clause). Please refer to the separate [LICENSE](LICENSE) file for the exact text of the license. You are obligated to give attribution if you use this code. ## ABOUT pyxDamerauLevenshtein implements the Damerau-Levenshtein (DL) edit distance algorithm for Python in Cython for high performance. Courtesy [Wikipedia](http://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance): > In information theory and computer science, the Damerau-Levenshtein distance (named after Frederick J. Damerau and Vladimir I. Levenshtein) is a "distance" (string metric) between two strings, i.e., finite sequence of symbols, given by counting the minimum number of operations needed to transform one string into the other, where an operation is defined as an insertion, deletion, or substitution of a single character, or a transposition of two adjacent characters. This implementation is based on [Michael Homer's pure Python implementation](https://web.archive.org/web/20150909134357/http://mwh.geek.nz:80/2009/04/26/python-damerau-levenshtein-distance/), which implements the [optimal string alignment distance algorithm](https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Optimal_string_alignment_distance). It runs in `O(N*M)` time using `O(M)` space. It supports unicode characters. ## REQUIREMENTS This code requires Python 3.8+, C compiler such as GCC, and Cython. ## INSTALL pyxDamerauLevenshtein is available on PyPI at https://pypi.org/project/pyxDamerauLevenshtein/. Install using [pip](https://pypi.org/project/pip/): pip install pyxDamerauLevenshtein Install from source: pip install . ## USING THIS CODE The following methods are available: * **Edit distance** (`damerau_levenshtein_distance`) * Compute the raw distance between two strings (i.e., the minumum number of operations necessary to transform one string into the other). * Additionally, the distance between lists and tuples can also be computed. * **Normalized edit distance** (`normalized_damerau_levenshtein_distance`) * Compute the ratio of the edit distance to the length of `max(string1, string2)`. 0.0 means that the sequences are identical, while 1.0 means that they have nothing in common. Note that this definition is the exact opposite of [`difflib.SequenceMatcher.ratio()`](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher.ratio). * **Edit distance against a sequence of sequences** (`damerau_levenshtein_distance_seqs`) * Compute the raw distances between a sequence and each sequence within another sequence (e.g., `list`, `tuple`). * **Normalized edit distance against a sequence of sequences** (`normalized_damerau_levenshtein_distance_seqs`) * Compute the normalized distances between a sequence and each sequence within another sequence (e.g., `list`, `tuple`). Basic use: ```python from pyxdameraulevenshtein import damerau_levenshtein_distance, normalized_damerau_levenshtein_distance damerau_levenshtein_distance('smtih', 'smith') # expected result: 1 normalized_damerau_levenshtein_distance('smtih', 'smith') # expected result: 0.2 damerau_levenshtein_distance([1, 2, 3, 4, 5, 6], [7, 8, 9, 7, 10, 11, 4]) # expected result: 7 from pyxdameraulevenshtein import damerau_levenshtein_distance_seqs, normalized_damerau_levenshtein_distance_seqs array = ['test1', 'test12', 'test123'] damerau_levenshtein_distance_seqs('test', array) # expected result: [1, 2, 3] normalized_damerau_levenshtein_distance_seqs('test', array) # expected result: [0.2, 0.33333334, 0.42857143] ``` ## DIFFERENCES Other Python DL implementations: * [Michael Homer's native Python code](https://web.archive.org/web/20150909134357/http://mwh.geek.nz:80/2009/04/26/python-damerau-levenshtein-distance/) * [jellyfish](https://github.com/sunlightlabs/jellyfish) pyxDamerauLevenshtein differs from other Python implementations in that it is both fast via Cython *and* supports unicode. Michael Homer's implementation is fast for Python, but it is *two orders of magnitude* slower than this Cython implementation. jellyfish provides C implementations for a variety of string comparison metrics and is sometimes faster than pyxDamerauLevenshtein. Python's built-in [`difflib.SequenceMatcher.ratio()`](https://docs.python.org/3/library/difflib.html#difflib.SequenceMatcher.ratio) performs about an order of magnitude faster than Michael Homer's implementation but is still one order of magnitude slower than this DL implementation. difflib, however, uses a different algorithm (difflib uses the [Ratcliff/Obershelp algorithm](http://www.drdobbs.com/database/pattern-matching-the-gestalt-approach/184407970)). Performance differences (on Intel i7-2600 running at 3.4Ghz): >>> import timeit >>> #this implementation: ... timeit.timeit("damerau_levenshtein_distance('e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7')", 'from pyxdameraulevenshtein import damerau_levenshtein_distance', number=500000) 7.417556047439575 >>> #Michael Homer's native Python implementation: ... timeit.timeit("dameraulevenshtein('e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7')", 'from dameraulevenshtein import dameraulevenshtein', number=500000) 667.0276439189911 >>> #difflib ... timeit.timeit("difflib.SequenceMatcher(None, 'e0zdvfb840174ut74j2v7gabx1 5bs', 'qpk5vei 4tzo0bglx8rl7e 2h4uei7').ratio()", 'import difflib', number=500000) 135.41051697731018 pyxDamerauLevenshtein-1.8.0/pyproject.toml000066400000000000000000000001141461475324100207630ustar00rootroot00000000000000[build-system] requires = ["setuptools>=40.8.0", "wheel>=0.33.1", "cython"] pyxDamerauLevenshtein-1.8.0/pyxdameraulevenshtein/000077500000000000000000000000001461475324100224775ustar00rootroot00000000000000pyxDamerauLevenshtein-1.8.0/pyxdameraulevenshtein/__init__.py000066400000000000000000000000601461475324100246040ustar00rootroot00000000000000from pyxdameraulevenshtein._initialize import * pyxDamerauLevenshtein-1.8.0/pyxdameraulevenshtein/__init__.pyi000066400000000000000000000007061461475324100247640ustar00rootroot00000000000000from typing import Sequence, Any, List def damerau_levenshtein_distance(seq1: Sequence[Any], seq2: Sequence[Any]) -> int: ... def normalized_damerau_levenshtein_distance(seq1: Sequence[Any], seq2: Sequence[Any]) -> float: ... def damerau_levenshtein_distance_seqs(seq: Sequence[Any], seqs: Sequence[Sequence[Any]]) -> List[int]: ... def normalized_damerau_levenshtein_distance_seqs(seq: Sequence[Any], seqs: Sequence[Sequence[Any]]) -> List[float]: ... pyxDamerauLevenshtein-1.8.0/pyxdameraulevenshtein/_initialize.pyx000066400000000000000000000176621461475324100255550ustar00rootroot00000000000000# cython: language_level=3 """ Copyright (c) 2013, Triad National Security, LLC All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of Triad National Security, LLC nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. """ from libc.stdlib cimport calloc, free # these guys are used to index into storage inside damerau_levenshtein_distance() cdef Py_ssize_t TWO_AGO = 0 cdef Py_ssize_t ONE_AGO = 1 cdef Py_ssize_t THIS_ROW = 2 cpdef unsigned long damerau_levenshtein_distance(seq1, seq2): """ Return the edit distance. This implementation is based on Michael Homer's implementation (https://web.archive.org/web/20150909134357/http://mwh.geek.nz:80/2009/04/26/python-damerau-levenshtein-distance/) and runs in O(N*M) time using O(M) space. This code implements the "optimal string alignment distance" algorithm, as described in Wikipedia here: https://en.wikipedia.org/wiki/Damerau%E2%80%93Levenshtein_distance#Optimal_string_alignment_distance Note that `seq1` and `seq2` can be any sequence type. This not only includes `str` but also includes `list`, `tuple`, `range`, and more. Examples: >>> damerau_levenshtein_distance('smtih', 'smith') 1 >>> damerau_levenshtein_distance('saturday', 'sunday') 3 >>> damerau_levenshtein_distance('orange', 'pumpkin') 7 >>> damerau_levenshtein_distance([1, 2, 3, 4, 5, 6], [7, 8, 9, 7, 10, 11, 4]) 7 """ # possible short-circuit if sequences have a lot in common at the beginning (or are identical) cdef Py_ssize_t first_differing_index = 0 while first_differing_index < len(seq1) and \ first_differing_index < len(seq2) and \ seq1[first_differing_index] == seq2[first_differing_index]: first_differing_index += 1 seq1 = seq1[first_differing_index:] seq2 = seq2[first_differing_index:] if seq1 is None: return len(seq2) if seq2 is None: return len(seq1) # Fix bug where the second sequence is one shorter than the first (#22). if len(seq2) < len(seq1): seq1, seq2 = seq2, seq1 # Py_ssize_t should be used wherever we're dealing with an array index or length cdef Py_ssize_t i, j cdef Py_ssize_t offset = len(seq2) + 1 cdef unsigned long delete_cost, add_cost, subtract_cost, edit_distance # storage is a 3 x (len(seq2) + 1) array that stores TWO_AGO, ONE_AGO, and THIS_ROW cdef unsigned long * storage = calloc(3 * offset, sizeof(unsigned long)) if not storage: raise MemoryError() try: # initialize THIS_ROW for i in range(1, offset): storage[THIS_ROW * offset + (i - 1)] = i for i in range(len(seq1)): # swap/initialize vectors for j in range(offset): storage[TWO_AGO * offset + j] = storage[ONE_AGO * offset + j] storage[ONE_AGO * offset + j] = storage[THIS_ROW * offset + j] for j in range(len(seq2)): storage[THIS_ROW * offset + j] = 0 storage[THIS_ROW * offset + len(seq2)] = i + 1 # now compute costs for j in range(len(seq2)): delete_cost = storage[ONE_AGO * offset + j] + 1 add_cost = storage[THIS_ROW * offset + (j - 1 if j > 0 else len(seq2))] + 1 subtract_cost = storage[ONE_AGO * offset + (j - 1 if j > 0 else len(seq2))] + (seq1[i] != seq2[j]) storage[THIS_ROW * offset + j] = min(delete_cost, add_cost, subtract_cost) # deal with transpositions if i > 0 and j > 0 and seq1[i] == seq2[j - 1] and seq1[i - 1] == seq2[j] and seq1[i] != seq2[j]: storage[THIS_ROW * offset + j] = min(storage[THIS_ROW * offset + j], storage[TWO_AGO * offset + j - 2 if j > 1 else len(seq2)] + 1) # compute and return the final edit distance return storage[THIS_ROW * offset + (len(seq2) - 1)] finally: # free dynamically-allocated memory free(storage) cpdef float normalized_damerau_levenshtein_distance(seq1, seq2): """ Return a real number between 0.0 and 1.0, indicating the edit distance as a fraction of the longer sequence. 0.0 means that the sequences are identical, while 1.0 means they have nothing in common. Note that this definition is the exact opposite of `difflib.SequenceMatcher.ratio()`. `difflib` outputs 1.0 for identical sequences and 0.0 for unlike sequences. Examples: >>> normalized_damerau_levenshtein_distance('smtih', 'smith') 0.2 >>> normalized_damerau_levenshtein_distance('saturday', 'sunday') 0.375 >>> normalized_damerau_levenshtein_distance('orange', 'pumpkin') 1.0 >>> normalized_damerau_levenshtein_distance([1, 2, 3, 4, 5, 6], [7, 8, 9, 7, 10, 11, 4]) 1.0 """ # prevent division by zero for empty inputs n = max(len(seq1), len(seq2)) return float(damerau_levenshtein_distance(seq1, seq2)) / max(n, 1) cpdef list damerau_levenshtein_distance_seqs(seq, seqs): """ For each sequence in `seqs`, compute the DL distance between it and `seq`. A list of distances will be returned, one for each element in `seqs`. Because this code generates a list of distances, where each element's position corresponds to the index of the element we encounter as we iterate through `seqs`, `seqs` must be ordered. That is, do not use a data structure like a `set` because order is not guaranteed. Example: >>> damerau_levenshtein_distance_list('Sjöstedt', ['Sjöstedt', 'Sjostedt', 'Söstedt', 'Sjöedt']) [0, 1, 1, 2] """ return [damerau_levenshtein_distance(seq, x) for x in seqs] cpdef list normalized_damerau_levenshtein_distance_seqs(seq, seqs): """ For each sequence in `seqs`, compute the normalized DL distance between it and `seq`. A list of normalized distances will be returned, one for each element in `seqs`. Because this code generates a list of normalized distances, where each element's position corresponds to the index of the element we encounter as we iterate through `seqs`, `seqs` must be ordered. That is, do not use a data structure like a `set` because order is not guaranteed. Example: >>> normalized_damerau_levenshtein_distance_seqs('Sjöstedt', ['Sjöstedt', 'Sjostedt', 'Söstedt', 'Sjöedt']) [0.0, 0.125, 0.125, 0.25] """ return [normalized_damerau_levenshtein_distance(seq, x) for x in seqs] pyxDamerauLevenshtein-1.8.0/pyxdameraulevenshtein/py.typed000066400000000000000000000000001461475324100241640ustar00rootroot00000000000000pyxDamerauLevenshtein-1.8.0/requirements.txt000066400000000000000000000000071461475324100213340ustar00rootroot00000000000000Cython pyxDamerauLevenshtein-1.8.0/setup.py000066400000000000000000000115201461475324100175640ustar00rootroot00000000000000""" Copyright (c) 2013, Triad National Security, LLC All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of Triad National Security, LLC nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. """ from setuptools import setup, Extension from Cython.Build import cythonize metadata = dict( name='pyxDamerauLevenshtein', version='1.8.0', description='pyxDamerauLevenshtein implements the Damerau-Levenshtein (DL) edit ' 'distance algorithm for Python in Cython for high performance.', long_description='pyxDamerauLevenshtein implements the Damerau-Levenshtein (DL) ' 'edit distance algorithm for Python in Cython for high performance. ' 'Courtesy `Wikipedia `_: ' 'In information theory and computer science, the ' 'Damerau-Levenshtein distance (named after Frederick J. Damerau and ' 'Vladimir I. Levenshtein) is a "distance" (string metric) between ' 'two strings, i.e., finite sequence of symbols, given by counting ' 'the minimum number of operations needed to transform one string ' 'into the other, where an operation is defined as an insertion, ' 'deletion, or substitution of a single character, or a ' 'transposition of two adjacent characters. This implementation is ' 'based on `Michael Homer\'s pure Python implementation ' '`_, ' 'which implements the `optimal string alignment distance algorithm ' '`_. ' 'It runs in ``O(N*M)`` time using ``O(M)`` space. It supports ' 'unicode characters. For more information on pyxDamerauLevenshtein, ' 'visit the `GitHub project page `_.', author='Geoffrey Fairchild', author_email='mail@gfairchild.com', maintainer='Geoffrey Fairchild', maintainer_email='mail@gfairchild.com', url='https://github.com/lanl/pyxDamerauLevenshtein', license='BSD 3-Clause License', classifiers=[ 'Development Status :: 5 - Production/Stable', 'Intended Audience :: Developers', 'Intended Audience :: Education', 'Intended Audience :: Science/Research', 'License :: OSI Approved :: BSD License', 'Operating System :: OS Independent', 'Programming Language :: Cython', 'Programming Language :: Python :: 3', 'Programming Language :: Python :: 3.8', 'Programming Language :: Python :: 3.9', 'Programming Language :: Python :: 3.10', 'Programming Language :: Python :: 3.11', 'Programming Language :: Python :: 3.12', 'Programming Language :: Python :: 3.13', 'Topic :: Scientific/Engineering :: Bio-Informatics', 'Topic :: Scientific/Engineering :: Information Analysis', 'Topic :: Text Processing :: Linguistic', ], packages=["pyxdameraulevenshtein"], package_data={ "pyxdameraulevenshtein": ["*.pyi", "py.typed", "*.pyx"] } ) setup( ext_modules=cythonize("pyxdameraulevenshtein/_initialize.pyx"), **metadata ) pyxDamerauLevenshtein-1.8.0/tests/000077500000000000000000000000001461475324100172155ustar00rootroot00000000000000pyxDamerauLevenshtein-1.8.0/tests/test_pyxdl.py000066400000000000000000000132331461475324100217700ustar00rootroot00000000000000# -*- coding: utf-8 -*- """ Copyright (c) 2013, Triad National Security, LLC All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of Triad National Security, LLC nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. """ import unittest import math from pyxdameraulevenshtein import damerau_levenshtein_distance from pyxdameraulevenshtein import damerau_levenshtein_distance_seqs from pyxdameraulevenshtein import normalized_damerau_levenshtein_distance from pyxdameraulevenshtein import normalized_damerau_levenshtein_distance_seqs class TestDamerauLevenshtien(unittest.TestCase): def test_damerau_levenshtein_distance(self): assert damerau_levenshtein_distance('smtih', 'smith') == 1 assert damerau_levenshtein_distance('snapple', 'apple') == 2 assert damerau_levenshtein_distance('testing', 'testtn') == 2 assert damerau_levenshtein_distance('saturday', 'sunday') == 3 assert damerau_levenshtein_distance('Saturday', 'saturday') == 1 assert damerau_levenshtein_distance('orange', 'pumpkin') == 7 assert damerau_levenshtein_distance('gifts', 'profit') == 5 assert damerau_levenshtein_distance('Sjöstedt', 'Sjostedt') == 1 assert damerau_levenshtein_distance('tt', 't') == 1 assert damerau_levenshtein_distance([1, 2, 3], [1, 3, 2]) == 1 assert damerau_levenshtein_distance((1, 2, 3), (1, 3, 2)) == 1 assert damerau_levenshtein_distance((1, 2, 3), [1, 3, 2]) == 1 assert damerau_levenshtein_distance([], []) == 0 assert damerau_levenshtein_distance(range(10), range(1, 11)) == 2 assert damerau_levenshtein_distance([1, 2, 3, 4, 5, 6], [7, 8, 9, 7, 10, 11, 4]) == 7 assert damerau_levenshtein_distance([1, 2, 3], [1, 3, 2]) == 1 assert damerau_levenshtein_distance((1, 2, 3), (1, 3, 2)) == 1 assert damerau_levenshtein_distance((1, 2, 3), [1, 3, 2]) == 1 assert damerau_levenshtein_distance([], []) == 0 assert damerau_levenshtein_distance(range(10), range(1, 11)) == 2 assert damerau_levenshtein_distance([1, 2, 3, 4, 5, 6], [7, 8, 9, 7, 10, 11, 4]) == 7 def test_normalized_damerau_levenshtein_distance(self): assert normalized_damerau_levenshtein_distance('smtih', 'smith') == 0.20000000298023224 assert normalized_damerau_levenshtein_distance('', '') == 0 assert normalized_damerau_levenshtein_distance('snapple', 'apple') == 0.2857142984867096 assert normalized_damerau_levenshtein_distance('testing', 'testtn') == 0.2857142984867096 assert normalized_damerau_levenshtein_distance('saturday', 'sunday') == 0.375 assert normalized_damerau_levenshtein_distance('Saturday', 'saturday') == 0.125 assert normalized_damerau_levenshtein_distance('orange', 'pumpkin') == 1.0 assert normalized_damerau_levenshtein_distance('gifts', 'profit') == 0.8333333134651184 assert normalized_damerau_levenshtein_distance('Sjöstedt', 'Sjostedt') == 0.125 assert normalized_damerau_levenshtein_distance('tt', 't') == 0.5 assert math.isclose(normalized_damerau_levenshtein_distance([1, 2, 3], [1, 3, 2]), 1.0 / 3.0, rel_tol=1e-05) assert normalized_damerau_levenshtein_distance([], []) == 0.0 assert math.isclose(normalized_damerau_levenshtein_distance(range(10), range(1, 11)), 0.2, rel_tol=1e-05) assert normalized_damerau_levenshtein_distance([1, 2, 3, 4, 5, 6], [7, 8, 9, 7, 10, 11, 4]) == 1.0 def test_damerau_levenshtein_distance_seqs(self): assert damerau_levenshtein_distance_seqs( 'Saturday', ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'] ) == [3, 5, 5, 6, 4, 5, 0] assert damerau_levenshtein_distance_seqs( 'Sjöstedt', ['Sjöstedt', 'Sjostedt', 'Söstedt', 'Sjöedt'] ) == [0, 1, 1, 2] def test_normalized_damerau_levenshtein_distance_seqs(self): assert normalized_damerau_levenshtein_distance_seqs( 'Saturday', ['Sunday', 'Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday'] ) == [0.375, 0.625, 0.625, 0.6666666865348816, 0.5, 0.625, 0.0] assert normalized_damerau_levenshtein_distance_seqs( 'Sjöstedt', ['Sjöstedt', 'Sjostedt', 'Söstedt', 'Sjöedt'] ) == [0.0, 0.125, 0.125, 0.25] if __name__ == '__main__': unittest.main()