pax_global_header00006660000000000000000000000064127670770650014533gustar00rootroot0000000000000052 comment=be1a8b08c8716e59b89982557da9ea68cdf868c5 pg-similarity/000077500000000000000000000000001276707706500136715ustar00rootroot00000000000000pg-similarity/.gitignore000066400000000000000000000001021276707706500156520ustar00rootroot00000000000000*.so *.so.0 *.so.0.0 *.o pg_similarity.sql *.diffs *.out results/ pg-similarity/COPYRIGHT000066400000000000000000000030231276707706500151620ustar00rootroot00000000000000PostgreSQL Similarity Functions Copyright (c) 2008-2012 Euler Taveira de Oliveira All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the Euler Taveira de Oliveira nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. pg-similarity/Makefile000066400000000000000000000013331276707706500153310ustar00rootroot00000000000000# pg_similarity extension EXTENSION = pg_similarity MODULE_big = pg_similarity OBJS = tokenizer.o similarity.o similarity_gin.o \ block.o cosine.o dice.o euclidean.o hamming.o jaccard.o \ jaro.o levenshtein.o matching.o mongeelkan.o needlemanwunsch.o \ overlap.o qgram.o smithwaterman.o smithwatermangotoh.o soundex.o DATA_built = pg_similarity.sql DATA = pg_similarity--1.0.sql pg_similarity--unpackaged--1.0.sql REGRESS = test1 test2 test3 test4 #DOCS = README.md ifdef USE_PGXS PG_CONFIG = pg_config PGXS := $(shell $(PG_CONFIG) --pgxs) include $(PGXS) else subdir = contrib/pg_similarity top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif pg-similarity/README.md000066400000000000000000000452251276707706500151600ustar00rootroot00000000000000[![Coverity Scan Build Status](https://scan.coverity.com/projects/4830/badge.svg)](https://scan.coverity.com/projects/pg_similarity) Introduction ============ **pg\_similarity** is an extension to support similarity queries on [PostgreSQL](http://www.postgresql.org/). The implementation is tightly integrated in the RDBMS in the sense that it defines operators so instead of the traditional operators (= and <>) you can use ~~~ and ~!~ (any of these operators represents a similarity function). **pg\_similarity** has three main components: - **Functions**: a set of functions that implements similarity algorithms available in the literature. These functions can be used as UDFs and, will be the base for implementing the similarity operators; - **Operators**: a set of operators defined at the top of similarity functions. They use similarity functions to obtain the similarity threshold and, compare its value to a user-defined threshold to decide if it is a match or not; - **Session Variables**: a set of variables that store similarity function parameters. Theses variables can be defined at run time. Installation ============ **pg\_similarity** is distributed as a source package and can be downloaded at [PGFoundry](http://pgfoundry.org/projects/pgsimilarity/). This extension is supported on [those platforms](http://www.postgresql.org/docs/current/static/supported-platforms.html) that PostgreSQL is. The installation steps depend on your operating system. You can also keep up with the latest fixes and features cloning the Git repository. ``` $ git clone https://github.com/eulerto/pg_similarity.git ``` UNIX based Operating Systems ---------------------------- Before you are able to use your extension, you should build it and load it at the desirable database. The new way (9.1 or later): ``` $ tar -zxf pg_similarity-0.0.19.tgz $ cd pg_similarity-0.0.19 $ $EDITOR Makefile # edit PG_CONFIG iif necessary $ USE_PGXS=1 make $ USE_PGXS=1 make install $ psql mydb psql (9.3.5) Type "help" for help. mydb=# CREATE EXTENSION pg_similarity; CREATE EXTENSION ``` And the old way: ``` $ tar -zxf pg_similarity-0.0.19.tgz $ cd pg_similarity-0.0.19 $ $EDITOR Makefile # edit PG_CONFIG iif necessary $ USE_PGXS=1 make $ USE_PGXS=1 make install $ psql -f SHAREDIR/contrib/pg_similarity.sql mydb # SHAREDIR is pg_config --sharedir ``` The typical usage is to copy a sample file at tarball (*pg_similarity.conf.sample*) to PGDATA (as *pg_similarity.conf*) and include the following line in *postgresql.conf*: ``` include 'pg_similarity.conf' ``` Windows ------- Sorry, never tried^H^H^H^H^H Actually I tried that but it is not that easy as on UNIX. :( There are two ways to build PostgreSQL on Windows: (i) MingW and (ii) MSVC. The former is supported but it is not widely used and the latter is popular because Windows binaries (officially distributed) are built using MSVC. If you choose to use Mingw, just follow the UNIX instructions above to build pg_similarity. Otherwise, the MSVC steps are below: - Download and untar the *same* PostgreSQL version you are using; - Download and untar pg_similarity under PostgreSQL contrib directory; - Edit contrib/Makefile and add *pg_similarity* to SUBDIRS variable; - Follow [Installation from Source Code on Windows](http://www.postgresql.org/docs/current/static/install-windows.html) for building but do not install it; - Instead of executing install (if you already have Windows binaries installed), just copy pg_similarity.dll to LIBDIR (get it executing `pg_config --libdir`) and pg_similarity.control and *--1.0.sql to SHAREDIR/extension (get it executing `pg_config --sharedir`). - That is it! Do not forget to follow the instructions above to load the library and CREATE EXTENSION. Functions and Operators ======================= This extension supports a set of similarity algorithms. The most known algorithms are covered by this extension. You must be aware that each algorithm is suited for a specific domain. The following algorithms are provided. - L1 Distance (as known as City Block or Manhattan Distance); - Cosine Distance; - Dice Coefficient; - Euclidean Distance; - Hamming Distance; - Jaccard Coefficient; - Jaro Distance; - Jaro-Winkler Distance; - Levenshtein Distance; - Matching Coefficient; - Monge-Elkan Coefficient; - Needleman-Wunsch Coefficient; - Overlap Coefficient; - Q-Gram Distance; - Smith-Waterman Coefficient; - Smith-Waterman-Gotoh Coefficient; - Soundex Distance.
Algorithm Function Operator Use Index? Parameters
L1 Distance block(text, text) returns float8 ~++ yes pg_similarity.block_tokenizer (enum)
pg_similarity.block_threshold (float8)
pg_similarity.block_is_normalized (bool)
Cosine Distance cosine(text, text) returns float8 ~## yes pg_similarity.cosine_tokenizer (enum)
pg_similarity.cosine_threshold (float8)
pg_similarity.cosine_is_normalized (bool)
Dice Coefficient dice(text, text) returns float8 ~-~ yes pg_similarity.dice_tokenizer (enum)
pg_similarity.dice_threshold (float8)
pg_similarity.dice_is_normalized (bool)
Euclidean Distance euclidean(text, text) returns float8 ~!! yes pg_similarity.euclidean_tokenizer (enum)
pg_similarity.euclidean_threshold (float8)
pg_similarity.euclidean_is_normalized (bool)
Hamming Distance hamming(bit varying, bit varying) returns float8
hamming_text(text, text) returns float8
~@~ no pg_similarity.hamming_threshold (float8)
pg_similarity.hamming_is_normalized (bool)
Jaccard Coefficient jaccard(text, text) returns float8 ~?? yes pg_similarity.jaccard_tokenizer (enum)
pg_similarity.jaccard_threshold (float8)
pg_similarity.jaccard_is_normalized (bool)
Jaro Distance jaro(text, text) returns float8 ~%% no pg_similarity.jaro_threshold (float8)
pg_similarity.jaro_is_normalized (bool)
Jaro-Winkler Distance jarowinkler(text, text) returns float8 ~@@ no pg_similarity.jarowinkler_threshold (float8)
pg_similarity.jarowinkler_is_normalized (bool)
Levenshtein Distance lev(text, text) returns float8 ~== no pg_similarity.levenshtein_threshold (float8)
pg_similarity.levenshtein_is_normalized (bool)
Matching Coefficient matchingcoefficient(text, text) returns float8 ~^^ yes pg_similarity.matching_tokenizer (enum)
pg_similarity.matching_threshold (float8)
pg_similarity.matching_is_normalized (bool)
Monge-Elkan Coefficient mongeelkan(text, text) returns float8 ~|| no pg_similarity.mongeelkan_tokenizer (enum)
pg_similarity.mongeelkan_threshold (float8)
pg_similarity.mongeelkan_is_normalized (bool)
Needleman-Wunsch Coefficient needlemanwunsch(text, text) returns float8 ~#~ no pg_similarity.nw_threshold (float8)
pg_similarity.nw_is_normalized (bool)
Overlap Coefficient overlapcoefficient(text, text) returns float8 ~** yes pg_similarity.overlap_tokenizer (enum)
pg_similarity.overlap_threshold (float8)
pg_similarity.overlap_is_normalized (bool)
Q-Gram Distance qgram(text, text) returns float8 ~~~ yes pg_similarity.qgram_threshold (float8)
pg_similarity.qgram_is_normalized (bool)
Smith-Waterman Coefficient smithwaterman(text, text) returns float8 ~=~ no pg_similarity.sw_threshold (float8)
pg_similarity.sw_is_normalized (bool)
Smith-Waterman-Gotoh Coefficient smithwatermangotoh(text, text) returns float8 ~!~ no pg_similarity.swg_threshold (float8)
pg_similarity.swg_is_normalized (bool)
Soundex Distance soundex(text, text) returns float8 ~*~ no
The several parameters control the behavior of the pg\_similarity functions and operators. I don't explain in detail each parameter because they can be classified in three classes: **tokenizer**, **threshold**, and **normalized**. - **tokenizer**: controls how the strings are tokenized. The valid values are **alnum**, **gram**, **word**, and **camelcase**. All tokens are lowercase (this option can be set at compile time; see PGS\_IGNORE\_CASE at source code). Default is **alnum**; - **alnum**: delimiters are any non-alphanumeric characters. That means that only alphabetic characters in the standard C locale and digits (0-9) are accepted in tokens. For example, the string "Euler\_Taveira\_de\_Oliveira 22/02/2011" is tokenized as "Euler", "Taveira", "de", "Oliveira", "22", "02", "2011"; - **gram**: an n-gram is a subsequence of length n. Extracting n-grams from a string can be done by using the sliding-by-one technique, that is, sliding a window of length n through out the string by one character. For example, the string "euler taveira" (using n = 3) is tokenized as "eul", "ule", "ler", "er ", "r t", " ta", "tav", "ave", "vei", "eir", and "ira". There are some authors that consider n-grams adding " e", " eu", "ra ", and "a " to the set of tokens, that is called full n-grams (this option can be set at compile time; see PGS\_FULL\_NGRAM at source code); - **word**: delimiters are white space characters (space, form-feed, newline, carriage return, horizontal tab, and vertical tab). For example, the string "Euler Taveira de Oliveira 22/02/2011" is tokenized as "Euler", "Taveira", "de", "Oliveira", and "22/02/2011"; - **camelcase**: delimiters are capitalized characters but they are also included as first token characters. For example, the string "EulerTaveira de Oliveira" is tokenized as "Euler", "Taveira de ", and "Oliveira". - **threshold**: controls how flexible will be the result set. These values are used by operators to match strings. For each pair of strings, if the calculated value (using the corresponding similarity function) is greater or equal the threshold value, there is a match. The values range from **0.0** to **1.0**. Default is **0.7**; - **normalized**: controls whether the similarity coefficient/distance is normalized (between 0.0 and 1.0) or not. Normalized values are used automatically by operators to match strings, that is, this parameter only makes sense if you are using similarity functions. Default is **true**. Examples ======== Set parameters at run time. ``` mydb=# show pg_similarity.levenshtein_threshold; pg_similarity.levenshtein_threshold ------------------------------------- 0.7 (1 row) mydb=# set pg_similarity.levenshtein_threshold to 0.5; SET mydb=# show pg_similarity.levenshtein_threshold; pg_similarity.levenshtein_threshold ------------------------------------- 0.5 (1 row) mydb=# set pg_similarity.cosine_tokenizer to camelcase; SET mydb=# set pg_similarity.euclidean_is_normalized to false; SET ``` Simple tables for examples. ``` mydb=# create table foo (a text); CREATE TABLE mydb=# insert into foo values('Euler'),('Oiler'),('Euler Taveira de Oliveira'),('Maria Taveira dos Santos'),('Carlos Santos Silva'); INSERT 0 5 mydb=# create table bar (b text); CREATE TABLE mydb=# insert into bar values('Euler T. de Oliveira'),('Euller'),('Oliveira, Euler Taveira'),('Sr. Oliveira'); INSERT 0 4 ``` *Example 1*: Using similarity functions **cosine**, **jaro**, and **euclidean**. ``` mydb=# select a, b, cosine(a,b), jaro(a, b), euclidean(a, b) from foo, bar; a | b | cosine | jaro | euclidean ---------------------------+-------------------------+----------+----------+----------- Euler | Euler T. de Oliveira | 0.5 | 0.75 | 0.579916 Euler | Euller | 0 | 0.944444 | 0 Euler | Oliveira, Euler Taveira | 0.57735 | 0.605797 | 0.552786 Euler | Sr. Oliveira | 0 | 0.505556 | 0.225403 Oiler | Euler T. de Oliveira | 0 | 0.472222 | 0.457674 Oiler | Euller | 0 | 0.7 | 0 Oiler | Oliveira, Euler Taveira | 0 | 0.672464 | 0.367544 Oiler | Sr. Oliveira | 0 | 0.672222 | 0.225403 Euler Taveira de Oliveira | Euler T. de Oliveira | 0.75 | 0.79807 | 0.75 Euler Taveira de Oliveira | Euller | 0 | 0.677778 | 0.457674 Euler Taveira de Oliveira | Oliveira, Euler Taveira | 0.866025 | 0.773188 | 0.8 Euler Taveira de Oliveira | Sr. Oliveira | 0.353553 | 0.592222 | 0.552786 Maria Taveira dos Santos | Euler T. de Oliveira | 0 | 0.60235 | 0.5 Maria Taveira dos Santos | Euller | 0 | 0.305556 | 0.457674 Maria Taveira dos Santos | Oliveira, Euler Taveira | 0.288675 | 0.535024 | 0.552786 Maria Taveira dos Santos | Sr. Oliveira | 0 | 0.634259 | 0.452277 Carlos Santos Silva | Euler T. de Oliveira | 0 | 0.542105 | 0.47085 Carlos Santos Silva | Euller | 0 | 0.312865 | 0.367544 Carlos Santos Silva | Oliveira, Euler Taveira | 0 | 0.606662 | 0.42265 Carlos Santos Silva | Sr. Oliveira | 0 | 0.507728 | 0.379826 (20 rows) ``` *Example 2*: Using operator **levenshtein** (~==) and changing its threshold at run time. ``` mydb=# show pg_similarity.levenshtein_threshold; pg_similarity.levenshtein_threshold ------------------------------------- 0.7 (1 row) mydb=# select a, b, lev(a,b) from foo, bar where a ~== b; a | b | lev ---------------------------+----------------------+---------- Euler | Euller | 0.833333 Euler Taveira de Oliveira | Euler T. de Oliveira | 0.76 (2 rows) mydb=# set pg_similarity.levenshtein_threshold to 0.5; SET mydb=# select a, b, lev(a,b) from foo, bar where a ~== b; a | b | lev ---------------------------+----------------------+---------- Euler | Euller | 0.833333 Oiler | Euller | 0.5 Euler Taveira de Oliveira | Euler T. de Oliveira | 0.76 (3 rows) ``` *Example 3*: Using operator **qgram** (~~~) and changing its threshold at run time. ``` mydb=# set pg_similarity.qgram_threshold to 0.7; SET mydb=# show pg_similarity.qgram_threshold; pg_similarity.qgram_threshold ------------------------------- 0.7 (1 row) mydb=# select a, b,qgram(a, b) from foo, bar where a ~~~ b; a | b | qgram ---------------------------+-------------------------+---------- Euler | Euller | 0.8 Euler Taveira de Oliveira | Euler T. de Oliveira | 0.77551 Euler Taveira de Oliveira | Oliveira, Euler Taveira | 0.807692 (3 rows) mydb=# set pg_similarity.qgram_threshold to 0.35; SET mydb=# select a, b,qgram(a, b) from foo, bar where a ~~~ b; a | b | qgram ---------------------------+-------------------------+---------- Euler | Euler T. de Oliveira | 0.413793 Euler | Euller | 0.8 Oiler | Euller | 0.4 Euler Taveira de Oliveira | Euler T. de Oliveira | 0.77551 Euler Taveira de Oliveira | Oliveira, Euler Taveira | 0.807692 Euler Taveira de Oliveira | Sr. Oliveira | 0.439024 (6 rows) ``` *Example 4*: Using a set of operators using the same threshold (0.7) to ilustrate that some similarity functions are appropriated to certain data domains. ``` mydb=# select * from bar where b ~@@ 'euler'; -- jaro-winkler operator b ---------------------- Euler T. de Oliveira Euller (2 rows) mydb=# select * from bar where b ~~~ 'euler'; -- qgram operator b --- (0 rows) mydb=# select * from bar where b ~== 'euler'; -- levenshtein operator b -------- Euller (1 row) mydb=# select * from bar where b ~## 'euler'; -- cosine operator b --- (0 rows) ``` License ======= > Copyright © 2008-2012 Euler Taveira de Oliveira > All rights reserved. > Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: > Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer; > Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution; > Neither the name of the Euler Taveira de Oliveira nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. > THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. pg-similarity/TODO000066400000000000000000000023431276707706500143630ustar00rootroot00000000000000+--------------+ | TODO | +--------------+ + soundex pt_BR + tf/idf * fellegisunter * hellinger * jensenshannon * skew * harmonicmean * variational * confusion * fasta * blasta * ukkonen * taglink * taglinktoken + group base functions at similarity.c - write docs + test all functions +--------------+ | Simmetrics | +--------------+ - BlockDistance + ChapmanLengthDeviation + ChapmanMatchingSoundex + ChapmanMeanLength + ChapmanOrderedNameCompoundSimilarity - CosineSimilarity - DiceSimilarity - EuclideanDistance - JaccardSimilarity - Jaro - JaroWinkler - Levenshtein - MatchingCoefficient - MongeElkan - NeedlemanWunch - OverlapCoefficient - QGramsDistance - SmithWatermanGotoh + SmithWatermanGotohWindowedAffine - SmithWaterman - Soundex + TagLink + TagLinkToken +-----------------+ | Second String | +-----------------+ + ApproxNeedlemanWunsch + DirichletJS - Jaccard - Jaro - JaroWinkler + JaroWinklerTFIDF + JelinekMercerJS + JensenShannonDistance + Level2Jaro + Level2JaroWinkler + Level2 + Level2Levenstein + Level2MongeElkan - Levenstein + Mixture - MongeElkan - NeedlemanWunsch + ScaledLevenstein - SmithWaterman + SoftTFIDF + SoftTokenFelligiSunter + TagLink + TFIDF + TokenFelligiSunter + UnsmoothedJS + WinklerRescorer pg-similarity/block.c000066400000000000000000000102511276707706500151260ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * block.c * * L1 distance or City Block distance or Manhattan distance * * Aims to measure the distance between two strings * * X is a list of n-grams * Y is a list of n-grams * T is a set of n-grams of X and/or Y * * For each n-gram in T we count occorrences in X and Y; we sum the * absolute difference between nx and ny. * * For example: * * x: euler = {eu, ul, le, er} * y: heuser = {he, eu, us, se, er} * t: {eu, ul, le, er, he, us, se} * * eu ul le er he us se * s = |1 - 1| + |1 - 0| + |1 - 0| + |1 - 1| + |0 - 1| + |0 - 1| + |0 - 1| = 5 * * PS> we call n-grams: (i) n-sequence of letters (ii) n-sequence of words * * http://en.wikipedia.org/wiki/Block_distance * * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "similarity.h" #include "tokenizer.h" /* GUC variables */ int pgs_block_tokenizer = PGS_UNIT_ALNUM; double pgs_block_threshold = 0.7f; bool pgs_block_is_normalized = true; PG_FUNCTION_INFO_V1(block); Datum block(PG_FUNCTION_ARGS) { char *a, *b; TokenList *s, *t, *u; Token *p, *q, *r; int totpossible; int totdistance; float8 res; a = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(0)))); b = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(1)))); if (strlen(a) > PGS_MAX_STR_LEN || strlen(b) > PGS_MAX_STR_LEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("argument exceeds the maximum length of %d bytes", PGS_MAX_STR_LEN))); /* lists */ s = initTokenList(0); t = initTokenList(0); /* set list */ u = initTokenList(1); switch (pgs_block_tokenizer) { case PGS_UNIT_WORD: tokenizeBySpace(s, a); tokenizeBySpace(t, b); /* all tokens in a set */ tokenizeBySpace(u, a); tokenizeBySpace(u, b); break; case PGS_UNIT_GRAM: tokenizeByGram(s, a); tokenizeByGram(t, b); /* all tokens in a set */ tokenizeByGram(u, a); tokenizeByGram(u, b); break; case PGS_UNIT_CAMELCASE: tokenizeByCamelCase(s, a); tokenizeByCamelCase(t, b); /* all tokens in a set */ tokenizeByCamelCase(u, a); tokenizeByCamelCase(u, b); break; case PGS_UNIT_ALNUM: /* default */ default: tokenizeByNonAlnum(s, a); tokenizeByNonAlnum(t, b); /* all tokens in a set */ tokenizeByNonAlnum(u, a); tokenizeByNonAlnum(u, b); break; } elog(DEBUG3, "Token List A"); printToken(s); elog(DEBUG3, "Token List B"); printToken(t); elog(DEBUG3, "All Token List"); printToken(u); totpossible = s->size + t->size; totdistance = 0; p = u->head; while (p != NULL) { int acnt = 0; int bcnt = 0; q = s->head; while (q != NULL) { elog(DEBUG4, "p: %s; q: %s", p->data, q->data); if (strcmp(p->data, q->data) == 0) acnt++; q = q->next; } r = t->head; while (r != NULL) { elog(DEBUG4, "p: %s; r: %s", p->data, r->data); if (strcmp(p->data, r->data) == 0) bcnt++; r = r->next; } if (acnt > bcnt) totdistance += (acnt - bcnt); else totdistance += (bcnt - acnt); elog(DEBUG2, "\"%s\" => acnt(%d); bcnt(%d); totdistance(%d)", p->data, acnt, bcnt, totdistance); p = p->next; } elog(DEBUG1, "is normalized: %d", pgs_block_is_normalized); elog(DEBUG1, "total possible: %d", totpossible); elog(DEBUG1, "total distance: %d", totdistance); destroyTokenList(s); destroyTokenList(t); destroyTokenList(u); if (pgs_block_is_normalized) res = (float8) (totpossible - totdistance) / totpossible; else res = totdistance; PG_RETURN_FLOAT8(res); } PG_FUNCTION_INFO_V1(block_op); Datum block_op(PG_FUNCTION_ARGS) { float8 res; /* * store *_is_normalized value temporarily 'cause * threshold (we're comparing against) is normalized */ bool tmp = pgs_block_is_normalized; pgs_block_is_normalized = true; res = DatumGetFloat8(DirectFunctionCall2( block, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); /* we're done; back to the previous value */ pgs_block_is_normalized = tmp; PG_RETURN_BOOL(res >= pgs_block_threshold); } pg-similarity/cosine.c000066400000000000000000000057431276707706500153260ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * cosine.c * * Cosine Distance * * * http://en.wikipedia.org/wiki/Dot_product * * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "similarity.h" #include "tokenizer.h" #include /* GUC variables */ int pgs_cosine_tokenizer = PGS_UNIT_ALNUM; double pgs_cosine_threshold = 0.7f; bool pgs_cosine_is_normalized = true; PG_FUNCTION_INFO_V1(cosine); Datum cosine(PG_FUNCTION_ARGS) { char *a, *b; TokenList *s, *t; int atok, btok, comtok, alltok; float8 res; a = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(0)))); b = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(1)))); if (strlen(a) > PGS_MAX_STR_LEN || strlen(b) > PGS_MAX_STR_LEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("argument exceeds the maximum length of %d bytes", PGS_MAX_STR_LEN))); /* sets */ s = initTokenList(1); t = initTokenList(1); switch (pgs_cosine_tokenizer) { case PGS_UNIT_WORD: tokenizeBySpace(s, a); tokenizeBySpace(t, b); break; case PGS_UNIT_GRAM: tokenizeByGram(s, a); tokenizeByGram(t, b); break; case PGS_UNIT_CAMELCASE: tokenizeByCamelCase(s, a); tokenizeByCamelCase(t, b); break; case PGS_UNIT_ALNUM: /* default */ default: tokenizeByNonAlnum(s, a); tokenizeByNonAlnum(t, b); break; } elog(DEBUG3, "Token List A"); printToken(s); elog(DEBUG3, "Token List B"); printToken(t); atok = s->size; btok = t->size; /* combine the sets */ switch (pgs_cosine_tokenizer) { case PGS_UNIT_WORD: tokenizeBySpace(s, b); break; case PGS_UNIT_GRAM: tokenizeByGram(s, b); break; case PGS_UNIT_CAMELCASE: tokenizeByCamelCase(s, b); break; case PGS_UNIT_ALNUM: /* default */ default: tokenizeByNonAlnum(s, b); break; } elog(DEBUG3, "All Token List"); printToken(s); alltok = s->size; destroyTokenList(s); destroyTokenList(t); comtok = atok + btok - alltok; elog(DEBUG1, "is normalized: %d", pgs_cosine_is_normalized); elog(DEBUG1, "token list A size: %d", atok); elog(DEBUG1, "token list B size: %d", btok); elog(DEBUG1, "all tokens size: %d", alltok); elog(DEBUG1, "common tokens size: %d", comtok); /* normalized and unnormalized version are the same */ res = (float8) comtok / (sqrt(atok) * sqrt(btok)); PG_RETURN_FLOAT8(res); } PG_FUNCTION_INFO_V1(cosine_op); Datum cosine_op(PG_FUNCTION_ARGS) { float8 res; /* * store *_is_normalized value temporarily 'cause * threshold (we're comparing against) is normalized */ bool tmp = pgs_cosine_is_normalized; pgs_cosine_is_normalized = true; res = DatumGetFloat8(DirectFunctionCall2( cosine, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); /* we're done; back to the previous value */ pgs_cosine_is_normalized = tmp; PG_RETURN_BOOL(res >= pgs_cosine_threshold); } pg-similarity/data/000077500000000000000000000000001276707706500146025ustar00rootroot00000000000000pg-similarity/data/similarity.data000066400000000000000000001227001276707706500176250ustar00rootroot00000000000000d'arcy passmore stirling rothe alexis irons darcie schuster griffin pontifex trinity urmonas ethan newport julia mcsorley hayden richmond lara nguyen kale joel gillard layla keatch dylan dichiera sarah green am ber herbert shane wynne xavier jolly troy webb kane tossell john heidrich marcus morrison julia nicoll luke noble daniel mccarthy sophie redmond caitlin wylder emiily gianakis lachlan bradshaw summer mckirdy tori frammartino charlotte white branna widdup paris gibb madeline xie tommi-lee ryan joshua nguyen jordan hand maya kumarasamy nathan nguyen edward koerbin charlotte ryan jasmine green hawes emiily procter kira robson amy rosa grosser blake green sophie reid bayley dinon alisa coulson kelsey rudd caitlin millar freya beattie zoe clarke kiana grosser sascha shepherd juliana pasalic coby reid rachel simmonds adela mcphail james michalatos jaime shadbolt charlie kerslake natalia trooper michael reid levi whishaw tiffany bristow carly mason holly voivodich benjamin matthews zane costain jasmine heidrich adrian miles emiily coleman koula matson leditschke carmen madelyn crouch bianca nicolaides harrison sinton elli hammer jordan whiteley corie white kyra wilcockson kierra worsley cooper matthw brianna basselot-hall eliza mason benjamin simmonds alexandra green helena thick kierra mason charlie tiller jade pickel cade vangils mitchell clarke laklynn webb joel blake alex hefford paris olivarws ebony nguyen kayla hick jacob broadhead lily haren braiden clarke matthew webb nicholas shepherdson timothy ryan jed mcgregor nicolas laverty fiona palecek caitlin thrift aidan mccosker harrison branston jed parken scott clarker alicia ryan jack tuckwell brooke fitzpatrick clain burford alana bloomfield joshua minto deakin white emiily block tiana faulhaber georgia horsley jett coulson callum hoffman thomas shepherd connor wiseman lachlan clarke carlin georgia egan domenique cenin isabella berry olivia mcphail jack britten robert ryan daniel white emiily lamprey ethan kowald kane obersteller emiily matthews benjamin marveggio jack brackett cooper mewett sophie ryan zachary wotton chantelle blake zarlia gully victoria starega jarod kasinathan abbey ngo le zane berry zachary schembri kaitlin lock emiily kaljobic arki beattie hugo reid danny verco ella aspinall ryan pignalosa sarah mccarthy amber herbert matthew wheatley emma hawes hana hawes kale coleman bradley leishman cooper burford angelica jacob wasley jesse luvisi rebecca needham hannah ryan abbey kranz imogen chuck sara-louise traforti lauren haskett rhiannon white kai cardnell olivia jesser sophie wogandt arabella mccarthy jack leslie hamish michelmore harrison voglino liam petersen zali stronach sam arganese makayla norton-baker chloe blake chevonne nicholas nguyen zachariah lademan sarah white luke everill alyssa mason harrison siggins julia george ayla sobczak oscar matthews riley renfrey katharine noble michael gannah robson jasper mchenry mitchell clarke chelsea nguyen mitchell neville xanthe longden roy culbertson jemma larocca benjamin webb finlay grosser dylan wilkins madeline mazzone shane stanley jacqueline hague andrew glass taylor werneburg sean white phoebe mcdaid cooper condo brianna santi phoebe reid charles marris reece sheldon taylah hand patrick lathouras georgia steers taylah fabbro natalie berry sophie matthews alexandra clarke rebecca vincent bailley mccarthy lochlan kusel jordan linnell kyle mcgregor william gronow cooper clarke james white harry fogliano taliah eileen ruby mccarthy declrn white madeoimu raquel charman jack hendricks jack oleszcuk morgan mckane georgia rees talia kranz claire coulson breony coulson emelibe bellchambers thmosdd campbell zachary walch lachlan ryan luss waller jacob tschirpig zane denholm georgina campbell daniel campbell bailey mildren jake cooper mason liam neumann christian blackwell trent campton harley mcneill caleb fusco alessandra chegwidden benjamin green jade miels clarke ridley billy joel pasternak georgia hosgood luke khom samantha wilczek oliver musolino mason scerri lachlan hemmings georgia noble luke jolly amber lowra edward gerhardt isabelle boyle finley reid andrew carmody joshua obersteller zachary brain nevin logan ryan desi greeh annabelle ryan sarah doody danielle clarke arki beattie peter daish mia webb caleb garcia ayla azimi bella menzies ryleh crouch oliver chandler taliah huxley rachael abrahamson jett wilkey ella paterson jack piliczky jazz morrison aikaterina campbell ella stephenson ethan clelland lewis isola ebonie burford kelsey berry gianni clarke jake rees amy grewar timothy kalogeras thomas gwyn sophie blake victoria loosmore alia ronan jack hand michael godfrey xanthe hoang caitlin spate troy ben roseboom harriet mason cooper edhouse kenneth feeney sophie longden silas white georgia almeida lachlan pragt ethan longo ella kirra clarke emiily raftopoulos jye rau alissa stanley lily dobney sothman ellorah brooklyn gavel joel green liam rees nicholas rees kylee modystach douglas bloomfield mitchell leggio carla curry peta blackwell alana sherriff zachary maynard amber riddell kane ottens david nguyen hattie donaldson benjamin dukeson ruby reid madison plumb ewan grierson isabella vorrasi april mitton mia graue william croker sophie apse sam haggett isaac hazell jack everett marley spaander julia stephexon bailey simmonds troy lock michael gillard michael lackner david clarke alia tarn imogen robson emiily rees kaela hartland jesse tognella callum millar evan filipovic michaela fitzpatrick dylan gasparin madeline maddison hursey grace beeche christian mccarthy oliver hoffman benjamin hilton carly parken brianna tsang felicity david lara hyakutake jordan matthews tom laing nicholas bahouche hayden needham lily bradshaw ellie hearn emiily hindley nicholas dahlquist harry pascoe carlin reid zoe white giaan dudgeon dabinet blake eisenhut erin montandon liam jervis solomon webb jacob rovira jacqueline dixon noah gillard riley shepherd saara barisic keely hilton cameron musolino jade pickel tanyshah drechsler benjamin ryan tansy white kelsea wang emiily dolan haylee zheng mila george lucy etherington eliza ritzau hannah purdon olivia fauser tara bishop alexandra beams archie bento james robison steven white kyle grafton thomas bellchambers danielle elizabeth hoajvzg anika spicer brooke white asha eyles benjamin godfrey madalyn white lewis green gianni matthews robbie perfetto stephanie tidey lachlan mason benjamin white jacqueline gibb scott clarke jazz belperio benjamin reid amy lock cooper miles kyle bendyk matilda trenerry chelsea cochrane alia teague teneille brock michael morrison dakota white kyle barisic aiden white patrick rocca lucas reece bianca ryan antonio green patrick dixon rhiannon rees hannah everett alison erin white larissa hoang alexandra reid ky howie jessica rankine olivyer schumann olivia snelling isabella shevchenko jasmin alderson cheree soulsby ryan clarke hannah hoger stephanie xepapas jacob butt george flatman matilda wheldale lee bloomfield ryan crosswell jordan butt reegan ladiges malakai nguyen oliver morrison nicholas clarke sam zordan amy warnock sophie holderhead robbie tardrew bodhi koral emerson rosato sophie lindholm kyle crofts blakeston le messurier reeve tschirpig deakin ryan catherine renfrey ryley stefanovska hanna salt moody finlay white william donaldson lewis cameirao thomas sreckovic jasmine strenge jessie webb sean treherne maddison matthews lauren harrington hayley mason zane matthews locvan kusrl oakleigh tuendemann sophie boyle catherine edmundson louise maksim benjamin filgate kieran soulsby amy yost oliver reid nicholas dent mia green zane mccausland zeberah de bono sienna tsialafos christopher goode hannah nguyen george fimeri jack hanigan lee hill-smith mitchell blunden olipva plesa alice green emiily clarke petreece bishop jaiden dakin mitchell raftopoulos isabelle coppock victoria coulson chloe campbell jaxin white roisin carmody madison picot trey peckett cooper mccarthy rhett walls chiara coulson claire hinchey amy crouch brayden hyland rosie millar emiily tomkin ella green tara staggard reuben mcgregor ainsley stanley stephanie fiorenza chloe hearn madison millar aidyn kowald samuel gorjan miles archxwer jonathon halikos jai webb noah kropinyeri isabella paterson breeanne apostoloff jack white jack teuma ajeks blake chloe deakin liam meaney jacob gade lombardi alexander creber juliana lowe heath kamp brianna uniya alannah stanley libby cantini tiana karpathios abbey dixon sarah pulford lewis colquhoun samuel miles heath rudd tahlia hope nicholas excell grace sikansari talia hope chelsea cremasco ryan mearns jake exell michael linnell jasper hyland reuben burford ella nicolle oscar spratt andrew degeorge liam mccarthy charlie kluske erin salas joe winfield eloise rickett lily collinson nicholas capio hamish seddon adam gehling kailey mason sophie pascale christian grierson david lavender amy hebberman giuliana cother james niklaus olivia rosa ella ryan liam roepcke jasmine hammerton robert tuckwell danielle haeusler hugo wynne elana campbell lachlan stanbury declen white logan drysdale jacinta lock abbey clarke joel white zack severan schkirra crain jasmine coulson matilda vidakovic jasmyn burman samuel simmonds colenan sarag brianna berryman jack coleman breana mason samantha soden tahlia van der kolk jack zimmermann joshua grimm elly southwood jayden eglinotn stephanie soden silas liam nowers thomas dixon mathilde laburn benjamin haar eanna sumsion steven ngai emiily theissen jett stanbury christian macgowan elizabeth arhdntoulis holly hobson kiara miles andie robson sarah white harrison golden kate santi emiily miles john finlay jasmine rohrle riley ho phillip harrington melanie mawlai olivia brock kaitlyn everett blake riddell ruby matthews ella dabimust jsh bildbcki taylah smallacombe eliza hingston monique coleman michael lomman cade bradshaw michael moorby jackson weatherhill alexander white kyle bruhn caitlin averis matthew felmingham alexander donaldson tara burntsl sean vimpany judah gondzioulis liam sherriff takara weaver jessica coleman ella garcia samantha meaney riley cradock hugo farah zachariah clarke lachlan-john hayball jessica nguyen danielle jolly chloe green nikki ryai mathilde kelley shenae white chelsea crouch matthew ryan cade clarke claire nykamp annalise sunderland michael soulemezis jaxson white jacqueline extremera emiily buckell alexander herbert eden joannou annabelle viart chelsea pascoe benjamin dixon seth godfrey clement foulger braiden clarke jacobie kyriacou joshua thredgold hannah mccosker shai teague alexandra jeffries gillian longo erin moscatt madison reid adela nguyen emiily terek brooke kanellos matthew rigley alyssa southgate alysha byers ella pascoe alexandra jolly ebonie ryan darcy laing freya coffey david green jacob rawlings paige kelley ash ahmed katelyn schorr helliar scott shepherd kobe green paris olivares samuel edson luke benveniste oscar lund marcus van hees ethan hilton daniel mahony adam nobeh sam tuckew ell benjamin green victoria roche isabelle baillie kyle badger henry hammer liam herbert pascale green joel rawlings holly nguyen heather wilkey bridget priest maya mcilvar richard willsford alicia lodge matthew noack taylah gree zoe reid jessica slack-smith emma simonetti shanaye green bryce hartzenberg tiarna broders liana jessen jenna saly oscar weavell joel shadbolt tess spratt thomas whitsitt thomas green chelsea symmons mason borgmeyer drew talmet andrew mason jacob lo-sapio timothy shaun dinnison robert wooley lachlan white rya joel travis white rusling rosue benjamin durnin jack blackwell zachary dinh alice heilmair james le lievre jordan mccarthy prudence mccaffrey elizra rosa lushia gonzales nicholas carbone oliver morrison sian tremellen tenille trapnell alec webb shai teague tabitha mcgregor zoe royans elizabelth osfixrld chloe binns nicholas noyce jairus colquhoun kaitlin van geel michael griffen darcy van hoof emiily rees brigette coleman chelsea horsley jackson bettens richard binns michael haren nicolas quill breeanne byrt joel maynard zachary sok jesse clarke nathan gillis alissa vavic amy bellchambers caitlin havriluk bayden white ciaran butt sienna benger keegan keatch natassia campbell keziah nguyen timothy mcneill olivia culph claudia mccarthy matilda shepherd kieran hefford tiarna campbell daniella boyle ruby siggins michael glass maddison mccarthy sophie low joel lowe nicholas beaty wil braithwaite jack bishop benjamin fairnington cameron kammermann oliver weller lucy hipkiss shana pronk emiily wheatley david mccarthy natalia white emiily eglinton kiera miles taylah dent casey godfrey emma leslie alex green frahn ellie lazaraki flynn ryan katelyn vavic jacob walkley joshua milburn kelsye aspinall daniel white hayley berry juliana hill-smith ellie mccarthy kane camp tyler morcom jaxson green tyler campbell joel mawlai jack tweedie emiily shambrook kaela bastin jack botteoff isabella green harraon thredgold xavier clarke isaiah fenoughty jasmine bishop shelby matthews molly weaver william michelmore jessica speight mystique gulesserian montanna hand trevor lenartowicz isabella grillett emiily nguyen ruby blacklow xavier ryan connor matthews georgia leslie james dichiera amber webb amber coleman finlay maynard lauren jessup kane grooby micnazl griffen alexander estcourt cameron monteleone max hirota alexandra haeusler lachlan heron georgia georgia green jake pulford alex milburn dante woodbury benjamin lodge kai goode alysha berry william kingsley jaxin schuster matilda crouch bayden dreckow dillon white zali reid adam kalitsounakis melinda almeida jye lambertus madeline nguytxn jaslyn millar brooke desilva sarah monis patrick matthew vidakovic jye matson carmen leditschke joshua green lily ma timothy veror mila baj amelia mccarthy hamish mackinlay liam sumsion christian rankine isabelle badger jacqueline dolliver cameron berry erin simmonds lucas scalamera alexandra wehr hayley teav cory wigt gemma ho sophie jolly emma saccardo tristan bellchambers rhys fitzpatrick tyron waller jacinta simmonds cooper gibb dillon mcgurk joshua desilva erin coleman sophie brain marcus jobbins rosie rusling alexander mcphail liam tuckwell danjel colantoni benjamin sydenham jed walkley isabella paterson thomas coffey webb lauren bodman adela everett oscar reid sophie crighton arki bizjak courtney denholm brydee bedding grace mccarthy harry longden ajay kaddatz casey chandler ella white shona scantlebury olivia mcgregor jessica manson tabitha bishop georgia campbell jasmine ranson dylan kusnezow travis allard grayson bergamaschi laura szalai charlie brooker joel gillard charlotte matthews stephanie cochrane joel clarke koben webb sebastian brydee clarke hannah berry kyle waller olivia white alexandra bridgland harry willing liam bristow bloomgield tahlia megan zorin zachary green hawes aidan galbraith james faull jessica rayski imogen mcgurk timothy chandler jasper paine reid logan sybella matthews harry felmingham macey crook henry ryan trevor webb joel scattini shannon dougals gabriel reid lachlan allsopp emiily kiss benjamin huebner bailey mcmullen hunter reid joel hope rory campbell natasha bowerman keely everett matilda meadows wilson vincent david herbert charlie bresvansin desi lyden leon clisby amy fauser jessica herbert lucas handberg tayla koulizos stephanie shady olivia gershon kane hislop joshua scripps jasmine fretwell sam purdon antonio grezn alisa khammash bianca blake bianca varcoe braedon marcola natalia mafopust katelyn boyle seamus clarke wil mason tahni clarke kenneth rankine brock sprenglewski chelsea spicer tai haskett kesle bermy jayden katnich trey kelley braedon lowe olivia papageorgiou kayla bishop keely white tyler mary rees holly pallot ruby charlie morrison ella denne heath bisazza maya yoo j juskevisc olivia jack blake william clarke eleanor carbone ajay menzies bradley hemmings daniel jolly samuel chappel imogen bridgland corey whr sienna webb tess campbell william keatch erin gibbett alice cowell adam novak nikita leslie james lorraway emiily allcroft evan scheel harry longden zachary winterton ethan macgowan blake hanna marley hope cojrney wight kiara block les mia lymberis amy jessup lewis lowe jackson green luke cannell william millar kazuki blackwell lachlan clarke alana white samuel dooley alexander mason rhys webb elly green jayme negrin alexandra bevis harrison thorpe sarsha pipan briley bishop tara schuster isabella boyle mitchell lock mia matthews zara millar sophie white alana zdanowicz zac carioti james white madeleine lowe thomas craske grace hage bethany trainor alicia brain noble timothy clarke ethan white william lambertus tara blake flynn mccarthy campbell hazell charlotte boyle tia paterson hannah ryan james dichiera thomas harrington wen mitchell zali gilbertson holly thomas green karlee kazuki marilyn brock zachary neville montana minter cain webb harrison white riley coulson jenna genner ruby denduyn cameeln ryan aloysius nguyen alexandra brooker flynn meaney lachlan-john lohmann benjamin webb kieran plumb lily rasic brigitte dobrzanski sophie campbell isabella drayson riley garcia elizabeth varcoe tiana atsalas samsntha hoffman caitlin pennell emiilh clarke brayden hingston timothy green ashlee lisowski madison ranson tenille burford daniel jaden wisby lochlan pronk sophie clarke lara hingston trinity white jesse campbell nicholas finlay carlin dallos mason reid samuel huxley heath green noah dixon benjamin burleigh naomi hilton georgia aiston caitlin ryan tahlia mcgregor kaysey feneley bree coleman caitlin pringle jasper green joshua wardle joshua baltagie ryan hawes madison rees sophie babos lachlan coulson zachary macarlino fiona trent braedon campbell nicholas stephenson lacey hilton teaha white joshua bishop carlin dancey emiily kranz james minorchio james linnell phoenix priest lukas bansemer lachlan domarecki kristen reid cooper mcgregor isabelle daniel dixon natassia white brooke hislop andrew boyle madeline allanby logan green andrew clarke nathan green declen knowling hayley de angelis sophie huttlestone megan martinson jamie clarke sascha flatman peter korbut monteleone connor robson blakeston webb oscar mesecke elise donaldson taylah manson jolly chelsea spratt benjamin hingston riley creagh stephanie lodge joshua morrison jasxer garnett alannah butt quinn mason georgia cassidy robson jayden ballantyne patrick lucas manson imogen feast beth fuda noah tsakiridis vincent blewett lachlan hyland danny magnusdottir brooke coleman lachlan priebe haylee miles barkly mahony kristin blake brianna leicester ryan kallis matthew drysdale thomas mulquiney benjamin baillie aleisha webb zachary campbell callum reimers mason pizzey lachlan rosa callum blaize singleton adam berry finnbar mccarthy lilly thomas ryan caitlin mahony alessandria white isabella grillas isabella hauser georgia chandler tommi-lee priest chantelle mason ajay narracott green jasper benveniste elysse bishop daniel death matthew hilton lily wardle emiily friebel sarah coleman tristan caine andrew white kirra meaney braedon coppock michael fleet christopher dolby rhiannon sheldon hamish roling weaver jonah ryan flynn southgate shae meaney paris leante tara balasubramaniam abby priest jackson george timothy verco emiily koziol zoe needham nathan vincent emiily sotomayor matthew eglinton keeley stephenson hayden penno lachlan dixon tanar kiripolszki olivia shepherd connor ewald dante furze emma dorree white ethan flynn spears samuel statton crystal lowe holly heinlaid kathleen vodden jack fenlon madeline seen taliah bristow herbert allegra wynne connor reid kyle dalrymple amy dixon annabelle van lingen jackson paterson leah ryan karissa lowe connor stephenson tidna sarkissian alex ciciksza domenique paterson amy mortlock callum campbell chloe wimmer lachlan georgia marring logan reid evan reid madison white victoria matzka annalise browne elizabeth hoang connor hingston elijah white katie leslie alexandra gillard mya mildren jackson cheshire isabelle ryan georgia ngun kiara dent kirra green joshua beaty miller ryanh annika lindholm shane white tarshya harrison green phoebe lamborn tafa townsedtn-ebans samantha shepherdson jack david ella egan liana gamlin gemma boxhall caleb mcgregor sophie millar william belling jett tschirpig magnus clarke imogen chittleborough alissa musolino kylie brooker gabriella white jordan mcneill zachary white sophie benger lachlan hand lewis maynard mitchell whillas rees shantal mcgregor quinn brock william apolianr kyah dixon georgia campbell gemma clsrke jordan hefford josephine lavender sachin burford asha geraghty courtney spratt douglas marshal talia wooley rhys clarke mitchell sherriff hannah ryan sana capurso michael clarke olivia green jai braniff reuben wiseman taylah kibukamusoke christopher due lushia ikeda lachlan hand william crook shakirah howie max leishman meg hartzenberg lara hingston courtney white shakira tonge zachary herbert kayne everett jaime castelluzzo nathan nguyen chloe linnell lucy taliah galich taalia kantamessa emiily berry jade hare william campbell madison jean taalia deakin jack mcgregor jamie zimmermann holly staude joshua clarke isabella bullock dylan mercorella timothy siviour mitchell lionis marco gillick edward bloomfield tiana holna georgia jolly zebediah de b onp lauren william mccarthy zach trenery holly pulford seth hobson cain kinter kelsy merenda nicholas brain katelin cetinich robson jordan thorpe jackson ploughman emma ledwell phoenix zbierski bailley leung kylee matthews emiily bradshaw timara bizjak jack sadauskas antonio green jamie stoker michael shepherd james szepessy timothy sterpin luke kousiandas cassandra beros jai pascoe fraser fenwick christopher elphick phillip seppelt jacob jafff holly lachlan clarke james lehane joshua riddell ruben bradshaw makenzi miar benjamin hassall mitchell campbell benjamin verco jared hearn brody chandler lauren crossvell benedict binns kelsy hawes cooper astley bailey base caitlin humberdross zack clarke natalia calipari kai weaver tarshya flatman jade wheatley jessica isola julia rivers ellie ho gianni hassall joanna bishop blackwell eliza le cooke bodhi noack mackenzi reid daniel reid josuha baltagie abbie boothroyd jade bellchambers joel webb jamie bruinewoud jaxson rebmann hayley kecskes matthew browne sxott clarke lucas burford emiily reid tate nickolai nathan kavals michael abraha daniella birkin vanessa zimmermann indiana mason lachlan hope cooper nguyen owen berryman jessica white joel wgire alana webb tayah haythorpe jojn reid holly ho matthew lemke madeleine weidenhofer coby reif marcus donaldson paige campbell thomas nguyen mitchell van groesen rachel cheel jay mislov diamond ang blade pascsle lily crouch snadie marous ella weller vanessa manousos tess morrison william murton mathilde vojvodic monique lowe henry green alexandra clarke isaac vick nicholas egan taylah greeb desi dichiera ayla reid karlee sheldon victoria gold olivia campbell teaha wintulich ashton green chelsea gilkes kieren matthews ethan gajda lucas godfrey edward cmielewski lara marson gregory meadows cooper stephenson macey thorold reece goode ruby matthews thomas clarke michael rosa marco webb sophie ziersch harry mcvilly aurora stanbury gabrielle spicer lauren shorrock shaun hazell madeline byers james reid devan creagh maddison mccarthy lillian white timara lowe simone white william murton william mccarthy zac jayaraman anika petersen sarah ebert kyl ee coffy caitlin padbury sarah mccarthy matthew glennon kirra clarke kylie lanzotti indiana webb alexander karakoulakis joel white mia browne mitchell westerholm kalli mearns kirra webb simon bizjak lauren kyriacou claire madadi logqn yost corie pendergrast brianna brydon tegan collier elisha uhr lachlan grillas jessica dent louise blackney emalene kluske matthew clarke ryan lowe isabella leishman lachlan ronan alexandra shepherd jasmine kaleb mia clarke gabrielle jinks catherine oatey tahlia ferrar oliver eyre mitchell longo isabella baddeley rohan gilbertson joshua ho mia lissner dean bruhn lewis blake brooke gielen caleb ryan jack bottroff pace bishop eliza lock dylan reid emiily miles olibmr schumann bianca rook kane matthews paige canavan megan wilkins zali reid riley hohn amelie pascoe harry white mikaela grosser cooper o'shannessy cooper kammermann madelyn browne nicholas schembri stirling meg haberfield white jasmine strengq harrison zingarelli brodie green lushia reid mystique bodman ella reid lauren wilkins rachel arabilla sebastian creber lachlan pascoe jordan nguyen kate white amelia reid lauren harrington dylan schuster jarryd goode georgia savas stella bradshaw jordyn white drysdale taylah blunden logan matthews finn mccarthy jasmine strenge ada noble jack reid ethan wilde ashlie reid jaden wisby jacob baucks tabitha noble giaan lock lucy carnicelli noah hinchcliff archer ballantyne arabella brain jemima hebberman isabella lolias sascha priest jason ryan jacob hyland zac needham kynan boxhall alex klemm dylan carmody phoebe wang chase cenin jared kiss georgia stanley talia roles kyle nikolai alexandra fullgrabe dante blackwell emiily ryan noah mchenry riley collier charlotte bradshaw finn curry anastasia hislop mitchell emerton xanthe dixon jack lowry jack paine hoffman beau reid abbey fitzpatrick anthony white chelsea dolan renee shadbolt ajay niznik darcy moody sam nguyen luke brain kenneth nguyen jessica rosada laura ebrahim riley kinnane charli lombardi thomas reid mark chandler blade pascale kayne morrison isabella george jordan alexandra coleman isabella friedman desi colquhoun pia dietrich reegan reid harriet blake garth garcia deakin killingbeck matthew meixner holly bishop kate nguyen tara huddy tori addo vanessa zheng jack o'neall sam purtell flamey ellen kowald matteus white charlotte crlok thomas highet benjamin gilley lauren maynard adam doman james natalie rigl charles matthews shantal mccarthy riley reid noah lock jack bowerman joshua engst carmen mchenry cooper hersey nellie clarke leah morrison liam hamley lucy kullmann benjamin balmain emiily blows jasmine vimpsny tyler verco jadyen willing noah fitzpatrick giuliana robnik sean white ruben bishop macormack renfrey ebonie hoffman robert dixon zac stegemann xavier sherriff georgia bibby amy coppola kobe morrison nellie pringle catherine heidrich madison lowe jacqueline pinzon kyle burford george lock harrison o'nieil andrew sherriff daniel lodge tara jacob trimmer ashlie george georgia green michael maczkowiack joshua craswell anita white rourke fazis mia matthews jai rees william ryan liam cameron-hill ronan halbe natasha matthews logan neville dylan zilm clain rundle jonah dakin christian bartel thomas green gabriel white nathan hingston hannah vincent catherine cure aidan kluge-vass corey white sarah doody tara reid ssakra grn matthew collo abby gold georgia jordan sheldon lachlan-john mcnicholl charlize mercorella gabrille kennealy benjamin hage montanna spicer joshua coleman luke white heather gowling sonja leslie jason fitzpatrick jarrod george emiily etchell benjamin wallent lucinda tweedie gregory parr elton clarke joshua yre michaela clarke roisin leicester brianna heyes zachary teague emalene coleman erin campbell emiily green bradley prak lewis moshos irene crighton mia herbert laura lamprey jasper bullock emma waller amy campbell hayden galbraith jaiden bartley-steele macey morrison thomas clarke chloe artis ella theodore amber rawlings jacinta berry jessica reid hudson janet tahlia bloomfield sara lund carla minter nasyah camens reegan hemmings jacob artis ethan innis madeleine berryman brydon benjamin justin clarke timothy britten riley deblasio taylor fitzpatrick sophie matthews tristan kane green april wynne kenneth ryan alyssa minner fergus george emiily edson patrick marinos joshua wynne crystal webb thomas jenko heath butt felicity roxburgh hayley caine jack madigan taalia mccrohan georgia zbierski arren hilton ned white harrison conrick arki bizuka lushia herbert elton saviska portia pattie alexandra barra kody clarke alessia pendle jack falcone george ryan caitlin willmoth zachary harry wilkins nicholas luchetti brodee paterson benjamin purtell amy sheldon jack fokeas andrew trent jack stubbs braedon stanley rebecca bradshaw benedict tiller campbell royle james westbrook bridget braithwaite lachlan gillard kane ogareff bryce collier lacan paine jessica groenewege amelie tinraw jessica averis aimee wilczek caitlin fullerton paige wynne kirra ryan joel fenwick alessandria redi jack stanley lachla-njohn lohmann william serrels horsley indiana brock connor mortlock daniel tschirn taliah heidrich ebony bastin jesse bartone summer blackwell kyle ciotti shane nikiforova jessica reid william harrington indiana green connor goode isabella george desi green tommy matthews meg ayres joel lillycrapp edward southwood benjamin george dylan coffey emiily tennent louise drechsler jake laundy sachin hamon brianna browne zachary de totth george joel noble benjamin purtell jackson gromball isabella casaretto luke langrehr gracie grosser emiily caire kailey cochrane tara matthews abii horsley emma alderson oliver ryan bailey noble robert ryan kyle robson magnus waller eves ryan lohmann darcy banham georgia gilbertson joshua capell benjamin drechsler holly froud chloe fuda alexander crossman robin robson laura harrington george putrananda alexwndra verner liam stephenson liam voulgaridis ambdr lungwitz millar emiily beattie charlotte george schkira wightm isabella manson jordan steers chloe smeaton william claydon zac byrt emiily maingot emiily white michael shepherd karla mormina finn crook ethan siviour courtney grosser sarah hiscox lani reid jade southgate thomas geue alexandra jeffries chloe curry jackson leanse bayley satterthwaite thomas thrift holly heyes garth ottens aidyn boyle brayden hyvad lauren noble scott babic xavier higgin fergus schiansky sam clissold alice hah teresa gilkes mitchell hazell danjel lock tori white tahlia jessen cooper gibb taylah caine amy victoria widdowson james dalkin tayla hilton matteps crellin harrison ringe amner manos robin clarke adele coulson jasper dunstone churkin sylvie gibb joel roan sophie thredgold brooke matuschik serena tristan held joshua hissan jessica noble ava clarke nathan white levi crouch barnaby fletcher-jones samuel gaskin alexis maier freya goldsworthy jasmin brain stephanie nguyen erin fizio joel fauser isabella warde steven nguyen chloe cort jackson kontopoulos jayden mcgregor amber beelitz rourke whishaw indyana dedovic clement thorpe shae crighton zachary kerley samantha rehr jacob nguyen sebastian tbe connor hoffman archer miles karla luchetti annabel kayser danielle millar emiily coleman henry roles latham campbell alexandra purtell lachlan mckerlie lara longhurst alicia crosswell kai webb hari mason vanessa neville tynan lock riley asher kieren matthews emiily bullock george dylan green cassidy wiseman sean lowe annabelle langer ashlee maniotto ellie brock benjamin hignett madeline nguyen joshua bitmead callum zimmermann joshua yee coffey cooper sotnik riley clarke annabelle vavic mitchell chandler emma mason katelyn miles ella la palombara chloe morrison savannah howie benjamin matson riley ho jackson white erin hope brodie walkley caitlin hacy erin green cain heron kira matthews dylan herbert jake lavis kiara sydenham benjamin purtell joshua mason joshua smithen caitlin coleman riley horriat alexander appletree seth joannou harrison gastarini daniel prideaux tarshya neumann bailey denholm william rosendale lauren ikonomou tess rawlings gabrielle miari teaha sherriff bailey glouftsis kailey pacey jeremy michielan kate stanley clodagh millar thomas woodbury ashleigh embrey matthew pasclie jemma wahlstedt boxhall joshua makenzie shepherd archie ryan jake robson brinley roche jessica mccourtie michaela fondum eliza mcphail jackson campain zarran virag cameron hemerka chelsea brokenshire cooper ho lachlan wohltmann samuel sideris luke epari joshua cicchini matthew berry kalinowski brodie bowerman holly mcmurdo marleigh webb emiily dinon dylan zimmermann chloe clarke marco maynard kate ma william rawlings mikaela bradshaw pia cuming henry browne bryce mezzomo holly corbin brooke miles nathan green emiily butt jake matthews kierra dinh samantha mccully indyana degree kelsey berry sophie dakin lara denholm james pendle kane stanwick jordan reid olivia campbell jasmine hawes charlie kambouras jessup oliver campbell zac ryan monique mcneill christopher shelley isabella mccarthy caitlin hand alexander leslie kate hope tayah negerman ryan berryman gabrielle menke tiarna raftopoulos elizabeth varcoe sarah hyland brianna rocca lucas parken james drechsler logan campbell bailey pascoe cameron cure simone baugh steven bishop carmen wilkins thomas stanley jye aquila kenneth nguyen mason stella nakahara kira seddon connor malpas joel elphick zac tuendemann joel white emiily matthews kieren astley michael hindley luke burford indiana goldsworthy kyle wehner bailee berry joshua beaty imogen webb dylan george jackson siller tristan dixon keziah white max bandick harley moody hannah giefdercht rosa clarke varcoe bianca sean greenland maddison quill jasmine zilm blakeston priesley matilda mcsporran sean everett adrian hauser lauren weaver ruby mccarthy patrick egglestone benjamin petersen william carofano adrian wheatley rosie hantke eleanor howie charlie campbell oliver thorpe sajfa matthews shannon nabi mikayla nguyen krystin tarshya neumann molly miles emiily subramaniam riley white coleman leicwsyer bethany vignoni-squire andrew huddy renai alderson jasmine lowe madison etheridge cicconi joshua cooper wilde jacob ryan luke clarke gemma clarke beth chandler logan white destynii fullerton joshua easley eliza green kiara waller alisha scherf jaime wojciak chelsea fogg emiily pascale blake benjamin whiteley melissa artym joe blake zachariah artis michael white jack henshaw samuel wilkins xavier braunack meg grainger sachin novak harrison lowe amalia bristow bronte pulvirenti esther caryjdl luke elrick lara vincent oliver wohltmann ellen spaninks naomi berry bridget titmarsh hamish abbatista michael coulson joel caitlin stanwick mitchell dixon connor webb thomas frew giorgia weaver hari bennier alana white katelyn crooke matthew everett jack balasubramaniam finn bilecki chloe green stephanie quilliam ashleigh hingston jordan clarke luke easlea cooper groenveld tenille warneke tiffany nguyen samantha reni taylor-saige wurm georgia worsley brigitte aitken-moss joshua noble holly grainger shae goy jessica mccarthy pia dixon henry roche lachlan clarke jaykob monteleone tanar bishop zachary millar raquel charmsn rachel george jack white matteus liapis emiily kallis giaan thurlow kane crouch caitlin pennell blake noble ella dixon ysobel merkoureas ellorah noble pakita nguyen tara green blayke toothill taylah crook claire coulson gianni kerley breony maytum liam lebrique matteus clarke natalie rigley alexandra record hugo grainger callum heenan ridley clarke vanesfsa manousos michaela noble ruby millar kailey berry josephine dantiochia benjamin reid edward culliton joel stanley zane george sienna shepherd callum rees calvin reid erin pitrans zachary boyes harry leaver oliver everett blake liam garcia ella fitzpatrick kobe ryan garth godfrey benjamin trenery matthew tsiounis kaitlin kalinowski alana dolliver jol ryan emma egan meggie fair jack clarke robert mckinnell alysha rauseo jacob coulson hannah boxhall grace reid riley boxhall joshua baillie jasper cookson rysn charlotte hazell chelsea brokenshire luke wilkins edward clarke luke elphick taylor braithwaite anika berry james anns dale baynes bodhi clarke dedicoat brigette hebberman hugo corbin madison costigan tyler needham chloe mason bronte salcedo tenille lonie ka coppola kyle kingsley georgia burleigh olivia segar ryan shepherd imogen bradshaw gracie nguyen clodagh godfrey isabella mackinlay james hilton bailey damien peddie jacob mcclelland adam ryan breana flaus benjamin nye anthony purtell aleisha matthews georgia liapis samuel pegg jesse green jakob reid zane shepherd henry dooley ellie wilde fenkck riley rapisarda abby griffen jasmyn clapham hannah valdivia jacob hilton kieren mattherws erin white liam mitton chela beckwgth douglas caire lewis samulewicz mccarthy brayden herbert samuel green lily warnock jessica newport adam rizzo andrew green erin colquhoun samuel campbell caleb stanfield ashlee white kelsea lowe joshua hope isobel pascoe lily peglidis benjamin godfrey alexandra tromp miller lodge sophie tidey jaykob wyllie taylah green charlotye mason olivia soden renee irving-rodgers lachlan richmond michael griffeun sophie white john white mia kaleb julia bruhn naomi dorsey abigail petersen isabella blackwell alexandra dignam sienna tsialafos jett holyhrim olivia jesser samantha hoffban robert philps joshua dixon ruby capurso logan cameron gaynes tynan cresp taylah ezis joshua green alyssa berry joshua novak tommi-lee gerick zoe clarke finn sikansari dylan reid shona green ethan blake jarred cochrane tiahana norgate keaton demarco jasper neumann schkirra beattie michael blackwell tansy lock logan dolan dylan versteeg isabella clarke ben sumsion erin gripton benjamin elphick lar mccarthy harrison juric aaron di cesare claire coulson dante shepherd sarah clarke michael green ella matthews harley eilander david minorchio caitlin satterley cade rawlings taylah obod keely fullgrabe kane havu alessia jayden chee alexa-rose daviess ella shepherdson aidan fenu adrian wah cooper gillard lawson ryan isabella snell jacqueline feeney nicholas bradshaw sophie kaufman keelin sritharan jade ovens kira ardwlkh jenna salt sophie moulton james darcy-clarke bethanie zaimis lucy ung hang skye nguyen sophie gaskin niamh wiseman tarshya spratt james coleman harry purdey kristen britten ela dabinet jarvis radovic jade feneey toby mccaffrey william hassel patrick parr isabella everett harry betrum troy blake robbie ryan elijah lewan rosie pascoe isabelle nguyen louis ryan jackson ronan jessica blake aidyn blewett joshua weller katelyn parremore riley kao gus huggins tenille longman alana bishop isabella proks lara curry kimberly thiessen laura green abbey pulford jack bishop caitlin ryan heath vorrasi jade baust alicia hobson katie bunt isabelle menache renee ryan kieren wiley-smith joshua miles vendula white brooklyn white oliver campbell noah jolly zachary boyes crystal ryan jessica goldsworthy nicholas gloster isabelle warrior luke lowe isaac nimnualsan robbie miels liam neumann bradley seen lauren morrison bethanie tsakiridis menzies seamus maczkowiack darcy ryan shelbey kai witty sam green samuel white kyle raner connor banful charlotte hazell jacob bonthuis flynn mao jasmyn ballantyne nathan ryan jade hemus blake compton olivia capurso jack webb madeline dixon jayden harrington blake hiras ryan bibby gabrielle averis mitchell clarke matthew zappulla teegan ramiro emiily miles zoe trainor keira white jessica costain georgia mccartmer cameron crellin heather tuckwell nicholas morcom cameron beeche matthew stephenson jesse lowe mackenzi ballantyne brew chander hannah feeney jack green sybella kiwk charlotte eglinton imogen ryan stanley joel wilkins nicholas de angelis ned white mia bartel andrew dixon jacob skarmoupsos lochlan bodhi lowe april hiscox brayden hylqbd leohj priest caitlin eayrs ethan stagll zara eisele joshua howie samara flannery alysha rees tiarna clarke rebecca rivers hannah jeffries corey white kyle waller joe donaldson ldct kullmann cooper gulia madeline wardle tyler crook georgia churches coby matthews kalli hohn casey van tuil michael hilton niamh mildren annabelle rees bayden lydon brayden ryan izaac ryan emiily goode callum bottroff cassandra wilkey emiily coxon ryan grooby brooklyn rivers connor rafanelli emiily bishop holly froud dominic cradock mia donaldson hayden lomman sian lowe kate ryan indiana berryman zachary matthews stephanie badger emiily koulizos bridget matthews lauren adair jasmin crossman bayley hearn samuel ryan blake dixon raquel ryan william browne kaysey scarce pattichis sam rump connor thalmann-mckay nicholas coleman leon panagaris matthew webb kayden green peter dunstone georgia yu joshua gohl talia kran z teagan webb jessica green cooper arbuckle jessie ganibegovic nikita green chelsea rosendale liam neville kody bosonworth lucas fitzpadrick tegan dockray tarnia kreibig max campbell-smith abbey wiseman henry roebuck lawson lodge jacob hope chelsea sitepa abby caruana cooper sedunary ella campbell daniel hyland reuben burford anna arganese niamh nguyen jack mason zachary bartel harrison covino brigitte coleman darcy vanhoppf kaitlyn white jacob tomney cain green eva pizanias natasha dolan talan campbell zachary arganese zali fuda isabella barbuto natalee donaldson simone matthews annabelle reece edward clarke oscar dreyer hayley bristow magnus verco sophie madigan tara townsendt-dvans thomas tao xavier purdon william coleman hugo kuhndt isabella green emiily ranson joshua paterson carly galinec andrew ho april carmody emiily blake taylah green stephanie taladira joshua cowlam isobel petrunic georgia morrison kyra singleton olivia reuter liam white emiily campbell alexander roundhill nathan berry benjamin ilczak trinity nalpantidis holly cutcliffe courtney frahn sophie matthews natalie rigley joel longo grace nuh jett dagley joel ryan tarshya tremellen james hodby isabella everett jasmine burkyevcs oliver bellamy erin nguyen ashleigh brazzalotto jessica treplin jessica barram yasmin ryan annabel lomman ridley clarke isabella shepherd tess yen ella darcy-clarke amalia grosser damien swynczuk emiily lazaroff anari clarke creber eleni reid chelsea newbould ryan mcelwee nikita stephenson matthew mccarthy sophie webb clodagh garnsey quitnn setlhong connor paterson holly summer cammell spargo alexander dietmayer hamish render tara egan nathan everett campbell campbell taliah leong isabella nguyen madison kyriacou declen ho james nguyen heath gail georgia lowe izaac neville olivia evison danielle roche emma roynz jaxin pawsey rohan dixon alessandra kilmartin saara matilda clarke emiily seddon bailey oaks finn cannell hersey hayley ryan jack grubb geofrgia hermans charlie breedt logan srt tayla masol noah coulton pg-similarity/dice.c000066400000000000000000000065301276707706500147450ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * dice.c * * Dice's coefficient is a similarity measure * * 2 * nt * s = --------- * nx + ny * * where nt is the number of n-grams found in both strings, nx is the * number of n-grams in x and, ny is the number of n-grams in y. * * For example: * * x: euler = {eu, ul, le, er} * y: heuser = {he, eu, us, se, er} * * 2*nt 2 * 2 * s = --------- = ------- = 0.4444... * nx + ny 4 + 5 * * PS> we call n-grams: (i) n-sequence of letters (ii) n-sequence of words * * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "similarity.h" #include "tokenizer.h" /* GUC variables */ int pgs_dice_tokenizer = PGS_UNIT_ALNUM; double pgs_dice_threshold = 0.7f; bool pgs_dice_is_normalized = true; PG_FUNCTION_INFO_V1(dice); Datum dice(PG_FUNCTION_ARGS) { char *a, *b; TokenList *s, *t; int atok, btok, comtok, alltok; float8 res; a = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(0)))); b = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(1)))); if (strlen(a) > PGS_MAX_STR_LEN || strlen(b) > PGS_MAX_STR_LEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("argument exceeds the maximum length of %d bytes", PGS_MAX_STR_LEN))); /* sets */ s = initTokenList(1); t = initTokenList(1); switch (pgs_dice_tokenizer) { case PGS_UNIT_WORD: tokenizeBySpace(s, a); tokenizeBySpace(t, b); break; case PGS_UNIT_GRAM: tokenizeByGram(s, a); tokenizeByGram(t, b); break; case PGS_UNIT_CAMELCASE: tokenizeByCamelCase(s, a); tokenizeByCamelCase(t, b); break; case PGS_UNIT_ALNUM: /* default */ default: tokenizeByNonAlnum(s, a); tokenizeByNonAlnum(t, b); break; } elog(DEBUG3, "Token List A"); printToken(s); elog(DEBUG3, "Token List B"); printToken(t); atok = s->size; btok = t->size; /* combine the sets */ switch (pgs_dice_tokenizer) { case PGS_UNIT_WORD: tokenizeBySpace(s, b); break; case PGS_UNIT_GRAM: tokenizeByGram(s, b); break; case PGS_UNIT_CAMELCASE: tokenizeByCamelCase(s, b); break; case PGS_UNIT_ALNUM: /* default */ default: tokenizeByNonAlnum(s, b); break; } elog(DEBUG3, "All Token List"); printToken(s); alltok = s->size; destroyTokenList(s); destroyTokenList(t); comtok = atok + btok - alltok; elog(DEBUG1, "is normalized: %d", pgs_dice_is_normalized); elog(DEBUG1, "token list A size: %d", atok); elog(DEBUG1, "token list B size: %d", btok); elog(DEBUG1, "all tokens size: %d", alltok); elog(DEBUG1, "common tokens size: %d", comtok); /* normalized and unnormalized version are the same */ res = (float8) (2.0 * comtok) / (atok + btok); PG_RETURN_FLOAT8(res); } PG_FUNCTION_INFO_V1(dice_op); Datum dice_op(PG_FUNCTION_ARGS) { float8 res; /* * store *_is_normalized value temporarily 'cause * threshold (we're comparing against) is normalized */ bool tmp = pgs_dice_is_normalized; pgs_dice_is_normalized = true; res = DatumGetFloat8(DirectFunctionCall2( dice, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); /* we're done; back to the previous value */ pgs_dice_is_normalized = tmp; PG_RETURN_BOOL(res >= pgs_dice_threshold); } pg-similarity/doc/000077500000000000000000000000001276707706500144365ustar00rootroot00000000000000pg-similarity/doc/index.html000066400000000000000000000527701276707706500164460ustar00rootroot00000000000000 pg_similarity Documentation

Introduction

pg_similarity is an extension to support similarity queries on PostgreSQL. The implementation is tightly integrated in the RDBMS in the sense that it defines operators so instead of the traditional operators (= and <>) you can use ~~~ and ~!~ (any of these operators represents a similarity function).

pg_similarity has three main components:

  • Functions: a set of functions that implements similarity algorithms available in the literature. These functions can be used as UDFs and, will be the base for implementing the similarity operators;
  • Operators: a set of operators defined at the top of similarity functions. They use similarity functions to obtain the similarity threshold and, compare its value to a user-defined threshold to decide if it is a match or not;
  • Session Variables: a set of variables that store similarity function parameters. Theses variables can be defined at run time.

Installation

pg_similarity is distributed as a source package and can be downloaded at PGFoundry. This extension is supported on those platforms that PostgreSQL is. The installation steps depend on your operating system.

You can also keep up with the latest fixes and features cloning the Git repository:

$ git clone https://github.com/eulerto/pg_similarity.git

UNIX based Operating Systems

Before you are able to use your extension, you should build it and load it at the desirable database.

$ tar -zxf pg_similarity-0.0.19.tgz
$ cd pg_similarity-0.0.19
$ $EDITOR Makefile # edit PG_CONFIG iif necessary
$ USE_PGXS=1 make
$ USE_PGXS=1 make install
$ psql -f SHAREDIR/contrib/pg_similarity.sql mydb # SHAREDIR is pg_config --sharedir

To use it, simply load it to the server. You can load it into and individual session:

$ psql mydb
psql (9.0.3)
Type "help" for help.

mydb=# load 'pg_similarity';
LOAD

But the typical usage is to preload it into all sessions by including pg_similarity in shared_preload_libraries at postgresql.conf. Keep in mind that there is an overhead added to each new connection.

Windows

Sorry, never tried that!

Functions and Operators

This extension supports a set of similarity algorithms. The most known algorithms are covered by this extension. You must be aware that each algorithm is suited for a specific domain. The following algorithms are provided.
  • L1 Distance (as known as City Block or Manhattan Distance);
  • Cosine Distance;
  • Dice Coefficient;
  • Euclidean Distance;
  • Hamming Distance;
  • Jaccard Coefficient;
  • Jaro Distance;
  • Jaro-Winkler Distance;
  • Levenshtein Distance;
  • Matching Coefficient;
  • Monge-Elkan Coefficient;
  • Needleman-Wunsch Coefficient;
  • Overlap Coefficient;
  • Q-Gram Distance;
  • Smith-Waterman Coefficient;
  • Smith-Waterman-Gotoh Coefficient.
Algorithm Function Operator Parameters
L1 Distance block(text, text) returns float4 text ~++ text pg_similarity.block_tokenizer (enum)
pg_similarity.block_threshold (float4)
pg_similarity.block_is_normalized (bool)
Cosine Distance cosine(text, text) returns float4 text ~## text pg_similarity.cosine_tokenizer (enum)
pg_similarity.cosine_threshold (float4)
pg_similarity.cosine_is_normalized (bool)
Dice Coefficient dice(text, text) returns float4 text ~-~ text pg_similarity.dice_tokenizer (enum)
pg_similarity.dice_threshold (float4)
pg_similarity.dice_is_normalized (bool)
Euclidean Distance euclidean(text, text) returns float4 text ~!! text pg_similarity.euclidean_tokenizer (enum)
pg_similarity.euclidean_threshold (float4)
pg_similarity.euclidean_is_normalized (bool)
Hamming Distance hamming(bit varying, bit varying) returns float4   pg_similarity.hamming_threshold (float4)
pg_similarity.hamming_is_normalized (bool)
Jaccard Coefficient jaccard(text, text) returns float4 text ~?? text pg_similarity.jaccard_tokenizer (enum)
pg_similarity.jaccard_threshold (float4)
pg_similarity.jaccard_is_normalized (bool)
Jaro Distance jaro(text, text) returns float4 text ~%% text pg_similarity.jaro_threshold (float4)
pg_similarity.jaro_is_normalized (bool)
Jaro-Winkler Distance jarowinkler(text, text) returns float4 text ~@@ text pg_similarity.jarowinkler_threshold (float4)
pg_similarity.jarowinkler_is_normalized (bool)
Levenshtein Distance lev(text, text) returns float4 text ~== text pg_similarity.levenshtein_threshold (float4)
pg_similarity.levenshtein_is_normalized (bool)
Matching Coefficient matchingcoefficient(text, text) returns float4 text ~^^ text pg_similarity.matching_tokenizer (enum)
pg_similarity.matching_threshold (float4)
pg_similarity.matching_is_normalized (bool)
Monge-Elkan Coefficient mongeelkan(text, text) returns float4 text ~|| text pg_similarity.mongeelkan_tokenizer (enum)
pg_similarity.mongeelkan_threshold (float4)
pg_similarity.mongeelkan_is_normalized (bool)
Needleman-Wunsch Coefficient needlemanwunsch(text, text) returns float4 text ~#~ text pg_similarity.needlemanwunsch_threshold (float4)
pg_similarity.needlemanwunsch_is_normalized (bool)
Overlap Coefficient overlapcoefficient(text, text) returns float4 text ~** text pg_similarity.overlap_tokenizer (enum)
pg_similarity.overlap_threshold (float4)
pg_similarity.overlap_is_normalized (bool)
Q-Gram Distance qgram(text, text) returns float4 text ~~~ text pg_similarity.qgram_threshold (float4)
pg_similarity.qgram_is_normalized (bool)
Smith-Waterman Coefficient smithwaterman(text, text) returns float4 text ~=~ text pg_similarity.smithwaterman_threshold (float4)
pg_similarity.smithwaterman_is_normalized (bool)
Smith-Waterman-Gotoh Coefficient smithwatermangotoh(text, text) returns float4 text ~!~ text pg_similarity.smithwatermangotoh_threshold (float4)
pg_similarity.smithwatermangotoh_is_normalized (bool)

The several parameters control the behavior of the pg_similarity functions and operators. I don't explain in detail each parameter because they can be classified in three classes: tokenizer, threshold, and normalized.

  • tokenizer: controls how the strings are tokenized. The valid values are alnum, gram, word, and camelcase. All tokens are lowercase (this option can be set at compile time; see PGS_IGNORE_CASE at source code). Default is alnum;
    • alnum: delimiters are any non-alphanumeric characters. That means that only alphabetic characters in the standard C locale and digits (0-9) are accepted in tokens. For example, the string "Euler_Taveira_de_Oliveira 22/02/2011" are tokenized as "Euler", "Taveira", "de", "Oliveira", "22", "02", "2011";
    • gram: an n-gram is a subsequence of length n. Extracting n-grams from a string can be done by using the sliding-by-one technique, that is, sliding a window of length n through out the string by one character. For example, the string "euler taveira" (using n = 3) is tokenized as "eul", "ule", "ler", "er ", "r t", " ta", "tav", "ave", "vei", "eir", and "ira". There are some authors that consider n-grams adding "  e", " eu", "ra ", and "a  " to the set of tokens, that is called full n-grams (this option can be set at compile time; see PGS_FULL_NGRAM at source code);
    • word: delimiters are white space characters (space, form-feed, newline, carriage return, horizontal tab, and vertical tab). For example, the string "Euler Taveira de Oliveira 22/02/2011" is tokenized as "Euler", "Taveira", "de", "Oliveira", and "22/02/2011";
    • camelcase: delimiters are capitalized characters but they are also included as first token characters. For example, the string "EulerTaveira de Oliveira" is tokenized as "Euler", "Taveira de ", and "Oliveira".
  • threshold: controls how flexible will be the result set. These values are used by operators to match strings. For each pair of strings, if the calculated value (using the corresponding similarity function) is greater or equal the threshold value, there is a match. The values range from 0.0 to 1.0. Default is 0.7;
  • normalized: controls whether the similarity coefficient/distance is normalized (between 0.0 and 1.0) or not. Normalized values are used automatically by operators to match strings, that is, this parameter only makes sense if you are using similarity functions. Default is true.

Examples

Set parameters at run time.

mydb=# show pg_similarity.levenshtein_threshold;
 pg_similarity.levenshtein_threshold 
-------------------------------------
 0.7
(1 row)

mydb=# set pg_similarity.levenshtein_threshold to 0.5;
SET
mydb=# show pg_similarity.levenshtein_threshold;
 pg_similarity.levenshtein_threshold 
-------------------------------------
 0.5
(1 row)

mydb=# set pg_similarity.cosine_tokenizer to camelcase;
SET
mydb=# set pg_similarity.euclidean_is_normalized to false;
SET
					

Simple tables for examples.

mydb=# create table foo (a text);
CREATE TABLE
mydb=# insert into foo values('Euler'),('Oiler'),('Euler Taveira de Oliveira'),('Maria Taveira dos Santos'),('Carlos Santos Silva');
INSERT 0 5
mydb=# create table bar (b text);
CREATE TABLE
mydb=# insert into bar values('Euler T. de Oliveira'),('Euller'),('Oliveira, Euler Taveira'),('Sr. Oliveira');
INSERT 0 4
					

Example 1: Using similarity functions cosine, jaro, and euclidean.

mydb=# select a, b, cosine(a,b), jaro(a, b), euclidean(a, b) from foo, bar;
             a             |            b            |  cosine  |   jaro   | euclidean 
---------------------------+-------------------------+----------+----------+-----------
 Euler                     | Euler T. de Oliveira    |      0.5 |     0.75 |  0.579916
 Euler                     | Euller                  |        0 | 0.944444 |         0
 Euler                     | Oliveira, Euler Taveira |  0.57735 | 0.605797 |  0.552786
 Euler                     | Sr. Oliveira            |        0 | 0.505556 |  0.225403
 Oiler                     | Euler T. de Oliveira    |        0 | 0.472222 |  0.457674
 Oiler                     | Euller                  |        0 |      0.7 |         0
 Oiler                     | Oliveira, Euler Taveira |        0 | 0.672464 |  0.367544
 Oiler                     | Sr. Oliveira            |        0 | 0.672222 |  0.225403
 Euler Taveira de Oliveira | Euler T. de Oliveira    |     0.75 |  0.79807 |      0.75
 Euler Taveira de Oliveira | Euller                  |        0 | 0.677778 |  0.457674
 Euler Taveira de Oliveira | Oliveira, Euler Taveira | 0.866025 | 0.773188 |       0.8
 Euler Taveira de Oliveira | Sr. Oliveira            | 0.353553 | 0.592222 |  0.552786
 Maria Taveira dos Santos  | Euler T. de Oliveira    |        0 |  0.60235 |       0.5
 Maria Taveira dos Santos  | Euller                  |        0 | 0.305556 |  0.457674
 Maria Taveira dos Santos  | Oliveira, Euler Taveira | 0.288675 | 0.535024 |  0.552786
 Maria Taveira dos Santos  | Sr. Oliveira            |        0 | 0.634259 |  0.452277
 Carlos Santos Silva       | Euler T. de Oliveira    |        0 | 0.542105 |   0.47085
 Carlos Santos Silva       | Euller                  |        0 | 0.312865 |  0.367544
 Carlos Santos Silva       | Oliveira, Euler Taveira |        0 | 0.606662 |   0.42265
 Carlos Santos Silva       | Sr. Oliveira            |        0 | 0.507728 |  0.379826
(20 rows)
					

Example 2: Using operator levenshtein (~==) and changing its threshold at run time.

mydb=# show pg_similarity.levenshtein_threshold;
 pg_similarity.levenshtein_threshold 
-------------------------------------
 0.7
(1 row)

mydb=# select a, b, lev(a,b) from foo, bar where a ~== b;
             a             |          b           |   lev    
---------------------------+----------------------+----------
 Euler                     | Euller               | 0.833333
 Euler Taveira de Oliveira | Euler T. de Oliveira |     0.76
(2 rows)

mydb=# set pg_similarity.levenshtein_threshold to 0.5;
SET
mydb=# select a, b, lev(a,b) from foo, bar where a ~== b;
             a             |          b           |   lev    
---------------------------+----------------------+----------
 Euler                     | Euller               | 0.833333
 Oiler                     | Euller               |      0.5
 Euler Taveira de Oliveira | Euler T. de Oliveira |     0.76
(3 rows)
					

Example 3: Using operator qgram (~~~) and changing its threshold at run time.

mydb=# set pg_similarity.qgram_threshold to 0.7;
SET
mydb=# show pg_similarity.qgram_threshold;
 pg_similarity.qgram_threshold 
-------------------------------
 0.7
(1 row)

mydb=# select a, b,qgram(a, b) from foo, bar where a ~~~ b;
             a             |            b            |  qgram   
---------------------------+-------------------------+----------
 Euler                     | Euller                  |      0.8
 Euler Taveira de Oliveira | Euler T. de Oliveira    |  0.77551
 Euler Taveira de Oliveira | Oliveira, Euler Taveira | 0.807692
(3 rows)

mydb=# set pg_similarity.qgram_threshold to 0.35;
SET
mydb=# select a, b,qgram(a, b) from foo, bar where a ~~~ b;
             a             |            b            |  qgram   
---------------------------+-------------------------+----------
 Euler                     | Euler T. de Oliveira    | 0.413793
 Euler                     | Euller                  |      0.8
 Oiler                     | Euller                  |      0.4
 Euler Taveira de Oliveira | Euler T. de Oliveira    |  0.77551
 Euler Taveira de Oliveira | Oliveira, Euler Taveira | 0.807692
 Euler Taveira de Oliveira | Sr. Oliveira            | 0.439024
(6 rows)
					

Example 4: Using a set of operators using the same threshold (0.7) to ilustrate that some similarity functions are appropriated to certain data domains.

mydb=# select * from bar where b ~@@ 'euler'; -- jaro-winkler operator
          b           
----------------------
 Euler T. de Oliveira
 Euller
(2 rows)

mydb=# select * from bar where b ~~~ 'euler'; -- qgram operator
 b 
---
(0 rows)

mydb=# select * from bar where b ~== 'euler'; -- levenshtein operator
   b    
--------
 Euller
(1 row)

mydb=# select * from bar where b ~## 'euler'; -- cosine operator
 b 
---
(0 rows)
					

License

Copyright © 2008-2011 Euler Taveira de Oliveira
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
  • Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer;
  • Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution;
  • Neither the name of the Euler Taveira de Oliveira nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
pg-similarity/doc/screen.css000066400000000000000000000030301276707706500164230ustar00rootroot00000000000000/* basic classes */ body { margin: 0px; padding: 0px; border: 0px; background-color: #eeeeee; color: #000000; font-family: Georgia, Verdana, Arial; } h1, h2, h3, h4, h5, h6, p, li, td { font-family: Georgia, Verdana, Arial; color: #000000; text-align: justify; } h1 { font-size: 22px; color: #338800; } h2 { font-size: 18px; color: #338800; } p, li, td { font-size: 15px; } blockquote { font-family: Courier, Georgia, "Times New Roman", Verdana, Arial; font-size: 15px; } code, pre { font-size: 13px; } a { font-weight: bold; color: #d4aa00; text-decoration: none; } a:hover { color: #338800; background-color: #ffffff; text-decoration: none; } table { width: 100%; border: 1px solid #000000; padding: 0px; } th { border: 1px solid #000000; font-weight: bold; text-align: center; } td { border: 1px solid #000000; } td.c { text-align: center; } /* custom classes */ #tmenu { position: fixed; width: 100%; padding-top: 10px; padding-bottom: 10px; border: 0px; background: #338000; font-family: Georgia, "Times New Roman", Verdana, Arial; font-size: 20px; color: #ffffff; text-align: center; } #tmenu a { color: #eeeeee; text-decoration: none; } #tmenu a:hover { color: #d4aa00; text-decoration: none; } #tcontent { width: 90%; text-align: justify; padding-left: 40px; } #intro { padding-top: 50px; } .dashedbox { background-color: #ffffff; border: 1px dashed #000000; padding: 10px; } li.licenseblock { font-family: Courier, Georgia, "Times New Roman", Verdana, Arial; font-size: 15px; } pg-similarity/euclidean.c000066400000000000000000000105301276707706500157650ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * euclidean.c * * Euclidean distance * * aims to measure the distance between two strings * * X is a list of n-grams * Y is a list of n-grams * T is a set of n-grams of X and/or Y * * For each n-gram in T we count occorrences in X and Y; we sum the * quadratic difference between nx and ny. At the end, we get the * square root of the sum. * * For example: * * x: euler = {eu, ul, le, er} * y: heuser = {he, eu, us, se, er} * t: {eu, ul, le, er, he, us, se} * * eu ul le er he us se * s = sqrt((1 - 1)^2 + (1 - 0)^2 + (1 - 0)^2 + (1 - 1)^2 + (0 - 1)^2 + (0 - 1)^2 + (0 - 1)^2) = * s = sqrt(5) = 2.236067977... * * PS> we call n-grams: (i) n-sequence of letters (ii) n-sequence of words * * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "similarity.h" #include "tokenizer.h" #include /* GUC variables */ int pgs_euclidean_tokenizer = PGS_UNIT_ALNUM; double pgs_euclidean_threshold = 0.7f; bool pgs_euclidean_is_normalized = true; PG_FUNCTION_INFO_V1(euclidean); Datum euclidean(PG_FUNCTION_ARGS) { char *a, *b; TokenList *s, *t, *u; Token *p, *q, *r; double totdistance; double totpossible; float8 res; a = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(0)))); b = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(1)))); if (strlen(a) > PGS_MAX_STR_LEN || strlen(b) > PGS_MAX_STR_LEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("argument exceeds the maximum length of %d bytes", PGS_MAX_STR_LEN))); /* lists */ s = initTokenList(0); t = initTokenList(0); /* set list */ u = initTokenList(1); switch (pgs_euclidean_tokenizer) { case PGS_UNIT_WORD: tokenizeBySpace(s, a); tokenizeBySpace(t, b); /* all tokens in a set */ tokenizeBySpace(u, a); tokenizeBySpace(u, b); break; case PGS_UNIT_GRAM: tokenizeByGram(s, a); tokenizeByGram(t, b); /* all tokens in a set */ tokenizeByGram(u, a); tokenizeByGram(u, b); break; case PGS_UNIT_CAMELCASE: tokenizeByCamelCase(s, a); tokenizeByCamelCase(t, b); /* all tokens in a set */ tokenizeByCamelCase(u, a); tokenizeByCamelCase(u, b); break; case PGS_UNIT_ALNUM: /* default */ default: tokenizeByNonAlnum(s, a); tokenizeByNonAlnum(t, b); /* all tokens in a set */ tokenizeByNonAlnum(u, a); tokenizeByNonAlnum(u, b); break; } elog(DEBUG3, "Token List A"); printToken(s); elog(DEBUG3, "Token List B"); printToken(t); elog(DEBUG3, "All Token List"); printToken(u); totpossible = sqrt(s->size * s->size + t->size * t->size); totdistance = 0.0; p = u->head; while (p != NULL) { int acnt = 0; int bcnt = 0; q = s->head; while (q != NULL) { elog(DEBUG4, "p: %s; q: %s", p->data, q->data); if (strcmp(p->data, q->data) == 0) { acnt++; break; } q = q->next; } r = t->head; while (r != NULL) { elog(DEBUG4, "p: %s; r: %s", p->data, r->data); if (strcmp(p->data, r->data) == 0) { bcnt++; break; } r = r->next; } totdistance += (acnt - bcnt) * (acnt - bcnt); elog(DEBUG2, "\"%s\" => acnt(%d); bcnt(%d); totdistance(%.2f)", p->data, acnt, bcnt, totdistance); p = p->next; } totdistance = sqrt(totdistance); elog(DEBUG1, "is normalized: %d", pgs_euclidean_is_normalized); elog(DEBUG1, "total possible: %.2f", totpossible); elog(DEBUG1, "total distance: %.2f", totdistance); destroyTokenList(s); destroyTokenList(t); destroyTokenList(u); if (pgs_euclidean_is_normalized) res = (totpossible - totdistance) / totpossible; else res = totdistance; PG_RETURN_FLOAT8(res); } PG_FUNCTION_INFO_V1(euclidean_op); Datum euclidean_op(PG_FUNCTION_ARGS) { float8 res; /* * store *_is_normalized value temporarily 'cause * threshold (we're comparing against) is normalized */ bool tmp = pgs_euclidean_is_normalized; pgs_euclidean_is_normalized = true; res = DatumGetFloat8(DirectFunctionCall2( euclidean, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); /* we're done; back to the previous value */ pgs_euclidean_is_normalized = tmp; PG_RETURN_BOOL(res >= pgs_euclidean_threshold); } pg-similarity/expected/000077500000000000000000000000001276707706500154725ustar00rootroot00000000000000pg-similarity/expected/test1.out000066400000000000000000000064101276707706500172640ustar00rootroot00000000000000-- -- pg_similarity -- testing similarity functions and operators -- -- -- Turn off echoing so that expected file does not depend on contents of -- this file -- SET client_min_messages to warning; \set ECHO none \set a '\'Euler Taveira de Oliveira\'' \set b '\'Euler T Oliveira\'' \set c '\'Oiler Taviera do Oliviera\'' select block(:a, :b), block_op(:a, :b), :a ~++ :b as operator; block | block_op | operator -------------------+----------+---------- 0.571428571428571 | f | f (1 row) select cosine(:a, :b), cosine_op(:a, :b), :a ~## :b as operator; cosine | cosine_op | operator -------------------+-----------+---------- 0.577350269189626 | f | f (1 row) select dice(:a, :b), dice_op(:a, :b), :a ~-~ :b as operator; dice | dice_op | operator -------------------+---------+---------- 0.571428571428571 | f | f (1 row) select euclidean(:a, :b), euclidean_op(:a, :b), :a ~!! :b as operator; euclidean | euclidean_op | operator -------------------+--------------+---------- 0.653589838486225 | f | f (1 row) select hamming_text(:a, :c), hamming_text_op(:a, :c), :a ~@~ :c as operator; hamming_text | hamming_text_op | operator --------------+-----------------+---------- 0.72 | t | t (1 row) select jaccard(:a, :b), jaccard_op(:a, :b), :a ~?? :b as operator; jaccard | jaccard_op | operator ---------+------------+---------- 0.4 | f | f (1 row) select jaro(:a, :b), jaro_op(:a, :b), :a ~%% :b as operator; jaro | jaro_op | operator -------------------+---------+---------- 0.796666666666667 | t | t (1 row) select jarowinkler(:a, :b), jarowinkler_op(:a, :b), :a ~@@ :b as operator; jarowinkler | jarowinkler_op | operator -------------+----------------+---------- 0.878 | t | t (1 row) select lev(:a, :b), lev_op(:a, :b), :a ~== :b as operator; lev | lev_op | operator ------+--------+---------- 0.64 | f | f (1 row) select levslow(:a, :b), levslow_op(:a, :b); levslow | levslow_op ---------+------------ 0.64 | f (1 row) select matchingcoefficient(:a, :b), matchingcoefficient_op(:a, :b), :a ~^^ :b as operator; matchingcoefficient | matchingcoefficient_op | operator ---------------------+------------------------+---------- 0.5 | f | f (1 row) --select mongeelkan(:a, :b), mongeelkan_op(:a, :b), :a ~|| :b as operator; --select needlemanwunsch(:a, :b), needlemanwunsch_op(:a, :b), :a ~#~ :b as operator; select overlapcoefficient(:a, :b), overlapcoefficient_op(:a, :b), :a ~** :b as operator; overlapcoefficient | overlapcoefficient_op | operator --------------------+-----------------------+---------- 0.666666666666667 | f | f (1 row) select qgram(:a, :b), qgram_op(:a, :b), :a ~~~ :b as operator; qgram | qgram_op | operator -------------------+----------+---------- 0.711111111111111 | t | t (1 row) --select smithwaterman(:a, :b), smithwaterman_op(:a, :b), :a ~=~ :b as operator; --select smithwatermangotoh(:a, :b), smithwatermangotoh_op(:a, :b), :a ~!~ :b as operator; select soundex(:a, :b), soundex_op(:a, :b), :a ~*~ :b as operator; soundex | soundex_op | operator ---------+------------+---------- 1 | t | t (1 row) pg-similarity/expected/test2.out000066400000000000000000000021221276707706500172610ustar00rootroot00000000000000-- -- pg_similarity -- testing similarity variables -- -- -- Clean up in case a prior regression run failed -- RESET client_min_messages; \set ECHO all -- -- errors -- SHOW pg_similarity.foo_tokenizer; ERROR: unrecognized configuration parameter "pg_similarity.foo_tokenizer" SHOW pg_similarity.foo_is_normalized; ERROR: unrecognized configuration parameter "pg_similarity.foo_is_normalized" SET pg_similarity.cosine_threshold to 1.1; ERROR: 1.1 is outside the valid range for parameter "pg_similarity.cosine_threshold" (0 .. 1) SET pg_similarity.qgram_tokenizer to 'alnum'; ERROR: invalid value for parameter "pg_similarity.qgram_tokenizer": "alnum" HINT: Available values: gram. SHOW pg_similarity.jaro_tokenizer; ERROR: unrecognized configuration parameter "pg_similarity.jaro_tokenizer" -- -- valid values -- SET pg_similarity.block_is_normalized to true; SET pg_similarity.cosine_threshold = 0.72; SET pg_similarity.dice_tokenizer to 'alnum'; SET pg_similarity.euclidean_is_normalized to false; SET pg_similarity.jaro_winkler_is_normalized to false; SET pg_similarity.qgram_tokenizer to 'gram'; pg-similarity/expected/test3.out000066400000000000000000000047421276707706500172740ustar00rootroot00000000000000-- -- pg_similarity -- testing similarity functions and operators -- -- -- Clean up in case a prior regression run failed -- RESET client_min_messages; \set ECHO all \set a '\'Euler Taveira de Oliveira\'' CREATE TABLE simtst (a text); INSERT INTO simtst (a) VALUES ('Euler Taveira de Oliveira'), ('EULER TAVEIRA DE OLIVEIRA'), ('Euler T. de Oliveira'), ('Oliveira, Euler T.'), ('Euler Oliveira'), ('Euler Taveira'), ('EULER TAVEIRA OLIVEIRA'), ('Oliveira, Euler'), ('Oliveira, E. T.'), ('ETO'); -- Levenshtein SHOW pg_similarity.levenshtein_threshold; pg_similarity.levenshtein_threshold ------------------------------------- 0.7 (1 row) SELECT a FROM simtst WHERE a ~== :a; a --------------------------- Euler Taveira de Oliveira EULER TAVEIRA DE OLIVEIRA Euler T. de Oliveira EULER TAVEIRA OLIVEIRA (4 rows) SET pg_similarity.levenshtein_threshold to 0.4; SHOW pg_similarity.levenshtein_threshold; pg_similarity.levenshtein_threshold ------------------------------------- 0.4 (1 row) SELECT a FROM simtst WHERE a ~== :a; a --------------------------- Euler Taveira de Oliveira EULER TAVEIRA DE OLIVEIRA Euler T. de Oliveira Euler Oliveira Euler Taveira EULER TAVEIRA OLIVEIRA Oliveira, Euler (7 rows) -- Cosine SHOW pg_similarity.cosine_threshold; pg_similarity.cosine_threshold -------------------------------- 0.7 (1 row) SELECT a FROM simtst WHERE a ~## :a; a --------------------------- Euler Taveira de Oliveira Euler T. de Oliveira Euler Oliveira Euler Taveira Oliveira, Euler (5 rows) SET pg_similarity.cosine_threshold to 0.9; SHOW pg_similarity.cosine_threshold; pg_similarity.cosine_threshold -------------------------------- 0.9 (1 row) SELECT a FROM simtst WHERE a ~## :a; a --------------------------- Euler Taveira de Oliveira (1 row) -- Overlap Coefficient SHOW pg_similarity.overlap_tokenizer; pg_similarity.overlap_tokenizer --------------------------------- alnum (1 row) SELECT a FROM simtst WHERE a ~** :a; a --------------------------- Euler Taveira de Oliveira Euler T. de Oliveira Euler Oliveira Euler Taveira Oliveira, Euler (5 rows) SET pg_similarity.overlap_tokenizer to 'gram'; SET pg_similarity.overlap_threshold to 0.8; SELECT a FROM simtst WHERE a ~** :a; a --------------------------- Euler Taveira de Oliveira Euler T. de Oliveira Euler Oliveira Euler Taveira (4 rows) DROP TABLE simtst; pg-similarity/expected/test4.out000066400000000000000000000040141276707706500172650ustar00rootroot00000000000000-- -- pg_similarity -- testing similarity variables -- -- -- Clean up in case a prior regression run failed -- RESET client_min_messages; \set ECHO all \set a '\'Euler Taveira de Oliveira\'' CREATE TABLE simtst (a text); INSERT INTO simtst (a) VALUES ('Euler Taveira de Oliveira'), ('EULER TAVEIRA DE OLIVEIRA'), ('Euler T. de Oliveira'), ('Oliveira, Euler T.'), ('Euler Oliveira'), ('Euler Taveira'), ('EULER TAVEIRA OLIVEIRA'), ('Oliveira, Euler'), ('Oliveira, E. T.'), ('ETO'); \copy simtst FROM 'data/similarity.data' SELECT a, block(a, :a) FROM simtst WHERE a ~++ :a; a | block ---------------------------+------- Euler Taveira de Oliveira | 1 Euler T. de Oliveira | 0.75 (2 rows) SELECT a, cosine(a, :a) FROM simtst WHERE a ~## :a; a | cosine ---------------------------+------------------- Euler Taveira de Oliveira | 1 Euler T. de Oliveira | 0.75 Euler Oliveira | 0.707106781186547 Euler Taveira | 0.707106781186547 Oliveira, Euler | 0.707106781186547 (5 rows) CREATE INDEX simtsti ON simtst USING gin (a gin_similarity_ops); SELECT a, block(a, :a) FROM simtst WHERE a ~++ :a; a | block ---------------------------+------- Euler Taveira de Oliveira | 1 Euler T. de Oliveira | 0.75 (2 rows) SET enable_bitmapscan TO OFF; SELECT a, block(a, :a) FROM simtst WHERE a ~++ :a; a | block ---------------------------+------- Euler Taveira de Oliveira | 1 Euler T. de Oliveira | 0.75 (2 rows) SET enable_bitmapscan TO ON; SELECT a, cosine(a, :a) FROM simtst WHERE a ~## :a; a | cosine ---------------------------+------------------- Euler Taveira de Oliveira | 1 Euler T. de Oliveira | 0.75 Euler Oliveira | 0.707106781186547 Euler Taveira | 0.707106781186547 Oliveira, Euler | 0.707106781186547 (5 rows) DROP TABLE simtst; pg-similarity/hamming.c000066400000000000000000000114301276707706500154540ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * hamming.c * * Hamming Distance is a similarity metric * * Hamming distance between two strings of equal length is the number of * positions for which the correspondings symbols are different, i.e, the * number of substitutions required to change one into the other. * * X is a bit string * Y is a bit string * * For each position i we compare X[i] with Y[i]; if it doesn't match we * accumulate those mismatches. * * For example: * * x: 1010101010 * y: 1010111000 * ^ ^ * mismatches * * s = 2 * * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "similarity.h" #include "utils/varbit.h" /* GUC variables */ double pgs_hamming_threshold = 0.7f; bool pgs_hamming_is_normalized = true; PG_FUNCTION_INFO_V1(hamming); Datum hamming(PG_FUNCTION_ARGS) { VarBit *a, *b; int alen, blen; bits8 *pa, *pb; int maxlen; float8 res = 0.0; int i; int n; a = PG_GETARG_VARBIT_P(0); b = PG_GETARG_VARBIT_P(1); alen = VARBITLEN(a); blen = VARBITLEN(b); elog(DEBUG1, "alen: %d; blen: %d", alen, blen); if (alen != blen) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("bit strings must have the same length"))); /* alen and blen have the same length */ maxlen = alen; pa = VARBITS(a); pb = VARBITS(b); for (i = 0; i < VARBITBYTES(a); i++) { n = *pa++ ^ *pb++; while (n) { res += n & 1; n >>= 1; } } elog(DEBUG1, "is normalized: %d", pgs_hamming_is_normalized); elog(DEBUG1, "maximum length: %d", maxlen); /* * FIXME print string of bits elog(DEBUG1, "hammingdistance(%s, %s) = %.3f", DatumGetCString(varbit_out(VarBitPGetDatum(a))), DatumGetCString(varbit_out(VarBitPGetDatum(b))), res); */ /* if one string has zero length then return one */ if (maxlen == 0) { PG_RETURN_FLOAT8(1.0); } else if (pgs_hamming_is_normalized) { res = 1.0 - (res / maxlen); /* * FIXME print string of bits elog(DEBUG1, "hamming(%s, %s) = %.3f", DatumGetCString(varbit_out(VarBitPGetDatum(a))), DatumGetCString(varbit_out(VarBitPGetDatum(b))), res); */ PG_RETURN_FLOAT8(res); } else { PG_RETURN_FLOAT8(res); } } PG_FUNCTION_INFO_V1(hamming_op); Datum hamming_op(PG_FUNCTION_ARGS) { float8 res; /* * store *_is_normalized value temporarily 'cause * threshold (we're comparing against) is normalized */ bool tmp = pgs_hamming_is_normalized; pgs_hamming_is_normalized = true; res = DatumGetFloat8(DirectFunctionCall2( hamming, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); /* we're done; back to the previous value */ pgs_hamming_is_normalized = tmp; PG_RETURN_BOOL(res >= pgs_hamming_threshold); } PG_FUNCTION_INFO_V1(hamming_text); Datum hamming_text(PG_FUNCTION_ARGS) { char *a, *b; int alen, blen; char *pa, *pb; int maxlen; float8 res = 0.0; a = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(0)))); b = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(1)))); alen = strlen(a); blen = strlen(b); if (alen > PGS_MAX_STR_LEN || blen > PGS_MAX_STR_LEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("argument exceeds the maximum length of %d bytes", PGS_MAX_STR_LEN))); elog(DEBUG1, "alen: %d; blen: %d", alen, blen); if (alen != blen) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("text strings must have the same length"))); elog(DEBUG1, "a: %s ; b: %s", a, b); /* alen and blen have the same length */ maxlen = alen; pa = a; pb = b; while (*pa != '\0') { elog(DEBUG4, "a: %c ; b: %c", *pa, *pb); if (*pa++ != *pb++) res += 1.0; } elog(DEBUG1, "is normalized: %d", pgs_hamming_is_normalized); elog(DEBUG1, "maximum length: %d", maxlen); elog(DEBUG1, "hammingdistance(%s, %s) = %.3f", DatumGetCString(a), DatumGetCString(b), res); /* if one string has zero length then return one */ if (maxlen == 0) { PG_RETURN_FLOAT8(1.0); } else if (pgs_hamming_is_normalized) { res = 1.0 - (res / maxlen); elog(DEBUG1, "hamming(%s, %s) = %.3f", DatumGetCString(a), DatumGetCString(b), res); PG_RETURN_FLOAT8(res); } else { PG_RETURN_FLOAT8(res); } } PG_FUNCTION_INFO_V1(hamming_text_op); Datum hamming_text_op(PG_FUNCTION_ARGS) { float8 res; /* * store *_is_normalized value temporarily 'cause * threshold (we're comparing against) is normalized */ bool tmp = pgs_hamming_is_normalized; pgs_hamming_is_normalized = true; res = DatumGetFloat8(DirectFunctionCall2( hamming_text, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); /* we're done; back to the previous value */ pgs_hamming_is_normalized = tmp; PG_RETURN_BOOL(res >= pgs_hamming_threshold); } pg-similarity/jaccard.c000066400000000000000000000065641276707706500154370ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * jaccard.c * * Jaccard similarity coefficient is a similarity measure * * It measures similarity between sets, and is defined as the size of the * intersection divided by the size of the union of the sets. * * |A intersection B| * s = -------------------- * |A union B| * * For example: * * x: euler = {eu, ul, le, er} * y: heuser = {he, eu, us, se, er} * * 2 * s = --- = 0.285714286 * 7 * * PS> we call n-grams: (i) n-sequence of letters (ii) n-sequence of words * * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "similarity.h" #include "tokenizer.h" /* GUC variables */ int pgs_jaccard_tokenizer = PGS_UNIT_ALNUM; double pgs_jaccard_threshold = 0.7f; bool pgs_jaccard_is_normalized = true; PG_FUNCTION_INFO_V1(jaccard); Datum jaccard(PG_FUNCTION_ARGS) { char *a, *b; TokenList *s, *t; int atok, btok, comtok, alltok; float8 res; a = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(0)))); b = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(1)))); if (strlen(a) > PGS_MAX_STR_LEN || strlen(b) > PGS_MAX_STR_LEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("argument exceeds the maximum length of %d bytes", PGS_MAX_STR_LEN))); /* sets */ s = initTokenList(1); t = initTokenList(1); switch (pgs_jaccard_tokenizer) { case PGS_UNIT_WORD: tokenizeBySpace(s, a); tokenizeBySpace(t, b); break; case PGS_UNIT_GRAM: tokenizeByGram(s, a); tokenizeByGram(t, b); break; case PGS_UNIT_CAMELCASE: tokenizeByCamelCase(s, a); tokenizeByCamelCase(t, b); break; case PGS_UNIT_ALNUM: /* default */ default: tokenizeByNonAlnum(s, a); tokenizeByNonAlnum(t, b); break; } elog(DEBUG3, "Token List A"); printToken(s); elog(DEBUG3, "Token List B"); printToken(t); atok = s->size; btok = t->size; /* combine the sets */ switch (pgs_jaccard_tokenizer) { case PGS_UNIT_WORD: tokenizeBySpace(s, b); break; case PGS_UNIT_GRAM: tokenizeByGram(s, b); break; case PGS_UNIT_CAMELCASE: tokenizeByCamelCase(s, b); break; case PGS_UNIT_ALNUM: /* default */ default: tokenizeByNonAlnum(s, b); break; } elog(DEBUG3, "All Token List"); printToken(s); alltok = s->size; destroyTokenList(s); destroyTokenList(t); comtok = atok + btok - alltok; elog(DEBUG1, "is normalized: %d", pgs_jaccard_is_normalized); elog(DEBUG1, "token list A size: %d", atok); elog(DEBUG1, "token list B size: %d", btok); elog(DEBUG1, "all tokens size: %d", alltok); elog(DEBUG1, "common tokens size: %d", comtok); /* normalized and unnormalized version are the same */ res = (float8) comtok / alltok; PG_RETURN_FLOAT8(res); } PG_FUNCTION_INFO_V1(jaccard_op); Datum jaccard_op(PG_FUNCTION_ARGS) { float8 res; /* * store *_is_normalized value temporarily 'cause * threshold (we're comparing against) is normalized */ bool tmp = pgs_jaccard_is_normalized; pgs_jaccard_is_normalized = true; res = DatumGetFloat8(DirectFunctionCall2( jaccard, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); /* we're done; back to the previous value */ pgs_jaccard_is_normalized = tmp; PG_RETURN_BOOL(res >= pgs_jaccard_threshold); } pg-similarity/jaro.c000066400000000000000000000171771276707706500150050ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * jaro.c * * Jaro Distance [1] is a similarity measure * * 1 m 1 m 1 m - t * s = --- * ----- + --- * ----- + --- * ------- * 3 |a| 3 |b| 3 m * * where m is the number of matching characters [2], t is the number of * transpositions [3], |a| is the length of string a and |b| is the length of * string b. * * [2] two characters from a and b are considered matching iif they're not * farther than floor(max(|a|, |b|) / 2) - 1. * * [3] number of transpositions is the number of matchings that are in a * different sequence order divided by 2. * * Jaro-Winkler [4] Distance is a similarity measure * * It's an improvement over Jaro's original work. It gives more weight if the * initial characters are the same. So, * * w = s + (l * p * (1 - s)) * * where l is the length of common prefix up to 4 characters, p is a scaling * factor (Winkler's suggestion is 0.1), and s is the Jaro Distance. * * For example: * * x: euler * y: heuser * * 1 4 1 4 1 4 - 0 4 2 1 * s = --- * --- + --- * --- + --- * ------- = ---- + --- + --- = 0.822... * 3 5 3 6 3 4 15 9 3 * * * w = 0.822 + (0 * 0.1 * (1 - 0.822)) = 0.822... * * * [1] Jaro, M. A. (1989). "Advances in record linking methodology as applied * to the 1985 census of Tampa Florida". Journal of the American Statistical * Society 84 (406): 414–20. * * [4] Winkler, W. E. (2006). "Overview of Record Linkage and Current Research * Directions". Research Report Series, RRS. * http://www.census.gov/srd/papers/pdf/rrs2006-02.pdf. * * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "similarity.h" #include /* GUC variables */ double pgs_jaro_threshold = 0.7f; bool pgs_jaro_is_normalized = true; double pgs_jarowinkler_threshold = 0.7f; bool pgs_jarowinkler_is_normalized = true; static double _jaro(char *a, char *b) { int alen, blen; int i, j, k; int cd; /* common window distance */ int cc = 0; /* number of common characters */ int tr = 0; /* number of transpositions */ double res; int *amatch; /* matchs in string a; match = 1; unmatch = 0 !! USED? !!*/ int *bmatch; /* matchs in string b; match = 1; unmatch = 0 */ int *posa; /* positions of matched characters in a */ int *posb; /* positions of matched characters in b */ alen = strlen(a); blen = strlen(b); elog(DEBUG1, "alen: %d; blen: %d", alen, blen); if (alen > PGS_MAX_STR_LEN || blen > PGS_MAX_STR_LEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("argument exceeds the maximum length of %d bytes", PGS_MAX_STR_LEN))); /* if one string has zero length then return zero */ if (alen == 0 || blen == 0) return 0.0; /* * allocate 2 vectors of integers. each position will be 0 or 1 depending * on the character in that position is found between common window distance. */ amatch = palloc(sizeof(int) * alen); bmatch = palloc(sizeof(int) * blen); for (i = 0; i < alen; i++) amatch[i] = 0; for (j = 0; j < blen; j++) bmatch[j] = 0; /* common window distance is floor(max(alen, blen) / 2) - 1 */ cd = (int) floor((alen > blen ? alen : blen) / 2) - 1; /* catch case when alen = blen = 1 */ if (cd < 0) cd = 0; elog(DEBUG1, "common window distance: %d", cd); #ifdef PGS_IGNORE_CASE elog(DEBUG2, "case-sensitive turns off"); for (i = 0; i < alen; i++) a[i] = tolower(a[i]); for (j = 0; j < blen; j++) b[j] = tolower(b[j]); #endif for (i = 0; i < alen; i++) { /* * calculate window test limits. limit inf to 0 and sup to blen */ int inf = max2(i - cd, 0); int sup = i + cd + 1; sup = min2(sup, blen); /* * no more common characters 'cause we don't have characters in b * to test with characters in a */ if (inf >= sup) break; for (j = inf; j < sup; j++) { /* * if found some match and it's not matched yet: * (i) flag match characters in a and b * (ii) increment cc */ if (bmatch[j] != 1 && a[i] == b[j]) { amatch[i] = 1; bmatch[j] = 1; cc++; break; } } } elog(DEBUG1, "common characters: %d", cc); /* no common characters then return 0 */ if (cc == 0) return 0.0; /* allocate vector of positions */ posa = palloc(sizeof(int) * cc); posb = palloc(sizeof(int) * cc); k = 0; for (i = 0; i < alen; i++) { if (amatch[i] == 1) { posa[k] = i; k++; } } k = 0; for (j = 0; j < blen; j++) { if (bmatch[j] == 1) { posb[k] = j; k++; } } pfree(amatch); pfree(bmatch); /* counting half-transpositions */ for (i = 0; i < cc; i++) if (a[posa[i]] != b[posb[i]]) tr++; pfree(posa); pfree(posb); elog(DEBUG1, "half transpositions: %d", tr); /* real number of transpositions */ tr /= 2; elog(DEBUG1, "real transpositions: %d", tr); res = PGS_JARO_W1 * cc / alen + PGS_JARO_W2 * cc / blen + PGS_JARO_WT * (cc - tr) / cc; elog(DEBUG1, "jaro(%s, %s) = %f * %d / %d + %f * %d / %d + %f * (%d - %d) / %d = %f", a, b, PGS_JARO_W1, cc, alen, PGS_JARO_W2, cc, blen, PGS_JARO_WT, cc, tr, cc, res); return res; } PG_FUNCTION_INFO_V1(jaro); Datum jaro(PG_FUNCTION_ARGS) { char *a, *b; float8 res; a = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(0)))); b = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(1)))); res = _jaro(a, b); elog(DEBUG1, "is normalized: %d", pgs_jaro_is_normalized); elog(DEBUG1, "jaro(%s, %s) = %f", a, b, res); /* normalized and unnormalized version are the same */ PG_RETURN_FLOAT8(res); } PG_FUNCTION_INFO_V1(jaro_op); Datum jaro_op(PG_FUNCTION_ARGS) { float8 res; /* * store *_is_normalized value temporarily 'cause * threshold (we're comparing against) is normalized */ bool tmp = pgs_jaro_is_normalized; pgs_jaro_is_normalized = true; res = DatumGetFloat8(DirectFunctionCall2( jaro, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); /* we're done; back to the previous value */ pgs_jaro_is_normalized = tmp; PG_RETURN_BOOL(res >= pgs_jaro_threshold); } PG_FUNCTION_INFO_V1(jarowinkler); Datum jarowinkler(PG_FUNCTION_ARGS) { char *a, *b; float8 resj, res; int i; int plen = 0; a = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(0)))); b = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(1)))); resj = _jaro(a, b); res = resj; elog(DEBUG1, "jaro(%s, %s) = %f", a, b, resj); if (resj > PGS_JARO_BOOST_THRESHOLD) { for (i = 0; i < strlen(a) && i < strlen(b) && i < PGS_JARO_PREFIX_SIZE; i++) { if (a[i] == b[i]) plen++; else break; } elog(DEBUG1, "prefix length: %d", plen); res += PGS_JARO_SCALING_FACTOR * plen * (1.0 - resj); } elog(DEBUG1, "is normalized: %d", pgs_jarowinkler_is_normalized); elog(DEBUG1, "jarowinkler(%s, %s) = %f + %d * %f * (1.0 - %f) = %f", a, b, resj, plen, PGS_JARO_SCALING_FACTOR, resj, res); /* normalized and unnormalized version are the same */ PG_RETURN_FLOAT8(res); } PG_FUNCTION_INFO_V1(jarowinkler_op); Datum jarowinkler_op(PG_FUNCTION_ARGS) { float8 res; /* * store *_is_normalized value temporarily 'cause * threshold (we're comparing against) is normalized */ bool tmp = pgs_jarowinkler_is_normalized; pgs_jarowinkler_is_normalized = true; res = DatumGetFloat8(DirectFunctionCall2( jarowinkler, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); /* we're done; back to the previous value */ pgs_jarowinkler_is_normalized = tmp; PG_RETURN_BOOL(res >= pgs_jarowinkler_threshold); } pg-similarity/levenshtein.c000066400000000000000000000167751276707706500164010ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * levenshtein.c * * Levenshtein Distance is one of the most famous similarity measures * * It aims to obtain the minimum number of operations (insert, delete, or * substitution) needed to transform one string into the other. * * For example: * * x: euler * y: heuser * all operation costs are 1. * * +---------------------------+ * | | | e | u | l | e | r | * +---------------------------+ * | | 0 | 1 | 2 | 3 | 4 | 5 | * +---------------------------+ * | h | 1 | 1 | 2 | 3 | 4 | 5 | * +---------------------------+ * | e | 2 | 1 | 2 | 3 | 3 | 4 | * +---------------------------+ * | u | 3 | 2 | 1 | 2 | 3 | 4 | * +---------------------------+ * | s | 4 | 3 | 2 | 2 | 3 | 4 | * +---------------------------+ * | e | 5 | 4 | 3 | 3 | 2 | 3 | * +---------------------------+ * | r | 6 | 5 | 4 | 4 | 3 | 2 | <== * +---------------------------+ * * http://en.wikipedia.org/wiki/Levenshtein_distance * * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "similarity.h" /* GUC variables */ double pgs_levenshtein_threshold = 0.7f; bool pgs_levenshtein_is_normalized = true; int _lev(char *a, char *b, int icost, int dcost) { int *arow, *brow, *trow; /* above, below, and temp row */ int alen, blen; int i, j; int res; alen = strlen(a); blen = strlen(b); elog(DEBUG2, "alen: %d; blen: %d", alen, blen); if (alen == 0) return blen; if (blen == 0) return alen; arow = (int *) malloc((blen + 1) * sizeof(int)); brow = (int *) malloc((blen + 1) * sizeof(int)); if (arow == NULL) elog(ERROR, "memory exaushted for array size %d", (alen + 1)); if (brow == NULL) elog(ERROR, "memory exaushted for array size %d", (blen + 1)); #ifdef PGS_IGNORE_CASE elog(DEBUG2, "case-sensitive turns off"); for (i = 0; i < alen; i++) a[i] = tolower(a[i]); for (j = 0; j < blen; j++) b[j] = tolower(b[j]); #endif /* initial values */ for (i = 0; i <= blen; i++) arow[i] = i; for (i = 1; i <= alen; i++) { /* first value is 'i' */ brow[0] = i; for (j = 1; j <= blen; j++) { /* TODO change it to a callback function */ /* get operation cost */ int scost = levcost(a[i-1], b[j-1]); brow[j] = min3(brow[j-1] + icost, arow[j] + dcost, arow[j-1] + scost); elog(DEBUG2, "(i, j) = (%d, %d); cost(%c, %c): %d; min(top, left, diag) = (%d, %d, %d) = %d", i, j, a[i-1], b[j-1], scost, brow[j-1] + icost, arow[j] + dcost, arow[j-1] + scost, brow[j]); } /* * below row becomes above row * above row is reused as below row */ trow = arow; arow = brow; brow = trow; elog(DEBUG2, "row: "); for (j = 1; j <= alen; j++) elog(DEBUG2, "%d ", arow[j]); } res = arow[blen]; free(arow); free(brow); return res; } /* * wastes more memory and execution time * XXX the purpose of this function is merely academic */ int _lev_slow(char *a, char *b, int icost, int dcost) { int **matrix; /* dynamic programming matrix */ int alen, blen; int i, j; int res; alen = strlen(a); blen = strlen(b); elog(DEBUG2, "alen: %d; blen: %d", alen, blen); if (alen == 0) return blen; if (blen == 0) return alen; matrix = (int **) malloc((alen + 1) * sizeof(int *)); if (matrix == NULL) elog(ERROR, "memory exaushted for array size %d", (alen + 1)); for (i = 0; i <= alen; i++) { matrix[i] = (int *) malloc((blen + 1) * sizeof(int)); if (matrix[i] == NULL) elog(ERROR, "memory exaushted for array size %d", (blen + 1)); } #ifdef PGS_IGNORE_CASE elog(DEBUG2, "case-sensitive turns off"); for (i = 0; i < alen; i++) a[i] = tolower(a[i]); for (j = 0; j < blen; j++) b[j] = tolower(b[j]); #endif /* initial values */ for (i = 0; i <= alen; i++) matrix[i][0] = i; for (j = 0; j <= blen; j++) matrix[0][j] = j; for (i = 1; i <= alen; i++) { for (j = 1; j <= blen; j++) { /* get operation cost */ int scost = levcost(a[i-1], b[j-1]); matrix[i][j] = min3(matrix[i-1][j] + dcost, matrix[i][j-1] + icost, matrix[i-1][j-1] + scost); elog(DEBUG2, "(i, j) = (%d, %d); cost(%c, %c): %d; min(top, left, diag) = (%d, %d, %d) = %d", i, j, a[i-1], b[j-1], scost, matrix[i-1][j] + dcost, matrix[i][j-1] + icost, matrix[i-1][j-1] + scost, matrix[i][j]); } } res = matrix[alen][blen]; for (i = 0; i <= alen; i++) free(matrix[i]); free(matrix); return res; } PG_FUNCTION_INFO_V1(lev); Datum lev(PG_FUNCTION_ARGS) { char *a, *b; int maxlen; float8 res; a = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(0)))); b = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(1)))); if (strlen(a) > PGS_MAX_STR_LEN || strlen(b) > PGS_MAX_STR_LEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("argument exceeds the maximum length of %d bytes", PGS_MAX_STR_LEN))); maxlen = max2(strlen(a), strlen(b)); res = (float8) _lev(a, b, PGS_LEV_MAX_COST, PGS_LEV_MAX_COST); elog(DEBUG1, "is normalized: %d", pgs_levenshtein_is_normalized); elog(DEBUG1, "maximum length: %d", maxlen); elog(DEBUG1, "levdistance(%s, %s) = %.3f", a, b, res); if (maxlen == 0) { PG_RETURN_FLOAT8(1.0); } else if (pgs_levenshtein_is_normalized) { res = 1.0 - (res / maxlen); elog(DEBUG1, "lev(%s, %s) = %.3f", a, b, res); PG_RETURN_FLOAT8(res); } else { PG_RETURN_FLOAT8(res); } } PG_FUNCTION_INFO_V1(lev_op); Datum lev_op(PG_FUNCTION_ARGS) { float8 res; /* * store *_is_normalized value temporarily 'cause * threshold (we're comparing against) is normalized */ bool tmp = pgs_levenshtein_is_normalized; pgs_levenshtein_is_normalized = true; res = DatumGetFloat8(DirectFunctionCall2( lev, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); /* we're done; back to the previous value */ pgs_levenshtein_is_normalized = tmp; PG_RETURN_BOOL(res >= pgs_levenshtein_threshold); } PG_FUNCTION_INFO_V1(levslow); Datum levslow(PG_FUNCTION_ARGS) { char *a, *b; int maxlen; float8 res; a = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(0)))); b = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(1)))); if (strlen(a) > PGS_MAX_STR_LEN || strlen(b) > PGS_MAX_STR_LEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("argument exceeds the maximum length of %d bytes", PGS_MAX_STR_LEN))); maxlen = max2(strlen(a), strlen(b)); res = (float8) _lev_slow(a, b, PGS_LEV_MAX_COST, PGS_LEV_MAX_COST); elog(DEBUG1, "is normalized: %d", pgs_levenshtein_is_normalized); elog(DEBUG1, "maximum length: %d", maxlen); elog(DEBUG1, "levdistance(%s, %s) = %.3f", a, b, res); if (maxlen == 0) { PG_RETURN_FLOAT8(1.0); } else if (pgs_levenshtein_is_normalized) { res = 1.0 - (res / maxlen); elog(DEBUG1, "lev(%s, %s) = %.3f", a, b, res); PG_RETURN_FLOAT8(res); } else { PG_RETURN_FLOAT8(res); } } PG_FUNCTION_INFO_V1(levslow_op); Datum levslow_op(PG_FUNCTION_ARGS) { float8 res; /* * store *_is_normalized value temporarily 'cause * threshold (we're comparing against) is normalized */ bool tmp = pgs_levenshtein_is_normalized; pgs_levenshtein_is_normalized = true; res = DatumGetFloat8(DirectFunctionCall2( levslow, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); /* we're done; back to the previous value */ pgs_levenshtein_is_normalized = tmp; PG_RETURN_BOOL(res >= pgs_levenshtein_threshold); } pg-similarity/matching.c000066400000000000000000000065741276707706500156430ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * matching.c * * The Matching Coefficient is a simple vector based approach * * nt * s = ------------- * max(nx, ny) * * where nt is the number of common n-grams found in both strings, nx is the * number of n-grams in x and, ny is the number of n-grams in y. * * For example: * * x: euler = {e, u, l, e, r} * y: heuser = {h, e, u, s, e, r} * * 4 * s = --- = 0.666... * 6 * * PS> we call n-grams: (i) n-sequence of letters (ii) n-sequence of words * * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "similarity.h" #include "tokenizer.h" /* GUC variables */ int pgs_matching_tokenizer = PGS_UNIT_ALNUM; double pgs_matching_threshold = 0.7f; bool pgs_matching_is_normalized = true; PG_FUNCTION_INFO_V1(matchingcoefficient); Datum matchingcoefficient(PG_FUNCTION_ARGS) { char *a, *b; TokenList *s, *t; Token *p, *q; int atok, btok, comtok, maxtok; float8 res; a = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(0)))); b = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(1)))); if (strlen(a) > PGS_MAX_STR_LEN || strlen(b) > PGS_MAX_STR_LEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("argument exceeds the maximum length of %d bytes", PGS_MAX_STR_LEN))); /* lists */ s = initTokenList(0); t = initTokenList(0); switch (pgs_matching_tokenizer) { case PGS_UNIT_WORD: tokenizeBySpace(s, a); tokenizeBySpace(t, b); break; case PGS_UNIT_GRAM: tokenizeByGram(s, a); tokenizeByGram(t, b); break; case PGS_UNIT_CAMELCASE: tokenizeByCamelCase(s, a); tokenizeByCamelCase(t, b); break; case PGS_UNIT_ALNUM: default: tokenizeByNonAlnum(s, a); tokenizeByNonAlnum(t, b); break; } elog(DEBUG3, "Token List A"); printToken(s); elog(DEBUG3, "Token List B"); printToken(t); atok = s->size; btok = t->size; maxtok = max2(atok, btok); comtok = 0; /* * XXX consider sorting s and t when we're dealing with large lists? */ p = s->head; while (p != NULL) { int found = 0; q = t->head; while (q != NULL) { elog(DEBUG3, "p: %s; q: %s", p->data, q->data); if (strcmp(p->data, q->data) == 0) { found = 1; break; } q = q->next; } if (found) { comtok++; elog(DEBUG2, "\"%s\" found; comtok = %d", p->data, comtok); } p = p->next; } destroyTokenList(s); destroyTokenList(t); elog(DEBUG1, "is normalized: %d", pgs_matching_is_normalized); elog(DEBUG1, "common tokens size: %d", comtok); elog(DEBUG1, "maximum token size: %d", maxtok); if (pgs_matching_is_normalized) res = (float8) comtok / maxtok; else res = comtok; PG_RETURN_FLOAT8(res); } PG_FUNCTION_INFO_V1(matchingcoefficient_op); Datum matchingcoefficient_op(PG_FUNCTION_ARGS) { float8 res; /* * store *_is_normalized value temporarily 'cause * threshold (we're comparing against) is normalized */ bool tmp = pgs_matching_is_normalized; pgs_matching_is_normalized = true; res = DatumGetFloat8(DirectFunctionCall2( matchingcoefficient, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); /* we're done; back to the previous value */ pgs_matching_is_normalized = tmp; PG_RETURN_BOOL(res >= pgs_matching_threshold); } pg-similarity/mongeelkan.c000066400000000000000000000130301276707706500161520ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * mongeelkan.c * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "similarity.h" #include "tokenizer.h" /* GUC variables */ int pgs_mongeelkan_tokenizer = PGS_UNIT_ALNUM; double pgs_mongeelkan_threshold = 0.7f; bool pgs_mongeelkan_is_normalized = true; /* * TODO move this function to similarity.c * TODO this function is a smithwatermangotoh() clone */ static double _mongeelkan(char *a, char *b) { float **matrix; /* dynamic programming matrix */ int alen, blen; int i, j; double maxvalue; alen = strlen(a); blen = strlen(b); elog(DEBUG2, "alen: %d; blen: %d", alen, blen); if (alen == 0) return blen; if (blen == 0) return alen; matrix = (float **) malloc((alen + 1) * sizeof(float *)); if (matrix == NULL) elog(ERROR, "memory exaushted for array size %d", alen); for (i = 0; i <= alen; i++) { matrix[i] = (float *) malloc((blen + 1) * sizeof(float)); if (matrix[i] == NULL) elog(ERROR, "memory exaushted for array size %d", blen); } #ifdef PGS_IGNORE_CASE elog(DEBUG2, "case-sensitive turns off"); for (i = 0; i < alen; i++) a[i] = tolower(a[i]); for (j = 0; j < blen; j++) b[j] = tolower(b[j]); #endif maxvalue = 0.0; /* initial values */ for (i = 0; i <= alen; i++) { float c = megapcost(a, b, i, 0); if (i == 0) { matrix[0][0] = max2(0.0, c); } else { float maxgapcost = 0.0; int wstart = i - PGS_SWG_WINDOW_SIZE; int k; if (wstart < 1) wstart = 1; for (k = wstart; k < i; k++) maxgapcost = max2(maxgapcost, matrix[i - k][0] - swggapcost(i - k, i)); matrix[i][0] = max3(0.0, maxgapcost, c); } if (matrix[i][0] > maxvalue) maxvalue = matrix[i][0]; } for (j = 0; j <= blen; j++) { float c = megapcost(a, b, 0, j); if (j == 0) { matrix[0][0] = max2(0.0, c); } else { float maxgapcost = 0.0; int wstart = j - PGS_SWG_WINDOW_SIZE; int k; if (wstart < 1) wstart = 1; for (k = wstart; k < j; k++) maxgapcost = max2(maxgapcost, matrix[0][j - k] - swggapcost(j - k, j)); matrix[0][j] = max3(0.0, maxgapcost, c); } if (matrix[0][j] > maxvalue) maxvalue = matrix[0][j]; } for (i = 1; i <= alen; i++) { for (j = 1; j <= blen; j++) { int wstart; int k; float maxgapcost1 = 0.0, maxgapcost2 = 0.0; /* get operation cost */ float c = megapcost(a, b, i, j); wstart = i - PGS_SWG_WINDOW_SIZE; if (wstart < 1) wstart = 1; for (k = wstart; k < i; k++) maxgapcost1 = max2(maxgapcost1, matrix[i - k][0] - swggapcost(i - k, i)); wstart = j - PGS_SWG_WINDOW_SIZE; if (wstart < 1) wstart = 1; for (k = wstart; k < j; k++) maxgapcost2 = max2(maxgapcost2, matrix[0][j - k] - swggapcost(j - k, j)); matrix[i][j] = max4(0.0, maxgapcost1, maxgapcost2, matrix[i-1][j-1] + c); elog(DEBUG2, "(i, j) = (%d, %d); cost(%c, %c): %.3f; max(zero, top, left, diag) = (0.0, %.3f, %.3f, %.3f) = %.3f", i, j, a[i-1], b[j-1], c, maxgapcost1, maxgapcost2, matrix[i-1][j-1] + c, matrix[i][j]); if (matrix[i][j] > maxvalue) maxvalue = matrix[i][j]; } } for (i = 0; i <= alen; i++) free(matrix[i]); free(matrix); return maxvalue; } PG_FUNCTION_INFO_V1(mongeelkan); Datum mongeelkan(PG_FUNCTION_ARGS) { char *a, *b; TokenList *s, *t; Token *p, *q; double summatches; double maxvalue; float8 res; a = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(0)))); b = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(1)))); if (strlen(a) > PGS_MAX_STR_LEN || strlen(b) > PGS_MAX_STR_LEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("argument exceeds the maximum length of %d bytes", PGS_MAX_STR_LEN))); /* lists */ s = initTokenList(0); t = initTokenList(0); switch (pgs_mongeelkan_tokenizer) { case PGS_UNIT_WORD: tokenizeBySpace(s, a); tokenizeBySpace(t, b); break; case PGS_UNIT_GRAM: tokenizeByGram(s, a); tokenizeByGram(t, b); break; case PGS_UNIT_CAMELCASE: tokenizeByCamelCase(s, a); tokenizeByCamelCase(t, b); break; case PGS_UNIT_ALNUM: default: tokenizeByNonAlnum(s, a); tokenizeByNonAlnum(t, b); break; } summatches = 0.0; p = s->head; while (p != NULL) { maxvalue = 0.0; q = t->head; while (q != NULL) { double val = _mongeelkan(p->data, q->data); elog(DEBUG3, "p: %s; q: %s", p->data, q->data); if (val > maxvalue) maxvalue = val; q = q->next; } summatches += maxvalue; p = p->next; } /* normalized and unnormalized version are the same */ res = summatches / s->size; elog(DEBUG1, "is normalized: %d", pgs_mongeelkan_is_normalized); elog(DEBUG1, "sum matches: %.3f", summatches); elog(DEBUG1, "s size: %d", s->size); elog(DEBUG1, "medistance(%s, %s) = %.3f", a, b, res); destroyTokenList(s); destroyTokenList(t); PG_RETURN_FLOAT8(res); } PG_FUNCTION_INFO_V1(mongeelkan_op); Datum mongeelkan_op(PG_FUNCTION_ARGS) { float8 res; /* * store *_is_normalized value temporarily 'cause * threshold (we're comparing against) is normalized */ bool tmp = pgs_mongeelkan_is_normalized; pgs_mongeelkan_is_normalized = true; res = DatumGetFloat8(DirectFunctionCall2( mongeelkan, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); /* we're done; back to the previous value */ pgs_mongeelkan_is_normalized = tmp; PG_RETURN_BOOL(res >= pgs_mongeelkan_threshold); } pg-similarity/needlemanwunsch.c000066400000000000000000000131131276707706500172140ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * needlemanwunsch.c * * Needleman-Wunsch is an algorithm that performs a global alignment on two * sequences. * * It is a dynamic programming algorithm that is used to biological sequence * comparison. The operation costs (scores) are specified by similarity * matrix. It also uses a linear gap penalty (like Levenshtein). * * For example: * * similarity matrix * * +-----------------------+ * | | A | G | C | T | * +-----------------------+ * | A | 10 | -1 | -3 | -4 | * +-----------------------+ * | G | -1 | 7 | -5 | -3 | * +-----------------------+ * | C | -3 | -5 | 9 | 0 | * +-----------------------+ * | T | -4 | -3 | 0 | 8 | * +-----------------------+ * * x: GACTAG * y: ACCTGAA * gap penalty: -5 * * +---------------------------------------------+ * | | | G | A | C | T | A | G | * +---------------------------------------------+ * | | 0 | -5 | -10 | -15 | -20 | -25 | -30 | * +---------------------------------------------+ * | A | -5 | -1 | 5 | 0 | -5 | -10 | -15 | * +---------------------------------------------+ * | C | -10 | -6 | 0 | 14 | 9 | 4 | -1 | * +---------------------------------------------+ * | C | -15 | -11 | -5 | 9 | 14 | 9 | 4 | * +---------------------------------------------+ * | T | -20 | -16 | -10 | 4 | 17 | 12 | 7 | * +---------------------------------------------+ * | G | -25 | -13 | -15 | -1 | 12 | 16 | 19 | * +---------------------------------------------+ * | A | -30 | -18 | -3 | -6 | 7 | 22 | 17 | * +---------------------------------------------+ * | A | -35 | -23 | -8 | -6 | 2 | 17 | 21 | * +---------------------------------------------+ * * http://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm * * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "similarity.h" /* GUC variables */ double pgs_nw_threshold = 0.7f; bool pgs_nw_is_normalized = true; double pgs_nw_gap_penalty = -5.0f; static int _nwunsch(char *a, char *b, int gap) { int *arow, *brow, *trow; int alen, blen; int i, j; int res; alen = strlen(a); blen = strlen(b); elog(DEBUG2, "alen: %d; blen: %d", alen, blen); if (alen == 0) return blen; if (blen == 0) return alen; arow = (int *) malloc((blen + 1) * sizeof(int)); brow = (int *) malloc((blen + 1) * sizeof(int)); if (arow == NULL) elog(ERROR, "memory exaushted for array size %d", (alen + 1)); if (brow == NULL) elog(ERROR, "memory exaushted for array size %d", (blen + 1)); #ifdef PGS_IGNORE_CASE elog(DEBUG2, "case-sensitive turns off"); for (i = 0; i < alen; i++) a[i] = tolower(a[i]); for (j = 0; j < blen; j++) b[j] = tolower(b[j]); #endif /* initial values */ for (i = 0; i <= blen; i++) arow[i] = gap * i; for (i = 1; i <= alen; i++) { /* first value is 'i' */ brow[0] = gap * i; for (j = 1; j <= blen; j++) { /* TODO change it to a callback function */ /* get operation cost */ int scost = nwcost(a[i-1], b[j-1]); brow[j] = max3(brow[j-1] + gap, arow[j] + gap, arow[j-1] + scost); elog(DEBUG2, "(i, j) = (%d, %d); cost(%c, %c): %d; max(top, left, diag) = (%d, %d, %d) = %d", i, j, a[i-1], b[j-1], scost, brow[j-1] + gap, arow[j] + gap, arow[j-1] + scost, brow[j]); } /* * below row becomes above row * above row is reused as below row */ trow = arow; arow = brow; brow = trow; } res = arow[blen]; free(arow); free(brow); return res; } PG_FUNCTION_INFO_V1(needlemanwunsch); Datum needlemanwunsch(PG_FUNCTION_ARGS) { char *a, *b; double minvalue, maxvalue; float8 res; a = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(0)))); b = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(1)))); if (strlen(a) > PGS_MAX_STR_LEN || strlen(b) > PGS_MAX_STR_LEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("argument exceeds the maximum length of %d bytes", PGS_MAX_STR_LEN))); maxvalue = (float8) max2(strlen(a), strlen(b)); res = (float8) _nwunsch(a, b, pgs_nw_gap_penalty); elog(DEBUG1, "is normalized: %d", pgs_nw_is_normalized); elog(DEBUG1, "maximum length: %.3f", maxvalue); elog(DEBUG1, "nwdistance(%s, %s) = %.3f", a, b, res); if (maxvalue == 0.0) { PG_RETURN_FLOAT8(1.0); } else if (pgs_nw_is_normalized) { /* FIXME normalize nw result */ minvalue = maxvalue; if (PGS_LEV_MAX_COST > pgs_nw_gap_penalty) maxvalue *= PGS_LEV_MAX_COST; else maxvalue *= pgs_nw_gap_penalty; if (PGS_LEV_MIN_COST < pgs_nw_gap_penalty) minvalue *= PGS_LEV_MIN_COST; else minvalue *= pgs_nw_gap_penalty; if (minvalue < 0.0) { maxvalue -= minvalue; res -= minvalue; } /* paranoia ? */ if (maxvalue == 0.0) { PG_RETURN_FLOAT8(0.0); } else { res = 1.0 - (res / maxvalue); elog(DEBUG1, "nw(%s, %s) = %.3f", a, b, res); PG_RETURN_FLOAT8(res); } } else { PG_RETURN_FLOAT8(res); } } PG_FUNCTION_INFO_V1(needlemanwunsch_op); Datum needlemanwunsch_op(PG_FUNCTION_ARGS) { float8 res; /* * store *_is_normalized value temporarily 'cause * threshold (we're comparing against) is normalized */ bool tmp = pgs_nw_is_normalized; pgs_nw_is_normalized = true; res = DatumGetFloat8(DirectFunctionCall2( needlemanwunsch, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); /* we're done; back to the previous value */ pgs_nw_is_normalized = tmp; PG_RETURN_BOOL(res >= pgs_nw_threshold); } pg-similarity/overlap.c000066400000000000000000000070501276707706500155070ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * overlap.c * * Overlap Coefficient is a similarity measure * * It computes the overlap between sets, and is defined as the size of the * intersection divided by the minimum size of the sets. * * | A intersection B | * s = ---------------------- * min(|A|, |B|) * * For example: * * x: euler = {eu, ul, le, er} * y: heuser = {he, eu, us, se, er} * * 2 2 * s = ----------- = --- = 0.5 * min(4, 5) 4 * * PS> we call n-grams: (i) n-sequence of letters (ii) n-sequence of words * * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "similarity.h" #include "tokenizer.h" /* GUC variables */ int pgs_overlap_tokenizer = PGS_UNIT_ALNUM; double pgs_overlap_threshold = 0.7f; bool pgs_overlap_is_normalized = true; PG_FUNCTION_INFO_V1(overlapcoefficient); Datum overlapcoefficient(PG_FUNCTION_ARGS) { char *a, *b; TokenList *s, *t; int atok, btok, comtok, alltok; int mintok; float8 res; a = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(0)))); b = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(1)))); if (strlen(a) > PGS_MAX_STR_LEN || strlen(b) > PGS_MAX_STR_LEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("argument exceeds the maximum length of %d bytes", PGS_MAX_STR_LEN))); /* sets */ s = initTokenList(1); t = initTokenList(1); switch (pgs_overlap_tokenizer) { case PGS_UNIT_WORD: tokenizeBySpace(s, a); tokenizeBySpace(t, b); break; case PGS_UNIT_GRAM: tokenizeByGram(s, a); tokenizeByGram(t, b); break; case PGS_UNIT_CAMELCASE: tokenizeByCamelCase(s, a); tokenizeByCamelCase(t, b); break; case PGS_UNIT_ALNUM: /* default */ default: tokenizeByNonAlnum(s, a); tokenizeByNonAlnum(t, b); break; } elog(DEBUG3, "Token List A"); printToken(s); elog(DEBUG3, "Token List B"); printToken(t); atok = s->size; btok = t->size; /* combine the sets */ switch (pgs_overlap_tokenizer) { case PGS_UNIT_WORD: tokenizeBySpace(s, b); break; case PGS_UNIT_GRAM: tokenizeByGram(s, b); break; case PGS_UNIT_CAMELCASE: tokenizeByCamelCase(s, b); break; case PGS_UNIT_ALNUM: /* default */ default: tokenizeByNonAlnum(s, b); break; } elog(DEBUG3, "All Token List"); printToken(s); alltok = s->size; destroyTokenList(s); destroyTokenList(t); comtok = atok + btok - alltok; mintok = min2(atok, btok); elog(DEBUG1, "is normalized: %d", pgs_overlap_is_normalized); elog(DEBUG1, "token list A size: %d", atok); elog(DEBUG1, "token list B size: %d", btok); elog(DEBUG1, "all tokens size: %d", alltok); elog(DEBUG1, "common tokens size: %d", comtok); elog(DEBUG1, "min between A and B sizes: %d", mintok); /* normalized and unnormalized version are the same */ res = (float8) comtok / mintok; PG_RETURN_FLOAT8(res); } PG_FUNCTION_INFO_V1(overlapcoefficient_op); Datum overlapcoefficient_op(PG_FUNCTION_ARGS) { float8 res; /* * store *_is_normalized value temporarily 'cause * threshold (we're comparing against) is normalized */ bool tmp = pgs_overlap_is_normalized; pgs_overlap_is_normalized = true; res = DatumGetFloat8(DirectFunctionCall2( overlapcoefficient, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); /* we're done; back to the previous value */ pgs_overlap_is_normalized = tmp; PG_RETURN_BOOL(res >= pgs_overlap_threshold); } pg-similarity/pg_similarity--1.0.sql000066400000000000000000000222541276707706500176440ustar00rootroot00000000000000-- keep this file in sync with the pg_similarity.sql.in legacy install file -- complain if script is sourced in psql, rather than via CREATE EXTENSION \echo Use "CREATE EXTENSION pg_similarity" to load this file. \quit -- Block CREATE FUNCTION block (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'block' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION block_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'block_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~++ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = block_op, COMMUTATOR = '~++', RESTRICT = contsel, JOIN = contjoinsel ); -- Cosine CREATE FUNCTION cosine (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'cosine' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION cosine_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'cosine_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~## ( LEFTARG = text, RIGHTARG = text, PROCEDURE = cosine_op, COMMUTATOR = '~##', RESTRICT = contsel, JOIN = contjoinsel ); -- Dice CREATE FUNCTION dice (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'dice' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION dice_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'dice_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~-~ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = dice_op, COMMUTATOR = '~-~', RESTRICT = contsel, JOIN = contjoinsel ); -- Euclidean CREATE FUNCTION euclidean (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'euclidean' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION euclidean_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'euclidean_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~!! ( LEFTARG = text, RIGHTARG = text, PROCEDURE = euclidean_op, COMMUTATOR = '~!!', RESTRICT = contsel, JOIN = contjoinsel ); -- Hamming CREATE FUNCTION hamming (varbit, varbit) RETURNS float8 AS 'MODULE_PATHNAME','hamming' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION hamming_op (varbit, varbit) RETURNS bool AS 'MODULE_PATHNAME', 'hamming_op' LANGUAGE C STABLE STRICT; CREATE FUNCTION hamming_text (text, text) RETURNS float8 AS 'MODULE_PATHNAME','hamming_text' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION hamming_text_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'hamming_text_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~@~ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = hamming_text_op, COMMUTATOR = '~@~', RESTRICT = contsel, JOIN = contjoinsel ); -- Jaccard CREATE FUNCTION jaccard (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'jaccard' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION jaccard_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'jaccard_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~?? ( LEFTARG = text, RIGHTARG = text, PROCEDURE = jaccard_op, COMMUTATOR = '~??', RESTRICT = contsel, JOIN = contjoinsel ); -- Jaro CREATE FUNCTION jaro (text, text) RETURNS float8 AS 'MODULE_PATHNAME','jaro' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION jaro_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'jaro_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~%% ( LEFTARG = text, RIGHTARG = text, PROCEDURE = jaro_op, COMMUTATOR = '~%%', RESTRICT = contsel, JOIN = contjoinsel ); -- Jaro-Winkler CREATE FUNCTION jarowinkler (text, text) RETURNS float8 AS 'MODULE_PATHNAME','jarowinkler' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION jarowinkler_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'jarowinkler_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~@@ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = jarowinkler_op, COMMUTATOR = '~@@', RESTRICT = contsel, JOIN = contjoinsel ); -- Levenshtein CREATE FUNCTION lev (text, text) RETURNS float8 AS 'MODULE_PATHNAME','lev' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION lev_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'lev_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~== ( LEFTARG = text, RIGHTARG = text, PROCEDURE = lev_op, COMMUTATOR = '~==', RESTRICT = contsel, JOIN = contjoinsel ); -- Those functions are here just for academic purposes --CREATE FUNCTION levslow (text, text) RETURNS float8 --AS 'MODULE_PATHNAME','levslow' --LANGUAGE C IMMUTABLE STRICT; --CREATE FUNCTION levslow_op (text, text) RETURNS bool --AS 'MODULE_PATHNAME', 'levslow_op' --LANGUAGE C STABLE STRICT; --CREATE OPERATOR ~@@ ( -- LEFTARG = text, -- RIGHTARG = text, -- PROCEDURE = levslow_op, -- COMMUTATOR = '~@@', -- RESTRICT = contsel, -- JOIN = contjoinsel --); -- Matching Coefficient CREATE FUNCTION matchingcoefficient (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'matchingcoefficient' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION matchingcoefficient_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'matchingcoefficient_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~^^ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = matchingcoefficient_op, COMMUTATOR = '~^^', RESTRICT = contsel, JOIN = contjoinsel ); -- Monge-Elkan CREATE FUNCTION mongeelkan (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'mongeelkan' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION mongeelkan_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'mongeelkan_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~|| ( LEFTARG = text, RIGHTARG = text, PROCEDURE = mongeelkan_op, COMMUTATOR = '~||', RESTRICT = contsel, JOIN = contjoinsel ); -- Needleman-Wunsch CREATE FUNCTION needlemanwunsch (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'needlemanwunsch' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION needlemanwunsch_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'needlemanwunsch_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~#~ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = needlemanwunsch_op, COMMUTATOR = '~#~', RESTRICT = contsel, JOIN = contjoinsel ); -- Overlap Coefficient CREATE FUNCTION overlapcoefficient (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'overlapcoefficient' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION overlapcoefficient_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'overlapcoefficient_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~** ( LEFTARG = text, RIGHTARG = text, PROCEDURE = overlapcoefficient_op, COMMUTATOR = '~**', RESTRICT = contsel, JOIN = contjoinsel ); -- Q-Gram CREATE FUNCTION qgram (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'qgram' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION qgram_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'qgram_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~~~ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = qgram_op, COMMUTATOR = '~~~', RESTRICT = contsel, JOIN = contjoinsel ); -- Smith-Waterman CREATE FUNCTION smithwaterman (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'smithwaterman' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION smithwaterman_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'smithwaterman_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~=~ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = smithwaterman_op, COMMUTATOR = '~=~', RESTRICT = contsel, JOIN = contjoinsel ); -- Smith-Waterman-Gotoh CREATE FUNCTION smithwatermangotoh (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'smithwatermangotoh' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION smithwatermangotoh_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'smithwatermangotoh_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~!~ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = smithwatermangotoh_op, COMMUTATOR = '~!~', RESTRICT = contsel, JOIN = contjoinsel ); -- Soundex CREATE FUNCTION soundex (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'soundex' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION soundex_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'soundex_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~*~ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = soundex_op, COMMUTATOR = '~*~', RESTRICT = contsel, JOIN = contjoinsel ); -- -- GIN support -- CREATE FUNCTION gin_extract_value_token(internal, internal, internal) RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION gin_extract_query_token(internal, internal, int2, internal, internal, internal, internal) RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; CREATE FUNCTION gin_token_consistent(internal, int2, internal, int4, internal, internal, internal, internal) RETURNS bool AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; CREATE OPERATOR CLASS gin_similarity_ops FOR TYPE text USING gin AS OPERATOR 1 ~++, -- block OPERATOR 2 ~##, -- cosine OPERATOR 3 ~-~, -- dice OPERATOR 4 ~!!, -- euclidean OPERATOR 5 ~??, -- jaccard -- OPERATOR 6 ~%%, -- jaro -- OPERATOR 7 ~@@, -- jarowinkler -- OPERATOR 8 ~==, -- lev OPERATOR 9 ~^^, -- matchingcoefficient -- OPERATOR 10 ~||, -- mongeelkan -- OPERATOR 11 ~#~, -- needlemanwunsch OPERATOR 12 ~**, -- overlapcoefficient OPERATOR 13 ~~~, -- qgram -- OPERATOR 14 ~=~, -- smithwaterman -- OPERATOR 15 ~!~, -- smithwatermangotoh -- OPERATOR 16 ~*~, -- soundex FUNCTION 1 bttextcmp(text, text), FUNCTION 2 gin_extract_value_token(internal, internal, internal), FUNCTION 3 gin_extract_query_token(internal, internal, int2, internal, internal, internal, internal), FUNCTION 4 gin_token_consistent(internal, int2, internal, int4, internal, internal, internal, internal), STORAGE text; pg-similarity/pg_similarity--unpackaged--1.0.sql000066400000000000000000000111541276707706500220160ustar00rootroot00000000000000-- complain if script is sourced in psql, rather than via CREATE EXTENSION \echo Use "CREATE EXTENSION pg_similarity" to load this file. \quit -- Block ALTER EXTENSION pg_similarity ADD function block(text, text); ALTER EXTENSION pg_similarity ADD function block_op(text, text); ALTER EXTENSION pg_similarity ADD operator ~++(text, text); -- Cosine ALTER EXTENSION pg_similarity ADD function cosine(text, text); ALTER EXTENSION pg_similarity ADD function cosine_op(text, text); ALTER EXTENSION pg_similarity ADD operator ~##(text, text); -- Dice ALTER EXTENSION pg_similarity ADD function dice(text, text); ALTER EXTENSION pg_similarity ADD function dice_op(text, text); ALTER EXTENSION pg_similarity ADD operator ~-~(text, text); -- Euclidean ALTER EXTENSION pg_similarity ADD function euclidean(text, text); ALTER EXTENSION pg_similarity ADD function euclidean_op(text, text); ALTER EXTENSION pg_similarity ADD operator ~!!(text, text); -- Hamming ALTER EXTENSION pg_similarity ADD function hamming(varbit, varbit); ALTER EXTENSION pg_similarity ADD function hamming_op(varbit, varbit); ALTER EXTENSION pg_similarity ADD function hamming_text(text, text); ALTER EXTENSION pg_similarity ADD function hamming_text_op(text, text); ALTER EXTENSION pg_similarity ADD operator ~@~(text, text); -- Jaccard ALTER EXTENSION pg_similarity ADD function jaccard(text, text); ALTER EXTENSION pg_similarity ADD function jaccard_op(text, text); ALTER EXTENSION pg_similarity ADD operator ~??(text, text); -- Jaro ALTER EXTENSION pg_similarity ADD function jaro(text, text); ALTER EXTENSION pg_similarity ADD function jaro_op(text, text); ALTER EXTENSION pg_similarity ADD operator ~%%(text, text); -- Jaro-Winkler ALTER EXTENSION pg_similarity ADD function jarowinkler(text, text); ALTER EXTENSION pg_similarity ADD function jarowinkler_op(text, text); ALTER EXTENSION pg_similarity ADD operator ~@@(text, text); -- Levenshtein ALTER EXTENSION pg_similarity ADD function lev(text, text); ALTER EXTENSION pg_similarity ADD function lev_op(text, text); ALTER EXTENSION pg_similarity ADD operator ~==(text, text); -- just for academic purpose --ALTER EXTENSION pg_similarity ADD function levslow(text, text); --ALTER EXTENSION pg_similarity ADD function levslow_op(text, text); --ALTER EXTENSION pg_similarity ADD operator ~=^(text, text); -- those functions were created in earlier versions but are no longer included DROP FUNCTION levslow(text, text); DROP FUNCTION levslow_op(text, text); -- Matching Coefficient ALTER EXTENSION pg_similarity ADD function matchingcoefficient(text, text); ALTER EXTENSION pg_similarity ADD function matchingcoefficient_op(text, text); ALTER EXTENSION pg_similarity ADD operator ~^^(text, text); -- Monge-Elkan ALTER EXTENSION pg_similarity ADD function mongeelkan(text, text); ALTER EXTENSION pg_similarity ADD function mongeelkan_op(text, text); ALTER EXTENSION pg_similarity ADD operator ~||(text, text); -- Needleman-Wunsch ALTER EXTENSION pg_similarity ADD function needlemanwunsch(text, text); ALTER EXTENSION pg_similarity ADD function needlemanwunsch_op(text, text); ALTER EXTENSION pg_similarity ADD operator ~#~(text, text); -- Overlap ALTER EXTENSION pg_similarity ADD function overlapcoefficient(text, text); ALTER EXTENSION pg_similarity ADD function overlapcoefficient_op(text, text); ALTER EXTENSION pg_similarity ADD operator ~**(text, text); -- Q-Gram ALTER EXTENSION pg_similarity ADD function qgram(text, text); ALTER EXTENSION pg_similarity ADD function qgram_op(text, text); ALTER EXTENSION pg_similarity ADD operator ~~~(text, text); -- Smith-Waterman ALTER EXTENSION pg_similarity ADD function smithwaterman(text, text); ALTER EXTENSION pg_similarity ADD function smithwaterman_op(text, text); ALTER EXTENSION pg_similarity ADD operator ~=~(text, text); -- Smith-Waterman-Gotoh ALTER EXTENSION pg_similarity ADD function smithwatermangotoh(text, text); ALTER EXTENSION pg_similarity ADD function smithwatermangotoh_op(text, text); ALTER EXTENSION pg_similarity ADD operator ~!~(text, text); -- Soundex ALTER EXTENSION pg_similarity ADD function soundex(text, text); ALTER EXTENSION pg_similarity ADD function soundex_op(text, text); ALTER EXTENSION pg_similarity ADD operator ~*~(text, text); -- GIN support ALTER EXTENSION pg_similarity ADD function gin_extract_value_token(internal, internal, internal); ALTER EXTENSION pg_similarity ADD function gin_extract_query_token(internal, internal, int2, internal, internal, internal, internal); ALTER EXTENSION pg_similarity ADD function gin_token_consistent(internal, int2, internal, int4, internal, internal, internal, internal); ALTER EXTENSION pg_similarity ADD operator class gin_similarity_ops using gin; pg-similarity/pg_similarity.conf.sample000066400000000000000000000053341276707706500207010ustar00rootroot00000000000000#----------------------------------------------------------------------- # postgresql.conf #----------------------------------------------------------------------- # the former needs a restart every time you upgrade pg_similarity and # the later needs that you create a $libdir/plugins directory and move # pg_similarity.so to it (it doesn't require a restart; just open a new # connection). #shared_preload_libraries = 'pg_similarity' # - or - #local_preload_libraries = 'pg_similarity' #----------------------------------------------------------------------- # pg_similarity #----------------------------------------------------------------------- # - Block - #pg_similarity.block_tokenizer = 'alnum' # alnum, camelcase, gram, or word #pg_similarity.block_threshold = 0.7 # 0.0 .. 1.0 #pg_similarity.block_is_normalized = true # - Cosine - #pg_similarity.cosine_tokenizer = 'alnum' #pg_similarity.cosine_threshold = 0.7 #pg_similarity.cosine_is_normalized = true # - Dice - #pg_similarity.dice_tokenizer = 'alnum' #pg_similarity.dice_threshold = 0.7 #pg_similarity.dice_is_normalized = true # - Euclidean - #pg_similarity.euclidean_tokenizer = 'alnum' #pg_similarity.euclidean_threshold = 0.7 #pg_similarity.euclidean_is_normalized = true # - Hamming - #pg_similarity.hamming_threshold = 0.7 #pg_similarity.hamming_is_normalized = true # - Jaccard - #pg_similarity.jaccard_tokenizer = 'alnum' #pg_similarity.jaccard_threshold = 0.7 #pg_similarity.jaccard_is_normalized = true # - Jaro - #pg_similarity.jaro_threshold = 0.7 #pg_similarity.jaro_is_normalized = true # - Jaro - #pg_similarity.jaro_threshold = 0.7 #pg_similarity.jaro_is_normalized = true # - Jaro-Winkler - #pg_similarity.jarowinkler_threshold = 0.7 #pg_similarity.jarowinkler_is_normalized = true # - Levenshtein - #pg_similarity.levenshtein_threshold = 0.7 #pg_similarity.levenshtein_is_normalized = true # - Matching Coefficient - #pg_similarity.matching_tokenizer = 'alnum' #pg_similarity.matching_threshold = 0.7 #pg_similarity.matching_is_normalized = true # - Monge-Elkan - #pg_similarity.mongeelkan_tokenizer = 'alnum' #pg_similarity.mongeelkan_threshold = 0.7 #pg_similarity.mongeelkan_is_normalized = true # - Needleman-Wunsch - #pg_similarity.nw_threshold = 0.7 #pg_similarity.nw_is_normalized = true # - Overlap Coefficient - #pg_similarity.overlap_tokenizer = 'alnum' #pg_similarity.overlap_threshold = 0.7 #pg_similarity.overlap_is_normalized = true # - Q-Gram - #pg_similarity.qgram_tokenizer = 'qgram' #pg_similarity.qgram_threshold = 0.7 #pg_similarity.qgram_is_normalized = true # - Smith-Waterman - #pg_similarity.sw_threshold = 0.7 #pg_similarity.sw_is_normalized = true # - Smith-Waterman-Gotoh - #pg_similarity.swg_threshold = 0.7 #pg_similarity.swg_is_normalized = true pg-similarity/pg_similarity.control000066400000000000000000000002261276707706500201470ustar00rootroot00000000000000# pg_similarity extension comment = 'support similarity queries' default_version = '1.0' module_pathname = '$libdir/pg_similarity' relocatable = true pg-similarity/pg_similarity.sql.in000066400000000000000000000227701276707706500177030ustar00rootroot00000000000000-- keep this in sync with the pg_similarity--x.y.sql extension install file -- Adjust this setting to control where the objects get created. SET search_path = public; -- Block CREATE OR REPLACE FUNCTION block (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'block' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION block_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'block_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~++ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = block_op, COMMUTATOR = '~++', RESTRICT = contsel, JOIN = contjoinsel ); -- Cosine CREATE OR REPLACE FUNCTION cosine (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'cosine' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION cosine_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'cosine_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~## ( LEFTARG = text, RIGHTARG = text, PROCEDURE = cosine_op, COMMUTATOR = '~##', RESTRICT = contsel, JOIN = contjoinsel ); -- Dice CREATE OR REPLACE FUNCTION dice (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'dice' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION dice_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'dice_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~-~ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = dice_op, COMMUTATOR = '~-~', RESTRICT = contsel, JOIN = contjoinsel ); -- Euclidean CREATE OR REPLACE FUNCTION euclidean (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'euclidean' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION euclidean_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'euclidean_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~!! ( LEFTARG = text, RIGHTARG = text, PROCEDURE = euclidean_op, COMMUTATOR = '~!!', RESTRICT = contsel, JOIN = contjoinsel ); -- Hamming CREATE OR REPLACE FUNCTION hamming (varbit, varbit) RETURNS float8 AS 'MODULE_PATHNAME','hamming' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION hamming_op (varbit, varbit) RETURNS bool AS 'MODULE_PATHNAME', 'hamming_op' LANGUAGE C STABLE STRICT; CREATE OR REPLACE FUNCTION hamming_text (text, text) RETURNS float8 AS 'MODULE_PATHNAME','hamming_text' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION hamming_text_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'hamming_text_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~@~ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = hamming_text_op, COMMUTATOR = '~@~', RESTRICT = contsel, JOIN = contjoinsel ); -- Jaccard CREATE OR REPLACE FUNCTION jaccard (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'jaccard' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION jaccard_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'jaccard_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~?? ( LEFTARG = text, RIGHTARG = text, PROCEDURE = jaccard_op, COMMUTATOR = '~??', RESTRICT = contsel, JOIN = contjoinsel ); -- Jaro CREATE OR REPLACE FUNCTION jaro (text, text) RETURNS float8 AS 'MODULE_PATHNAME','jaro' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION jaro_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'jaro_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~%% ( LEFTARG = text, RIGHTARG = text, PROCEDURE = jaro_op, COMMUTATOR = '~%%', RESTRICT = contsel, JOIN = contjoinsel ); -- Jaro-Winkler CREATE OR REPLACE FUNCTION jarowinkler (text, text) RETURNS float8 AS 'MODULE_PATHNAME','jarowinkler' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION jarowinkler_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'jarowinkler_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~@@ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = jarowinkler_op, COMMUTATOR = '~@@', RESTRICT = contsel, JOIN = contjoinsel ); -- Levenshtein CREATE OR REPLACE FUNCTION lev (text, text) RETURNS float8 AS 'MODULE_PATHNAME','lev' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION lev_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'lev_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~== ( LEFTARG = text, RIGHTARG = text, PROCEDURE = lev_op, COMMUTATOR = '~==', RESTRICT = contsel, JOIN = contjoinsel ); CREATE OR REPLACE FUNCTION levslow (text, text) RETURNS float8 AS 'MODULE_PATHNAME','levslow' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION levslow_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'levslow_op' LANGUAGE C STABLE STRICT; --CREATE OPERATOR ~=^ ( -- LEFTARG = text, -- RIGHTARG = text, -- PROCEDURE = levslow_op, -- COMMUTATOR = '~=^', -- RESTRICT = contsel, -- JOIN = contjoinsel --); -- Matching Coefficient CREATE OR REPLACE FUNCTION matchingcoefficient (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'matchingcoefficient' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION matchingcoefficient_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'matchingcoefficient_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~^^ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = matchingcoefficient_op, COMMUTATOR = '~^^', RESTRICT = contsel, JOIN = contjoinsel ); -- Monge-Elkan CREATE OR REPLACE FUNCTION mongeelkan (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'mongeelkan' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION mongeelkan_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'mongeelkan_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~|| ( LEFTARG = text, RIGHTARG = text, PROCEDURE = mongeelkan_op, COMMUTATOR = '~||', RESTRICT = contsel, JOIN = contjoinsel ); -- Needleman-Wunsch CREATE OR REPLACE FUNCTION needlemanwunsch (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'needlemanwunsch' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION needlemanwunsch_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'needlemanwunsch_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~#~ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = needlemanwunsch_op, COMMUTATOR = '~#~', RESTRICT = contsel, JOIN = contjoinsel ); -- Overlap Coefficient CREATE OR REPLACE FUNCTION overlapcoefficient (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'overlapcoefficient' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION overlapcoefficient_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'overlapcoefficient_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~** ( LEFTARG = text, RIGHTARG = text, PROCEDURE = overlapcoefficient_op, COMMUTATOR = '~**', RESTRICT = contsel, JOIN = contjoinsel ); -- Q-Gram CREATE OR REPLACE FUNCTION qgram (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'qgram' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION qgram_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'qgram_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~~~ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = qgram_op, COMMUTATOR = '~~~', RESTRICT = contsel, JOIN = contjoinsel ); -- Smith-Waterman CREATE OR REPLACE FUNCTION smithwaterman (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'smithwaterman' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION smithwaterman_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'smithwaterman_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~=~ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = smithwaterman_op, COMMUTATOR = '~=~', RESTRICT = contsel, JOIN = contjoinsel ); -- Smith-Waterman-Gotoh CREATE OR REPLACE FUNCTION smithwatermangotoh (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'smithwatermangotoh' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION smithwatermangotoh_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'smithwatermangotoh_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~!~ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = smithwatermangotoh_op, COMMUTATOR = '~!~', RESTRICT = contsel, JOIN = contjoinsel ); -- Soundex CREATE OR REPLACE FUNCTION soundex (text, text) RETURNS float8 AS 'MODULE_PATHNAME', 'soundex' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION soundex_op (text, text) RETURNS bool AS 'MODULE_PATHNAME', 'soundex_op' LANGUAGE C STABLE STRICT; CREATE OPERATOR ~*~ ( LEFTARG = text, RIGHTARG = text, PROCEDURE = soundex_op, COMMUTATOR = '~*~', RESTRICT = contsel, JOIN = contjoinsel ); -- -- GIN support -- CREATE OR REPLACE FUNCTION gin_extract_value_token(internal, internal, internal) RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION gin_extract_query_token(internal, internal, int2, internal, internal, internal, internal) RETURNS internal AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; CREATE OR REPLACE FUNCTION gin_token_consistent(internal, int2, internal, int4, internal, internal, internal, internal) RETURNS bool AS 'MODULE_PATHNAME' LANGUAGE C IMMUTABLE STRICT; CREATE OPERATOR CLASS gin_similarity_ops FOR TYPE text USING gin AS OPERATOR 1 ~++, -- block OPERATOR 2 ~##, -- cosine OPERATOR 3 ~-~, -- dice OPERATOR 4 ~!!, -- euclidean OPERATOR 5 ~??, -- jaccard -- OPERATOR 6 ~%%, -- jaro -- OPERATOR 7 ~@@, -- jarowinkler -- OPERATOR 8 ~==, -- lev OPERATOR 9 ~^^, -- matchingcoefficient -- OPERATOR 10 ~||, -- mongeelkan -- OPERATOR 11 ~#~, -- needlemanwunsch OPERATOR 12 ~**, -- overlapcoefficient OPERATOR 13 ~~~, -- qgram -- OPERATOR 14 ~=~, -- smithwaterman -- OPERATOR 15 ~!~, -- smithwatermangotoh -- OPERATOR 16 ~*~, -- soundex FUNCTION 1 bttextcmp(text, text), FUNCTION 2 gin_extract_value_token(internal, internal, internal), FUNCTION 3 gin_extract_query_token(internal, internal, int2, internal, internal, internal, internal), FUNCTION 4 gin_token_consistent(internal, int2, internal, int4, internal, internal, internal, internal), STORAGE text; pg-similarity/qgram.c000066400000000000000000000041721276707706500151500ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * qgram.c * * Q-Gram distance * * This function is the same as block (L1 distance); the only difference is that * it uses only tokenizeByGram. * * For example: * * x: euler = {eu, ul, le, er} * y: heuser = {he, eu, us, se, er} * t: {eu, ul, le, er, he, us, se} * * eu ul le er he us se * s = |1 - 1| + |1 - 0| + |1 - 0| + |1 - 1| + |0 - 1| + |0 - 1| + |0 - 1| = 5 * * PS> we call n-grams: (i) n-sequence of letters (ii) n-sequence of words * * http://en.wikipedia.org/wiki/Block_distance * * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "similarity.h" #include "tokenizer.h" /* GUC variables */ int pgs_qgram_tokenizer = PGS_UNIT_GRAM; double pgs_qgram_threshold = 0.7f; bool pgs_qgram_is_normalized = true; PG_FUNCTION_INFO_V1(qgram); Datum qgram(PG_FUNCTION_ARGS) { float8 res; bool tmp; int tmp2; /* * store *_is_normalized value temporarily 'cause * threshold (we're comparing against) is normalized */ tmp = pgs_block_is_normalized; pgs_block_is_normalized = pgs_qgram_is_normalized; /* * store *_tokenizer value temporarily 'cause * we're using block function */ tmp2 = pgs_block_tokenizer; pgs_block_tokenizer = pgs_qgram_tokenizer; res = DatumGetFloat8(DirectFunctionCall2( block, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); /* we're done; back to the previous value */ pgs_block_is_normalized = tmp; pgs_block_tokenizer = tmp2; PG_RETURN_FLOAT8(res); } PG_FUNCTION_INFO_V1(qgram_op); Datum qgram_op(PG_FUNCTION_ARGS) { float8 res; /* * store *_is_normalized value temporarily 'cause * threshold (we're comparing against) is normalized */ bool tmp = pgs_qgram_is_normalized; pgs_qgram_is_normalized = true; res = DatumGetFloat8(DirectFunctionCall2( qgram, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); /* we're done; back to the previous value */ pgs_qgram_is_normalized = tmp; PG_RETURN_BOOL(res >= pgs_qgram_threshold); } pg-similarity/similarity.c000066400000000000000000000354011276707706500162260ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * similarity.c * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "similarity.h" #include PG_MODULE_MAGIC; /* * Monge-Elkan approximate sets */ static char *approx_set[7] = { "dt", "gj", "lr", "mn", "bpv", "aeiou", ",." }; /* * cost functions */ int levcost(char a, char b) { if (a == b) return PGS_LEV_MIN_COST; else return PGS_LEV_MAX_COST; } /* * TODO change it to a callback function */ int nwcost(char a, char b) { if (a == 'a' && b == 'a') return 10; else if (a == 'a' && b == 'g') return -1; else if (a == 'a' && b == 'c') return -3; else if (a == 'a' && b == 't') return -4; else if (a == 'g' && b == 'a') return -1; else if (a == 'g' && b == 'g') return 7; else if (a == 'g' && b == 'c') return -5; else if (a == 'g' && b == 't') return -3; else if (a == 'c' && b == 'a') return -3; else if (a == 'c' && b == 'g') return -5; else if (a == 'c' && b == 'c') return 9; else if (a == 'c' && b == 't') return 0; else if (a == 't' && b == 'a') return -4; else if (a == 't' && b == 'g') return -3; else if (a == 't' && b == 'c') return 0; else if (a == 't' && b == 't') return 8; else return -99; /* shouldn't happen */ } float swcost(char *a, char *b, int i, int j) { /* XXX paranoia? check for out-of-range index */ if (i < 0 || i >= strlen(a)) return 0.0; if (j < 0 || j >= strlen(b)) return 0.0; if (a[i] == b[j]) return PGS_SW_MAX_COST; else return PGS_SW_MIN_COST; } float swggapcost(int i, int j) { if (i >= j) return 0.0; else return (5.0 + ((j - 1) - i)); } float megapcost(char *a, char *b, int i, int j) { int k; /* XXX paranoia? check for out-of-range index */ if (i < 0 || i >= strlen(a)) return -3.0; if (j < 0 || j >= strlen(b)) return -3.0; if (a[i] == b[j]) return 5.0; for (k = 0; k < 7; k++) { if (strchr(approx_set[k], a[i]) != NULL && strchr(approx_set[k], b[j]) != NULL) return 3.0; } return -3.0; } /* * Module load callback * * Holds GUC variables that cause some behavior changes in similarity functions * */ void _PG_init(void) { static const struct config_enum_entry pgs_tokenizer_options[] = { {"alnum", PGS_UNIT_ALNUM, false}, {"gram", PGS_UNIT_GRAM, false}, {"word", PGS_UNIT_WORD, false}, {"camelcase", PGS_UNIT_CAMELCASE, false}, {NULL, 0, false} }; static const struct config_enum_entry pgs_gram_options[] = { {"gram", PGS_UNIT_GRAM, false}, {NULL, 0, false} }; /* Block */ DefineCustomEnumVariable("pg_similarity.block_tokenizer", "Sets the tokenizer for Block similarity function.", "Valid values are \"alnum\", \"gram\", \"word\", or \"camelcase\".", &pgs_block_tokenizer, PGS_UNIT_ALNUM, pgs_tokenizer_options, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomRealVariable("pg_similarity.block_threshold", "Sets the threshold used by the Block similarity function.", "Valid range is 0.0 .. 1.0.", &pgs_block_threshold, 0.7, 0.0, 1.0, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomBoolVariable("pg_similarity.block_is_normalized", "Sets if the result value is normalized or not.", NULL, &pgs_block_is_normalized, true, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); /* Cosine */ DefineCustomEnumVariable("pg_similarity.cosine_tokenizer", "Sets the tokenizer for Cosine similarity function.", "Valid values are \"alnum\", \"gram\", \"word\", or \"camelcase\".", &pgs_cosine_tokenizer, PGS_UNIT_ALNUM, pgs_tokenizer_options, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomRealVariable("pg_similarity.cosine_threshold", "Sets the threshold used by the Cosine similarity function.", "Valid range is 0.0 .. 1.0.", &pgs_cosine_threshold, 0.7, 0.0, 1.0, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomBoolVariable("pg_similarity.cosine_is_normalized", "Sets if the result value is normalized or not.", NULL, &pgs_cosine_is_normalized, true, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); /* Dice */ DefineCustomEnumVariable("pg_similarity.dice_tokenizer", "Sets the tokenizer for Dice similarity measure.", "Valid values are \"alnum\", \"gram\", \"word\", or \"camelcase\".", &pgs_dice_tokenizer, PGS_UNIT_ALNUM, pgs_tokenizer_options, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomRealVariable("pg_similarity.dice_threshold", "Sets the threshold used by the Dice similarity measure.", "Valid range is 0.0 .. 1.0.", &pgs_dice_threshold, 0.7, 0.0, 1.0, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomBoolVariable("pg_similarity.dice_is_normalized", "Sets if the result value is normalized or not.", NULL, &pgs_dice_is_normalized, true, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); /* Euclidean */ DefineCustomEnumVariable("pg_similarity.euclidean_tokenizer", "Sets the tokenizer for Euclidean similarity measure.", "Valid values are \"alnum\", \"gram\", \"word\", or \"camelcase\".", &pgs_euclidean_tokenizer, PGS_UNIT_ALNUM, pgs_tokenizer_options, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomRealVariable("pg_similarity.euclidean_threshold", "Sets the threshold used by the Euclidean similarity measure.", "Valid range is 0.0 .. 1.0.", &pgs_euclidean_threshold, 0.7, 0.0, 1.0, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomBoolVariable("pg_similarity.euclidean_is_normalized", "Sets if the result value is normalized or not.", NULL, &pgs_euclidean_is_normalized, true, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); /* Hamming */ DefineCustomRealVariable("pg_similarity.hamming_threshold", "Sets the threshold used by the Block similarity metric.", "Valid range is 0.0 .. 1.0.", &pgs_hamming_threshold, 0.7, 0.0, 1.0, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomBoolVariable("pg_similarity.hamming_is_normalized", "Sets if the result value is normalized or not.", NULL, &pgs_hamming_is_normalized, true, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); /* Jaccard */ DefineCustomEnumVariable("pg_similarity.jaccard_tokenizer", "Sets the tokenizer for Jaccard similarity measure.", "Valid values are \"alnum\", \"gram\", \"word\", or \"camelcase\".", &pgs_jaccard_tokenizer, PGS_UNIT_ALNUM, pgs_tokenizer_options, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomRealVariable("pg_similarity.jaccard_threshold", "Sets the threshold used by the Jaccard similarity measure.", "Valid range is 0.0 .. 1.0.", &pgs_jaccard_threshold, 0.7, 0.0, 1.0, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomBoolVariable("pg_similarity.jaccard_is_normalized", "Sets if the result value is normalized or not.", NULL, &pgs_jaccard_is_normalized, true, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); /* Jaro */ DefineCustomRealVariable("pg_similarity.jaro_threshold", "Sets the threshold used by the Jaro similarity measure.", "Valid range is 0.0 .. 1.0.", &pgs_jaro_threshold, 0.7, 0.0, 1.0, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomBoolVariable("pg_similarity.jaro_is_normalized", "Sets if the result value is normalized or not.", NULL, &pgs_jaro_is_normalized, true, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); /* Jaro-Winkler */ DefineCustomRealVariable("pg_similarity.jarowinkler_threshold", "Sets the threshold used by the Jaro similarity measure.", "Valid range is 0.0 .. 1.0.", &pgs_jarowinkler_threshold, 0.7, 0.0, 1.0, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomBoolVariable("pg_similarity.jarowinkler_is_normalized", "Sets if the result value is normalized or not.", NULL, &pgs_jarowinkler_is_normalized, true, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); /* Levenshtein */ DefineCustomRealVariable("pg_similarity.levenshtein_threshold", "Sets the threshold used by the Levenshtein similarity measure.", "Valid range is 0.0 .. 1.0.", &pgs_levenshtein_threshold, 0.7, 0.0, 1.0, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomBoolVariable("pg_similarity.levenshtein_is_normalized", "Sets if the result value is normalized or not.", NULL, &pgs_levenshtein_is_normalized, true, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); /* Matching Coefficient */ DefineCustomEnumVariable("pg_similarity.matching_tokenizer", "Sets the tokenizer for Matching Coefficient similarity measure.", "Valid values are \"alnum\", \"gram\", \"word\", or \"camelcase\".", &pgs_matching_tokenizer, PGS_UNIT_ALNUM, pgs_tokenizer_options, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomRealVariable("pg_similarity.matching_threshold", "Sets the threshold used by the Matching Coefficient similarity measure.", "Valid range is 0.0 .. 1.0.", &pgs_matching_threshold, 0.7, 0.0, 1.0, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomBoolVariable("pg_similarity.matching_is_normalized", "Sets if the result value is normalized or not.", NULL, &pgs_matching_is_normalized, true, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); /* Monge-Elkan */ DefineCustomEnumVariable("pg_similarity.mongeelkan_tokenizer", "Sets the tokenizer for Monge-Elkan similarity measure.", "Valid values are \"alnum\", \"gram\", \"word\", or \"camelcase\".", &pgs_mongeelkan_tokenizer, PGS_UNIT_ALNUM, pgs_tokenizer_options, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomRealVariable("pg_similarity.mongeelkan_threshold", "Sets the threshold used by the Monge-Elkan similarity measure.", "Valid range is 0.0 .. 1.0.", &pgs_mongeelkan_threshold, 0.7, 0.0, 1.0, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomBoolVariable("pg_similarity.mongeelkan_is_normalized", "Sets if the result value is normalized or not.", NULL, &pgs_mongeelkan_is_normalized, true, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); /* Needleman-Wunsch */ DefineCustomRealVariable("pg_similarity.nw_threshold", "Sets the threshold used by the Needleman-Wunsch similarity measure.", "Valid range is 0.0 .. 1.0.", &pgs_nw_threshold, 0.7, 0.0, 1.0, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomBoolVariable("pg_similarity.nw_is_normalized", "Sets if the result value is normalized or not.", NULL, &pgs_nw_is_normalized, true, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomRealVariable("pg_similarity.nw_gap_penalty", "Sets the gap penalty used by the Needleman-Wunsch similarity measure.", NULL, &pgs_nw_gap_penalty, -5.0, LONG_MIN, LONG_MAX, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); /* Overlap Coefficient */ DefineCustomEnumVariable("pg_similarity.overlap_tokenizer", "Sets the tokenizer for Overlap Coefficient similarity measure.", "Valid values are \"alnum\", \"gram\", \"word\", or \"camelcase\".", &pgs_overlap_tokenizer, PGS_UNIT_ALNUM, pgs_tokenizer_options, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomRealVariable("pg_similarity.overlap_threshold", "Sets the threshold used by the Overlap Coefficient similarity measure.", "Valid range is 0.0 .. 1.0.", &pgs_overlap_threshold, 0.7, 0.0, 1.0, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomBoolVariable("pg_similarity.overlap_is_normalized", "Sets if the result value is normalized or not.", NULL, &pgs_overlap_is_normalized, true, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); /* Q-Gram */ DefineCustomEnumVariable("pg_similarity.qgram_tokenizer", "Sets the tokenizer for Q-Gram similarity function.", "Valid value is \"gram\".", &pgs_qgram_tokenizer, PGS_UNIT_GRAM, pgs_gram_options, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomRealVariable("pg_similarity.qgram_threshold", "Sets the threshold used by the Q-Gram similarity function.", "Valid range is 0.0 .. 1.0.", &pgs_qgram_threshold, 0.7, 0.0, 1.0, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomBoolVariable("pg_similarity.qgram_is_normalized", "Sets if the result value is normalized or not.", NULL, &pgs_qgram_is_normalized, true, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); /* Smith-Waterman */ DefineCustomRealVariable("pg_similarity.sw_threshold", "Sets the threshold used by the Smith-Waterman similarity measure.", "Valid range is 0.0 .. 1.0.", &pgs_sw_threshold, 0.7, 0.0, 1.0, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomBoolVariable("pg_similarity.sw_is_normalized", "Sets if the result value is normalized or not.", NULL, &pgs_sw_is_normalized, true, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); /* Smith-Waterman-Gotoh */ DefineCustomRealVariable("pg_similarity.swg_threshold", "Sets the threshold used by the Smith-Waterman-Gotoh similarity measure.", "Valid range is 0.0 .. 1.0.", &pgs_swg_threshold, 0.7, 0.0, 1.0, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); DefineCustomBoolVariable("pg_similarity.swg_is_normalized", "Sets if the result value is normalized or not.", NULL, &pgs_swg_is_normalized, true, PGC_USERSET, 0, #if PG_VERSION_NUM >= 90100 NULL, #endif NULL, NULL); } pg-similarity/similarity.h000066400000000000000000000153331276707706500162350ustar00rootroot00000000000000#ifndef SIMILARITY_H #define SIMILARITY_H #include "postgres.h" #include "fmgr.h" #include "utils/builtins.h" #include "utils/guc.h" /* * XXX Windows workaround */ #ifndef WIN32 #define PGS_EXPORT #else #define PGS_EXPORT __declspec(dllexport) /* * PG_MODULE_MAGIC and PG_FUNCTION_INFO_V1 macros seems to be broken. * It uses PGDLLIMPORT, but those objects are not imported from postgres * and exported from the user module. So, it should be always dllexported. */ #undef PG_MODULE_MAGIC #define PG_MODULE_MAGIC \ extern PGS_EXPORT const Pg_magic_struct *PG_MAGIC_FUNCTION_NAME(void); \ const Pg_magic_struct * \ PG_MAGIC_FUNCTION_NAME(void) \ { \ static const Pg_magic_struct Pg_magic_data = PG_MODULE_MAGIC_DATA; \ return &Pg_magic_data; \ } \ extern int no_such_variable #undef PG_FUNCTION_INFO_V1 #define PG_FUNCTION_INFO_V1(funcname) \ extern PGS_EXPORT const Pg_finfo_record * CppConcat(pg_finfo_,funcname)(void); \ const Pg_finfo_record * \ CppConcat(pg_finfo_,funcname) (void) \ { \ static const Pg_finfo_record my_finfo = { 1 }; \ return &my_finfo; \ } \ extern int no_such_variable #endif /* Windows workaround */ /* case insensitive ? */ #define PGS_IGNORE_CASE 1 /* maximum string length */ #define PGS_MAX_STR_LEN 1024 /* * Jaro */ /* operation's weight */ #define PGS_JARO_W1 1.0/3.0 #define PGS_JARO_W2 1.0/3.0 #define PGS_JARO_WT 1.0/3.0 /* size of the initial prefix considered */ #define PGS_JARO_PREFIX_SIZE 4 /* scaling factor */ #define PGS_JARO_SCALING_FACTOR 0.1 /* minimum score for a string that gets boosted */ #define PGS_JARO_BOOST_THRESHOLD 0.7 /* * Levenshtein */ #define PGS_LEV_MIN_COST 0 #define PGS_LEV_MAX_COST 1 /* * Needleman-Wunch */ /* * Smith-Waterman */ /* XXX simmetrics uses these values #define PGS_SW_MIN_COST -2.0 #define PGS_SW_MAX_COST 1.0 #define PGS_SW_GAP_COST 0.5 */ #define PGS_SW_MIN_COST -1.0 #define PGS_SW_MAX_COST 2.0 #define PGS_SW_GAP_COST -1.0 /* * Smith-Waterman-Gotoh */ #define PGS_SWG_WINDOW_SIZE 100 /* * Soundex */ #define PGS_SOUNDEX_LEN 4 #define PGS_SOUNDEX_INV_CODE -1 /* * commonly used functions */ #define min2(a, b) ((a < b) ? a : b) #define max2(a, b) ((a > b) ? a : b) #define min3(a, b, c) ((a < b && a < c) ? a : ((b < c) ? b : c)) #define max3(a, b, c) ((a > b && a > c) ? a : ((b > c) ? b : c)) #define max4(a, b, c, d) ((a > b && a > c && a > d) ? a : ((b > c && b > d) ? b : ((c > d) ? c : d))) /* * normalized results? */ extern bool pgs_block_is_normalized; extern bool pgs_cosine_is_normalized; extern bool pgs_dice_is_normalized; extern bool pgs_euclidean_is_normalized; extern bool pgs_hamming_is_normalized; extern bool pgs_jaccard_is_normalized; extern bool pgs_jaro_is_normalized; extern bool pgs_jarowinkler_is_normalized; extern bool pgs_levenshtein_is_normalized; extern bool pgs_matching_is_normalized; extern bool pgs_mongeelkan_is_normalized; extern bool pgs_nw_is_normalized; extern bool pgs_overlap_is_normalized; extern bool pgs_qgram_is_normalized; extern bool pgs_sw_is_normalized; extern bool pgs_swg_is_normalized; /* * how to separate things? */ enum { PGS_UNIT_WORD, /* tokenize by spaces */ PGS_UNIT_GRAM, /* tokenize by n-gram */ PGS_UNIT_ALNUM, /* tokenize by nonalnum characters */ PGS_UNIT_CAMELCASE /* tokenize by camel-case */ }; /* * tokenizers per function */ extern int pgs_block_tokenizer; extern int pgs_cosine_tokenizer; extern int pgs_dice_tokenizer; extern int pgs_euclidean_tokenizer; extern int pgs_jaccard_tokenizer; extern int pgs_matching_tokenizer; extern int pgs_mongeelkan_tokenizer; extern int pgs_overlap_tokenizer; extern int pgs_qgram_tokenizer; /* * thresholds per function */ extern float8 pgs_block_threshold; extern float8 pgs_cosine_threshold; extern float8 pgs_dice_threshold; extern float8 pgs_euclidean_threshold; extern float8 pgs_hamming_threshold; extern float8 pgs_jaccard_threshold; extern float8 pgs_jaro_threshold; extern float8 pgs_jarowinkler_threshold; extern float8 pgs_levenshtein_threshold; extern float8 pgs_matching_threshold; extern float8 pgs_mongeelkan_threshold; extern float8 pgs_nw_threshold; extern float8 pgs_overlap_threshold; extern float8 pgs_qgram_threshold; extern float8 pgs_sw_threshold; extern float8 pgs_swg_threshold; /* * gap penalty */ extern float8 pgs_nw_gap_penalty; /* * levenshtein.c */ int _lev(char *a, char *b, int icost, int dcost); int _lev_slow(char *a, char *b, int icost, int dcost); /* * similarity.c */ int levcost(char a, char b); int nwcost(char a, char b); float swcost(char *a, char *b, int i, int j); float swggapcost(int i, int j); float megapcost(char *a, char *b, int i, int j); void _PG_init(void); /* * external function declarations */ extern Datum PGS_EXPORT block(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT block_op(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT cosine(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT cosine_op(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT dice(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT dice_op(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT euclidean(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT euclidean_op(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT hamming(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT hamming_op(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT hamming_text(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT hamming_text_op(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT jaccard(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT jaccard_op(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT jaro(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT jaro_op(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT jarowinkler(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT jarowinkler_op(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT lev(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT lev_op(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT levslow(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT levslow_op(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT matchingcoefficient(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT matchingcoefficient_op(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT mongeelkan(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT mongeelkan_op(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT needlemanwunsch(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT needlemanwunsch_op(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT overlapcoefficient(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT overlapcoefficient_op(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT qgram(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT qgram_op(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT smithwaterman(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT smithwaterman_op(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT smithwatermangotoh(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT smithwatermangotoh_op(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT soundex(PG_FUNCTION_ARGS); extern Datum PGS_EXPORT soundex_op(PG_FUNCTION_ARGS); #endif pg-similarity/similarity_gin.c000066400000000000000000000106621276707706500170650ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * similarity_gin.c * * GIN support routines * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "postgres.h" #include "access/gin.h" #include "access/skey.h" /* #include "access/reloptions.h" #include "utils/guc.h" #include "utils/syscache.h" */ #include "similarity.h" #include "tokenizer.h" /* choose one of them */ /* #define PGS_BY_WORD 1 */ #define PGS_BY_ALNUM 1 /* #define PGS_BY_GRAM 1 #define PGS_BY_CAMELCASE 1 */ PG_FUNCTION_INFO_V1(gin_extract_value_token); Datum gin_extract_value_token(PG_FUNCTION_ARGS); PG_FUNCTION_INFO_V1(gin_extract_query_token); Datum gin_extract_query_token(PG_FUNCTION_ARGS); PG_FUNCTION_INFO_V1(gin_token_consistent); Datum gin_token_consistent(PG_FUNCTION_ARGS); Datum gin_extract_value_token(PG_FUNCTION_ARGS) { text *value = (text *) PG_GETARG_TEXT_P(0); int32 *ntokens = (int32 *) PG_GETARG_POINTER(1); Datum *tokens = NULL; char *buf; elog(DEBUG3, "gin_extract_value_token() called"); buf = text_to_cstring(value); *ntokens = 0; if (buf != NULL) { TokenList *tlist; Token *t; tlist = initTokenList(1); /* * TODO we want to index according to out GUCs * TODO we need to store the tokenized-by information * TODO so we can decide to use or not the index for a * TODO given query */ #ifdef PGS_BY_WORD tokenizeBySpace(tlist, buf); #elif PGS_BY_ALNUM tokenizeByNonAlnum(tlist, buf); #elif PGS_BY_GRAM tokenizeByGram(tlist, buf); #elif PGS_BY_CAMELCASE tokenizeByCamelCase(tlist, buf); #else elog(ERROR, "choose a supported tokenizer"); #endif *ntokens = tlist->size; if (tlist->size > 0) { int i; tokens = (Datum *) palloc(sizeof(Datum) * tlist->size); t = tlist->head; for (i = 0; i < tlist->size; i++) { text *td; td = cstring_to_text_with_len(t->data, strlen(t->data)); tokens[i] = PointerGetDatum(td); t = t->next; } } destroyTokenList(tlist); } PG_FREE_IF_COPY(value, 0); PG_RETURN_POINTER(tokens); } Datum gin_extract_query_token(PG_FUNCTION_ARGS) { text *value = (text *) PG_GETARG_TEXT_P(0); int32 *ntokens = (int32 *) PG_GETARG_POINTER(1); /* StrategyNumber strategy = PG_GETARG_UINT16(2); bool **pmatch = (bool **) PG_GETARG_POINTER(3); Pointer **extra_data = (Pointer *) PG_GETARG_POINTER(4); */ #if PG_VERSION_NUM >= 90100 /* bool **null_flags = (bool **) PG_GETARG_POINTER(5); */ int32 *search_mode = (int32 *) PG_GETARG_POINTER(6); #endif Datum *tokens = NULL; char *buf; elog(DEBUG3, "gin_extract_query_token() called"); buf = text_to_cstring(value); *ntokens = 0; if (buf != NULL) { TokenList *tlist; Token *t; tlist = initTokenList(1); /* * TODO we want to index according to out GUCs * TODO we need to store the tokenized-by information * TODO so we can decide to use or not the index for a * TODO given query */ #ifdef PGS_BY_WORD tokenizeBySpace(tlist, buf); #elif PGS_BY_ALNUM tokenizeByNonAlnum(tlist, buf); #elif PGS_BY_GRAM tokenizeByGram(tlist, buf); #elif PGS_BY_CAMELCASE tokenizeByCamelCase(tlist, buf); #else elog(ERROR, "choose a supported tokenizer"); #endif *ntokens = tlist->size; if (tlist->size > 0) { int i; tokens = (Datum *) palloc(sizeof(Datum) * tlist->size); t = tlist->head; for (i = 0; i < tlist->size; i++) { text *td; td = cstring_to_text_with_len(t->data, strlen(t->data)); tokens[i] = PointerGetDatum(td); t = t->next; } } destroyTokenList(tlist); } #if PG_VERSION_NUM >= 90100 if (*ntokens == 0) *search_mode = GIN_SEARCH_MODE_ALL; #endif PG_FREE_IF_COPY(value, 0); PG_RETURN_POINTER(tokens); } Datum gin_token_consistent(PG_FUNCTION_ARGS) { /* bool *check = (bool *) PG_GETARG_POINTER(0); StrategyNumber strategy = PG_GETARG_UINT16(1); text *query = PG_GETARG_TEXT_P(2); int32 ntokens = PG_GETARG_INT32(3); Pointer *extra_data = (Pointer *) PG_GETARG_POINTER(4); */ bool *recheck = (bool *) PG_GETARG_POINTER(5); /* #if PG_VERSION_NUM >= 90100 Datum **query_tokens = PG_GETARG_POINTER(6); bool **null_flags = (bool **) PG_GETARG_POINTER(7); #endif */ elog(DEBUG3, "gin_token_consistent() called"); /* * Heap tuple might match the query. Evaluating the query operator directly * against the originally indexed item. */ *recheck = true; PG_RETURN_BOOL(true); } pg-similarity/smithwaterman.c000066400000000000000000000143601276707706500167240ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * smithwaterman.c * * Smith-Waterman is an algorithm that performs a global alignment on two * sequences. * * It is a dynamic programming algorithm that is used to biological sequence * comparison. The operation costs (scores) are specified by similarity * matrix. It also uses a linear gap penalty (like Levenshtein). * * For example: * * similarity matrix * * +-------------------+ * | | A | C | A | C | * +-------------------+ * | A | 2 | 1 | 2 | 1 | * +-------------------+ * | G | 1 | 1 | 1 | 1 | * +-------------------+ * | C | 0 | 3 | 2 | 3 | * +-------------------+ * | A | 2 | 2 | 5 | 4 | * +-------------------+ * * x: ACACACTA * y: AGCACACA * match cost: 2 * mismatch cost: -1 * insertion cost: -1 * deletion cost: -1 * * +---------------------------------------+ * | | | A | C | A | C | A | C | T | A | * +-------------------------------------------+ * | | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | * +-------------------------------------------+ * | A | 0 | 2 | 1 | 2 | 1 | 2 | 1 | 0 | 2 | * +-------------------------------------------+ * | G | 0 | 1 | 1 | 1 | 1 | 1 | 1 | 0 | 1 | * +-------------------------------------------+ * | C | 0 | 0 | 3 | 2 | 3 | 2 | 3 | 2 | 1 | * +-------------------------------------------+ * | A | 0 | 2 | 2 | 5 | 4 | 5 | 4 | 3 | 4 | * +-------------------------------------------+ * | C | 0 | 1 | 4 | 4 | 7 | 6 | 7 | 6 | 5 | * +-------------------------------------------+ * | A | 0 | 2 | 3 | 6 | 6 | 9 | 8 | 7 | 8 | * +-------------------------------------------+ * | C | 0 | 1 | 4 | 5 | 8 | 8 | 11 | 10 | 9 | * +-------------------------------------------+ * | A | 0 | 2 | 3 | 6 | 7 | 10 | 10 | 10 | 12 | * +-------------------------------------------+ * * http://en.wikipedia.org/wiki/Smith%E2%80%93Waterman_algorithm * * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "similarity.h" /* GUC variables */ double pgs_sw_threshold = 0.7f; bool pgs_sw_is_normalized = true; /* * TODO move this function to similarity.c */ static double _smithwaterman(char *a, char *b) { float **matrix; /* dynamic programming matrix */ int alen, blen; int i, j; double maxvalue; alen = strlen(a); blen = strlen(b); elog(DEBUG2, "alen: %d; blen: %d", alen, blen); if (alen == 0) return blen; if (blen == 0) return alen; matrix = (float **) malloc((alen + 1) * sizeof(float *)); if (matrix == NULL) elog(ERROR, "memory exaushted for array size %d", alen); for (i = 0; i <= alen; i++) { matrix[i] = (float *) malloc((blen + 1) * sizeof(float)); if (matrix[i] == NULL) elog(ERROR, "memory exaushted for array size %d", blen); } #ifdef PGS_IGNORE_CASE elog(DEBUG2, "case-sensitive turns off"); for (i = 0; i < alen; i++) a[i] = tolower(a[i]); for (j = 0; j < blen; j++) b[j] = tolower(b[j]); #endif maxvalue = 0.0; /* initial values */ for (i = 0; i <= alen; i++) { /* XXX why simmetrics does this way? XXX original algorithm initializes first column with zeros float c = swcost(a, b, i, 0); if (i == 0) matrix[0][0] = max3(0.0, -1 * PGS_SW_GAP_COST, c); else matrix[i][0] = max3(0.0, matrix[i-1][0] - PGS_SW_GAP_COST, c); if (matrix[i][0] > maxvalue) maxvalue = matrix[i][0]; */ matrix[i][0] = 0.0; } for (j = 0; j <= blen; j++) { /* XXX why simmetrics does this way? XXX original algorithm initializes first row with zeros float c = swcost(a, b, 0, j); if (j == 0) matrix[0][0] = max3(0.0, -1 * PGS_SW_GAP_COST, c); else matrix[0][j] = max3(0.0, matrix[0][j-1] - PGS_SW_GAP_COST, c); if (matrix[0][j] > maxvalue) maxvalue = matrix[0][j]; */ matrix[0][j] = 0.0; } for (i = 1; i <= alen; i++) { for (j = 1; j <= blen; j++) { /* get operation cost */ float c = swcost(a, b, i - 1, j - 1); matrix[i][j] = max4(0.0, matrix[i-1][j] + PGS_SW_GAP_COST, matrix[i][j-1] + PGS_SW_GAP_COST, matrix[i-1][j-1] + c); elog(DEBUG2, "(i, j) = (%d, %d); cost(%c, %c): %.3f; max(zero, top, left, diag) = (0.0, %.3f, %.3f, %.3f) = %.3f -- %.3f (%d, %d)", i, j, a[i-1], b[j-1], c, matrix[i-1][j] + PGS_SW_GAP_COST, matrix[i][j-1] + PGS_SW_GAP_COST, matrix[i-1][j-1] + c, matrix[i][j], matrix[i][j-1], i, j-1); if (matrix[i][j] > maxvalue) maxvalue = matrix[i][j]; } } for (i = 0; i <= alen; i++) for (j = 0; j <= blen; j++) elog(DEBUG1, "(%d, %d) = %.3f", i, j, matrix[i][j]); for (i = 0; i <= alen; i++) free(matrix[i]); free(matrix); return maxvalue; } PG_FUNCTION_INFO_V1(smithwaterman); Datum smithwaterman(PG_FUNCTION_ARGS) { char *a, *b; double maxvalue; float8 res; a = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(0)))); b = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(1)))); if (strlen(a) > PGS_MAX_STR_LEN || strlen(b) > PGS_MAX_STR_LEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("argument exceeds the maximum length of %d bytes", PGS_MAX_STR_LEN))); maxvalue = (float8) min2(strlen(a), strlen(b)); res = _smithwaterman(a, b); elog(DEBUG1, "is normalized: %d", pgs_sw_is_normalized); elog(DEBUG1, "maximum length: %.3f", maxvalue); elog(DEBUG1, "swdistance(%s, %s) = %.3f", a, b, res); if (maxvalue == 0.0) { res = 1.0; } if (pgs_sw_is_normalized) { if (PGS_SW_MAX_COST > (-1 * PGS_SW_GAP_COST)) maxvalue *= PGS_SW_MAX_COST; else maxvalue *= -1 * PGS_SW_GAP_COST; /* paranoia ? */ if (maxvalue == 0.0) res = 1.0; else res = (res / maxvalue); } elog(DEBUG1, "sw(%s, %s) = %.3f", a, b, res); PG_RETURN_FLOAT8(res); } PG_FUNCTION_INFO_V1(smithwaterman_op); Datum smithwaterman_op(PG_FUNCTION_ARGS) { float8 res; /* * store *_is_normalized value temporarily 'cause * threshold (we're comparing against) is normalized */ bool tmp = pgs_sw_is_normalized; pgs_sw_is_normalized = true; res = DatumGetFloat8(DirectFunctionCall2( smithwaterman, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); /* we're done; back to the previous value */ pgs_sw_is_normalized = tmp; PG_RETURN_BOOL(res >= pgs_sw_threshold); } pg-similarity/smithwatermangotoh.c000066400000000000000000000115001276707706500177560ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * smithwatermangotoh.c * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "similarity.h" double pgs_swg_threshold = 0.7f; bool pgs_swg_is_normalized = true; /* * TODO move this function to similarity.c */ static double _smithwatermangotoh(char *a, char *b) { float **matrix; /* dynamic programming matrix */ int alen, blen; int i, j; double maxvalue; alen = strlen(a); blen = strlen(b); elog(DEBUG2, "alen: %d; blen: %d", alen, blen); if (alen == 0) return blen; if (blen == 0) return alen; matrix = (float **) malloc((alen + 1) * sizeof(float *)); if (matrix == NULL) elog(ERROR, "memory exaushted for array size %d", alen); for (i = 0; i <= alen; i++) { matrix[i] = (float *) malloc((blen + 1) * sizeof(float)); if (matrix[i] == NULL) elog(ERROR, "memory exaushted for array size %d", blen); } #ifdef PGS_IGNORE_CASE elog(DEBUG2, "case-sensitive turns off"); for (i = 0; i < alen; i++) a[i] = tolower(a[i]); for (j = 0; j < blen; j++) b[j] = tolower(b[j]); #endif maxvalue = 0.0; /* initial values */ for (i = 0; i <= alen; i++) { float c = megapcost(a, b, i, 0); if (i == 0) { matrix[0][0] = max2(0.0, c); } else { float maxgapcost = 0.0; int wstart = i - PGS_SWG_WINDOW_SIZE; int k; if (wstart < 1) wstart = 1; for (k = wstart; k < i; k++) maxgapcost = max2(maxgapcost, matrix[i - k][0] - swggapcost(i - k, i)); matrix[i][0] = max3(0.0, maxgapcost, c); } if (matrix[i][0] > maxvalue) maxvalue = matrix[i][0]; } for (j = 0; j <= blen; j++) { float c = megapcost(a, b, 0, j); if (j == 0) { matrix[0][0] = max2(0.0, c); } else { float maxgapcost = 0.0; int wstart = j - PGS_SWG_WINDOW_SIZE; int k; if (wstart < 1) wstart = 1; for (k = wstart; k < j; k++) maxgapcost = max2(maxgapcost, matrix[0][j - k] - swggapcost(j - k, j)); matrix[0][j] = max3(0.0, maxgapcost, c); } if (matrix[0][j] > maxvalue) maxvalue = matrix[0][j]; } for (i = 1; i <= alen; i++) { for (j = 1; j <= blen; j++) { int wstart; int k; float maxgapcost1 = 0.0, maxgapcost2 = 0.0; /* get operation cost */ float c = megapcost(a, b, i, j); wstart = i - PGS_SWG_WINDOW_SIZE; if (wstart < 1) wstart = 1; for (k = wstart; k < i; k++) maxgapcost1 = max2(maxgapcost1, matrix[i - k][0] - swggapcost(i - k, i)); wstart = j - PGS_SWG_WINDOW_SIZE; if (wstart < 1) wstart = 1; for (k = wstart; k < j; k++) maxgapcost2 = max2(maxgapcost2, matrix[0][j - k] - swggapcost(j - k, j)); matrix[i][j] = max4(0.0, maxgapcost1, maxgapcost2, matrix[i-1][j-1] + c); elog(DEBUG2, "(i, j) = (%d, %d); cost(%c, %c): %.3f; max(zero, top, left, diag) = (0.0, %.3f, %.3f, %.3f) = %.3f", i, j, a[i-1], b[j-1], c, maxgapcost1, maxgapcost2, matrix[i-1][j-1] + c, matrix[i][j]); if (matrix[i][j] > maxvalue) maxvalue = matrix[i][j]; } } for (i = 0; i <= alen; i++) free(matrix[i]); free(matrix); return maxvalue; } PG_FUNCTION_INFO_V1(smithwatermangotoh); Datum smithwatermangotoh(PG_FUNCTION_ARGS) { char *a, *b; double maxvalue; float8 res; a = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(0)))); b = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(1)))); if (strlen(a) > PGS_MAX_STR_LEN || strlen(b) > PGS_MAX_STR_LEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("argument exceeds the maximum length of %d bytes", PGS_MAX_STR_LEN))); maxvalue = (float8) min2(strlen(a), strlen(b)); res = _smithwatermangotoh(a, b); elog(DEBUG1, "is normalized: %d", pgs_swg_is_normalized); elog(DEBUG1, "maximum length: %.3f", maxvalue); elog(DEBUG1, "swgdistance(%s, %s) = %.3f", a, b, res); if (maxvalue == 0) { res = 1.0; } if (pgs_swg_is_normalized) { if (PGS_SW_MAX_COST > (-1 * PGS_SW_GAP_COST)) maxvalue *= PGS_SW_MAX_COST; else maxvalue *= -1 * PGS_SW_GAP_COST; /* paranoia ? */ if (maxvalue == 0.0) res = 1.0; else res = (res / maxvalue); } elog(DEBUG1, "swg(%s, %s) = %.3f", a, b, res); PG_RETURN_FLOAT8(res); } PG_FUNCTION_INFO_V1(smithwatermangotoh_op); Datum smithwatermangotoh_op(PG_FUNCTION_ARGS) { float8 res; /* * store *_is_normalized value temporarily 'cause * threshold (we're comparing against) is normalized */ bool tmp = pgs_swg_is_normalized; pgs_swg_is_normalized = true; res = DatumGetFloat8(DirectFunctionCall2( smithwatermangotoh, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); /* we're done; back to the previous value */ pgs_swg_is_normalized = tmp; PG_RETURN_BOOL(res >= pgs_swg_threshold); } pg-similarity/soundex.c000066400000000000000000000056071276707706500155320ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * soundex.c * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "similarity.h" static const char *stable = /* ABCDEFGHIJKLMNOPQRSTUVWXYZ */ "01230120022455012623010202"; /* * soundex code is only defined to ASCII characters */ static char convert_soundex(char a) { a = toupper((unsigned char) a); /* soundex code is only defined to ASCII characters */ if (a >= 'A' && a <= 'Z') return stable[a - 'A']; else return a; } static char *_soundex(char *a) { int alen; int i; int len; char *scode; int lastcode = PGS_SOUNDEX_INV_CODE; alen = strlen(a); elog(DEBUG2, "alen: %d", alen); if (alen == 0) return NULL; #ifdef PGS_IGNORE_CASE elog(DEBUG2, "case-sensitive turns off"); for (i = 0; i < alen; i++) a[i] = toupper(a[i]); #endif scode = palloc(PGS_SOUNDEX_LEN + 1); scode[PGS_SOUNDEX_LEN] = '\0'; /* ignoring non-alpha characters */ while (!isalpha(*a) && *a != '\0') a++; if (*a != '\0') elog(ERROR, "string doesn't contain non-alpha character(s)"); /* get the first letter */ scode[0] = *a++; len = 1; elog(DEBUG2, "The first letter is: %c", scode[0]); while (*a && len < PGS_SOUNDEX_LEN) { int curcode = convert_soundex(*a); elog(DEBUG3, "The code for '%c' is: %d", *a, curcode); if (isalpha(*a) && (curcode != lastcode) && curcode != '0') { scode[len] = curcode; elog(DEBUG2, "scode[%d] = %d", len, curcode); len++; } lastcode = curcode; a++; } /* fill with zeros (if necessary) */ while (len < PGS_SOUNDEX_LEN) { scode[len] = '0'; elog(DEBUG2, "scode[%d] = %d", len, scode[len]); len++; } return scode; } PG_FUNCTION_INFO_V1(soundex); Datum soundex(PG_FUNCTION_ARGS) { char *a, *b; char *resa; char *resb; float8 res; a = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(0)))); b = DatumGetPointer(DirectFunctionCall1(textout, PointerGetDatum(PG_GETARG_TEXT_P(1)))); if (strlen(a) > PGS_MAX_STR_LEN || strlen(b) > PGS_MAX_STR_LEN) ereport(ERROR, (errcode(ERRCODE_INVALID_PARAMETER_VALUE), errmsg("argument exceeds the maximum length of %d bytes", PGS_MAX_STR_LEN))); resa = _soundex(a); resb = _soundex(b); elog(DEBUG1, "soundex(%s) = %s", a, resa); elog(DEBUG1, "soundex(%s) = %s", b, resb); /* * we don't have threshold in soundex algorithm, instead same code means strings * are similar (i.e. threshold is 1.0) or it is not (i.e. threshold is 0.0). */ if (strncmp(resa, resb, PGS_SOUNDEX_LEN) == 0) res = 1.0; else res = 0.0; PG_RETURN_FLOAT8(res); } PG_FUNCTION_INFO_V1(soundex_op); Datum soundex_op(PG_FUNCTION_ARGS) { float8 res; res = DatumGetFloat8(DirectFunctionCall2( soundex, PG_GETARG_DATUM(0), PG_GETARG_DATUM(1))); PG_RETURN_BOOL(res == 1.0); } pg-similarity/sql/000077500000000000000000000000001276707706500144705ustar00rootroot00000000000000pg-similarity/sql/test1.sql000066400000000000000000000032061276707706500162520ustar00rootroot00000000000000-- -- pg_similarity -- testing similarity functions and operators -- -- -- Turn off echoing so that expected file does not depend on contents of -- this file -- SET client_min_messages to warning; \set ECHO none \i pg_similarity.sql RESET client_min_messages; \set ECHO all \set a '\'Euler Taveira de Oliveira\'' \set b '\'Euler T Oliveira\'' \set c '\'Oiler Taviera do Oliviera\'' select block(:a, :b), block_op(:a, :b), :a ~++ :b as operator; select cosine(:a, :b), cosine_op(:a, :b), :a ~## :b as operator; select dice(:a, :b), dice_op(:a, :b), :a ~-~ :b as operator; select euclidean(:a, :b), euclidean_op(:a, :b), :a ~!! :b as operator; select hamming_text(:a, :c), hamming_text_op(:a, :c), :a ~@~ :c as operator; select jaccard(:a, :b), jaccard_op(:a, :b), :a ~?? :b as operator; select jaro(:a, :b), jaro_op(:a, :b), :a ~%% :b as operator; select jarowinkler(:a, :b), jarowinkler_op(:a, :b), :a ~@@ :b as operator; select lev(:a, :b), lev_op(:a, :b), :a ~== :b as operator; select levslow(:a, :b), levslow_op(:a, :b); select matchingcoefficient(:a, :b), matchingcoefficient_op(:a, :b), :a ~^^ :b as operator; --select mongeelkan(:a, :b), mongeelkan_op(:a, :b), :a ~|| :b as operator; --select needlemanwunsch(:a, :b), needlemanwunsch_op(:a, :b), :a ~#~ :b as operator; select overlapcoefficient(:a, :b), overlapcoefficient_op(:a, :b), :a ~** :b as operator; select qgram(:a, :b), qgram_op(:a, :b), :a ~~~ :b as operator; --select smithwaterman(:a, :b), smithwaterman_op(:a, :b), :a ~=~ :b as operator; --select smithwatermangotoh(:a, :b), smithwatermangotoh_op(:a, :b), :a ~!~ :b as operator; select soundex(:a, :b), soundex_op(:a, :b), :a ~*~ :b as operator; pg-similarity/sql/test2.sql000066400000000000000000000012441276707706500162530ustar00rootroot00000000000000-- -- pg_similarity -- testing similarity variables -- -- -- Clean up in case a prior regression run failed -- RESET client_min_messages; \set ECHO all -- -- errors -- SHOW pg_similarity.foo_tokenizer; SHOW pg_similarity.foo_is_normalized; SET pg_similarity.cosine_threshold to 1.1; SET pg_similarity.qgram_tokenizer to 'alnum'; SHOW pg_similarity.jaro_tokenizer; -- -- valid values -- SET pg_similarity.block_is_normalized to true; SET pg_similarity.cosine_threshold = 0.72; SET pg_similarity.dice_tokenizer to 'alnum'; SET pg_similarity.euclidean_is_normalized to false; SET pg_similarity.jaro_winkler_is_normalized to false; SET pg_similarity.qgram_tokenizer to 'gram'; pg-similarity/sql/test3.sql000066400000000000000000000022221276707706500162510ustar00rootroot00000000000000-- -- pg_similarity -- testing similarity functions and operators -- -- -- Clean up in case a prior regression run failed -- RESET client_min_messages; \set ECHO all \set a '\'Euler Taveira de Oliveira\'' CREATE TABLE simtst (a text); INSERT INTO simtst (a) VALUES ('Euler Taveira de Oliveira'), ('EULER TAVEIRA DE OLIVEIRA'), ('Euler T. de Oliveira'), ('Oliveira, Euler T.'), ('Euler Oliveira'), ('Euler Taveira'), ('EULER TAVEIRA OLIVEIRA'), ('Oliveira, Euler'), ('Oliveira, E. T.'), ('ETO'); -- Levenshtein SHOW pg_similarity.levenshtein_threshold; SELECT a FROM simtst WHERE a ~== :a; SET pg_similarity.levenshtein_threshold to 0.4; SHOW pg_similarity.levenshtein_threshold; SELECT a FROM simtst WHERE a ~== :a; -- Cosine SHOW pg_similarity.cosine_threshold; SELECT a FROM simtst WHERE a ~## :a; SET pg_similarity.cosine_threshold to 0.9; SHOW pg_similarity.cosine_threshold; SELECT a FROM simtst WHERE a ~## :a; -- Overlap Coefficient SHOW pg_similarity.overlap_tokenizer; SELECT a FROM simtst WHERE a ~** :a; SET pg_similarity.overlap_tokenizer to 'gram'; SET pg_similarity.overlap_threshold to 0.8; SELECT a FROM simtst WHERE a ~** :a; DROP TABLE simtst; pg-similarity/sql/test4.sql000066400000000000000000000016431276707706500162600ustar00rootroot00000000000000-- -- pg_similarity -- testing similarity variables -- -- -- Clean up in case a prior regression run failed -- RESET client_min_messages; \set ECHO all \set a '\'Euler Taveira de Oliveira\'' CREATE TABLE simtst (a text); INSERT INTO simtst (a) VALUES ('Euler Taveira de Oliveira'), ('EULER TAVEIRA DE OLIVEIRA'), ('Euler T. de Oliveira'), ('Oliveira, Euler T.'), ('Euler Oliveira'), ('Euler Taveira'), ('EULER TAVEIRA OLIVEIRA'), ('Oliveira, Euler'), ('Oliveira, E. T.'), ('ETO'); \copy simtst FROM 'data/similarity.data' SELECT a, block(a, :a) FROM simtst WHERE a ~++ :a; SELECT a, cosine(a, :a) FROM simtst WHERE a ~## :a; CREATE INDEX simtsti ON simtst USING gin (a gin_similarity_ops); SELECT a, block(a, :a) FROM simtst WHERE a ~++ :a; SET enable_bitmapscan TO OFF; SELECT a, block(a, :a) FROM simtst WHERE a ~++ :a; SET enable_bitmapscan TO ON; SELECT a, cosine(a, :a) FROM simtst WHERE a ~## :a; DROP TABLE simtst; pg-similarity/tokenizer.c000066400000000000000000000234331276707706500160540ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * tokenizer.c * * Tokenization support functions * * Tokens are stored in a linked list to make manipulation easy. * We have support for three types of tokenization: * (i) space: is treated as delimiter; * (ii) non-alphanumeric: is treated as delimiter; * (iii) n-gram: token is a n-character "window". * * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "tokenizer.h" TokenList *initTokenList(int a) { TokenList *t; t = (TokenList *) malloc(sizeof(TokenList)); t->isset = a; t->size = 0; t->head = NULL; t->tail = NULL; elog(DEBUG4, "t->isset: %d", t->isset); return t; } void destroyTokenList(TokenList *t) { char *n; int i; int len; while (t->size > 0) { len = strlen(t->head->data); n = (char *) malloc(sizeof(char) * len + 1); strcpy(n, t->head->data); i = removeToken(t); if (i == 0) elog(DEBUG3, "token \"%s\" removed; actual token list size: %d", n, t->size); else elog(DEBUG3, "failed to remove token: \"%s\"", n); free(n); } free(t); } int addToken(TokenList *t, char *s) { Token *n; if (t->isset) { Token *x = searchToken(t, s); if (x != NULL) { x->freq++; elog(DEBUG3, "token \"%s\" is already in the list; frequency: %d", s, x->freq); return -1; } } n = (Token *) malloc(sizeof(Token)); if (n == NULL) return -1; /* * memory is allocated by tokenizeByXXX() */ n->data = s; n->freq = 1; /* first token */ if (t->size == 0) t->tail = n; n->next = t->head; t->head = n; t->size++; return 0; } /* * free up the head node and its content */ int removeToken(TokenList *t) { Token *n; if (t->size == 0) { elog(DEBUG3, "list is empty"); return -1; } n = t->head; t->head = n->next; if (t->size == 1) t->tail = NULL; free(n->data); free(n); t->size--; return 0; } Token *searchToken(TokenList *t, char *s) { Token *n; n = t->head; while (n != NULL) { #ifdef PGS_IGNORE_CASE /* * For portability reason, use pg_strcasecmp instead of strcasecmp * (Windows doesn't provide this function). */ if (pg_strcasecmp(n->data, s) == 0) #else if (strcmp(n->data, s) == 0) #endif { elog(DEBUG4, "\"%s\" found", n->data); return n; } n = n->next; } return NULL; } void printToken(TokenList *t) { Token *n; elog(DEBUG3, "==================================================="); if (t->size == 0) elog(DEBUG3, "word list is empty"); n = t->head; while (n != NULL) { elog(DEBUG3, "addr: %p; next: %p; word: %s; freq: %d", n, n->next, n->data, n->freq); n = n->next; } if (t->head != NULL) elog(DEBUG3, "head: %s", t->head->data); if (t->tail != NULL) elog(DEBUG3, "tail: %s", t->tail->data); elog(DEBUG3, "==================================================="); } /* * XXX non alnum characters are ignored in this function * XXX because they are treated as delimiter characters */ void tokenizeByNonAlnum(TokenList *t, char *s) { const char *cptr, /* current pointer */ *sptr; /* start token pointer */ int c = 0; /* number of bytes */ elog(DEBUG3, "sentence: \"%s\"", s); if (t->size == 0) elog(DEBUG3, "token list is empty"); else elog(DEBUG3, "token list contains %d tokens", t->size); if (t->head == NULL) elog(DEBUG3, "there is no head token yet"); else elog(DEBUG3, "head token is \"%s\"", t->head->data); if (t->tail == NULL) elog(DEBUG3, "there is no tail token yet"); else elog(DEBUG3, "tail token is \"%s\"", t->tail->data); cptr = sptr = s; while (*cptr) { while (!isalnum(*cptr) && *cptr != '\0') { elog(DEBUG4, "\"%c\" is non alnum", *cptr); cptr++; } if (*cptr == '\0') elog(DEBUG4, "end of sentence"); #ifdef PGS_IGNORE_CASE *sptr = tolower(*sptr); #endif sptr = cptr; elog(DEBUG4, "token's first char: \"%c\"", *sptr); while (isalnum(*cptr) && *cptr != '\0') { c++; #ifdef PGS_IGNORE_CASE *cptr = tolower(*cptr); #endif elog(DEBUG4, "char: \"%c\"; actual token size: %d", *cptr, c); cptr++; } if (*cptr == '\0') elog(DEBUG4, "end of setence (2)"); if (c > 0) { char *tok = malloc(sizeof(char) * c + 1); strncpy(tok, sptr, c); tok[c] = '\0'; elog(DEBUG3, "token: \"%s\"; size: %lu", tok, sizeof(char) * c); addToken(t, tok); elog(DEBUG4, "actual token list size: %d", t->size); elog(DEBUG4, "tok: \"%s\"; size: %u", tok, (unsigned int) strlen(tok)); Assert(strlen(tok) <= PGS_MAX_TOKEN_LEN); /* * XXX don't do that! * free(tok); */ c = 0; } } } void tokenizeBySpace(TokenList *t, char *s) { const char *cptr, /* current pointer */ *sptr; /* start token pointer */ int c = 0; /* number of bytes */ elog(DEBUG3, "sentence: \"%s\"", s); if (t->size == 0) elog(DEBUG3, "token list is empty"); else elog(DEBUG3, "token list contains %d tokens", t->size); if (t->head == NULL) elog(DEBUG3, "there is no head token yet"); else elog(DEBUG3, "head token is \"%s\"", t->head->data); if (t->tail == NULL) elog(DEBUG3, "there is no tail token yet"); else elog(DEBUG3, "tail token is \"%s\"", t->tail->data); cptr = sptr = s; while (*cptr) { while (isspace(*cptr) && *cptr != '\0') { elog(DEBUG4, "\"%c\" is a space", *cptr); cptr++; } if (*cptr == '\0') elog(DEBUG4, "end of sentence"); #ifdef PGS_IGNORE_CASE *sptr = tolower(*sptr); #endif sptr = cptr; elog(DEBUG4, "token's first char: \"%c\"", *sptr); while (!isspace(*cptr) && *cptr != '\0') { c++; #ifdef PGS_IGNORE_CASE *cptr = tolower(*cptr); #endif elog(DEBUG4, "char: \"%c\"; actual token size: %d", *cptr, c); cptr++; } if (*cptr == '\0') elog(DEBUG4, "end of setence (2)"); if (c > 0) { char *tok = malloc(sizeof(char) * c + 1); strncpy(tok, sptr, c); tok[c] = '\0'; elog(DEBUG3, "token: \"%s\"; size: %lu", tok, sizeof(char) * c); addToken(t, tok); elog(DEBUG4, "actual token list size: %d", t->size); elog(DEBUG4, "tok: \"%s\"; size: %u", tok, (unsigned int) strlen(tok)); Assert(strlen(tok) <= PGS_MAX_TOKEN_LEN); /* * XXX don't do that! * free(tok); */ c = 0; } } } /* * our n-grams are letter level and we have: * (i) full n-gram: euler = {" e", eu, ul, le, er, "r "} * (ii) normal n-gram: euler = {eu, ul, le, er} */ void tokenizeByGram(TokenList *t, char *s) { char *p; int slen; int i; slen = strlen(s); p = s; /* * n-grams with starting character */ #ifdef PGS_FULL_NGRAM for (i = (PGS_GRAM_LEN - 1); i > 0; i--) { char *buf; buf = (char *) malloc((PGS_GRAM_LEN + 1) * sizeof(char)); memset(buf, PGS_BLANK_CHAR, i); strncpy((buf + i), s, PGS_GRAM_LEN - i); buf[PGS_GRAM_LEN] = '\0'; addToken(t, buf); elog(DEBUG1, "qgram (b): \"%s\"", buf); } #else { char *buf; buf = (char *) malloc((PGS_GRAM_LEN + 1) * sizeof(char)); memset(buf, PGS_BLANK_CHAR, 1); strncpy((buf + 1), s, PGS_GRAM_LEN - 1); buf[PGS_GRAM_LEN] = '\0'; addToken(t, buf); elog(DEBUG1, "qgram (b): \"%s\"", buf); } #endif for (i = 0; i <= (slen - PGS_GRAM_LEN); i++) { char *buf; buf = (char *) malloc((PGS_GRAM_LEN + 1) * sizeof(char)); strncpy(buf, p, PGS_GRAM_LEN); buf[PGS_GRAM_LEN] = '\0'; addToken(t, buf); p++; elog(DEBUG1, "qgram (m): \"%s\"", buf); } /* * n-grams with ending character */ #ifdef PGS_FULL_NGRAM for (i = 1; i < PGS_GRAM_LEN; i++) { char *buf; buf = (char *) malloc((PGS_GRAM_LEN + 1) * sizeof(char)); strncpy(buf, p, PGS_GRAM_LEN - i); memset((buf + (PGS_GRAM_LEN - i)), PGS_BLANK_CHAR, i); buf[PGS_GRAM_LEN] = '\0'; addToken(t, buf); p++; elog(DEBUG1, "qgram (a): \"%s\"", buf); } #else { char *buf; buf = (char *) malloc((PGS_GRAM_LEN + 1) * sizeof(char)); strncpy(buf, p, PGS_GRAM_LEN - 1); memset((buf + (PGS_GRAM_LEN - 1)), PGS_BLANK_CHAR, 1); buf[PGS_GRAM_LEN] = '\0'; addToken(t, buf); elog(DEBUG1, "qgram (a): \"%s\"", buf); } #endif } void tokenizeByCamelCase(TokenList *t, char *s) { const char *cptr, /* current pointer */ *sptr; /* start token pointer */ int c = 0; /* number of bytes */ elog(DEBUG3, "sentence: \"%s\"", s); if (t->size == 0) elog(DEBUG3, "token list is empty"); else elog(DEBUG3, "token list contains %d tokens", t->size); if (t->head == NULL) elog(DEBUG3, "there is no head token yet"); else elog(DEBUG3, "head token is \"%s\"", t->head->data); if (t->tail == NULL) elog(DEBUG3, "there is no tail token yet"); else elog(DEBUG3, "tail token is \"%s\"", t->tail->data); cptr = sptr = s; while (*cptr) { while (isspace(*cptr) && *cptr != '\0') { elog(DEBUG4, "\"%c\" is a space", *cptr); cptr++; } if (*cptr == '\0') elog(DEBUG4, "end of sentence"); #ifdef PGS_IGNORE_CASE *sptr = tolower(*sptr); #endif sptr = cptr; elog(DEBUG4, "token's first char: \"%c\"", *sptr); if (isupper(*cptr)) elog(DEBUG4, "\"%c\" is uppercase", *cptr); else elog(DEBUG4, "\"%c\" is not uppercase", *cptr); /* * if the first caracter is uppercase enter the loop because sometimes * the first char in a camel-case notation is uppercase */ while (c == 0 || (!isupper(*cptr) && *cptr != '\0')) { c++; #ifdef PGS_IGNORE_CASE *cptr = tolower(*cptr); #endif elog(DEBUG4, "char: \"%c\"; actual token size: %d", *cptr, c); cptr++; } if (*cptr == '\0') elog(DEBUG4, "end of setence (2)"); if (c > 0) { char *tok = malloc(sizeof(char) * c + 1); strncpy(tok, sptr, c); tok[c] = '\0'; elog(DEBUG3, "token: \"%s\"; size: %lu", tok, sizeof(char) * c); addToken(t, tok); elog(DEBUG4, "actual token list size: %d", t->size); elog(DEBUG4, "tok: \"%s\"; size: %u", tok, (unsigned int) strlen(tok)); Assert(strlen(tok) <= PGS_MAX_TOKEN_LEN); /* * XXX don't do that! * free(tok); */ c = 0; } } } pg-similarity/tokenizer.h000066400000000000000000000021451276707706500160560ustar00rootroot00000000000000/*---------------------------------------------------------------------------- * * tokenizer.h * * Copyright (c) 2008-2012, Euler Taveira de Oliveira * *---------------------------------------------------------------------------- */ #include "postgres.h" #include #include #include #define PGS_MAX_TOKEN_LEN 1024 #define PGS_GRAM_LEN 3 #define PGS_BLANK_CHAR ' ' #define PGS_FULL_NGRAM typedef struct Token { char *data; /* token data */ int freq; /* frequency */ struct Token *next; /* next token */ } Token; typedef struct TokenList { int isset; /* is a set? */ int size; /* list size */ Token *head; /* first token */ Token *tail; /* last token */ } TokenList; TokenList *initTokenList(int isset); void destroyTokenList(TokenList *t); int addToken(TokenList *t, char *s); int removeToken(TokenList *t); Token *searchToken(TokenList *t, char *s); void printToken(TokenList *t); void tokenizeByNonAlnum(TokenList *t, char *s); void tokenizeBySpace(TokenList *t, char *s); void tokenizeByGram(TokenList *t, char *s); void tokenizeByCamelCase(TokenList *t, char *s); pg-similarity/uninstall_pg_similarity.sql000066400000000000000000000042731276707706500213650ustar00rootroot00000000000000/* $PostgreSQL $ */ -- Adjust this setting to control where the objects get dropped. SET search_path = public; DROP OPERATOR ~++ (text, text); DROP FUNCTION block (text, text); DROP FUNCTION block_op (text, text); DROP OPERATOR ~## (text, text); DROP FUNCTION cosine (text, text); DROP FUNCTION cosine_op (text, text); DROP OPERATOR ~-~ (text, text); DROP FUNCTION dice (text, text); DROP FUNCTION dice_op (text, text); DROP OPERATOR ~!! (text, text); DROP FUNCTION euclidean (text, text); DROP FUNCTION euclidean_op (text, text); DROP OPERATOR ~@~ (text, text); DROP FUNCTION hamming_text (text, text); DROP FUNCTION hamming_text_op (text, text); DROP FUNCTION hamming (varbit, varbit); DROP FUNCTION hamming_op (varbit, varbit); DROP OPERATOR ~?? (text, text); DROP FUNCTION jaccard (text, text); DROP FUNCTION jaccard_op (text, text); DROP OPERATOR ~%% (text, text); DROP FUNCTION jaro (text, text); DROP FUNCTION jaro_op (text, text); DROP OPERATOR ~@@ (text, text); DROP FUNCTION jarowinkler (text, text); DROP FUNCTION jarowinkler_op (text, text); DROP OPERATOR ~== (text, text); DROP FUNCTION lev (text, text); DROP FUNCTION lev_op (text, text); --DROP OPERATOR ~@@ (text, text); DROP FUNCTION levslow (text, text); DROP FUNCTION levslow_op (text, text); DROP OPERATOR ~^^ (text, text); DROP FUNCTION matchingcoefficient (text, text); DROP FUNCTION matchingcoefficient_op (text, text); DROP OPERATOR ~|| (text, text); DROP FUNCTION mongeelkan (text, text); DROP FUNCTION mongeelkan_op (text, text); DROP OPERATOR ~#~ (text, text); DROP FUNCTION needlemanwunsch (text, text); DROP FUNCTION needlemanwunsch_op (text, text); DROP OPERATOR ~** (text, text); DROP FUNCTION overlapcoefficient (text, text); DROP FUNCTION overlapcoefficient_op (text, text); DROP OPERATOR ~~~ (text, text); DROP FUNCTION qgram (text, text); DROP FUNCTION qgram_op (text, text); DROP OPERATOR ~=~ (text, text); DROP FUNCTION smithwaterman (text, text); DROP FUNCTION smithwaterman_op (text, text); DROP OPERATOR ~!~ (text, text); DROP FUNCTION smithwatermangotoh (text, text); DROP FUNCTION smithwatermangotoh_op (text, text); DROP OPERATOR ~*~ (text, text); DROP FUNCTION soundex (text, text); DROP FUNCTION soundex_op (text, text);