pax_global_header 0000666 0000000 0000000 00000000064 13256345431 0014520 g ustar 00root root 0000000 0000000 52 comment=537413f6a735b5936b4e243a0bbdf6026e01452e pg_similarity-pg_similarity_1_0/ 0000775 0000000 0000000 00000000000 13256345431 0017251 5 ustar 00root root 0000000 0000000 pg_similarity-pg_similarity_1_0/.gitignore 0000664 0000000 0000000 00000000102 13256345431 0021232 0 ustar 00root root 0000000 0000000 *.so *.so.0 *.so.0.0 *.o pg_similarity.sql *.diffs *.out results/ pg_similarity-pg_similarity_1_0/COPYRIGHT 0000664 0000000 0000000 00000003023 13256345431 0020542 0 ustar 00root root 0000000 0000000 PostgreSQL Similarity Functions Copyright (c) 2008-2018 Euler Taveira de Oliveira All rights reserved. Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: * Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. * Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. * Neither the name of the Euler Taveira de Oliveira nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. pg_similarity-pg_similarity_1_0/Makefile 0000664 0000000 0000000 00000001333 13256345431 0020711 0 ustar 00root root 0000000 0000000 # pg_similarity extension EXTENSION = pg_similarity MODULE_big = pg_similarity OBJS = tokenizer.o similarity.o similarity_gin.o \ block.o cosine.o dice.o euclidean.o hamming.o jaccard.o \ jaro.o levenshtein.o matching.o mongeelkan.o needlemanwunsch.o \ overlap.o qgram.o smithwaterman.o smithwatermangotoh.o soundex.o DATA_built = pg_similarity.sql DATA = pg_similarity--1.0.sql pg_similarity--unpackaged--1.0.sql REGRESS = test1 test2 test3 test4 #DOCS = README.md ifdef USE_PGXS PG_CONFIG = pg_config PGXS := $(shell $(PG_CONFIG) --pgxs) include $(PGXS) else subdir = contrib/pg_similarity top_builddir = ../.. include $(top_builddir)/src/Makefile.global include $(top_srcdir)/contrib/contrib-global.mk endif pg_similarity-pg_similarity_1_0/README.md 0000664 0000000 0000000 00000045020 13256345431 0020531 0 ustar 00root root 0000000 0000000 [](https://scan.coverity.com/projects/pg_similarity) Introduction ============ **pg\_similarity** is an extension to support similarity queries on [PostgreSQL](http://www.postgresql.org/). The implementation is tightly integrated in the RDBMS in the sense that it defines operators so instead of the traditional operators (= and <>) you can use ~~~ and ~!~ (any of these operators represents a similarity function). **pg\_similarity** has three main components: - **Functions**: a set of functions that implements similarity algorithms available in the literature. These functions can be used as UDFs and, will be the base for implementing the similarity operators; - **Operators**: a set of operators defined at the top of similarity functions. They use similarity functions to obtain the similarity threshold and, compare its value to a user-defined threshold to decide if it is a match or not; - **Session Variables**: a set of variables that store similarity function parameters. Theses variables can be defined at run time. Installation ============ **pg\_similarity** is supported on [those platforms](http://www.postgresql.org/docs/current/static/supported-platforms.html) that PostgreSQL is. The installation steps depend on your operating system. You can also keep up with the latest fixes and features cloning the Git repository. ``` $ git clone https://github.com/eulerto/pg_similarity.git ``` UNIX based Operating Systems ---------------------------- Before you are able to use your extension, you should build it and load it at the desirable database. The new way (9.1 or later): ``` $ tar -zxf pg_similarity-0.0.19.tgz $ cd pg_similarity-0.0.19 $ $EDITOR Makefile # edit PG_CONFIG iif necessary $ USE_PGXS=1 make $ USE_PGXS=1 make install $ psql mydb psql (9.3.5) Type "help" for help. mydb=# CREATE EXTENSION pg_similarity; CREATE EXTENSION ``` And the old way: ``` $ tar -zxf pg_similarity-0.0.19.tgz $ cd pg_similarity-0.0.19 $ $EDITOR Makefile # edit PG_CONFIG iif necessary $ USE_PGXS=1 make $ USE_PGXS=1 make install $ psql -f SHAREDIR/contrib/pg_similarity.sql mydb # SHAREDIR is pg_config --sharedir ``` The typical usage is to copy a sample file at tarball (*pg_similarity.conf.sample*) to PGDATA (as *pg_similarity.conf*) and include the following line in *postgresql.conf*: ``` include 'pg_similarity.conf' ``` Windows ------- Sorry, never tried^H^H^H^H^H Actually I tried that but it is not that easy as on UNIX. :( There are two ways to build PostgreSQL on Windows: (i) MingW and (ii) MSVC. The former is supported but it is not widely used and the latter is popular because Windows binaries (officially distributed) are built using MSVC. If you choose to use Mingw, just follow the UNIX instructions above to build pg_similarity. Otherwise, the MSVC steps are below: - Download and untar the *same* PostgreSQL version you are using; - Download and untar pg_similarity under PostgreSQL contrib directory; - Edit contrib/Makefile and add *pg_similarity* to SUBDIRS variable; - Follow [Installation from Source Code on Windows](http://www.postgresql.org/docs/current/static/install-windows.html) for building but do not install it; - Instead of executing install (if you already have Windows binaries installed), just copy pg_similarity.dll to LIBDIR (get it executing `pg_config --libdir`) and pg_similarity.control and *--1.0.sql to SHAREDIR/extension (get it executing `pg_config --sharedir`). - That is it! Do not forget to follow the instructions above to load the library and CREATE EXTENSION. Functions and Operators ======================= This extension supports a set of similarity algorithms. The most known algorithms are covered by this extension. You must be aware that each algorithm is suited for a specific domain. The following algorithms are provided. - L1 Distance (as known as City Block or Manhattan Distance); - Cosine Distance; - Dice Coefficient; - Euclidean Distance; - Hamming Distance; - Jaccard Coefficient; - Jaro Distance; - Jaro-Winkler Distance; - Levenshtein Distance; - Matching Coefficient; - Monge-Elkan Coefficient; - Needleman-Wunsch Coefficient; - Overlap Coefficient; - Q-Gram Distance; - Smith-Waterman Coefficient; - Smith-Waterman-Gotoh Coefficient; - Soundex Distance.
Algorithm | Function | Operator | Use Index? | Parameters |
---|---|---|---|---|
L1 Distance | block(text, text) returns float8 | ~++ | yes |
pg_similarity.block_tokenizer (enum) pg_similarity.block_threshold (float8) pg_similarity.block_is_normalized (bool) |
Cosine Distance | cosine(text, text) returns float8 | ~## | yes |
pg_similarity.cosine_tokenizer (enum) pg_similarity.cosine_threshold (float8) pg_similarity.cosine_is_normalized (bool) |
Dice Coefficient | dice(text, text) returns float8 | ~-~ | yes |
pg_similarity.dice_tokenizer (enum) pg_similarity.dice_threshold (float8) pg_similarity.dice_is_normalized (bool) |
Euclidean Distance | euclidean(text, text) returns float8 | ~!! | yes |
pg_similarity.euclidean_tokenizer (enum) pg_similarity.euclidean_threshold (float8) pg_similarity.euclidean_is_normalized (bool) |
Hamming Distance | hamming(bit varying, bit varying) returns float8 hamming_text(text, text) returns float8 |
~@~ | no |
pg_similarity.hamming_threshold (float8) pg_similarity.hamming_is_normalized (bool) |
Jaccard Coefficient | jaccard(text, text) returns float8 | ~?? | yes |
pg_similarity.jaccard_tokenizer (enum) pg_similarity.jaccard_threshold (float8) pg_similarity.jaccard_is_normalized (bool) |
Jaro Distance | jaro(text, text) returns float8 | ~%% | no |
pg_similarity.jaro_threshold (float8) pg_similarity.jaro_is_normalized (bool) |
Jaro-Winkler Distance | jarowinkler(text, text) returns float8 | ~@@ | no |
pg_similarity.jarowinkler_threshold (float8) pg_similarity.jarowinkler_is_normalized (bool) |
Levenshtein Distance | lev(text, text) returns float8 | ~== | no |
pg_similarity.levenshtein_threshold (float8) pg_similarity.levenshtein_is_normalized (bool) |
Matching Coefficient | matchingcoefficient(text, text) returns float8 | ~^^ | yes |
pg_similarity.matching_tokenizer (enum) pg_similarity.matching_threshold (float8) pg_similarity.matching_is_normalized (bool) |
Monge-Elkan Coefficient | mongeelkan(text, text) returns float8 | ~|| | no |
pg_similarity.mongeelkan_tokenizer (enum) pg_similarity.mongeelkan_threshold (float8) pg_similarity.mongeelkan_is_normalized (bool) |
Needleman-Wunsch Coefficient | needlemanwunsch(text, text) returns float8 | ~#~ | no |
pg_similarity.nw_threshold (float8) pg_similarity.nw_is_normalized (bool) |
Overlap Coefficient | overlapcoefficient(text, text) returns float8 | ~** | yes |
pg_similarity.overlap_tokenizer (enum) pg_similarity.overlap_threshold (float8) pg_similarity.overlap_is_normalized (bool) |
Q-Gram Distance | qgram(text, text) returns float8 | ~~~ | yes |
pg_similarity.qgram_threshold (float8) pg_similarity.qgram_is_normalized (bool) |
Smith-Waterman Coefficient | smithwaterman(text, text) returns float8 | ~=~ | no |
pg_similarity.sw_threshold (float8) pg_similarity.sw_is_normalized (bool) |
Smith-Waterman-Gotoh Coefficient | smithwatermangotoh(text, text) returns float8 | ~!~ | no |
pg_similarity.swg_threshold (float8) pg_similarity.swg_is_normalized (bool) |
Soundex Distance | soundex(text, text) returns float8 | ~*~ | no |
pg_similarity is an extension to support similarity queries on PostgreSQL. The implementation is tightly integrated in the RDBMS in the sense that it defines operators so instead of the traditional operators (= and <>) you can use ~~~ and ~!~ (any of these operators represents a similarity function).
pg_similarity has three main components:
pg_similarity is supported on those platforms that PostgreSQL is. The installation steps depend on your operating system.
You can also keep up with the latest fixes and features cloning the Git repository:
$ git clone https://github.com/eulerto/pg_similarity.git
Before you are able to use your extension, you should build it and load it at the desirable database.
$ tar -zxf pg_similarity-0.0.19.tgz
$ cd pg_similarity-0.0.19
$ $EDITOR Makefile # edit PG_CONFIG iif necessary
$ USE_PGXS=1 make
$ USE_PGXS=1 make install
$ psql -f SHAREDIR/contrib/pg_similarity.sql mydb # SHAREDIR is pg_config --sharedir
To use it, simply load it to the server. You can load it into and individual session:
$ psql mydb
psql (9.0.3)
Type "help" for help.
mydb=# load 'pg_similarity';
LOAD
But the typical usage is to preload it into all sessions by including pg_similarity in shared_preload_libraries at postgresql.conf. Keep in mind that there is an overhead added to each new connection.
Sorry, never tried that!
Algorithm | Function | Operator | Parameters |
---|---|---|---|
L1 Distance | block(text, text) returns float4 | text ~++ text | pg_similarity.block_tokenizer (enum) pg_similarity.block_threshold (float4) pg_similarity.block_is_normalized (bool) |
Cosine Distance | cosine(text, text) returns float4 | text ~## text | pg_similarity.cosine_tokenizer (enum) pg_similarity.cosine_threshold (float4) pg_similarity.cosine_is_normalized (bool) |
Dice Coefficient | dice(text, text) returns float4 | text ~-~ text | pg_similarity.dice_tokenizer (enum) pg_similarity.dice_threshold (float4) pg_similarity.dice_is_normalized (bool) |
Euclidean Distance | euclidean(text, text) returns float4 | text ~!! text | pg_similarity.euclidean_tokenizer (enum) pg_similarity.euclidean_threshold (float4) pg_similarity.euclidean_is_normalized (bool) |
Hamming Distance | hamming(bit varying, bit varying) returns float4 | pg_similarity.hamming_threshold (float4) pg_similarity.hamming_is_normalized (bool) |
|
Jaccard Coefficient | jaccard(text, text) returns float4 | text ~?? text | pg_similarity.jaccard_tokenizer (enum) pg_similarity.jaccard_threshold (float4) pg_similarity.jaccard_is_normalized (bool) |
Jaro Distance | jaro(text, text) returns float4 | text ~%% text | pg_similarity.jaro_threshold (float4) pg_similarity.jaro_is_normalized (bool) |
Jaro-Winkler Distance | jarowinkler(text, text) returns float4 | text ~@@ text | pg_similarity.jarowinkler_threshold (float4) pg_similarity.jarowinkler_is_normalized (bool) |
Levenshtein Distance | lev(text, text) returns float4 | text ~== text | pg_similarity.levenshtein_threshold (float4) pg_similarity.levenshtein_is_normalized (bool) |
Matching Coefficient | matchingcoefficient(text, text) returns float4 | text ~^^ text | pg_similarity.matching_tokenizer (enum) pg_similarity.matching_threshold (float4) pg_similarity.matching_is_normalized (bool) |
Monge-Elkan Coefficient | mongeelkan(text, text) returns float4 | text ~|| text | pg_similarity.mongeelkan_tokenizer (enum) pg_similarity.mongeelkan_threshold (float4) pg_similarity.mongeelkan_is_normalized (bool) |
Needleman-Wunsch Coefficient | needlemanwunsch(text, text) returns float4 | text ~#~ text | pg_similarity.needlemanwunsch_threshold (float4) pg_similarity.needlemanwunsch_is_normalized (bool) |
Overlap Coefficient | overlapcoefficient(text, text) returns float4 | text ~** text | pg_similarity.overlap_tokenizer (enum) pg_similarity.overlap_threshold (float4) pg_similarity.overlap_is_normalized (bool) |
Q-Gram Distance | qgram(text, text) returns float4 | text ~~~ text | pg_similarity.qgram_threshold (float4) pg_similarity.qgram_is_normalized (bool) |
Smith-Waterman Coefficient | smithwaterman(text, text) returns float4 | text ~=~ text | pg_similarity.smithwaterman_threshold (float4) pg_similarity.smithwaterman_is_normalized (bool) |
Smith-Waterman-Gotoh Coefficient | smithwatermangotoh(text, text) returns float4 | text ~!~ text | pg_similarity.smithwatermangotoh_threshold (float4) pg_similarity.smithwatermangotoh_is_normalized (bool) |
The several parameters control the behavior of the pg_similarity functions and operators. I don't explain in detail each parameter because they can be classified in three classes: tokenizer, threshold, and normalized.
Set parameters at run time.
mydb=# show pg_similarity.levenshtein_threshold; pg_similarity.levenshtein_threshold ------------------------------------- 0.7 (1 row) mydb=# set pg_similarity.levenshtein_threshold to 0.5; SET mydb=# show pg_similarity.levenshtein_threshold; pg_similarity.levenshtein_threshold ------------------------------------- 0.5 (1 row) mydb=# set pg_similarity.cosine_tokenizer to camelcase; SET mydb=# set pg_similarity.euclidean_is_normalized to false; SET
Simple tables for examples.
mydb=# create table foo (a text); CREATE TABLE mydb=# insert into foo values('Euler'),('Oiler'),('Euler Taveira de Oliveira'),('Maria Taveira dos Santos'),('Carlos Santos Silva'); INSERT 0 5 mydb=# create table bar (b text); CREATE TABLE mydb=# insert into bar values('Euler T. de Oliveira'),('Euller'),('Oliveira, Euler Taveira'),('Sr. Oliveira'); INSERT 0 4
Example 1: Using similarity functions cosine, jaro, and euclidean.
mydb=# select a, b, cosine(a,b), jaro(a, b), euclidean(a, b) from foo, bar; a | b | cosine | jaro | euclidean ---------------------------+-------------------------+----------+----------+----------- Euler | Euler T. de Oliveira | 0.5 | 0.75 | 0.579916 Euler | Euller | 0 | 0.944444 | 0 Euler | Oliveira, Euler Taveira | 0.57735 | 0.605797 | 0.552786 Euler | Sr. Oliveira | 0 | 0.505556 | 0.225403 Oiler | Euler T. de Oliveira | 0 | 0.472222 | 0.457674 Oiler | Euller | 0 | 0.7 | 0 Oiler | Oliveira, Euler Taveira | 0 | 0.672464 | 0.367544 Oiler | Sr. Oliveira | 0 | 0.672222 | 0.225403 Euler Taveira de Oliveira | Euler T. de Oliveira | 0.75 | 0.79807 | 0.75 Euler Taveira de Oliveira | Euller | 0 | 0.677778 | 0.457674 Euler Taveira de Oliveira | Oliveira, Euler Taveira | 0.866025 | 0.773188 | 0.8 Euler Taveira de Oliveira | Sr. Oliveira | 0.353553 | 0.592222 | 0.552786 Maria Taveira dos Santos | Euler T. de Oliveira | 0 | 0.60235 | 0.5 Maria Taveira dos Santos | Euller | 0 | 0.305556 | 0.457674 Maria Taveira dos Santos | Oliveira, Euler Taveira | 0.288675 | 0.535024 | 0.552786 Maria Taveira dos Santos | Sr. Oliveira | 0 | 0.634259 | 0.452277 Carlos Santos Silva | Euler T. de Oliveira | 0 | 0.542105 | 0.47085 Carlos Santos Silva | Euller | 0 | 0.312865 | 0.367544 Carlos Santos Silva | Oliveira, Euler Taveira | 0 | 0.606662 | 0.42265 Carlos Santos Silva | Sr. Oliveira | 0 | 0.507728 | 0.379826 (20 rows)
Example 2: Using operator levenshtein (~==) and changing its threshold at run time.
mydb=# show pg_similarity.levenshtein_threshold; pg_similarity.levenshtein_threshold ------------------------------------- 0.7 (1 row) mydb=# select a, b, lev(a,b) from foo, bar where a ~== b; a | b | lev ---------------------------+----------------------+---------- Euler | Euller | 0.833333 Euler Taveira de Oliveira | Euler T. de Oliveira | 0.76 (2 rows) mydb=# set pg_similarity.levenshtein_threshold to 0.5; SET mydb=# select a, b, lev(a,b) from foo, bar where a ~== b; a | b | lev ---------------------------+----------------------+---------- Euler | Euller | 0.833333 Oiler | Euller | 0.5 Euler Taveira de Oliveira | Euler T. de Oliveira | 0.76 (3 rows)
Example 3: Using operator qgram (~~~) and changing its threshold at run time.
mydb=# set pg_similarity.qgram_threshold to 0.7; SET mydb=# show pg_similarity.qgram_threshold; pg_similarity.qgram_threshold ------------------------------- 0.7 (1 row) mydb=# select a, b,qgram(a, b) from foo, bar where a ~~~ b; a | b | qgram ---------------------------+-------------------------+---------- Euler | Euller | 0.8 Euler Taveira de Oliveira | Euler T. de Oliveira | 0.77551 Euler Taveira de Oliveira | Oliveira, Euler Taveira | 0.807692 (3 rows) mydb=# set pg_similarity.qgram_threshold to 0.35; SET mydb=# select a, b,qgram(a, b) from foo, bar where a ~~~ b; a | b | qgram ---------------------------+-------------------------+---------- Euler | Euler T. de Oliveira | 0.413793 Euler | Euller | 0.8 Oiler | Euller | 0.4 Euler Taveira de Oliveira | Euler T. de Oliveira | 0.77551 Euler Taveira de Oliveira | Oliveira, Euler Taveira | 0.807692 Euler Taveira de Oliveira | Sr. Oliveira | 0.439024 (6 rows)
Example 4: Using a set of operators using the same threshold (0.7) to ilustrate that some similarity functions are appropriated to certain data domains.
mydb=# select * from bar where b ~@@ 'euler'; -- jaro-winkler operator b ---------------------- Euler T. de Oliveira Euller (2 rows) mydb=# select * from bar where b ~~~ 'euler'; -- qgram operator b --- (0 rows) mydb=# select * from bar where b ~== 'euler'; -- levenshtein operator b -------- Euller (1 row) mydb=# select * from bar where b ~## 'euler'; -- cosine operator b --- (0 rows)
Copyright © 2008-2018 Euler Taveira de Oliveira
All rights reserved.
Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:
THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.