Text-Soundex-3.05/000755 000767 000024 00000000000 12620420313 014106 5ustar00rjbsstaff000000 000000 Text-Soundex-3.05/Changes000644 000767 000024 00000003546 12105012162 015406 0ustar00rjbsstaff000000 000000 Revision history for Perl extension Text::Soundex. 3.04 Thu Feb 7 15:53:09 EST 2013 The module is going to be removed from the core distribution of perl, and will now warn (under warnings) if loaded from its installed-to-core location. 3.02 Sun Feb 02 02:54:00 EST 2003 The U8 type was over-used in 3.00 and 3.01. Now, "U8 *" is used only as a pointer into the UTF-8 string. Also, unicode now works properly on Perl 5.6.x as the utf8_to_uv() function is used instead of utf8n_to_uvchr() when compiled under a version of Perl earlier than 5.8.0. 3.01 Sun Jan 26 16:30:00 EST 2003 A bug with non-UTF 8 strings that contain non-ASCII alphabetic characters was fixed. The soundex_unicode() and soundex_nara_unicode() wrapper routines were included and the documentation refers the user to the excellent Text::Unidecode module to perform soundex encodings using unicode strings. The Perl versions of the routines have been further optimized, and correct a border case involving non-alphabetic characters at the beginning of the string. 3.00 Sun Jan 26 04:08:00 EST 2003 Updated documentation, simplified the Perl interface, and updated the XS code to be faster, and to properly work with UTF-8 strings. UNICODE characters outside the ASCII range (0x00 - 0x7F) are considered to be non-alphabetic for the purposes of the soundex algorithms. 2.10 Sun Feb 15 15:29:38 EST 1998 I've put in a version of my XS code and fully integrated it with the existing 100% perl mechanism. The change should be virtually transparent to the user. XS code is approx 7.5 times faster. - Mark Mielke 2.00 Thu Jan 1 16:22:11 1998 Incorporated Mark Mielke's rewritten version of the main soundex routine and made the test.pl file simpler. Text-Soundex-3.05/LICENSE000644 000767 000024 00000043671 12620416041 015131 0ustar00rjbsstaff000000 000000 This software is copyright (c) 1998-2003 by Mark Mielke. This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself. Terms of the Perl programming language system itself a) the GNU General Public License as published by the Free Software Foundation; either version 1, or (at your option) any later version, or b) the "Artistic License" --- The GNU General Public License, Version 1, February 1989 --- This software is Copyright (c) 1998-2003 by Mark Mielke. This is free software, licensed under: The GNU General Public License, Version 1, February 1989 GNU GENERAL PUBLIC LICENSE Version 1, February 1989 Copyright (C) 1989 Free Software Foundation, Inc. 51 Franklin St, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The license agreements of most software companies try to keep users at the mercy of those companies. By contrast, our General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. The General Public License applies to the Free Software Foundation's software and to any other program whose authors commit to using it. You can use it for your programs, too. When we speak of free software, we are referring to freedom, not price. Specifically, the General Public License is designed to make sure that you have the freedom to give away or sell copies of free software, that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. For example, if you distribute copies of a such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must tell them their rights. We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software. Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations. The precise terms and conditions for copying, distribution and modification follow. GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License Agreement applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any work containing the Program or a portion of it, either verbatim or with modifications. Each licensee is addressed as "you". 1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this General Public License and to the absence of any warranty; and give any other recipients of the Program a copy of this General Public License along with the Program. You may charge a fee for the physical act of transferring a copy. 2. You may modify your copy or copies of the Program or any portion of it, and copy and distribute such modifications under the terms of Paragraph 1 above, provided that you also do the following: a) cause the modified files to carry prominent notices stating that you changed the files and the date of any change; and b) cause the whole of any work that you distribute or publish, that in whole or in part contains the Program or any part thereof, either with or without modifications, to be licensed at no charge to all third parties under the terms of this General Public License (except that you may choose to grant warranty protection to some or all third parties, at your option). c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the simplest and most usual way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this General Public License. d) You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. Mere aggregation of another independent work with the Program (or its derivative) on a volume of a storage or distribution medium does not bring the other work under the scope of these terms. 3. You may copy and distribute the Program (or a portion or derivative of it, under Paragraph 2) in object code or executable form under the terms of Paragraphs 1 and 2 above provided that you also do one of the following: a) accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Paragraphs 1 and 2 above; or, b) accompany it with a written offer, valid for at least three years, to give any third party free (except for a nominal charge for the cost of distribution) a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Paragraphs 1 and 2 above; or, c) accompany it with the information you received as to where the corresponding source code may be obtained. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form alone.) Source code for a work means the preferred form of the work for making modifications to it. For an executable file, complete source code means all the source code for all modules it contains; but, as a special exception, it need not include source code for modules which are standard libraries that accompany the operating system on which the executable file runs, or for standard header files or definitions files that accompany that operating system. 4. You may not copy, modify, sublicense, distribute or transfer the Program except as expressly provided under this General Public License. Any attempt otherwise to copy, modify, sublicense, distribute or transfer the Program is void, and will automatically terminate your rights to use the Program under this License. However, parties who have received copies, or rights to use copies, from you under this General Public License will not have their licenses terminated so long as such parties remain in full compliance. 5. By copying, distributing or modifying the Program (or any work based on the Program) you indicate your acceptance of this license to do so, and all its terms and conditions. 6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. 7. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies a version number of the license which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of the license, you may choose any version ever published by the Free Software Foundation. 8. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 9. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 10. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. END OF TERMS AND CONDITIONS Appendix: How to Apply These Terms to Your New Programs If you develop a new program, and you want it to be of the greatest possible use to humanity, the best way to achieve this is to make it free software which everyone can redistribute and change under these terms. To do so, attach the following notices to the program. It is safest to attach them to the start of each source file to most effectively convey the exclusion of warranty; and each file should have at least the "copyright" line and a pointer to where the full notice is found. Copyright (C) 19yy This program is free software; you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation; either version 1, or (at your option) any later version. This program is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details. You should have received a copy of the GNU General Public License along with this program; if not, write to the Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston MA 02110-1301 USA Also add information on how to contact you by electronic and paper mail. If the program is interactive, make it output a short notice like this when it starts in an interactive mode: Gnomovision version 69, Copyright (C) 19xx name of author Gnomovision comes with ABSOLUTELY NO WARRANTY; for details type `show w'. This is free software, and you are welcome to redistribute it under certain conditions; type `show c' for details. The hypothetical commands `show w' and `show c' should show the appropriate parts of the General Public License. Of course, the commands you use may be called something other than `show w' and `show c'; they could even be mouse-clicks or menu items--whatever suits your program. You should also get your employer (if you work as a programmer) or your school, if any, to sign a "copyright disclaimer" for the program, if necessary. Here a sample; alter the names: Yoyodyne, Inc., hereby disclaims all copyright interest in the program `Gnomovision' (a program to direct compilers to make passes at assemblers) written by James Hacker. , 1 April 1989 Ty Coon, President of Vice That's all there is to it! --- The Artistic License 1.0 --- This software is Copyright (c) 1998-2003 by Mark Mielke. This is free software, licensed under: The Artistic License 1.0 The Artistic License Preamble The intent of this document is to state the conditions under which a Package may be copied, such that the Copyright Holder maintains some semblance of artistic control over the development of the package, while giving the users of the package the right to use and distribute the Package in a more-or-less customary fashion, plus the right to make reasonable modifications. Definitions: - "Package" refers to the collection of files distributed by the Copyright Holder, and derivatives of that collection of files created through textual modification. - "Standard Version" refers to such a Package if it has not been modified, or has been modified in accordance with the wishes of the Copyright Holder. - "Copyright Holder" is whoever is named in the copyright or copyrights for the package. - "You" is you, if you're thinking about copying or distributing this Package. - "Reasonable copying fee" is whatever you can justify on the basis of media cost, duplication charges, time of people involved, and so on. (You will not be required to justify it to the Copyright Holder, but only to the computing community at large as a market that must bear the fee.) - "Freely Available" means that no fee is charged for the item itself, though there may be fees involved in handling the item. It also means that recipients of the item may redistribute it under the same conditions they received it. 1. You may make and give away verbatim copies of the source form of the Standard Version of this Package without restriction, provided that you duplicate all of the original copyright notices and associated disclaimers. 2. You may apply bug fixes, portability fixes and other modifications derived from the Public Domain or from the Copyright Holder. A Package modified in such a way shall still be considered the Standard Version. 3. You may otherwise modify your copy of this Package in any way, provided that you insert a prominent notice in each changed file stating how and when you changed that file, and provided that you do at least ONE of the following: a) place your modifications in the Public Domain or otherwise make them Freely Available, such as by posting said modifications to Usenet or an equivalent medium, or placing the modifications on a major archive site such as ftp.uu.net, or by allowing the Copyright Holder to include your modifications in the Standard Version of the Package. b) use the modified Package only within your corporation or organization. c) rename any non-standard executables so the names do not conflict with standard executables, which must also be provided, and provide a separate manual page for each non-standard executable that clearly documents how it differs from the Standard Version. d) make other distribution arrangements with the Copyright Holder. 4. You may distribute the programs of this Package in object code or executable form, provided that you do at least ONE of the following: a) distribute a Standard Version of the executables and library files, together with instructions (in the manual page or equivalent) on where to get the Standard Version. b) accompany the distribution with the machine-readable source of the Package with your modifications. c) accompany any non-standard executables with their corresponding Standard Version executables, giving the non-standard executables non-standard names, and clearly documenting the differences in manual pages (or equivalent), together with instructions on where to get the Standard Version. d) make other distribution arrangements with the Copyright Holder. 5. You may charge a reasonable copying fee for any distribution of this Package. You may charge any fee you choose for support of this Package. You may not charge a fee for this Package itself. However, you may distribute this Package in aggregate with other (possibly commercial) programs as part of a larger (possibly commercial) software distribution provided that you do not advertise this Package as a product of your own. 6. The scripts and library files supplied as input to or produced as output from the programs of this Package do not automatically fall under the copyright of this Package, but belong to whomever generated them, and may be sold commercially, and may be aggregated with this Package. 7. C or perl subroutines supplied by you and linked into this Package shall not be considered part of this Package. 8. The name of the Copyright Holder may not be used to endorse or promote products derived from this software without specific prior written permission. 9. THIS PACKAGE IS PROVIDED "AS IS" AND WITHOUT ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, WITHOUT LIMITATION, THE IMPLIED WARRANTIES OF MERCHANTIBILITY AND FITNESS FOR A PARTICULAR PURPOSE. The End Text-Soundex-3.05/Makefile.PL000644 000767 000024 00000002340 12620416122 016062 0ustar00rjbsstaff000000 000000 # -*- perl -*- use strict; use ExtUtils::MakeMaker; my $useXS = 1; foreach (@ARGV) { /^--use-xs$/ && do { $useXS = 1; next; }; /^--no-xs$/ && do { $useXS = 0; next; }; warn "WARNING: Option $_ was not recognized. (ignoring)\n"; } my %BuildOptions; $BuildOptions{'NAME'} = "Text::Soundex"; $BuildOptions{'LICENSE'} = "perl"; $BuildOptions{'VERSION_FROM'} = "Soundex.pm"; # Finds $VERSION $BuildOptions{'INSTALLDIRS'} = $] > 5.011 ? 'site' : "perl"; # Install in CORE. $BuildOptions{'NORECURS'} = 1; # No need to recurse. $BuildOptions{'dist'} = { 'COMPRESS' => "gzip -v9Nf", 'SUFFIX' => "gz", }; $BuildOptions{'PREREQ_PM'} = { 'if' => 0, }; if ($useXS) { print "The XS code will be compiled.\n" if -t STDOUT; $BuildOptions{'XS'} = {'Soundex.xs' => 'Soundex.c'}; $BuildOptions{'C'} = ['Soundex.c']; } else { print "The XS code will not be compiled.\n" if -t STDOUT; $BuildOptions{'XS'} = {}; $BuildOptions{'C'} = []; } WriteMakefile(%BuildOptions); Text-Soundex-3.05/MANIFEST000644 000767 000024 00000000364 12620420313 015242 0ustar00rjbsstaff000000 000000 Changes LICENSE Makefile.PL MANIFEST README Soundex.pm Soundex.xs t/basic.t META.yml Module YAML meta-data (added by MakeMaker) META.json Module JSON meta-data (added by MakeMaker) Text-Soundex-3.05/META.json000644 000767 000024 00000001560 12620420313 015531 0ustar00rjbsstaff000000 000000 { "abstract" : "unknown", "author" : [ "unknown" ], "dynamic_config" : 1, "generated_by" : "ExtUtils::MakeMaker version 7.1, CPAN::Meta::Converter version 2.150005", "license" : [ "perl_5" ], "meta-spec" : { "url" : "http://search.cpan.org/perldoc?CPAN::Meta::Spec", "version" : "2" }, "name" : "Text-Soundex", "no_index" : { "directory" : [ "t", "inc" ] }, "prereqs" : { "build" : { "requires" : { "ExtUtils::MakeMaker" : "0" } }, "configure" : { "requires" : { "ExtUtils::MakeMaker" : "0" } }, "runtime" : { "requires" : { "if" : "0" } } }, "release_status" : "stable", "version" : "3.05", "x_serialization_backend" : "JSON::PP version 2.27300" } Text-Soundex-3.05/META.yml000644 000767 000024 00000000761 12620420313 015363 0ustar00rjbsstaff000000 000000 --- abstract: unknown author: - unknown build_requires: ExtUtils::MakeMaker: '0' configure_requires: ExtUtils::MakeMaker: '0' dynamic_config: 1 generated_by: 'ExtUtils::MakeMaker version 7.1, CPAN::Meta::Converter version 2.150005' license: perl meta-spec: url: http://module-build.sourceforge.net/META-spec-v1.4.html version: '1.4' name: Text-Soundex no_index: directory: - t - inc requires: if: '0' version: '3.05' x_serialization_backend: 'CPAN::Meta::YAML version 0.017' Text-Soundex-3.05/README000644 000767 000024 00000011150 12620420260 014765 0ustar00rjbsstaff000000 000000 Text::Soundex - Implementation of the soundex algorithm. Basic Usage: Soundex is used to do a one way transformation of a name, converting a character string given as input into a set of codes representing the identifiable sounds those characters might make in the output. For example: use Text::Soundex; print soundex("Mark"), "\n"; # prints: M620 print soundex("Marc"), "\n"; # prints: M620 print soundex("Hansen"), "\n"; # prints: H525 print soundex("Hanson"), "\n"; # prints: H525 print soundex("Henson"), "\n"; # prints: H525 In many situations, code such as the following: if ($name1 eq $name2) { ... } Can be substituted with: if (soundex($name1) eq soundex($name2)) { ... } Installation: Once the archive has been unpacked then the following steps are needed to build, test and install the module (to be done in the directory which contains the Makefile.PL) perl Makefile.PL make make test If the make test succeeds then the next step may need to be run as root (on a Unix-like system) or with special privileges on other systems. make install If you do not want to use the XS code (for whatever reason) do the following instead of the above: perl Makefile.PL --no-xs make make test make install If any of the tests report 'not ok' and you are running perl 5.6.0 or later then please contact Mark Mielke History: Version 3.03: Updated to allow the XS implementation to work properly under an EBCDIC/EBCDIC-UTF8 character set environment. Updated documentation to better describe the history of the soundex algorithm and how it applies to this module. Version 3.02: 3.01 and 3.00 used the 'U8' type incorrectly causing some strict compilers to complain or refuse to compile the XS code. Also, Unicode support did not work properly for Perl 5.6.x. Both of these problems are now fixed. Version 3.01: A bug with non-UTF 8 strings that contain non-ASCII alphabetic characters was fixed. The soundex_unicode() and soundex_nara_unicode() wrapper routines were included and the documentation refers the user to the excellent Text::Unidecode module to perform soundex encodings using unicode strings. The Perl versions of the routines have been further optimized, and correct a border case involving non-alphabetic characters at the beginning of the string. Version 3.00: Support for UTF-8 strings (unicode strings) is now in place. Note that this allows UTF-8 strings to be passed to the XS version of the soundex() routine. The Soundex algorithm treats characters outside the ascii range (0x00 - 0x7F) as if they were not alphabetical. The interface has been simplified. In order to explicitly use the non-XS implementation of soundex(): use Text::Soundex (); $code = Text::Soundex::soundex_noxs($name); In order to use the NARA soundex algorithm: use Text::Soundex 'soundex_nara'; $code = soundex_nara($name); Use of the ':NARA-Ruleset' import directive is now obsolete. To emulate the old behaviour: use Text::Soundex (); *soundex = \&Text::Soundex::soundex_nara; $code = soundex($name); Version 2.20: This version includes support for the algorithm used to index the U.S. Federal Censuses. There is a slight descrepancy in the definition for a soundex code which is not commonly known or recognized involved similar sounding letters being seperated by the characters H or W. This is defined as the NARA ruleset, as this descrepency was discovered by them. (Calling it "the US Census ruleset" was too unwieldy...) NARA can be found at: http://www.archives.gov/research/genealogy/index.html The algorithm used by NARA can be found at: http://www.archives.gov/research/census/soundex.html Version 2.00: This version is a full re-write of the 1.0 engine by Mark Mielke. The goal was for speed... and this was achieved. There is an optional XS module which can be used completely transparently by the user which offers a further speed increase of a factor of more than 7.5X. Version 1.00: This version can be found in the perl core distribution from at least Perl 5.8.0 and down. It was written by Mike Stok. It can be identified by the fact that it does not contain a $VERSION in the beginning of the module, and as well it uses an RCS tag with a version of 1.x. This version, before some perl5'ish packaging was introduced, was actually written for perl4. Text-Soundex-3.05/Soundex.pm000644 000767 000024 00000020231 12620420246 016074 0ustar00rjbsstaff000000 000000 # -*- perl -*- # (c) Copyright 1998-2007 by Mark Mielke # # Freedom to use these sources for whatever you want, as long as credit # is given where credit is due, is hereby granted. You may make modifications # where you see fit but leave this copyright somewhere visible. As well, try # to initial any changes you make so that if I like the changes I can # incorporate them into later versions. # # - Mark Mielke # package Text::Soundex; require 5.006; use Exporter (); use XSLoader (); use strict; use if $] > 5.016, 'deprecate'; our $VERSION = '3.05'; our @EXPORT_OK = qw(soundex soundex_unicode soundex_nara soundex_nara_unicode $soundex_nocode); our @EXPORT = qw(soundex soundex_nara $soundex_nocode); our @ISA = qw(Exporter); our $nocode; # Previous releases of Text::Soundex made $nocode available as $soundex_nocode. # For now, this part of the interface is exported and maintained. # In the feature, $soundex_nocode will be deprecated. *Text::Soundex::soundex_nocode = \$nocode; sub soundex_noxs { # Original Soundex algorithm my @results = map { my $code = uc($_); $code =~ tr/AaEeHhIiOoUuWwYyBbFfPpVvCcGgJjKkQqSsXxZzDdTtLlMmNnRr//cd; if (length($code)) { my $firstchar = substr($code, 0, 1); $code =~ tr[AaEeHhIiOoUuWwYyBbFfPpVvCcGgJjKkQqSsXxZzDdTtLlMmNnRr] [0000000000000000111111112222222222222222333344555566]s; ($code = substr($code, 1)) =~ tr/0//d; substr($firstchar . $code . '000', 0, 4); } else { $nocode; } } @_; wantarray ? @results : $results[0]; } sub soundex_nara { # US census (NARA) algorithm. my @results = map { my $code = uc($_); $code =~ tr/AaEeHhIiOoUuWwYyBbFfPpVvCcGgJjKkQqSsXxZzDdTtLlMmNnRr//cd; if (length($code)) { my $firstchar = substr($code, 0, 1); $code =~ tr[AaEeHhIiOoUuWwYyBbFfPpVvCcGgJjKkQqSsXxZzDdTtLlMmNnRr] [0000990000009900111111112222222222222222333344555566]s; $code =~ s/(.)9\1/$1/gs; ($code = substr($code, 1)) =~ tr/09//d; substr($firstchar . $code . '000', 0, 4); } else { $nocode } } @_; wantarray ? @results : $results[0]; } sub soundex_unicode { require Text::Unidecode unless defined &Text::Unidecode::unidecode; soundex(Text::Unidecode::unidecode(@_)); } sub soundex_nara_unicode { require Text::Unidecode unless defined &Text::Unidecode::unidecode; soundex_nara(Text::Unidecode::unidecode(@_)); } eval { XSLoader::load(__PACKAGE__, $VERSION) }; if (defined(&soundex_xs)) { *soundex = \&soundex_xs; } else { *soundex = \&soundex_noxs; *soundex_xs = sub { require Carp; Carp::croak("XS implementation of Text::Soundex::soundex_xs() ". "could not be loaded"); }; } 1; __END__ # Implementation of the soundex algorithm. # # Some of this documention was written by Mike Stok. # # Examples: # # Euler, Ellery -> E460 # Gauss, Ghosh -> G200 # Hilbert, Heilbronn -> H416 # Knuth, Kant -> K530 # Lloyd, Ladd -> L300 # Lukasiewicz, Lissajous -> L222 # =head1 NAME Text::Soundex - Implementation of the soundex algorithm. =head1 SYNOPSIS use Text::Soundex; # Original algorithm. $code = soundex($name); # Get the soundex code for a name. @codes = soundex(@names); # Get the list of codes for a list of names. # American Soundex variant (NARA) - Used for US census data. $code = soundex_nara($name); # Get the soundex code for a name. @codes = soundex_nara(@names); # Get the list of codes for a list of names. # Redefine the value that soundex() will return if the input string # contains no identifiable sounds within it. $Text::Soundex::nocode = 'Z000'; =head1 DESCRIPTION Soundex is a phonetic algorithm for indexing names by sound, as pronounced in English. The goal is for names with the same pronunciation to be encoded to the same representation so that they can be matched despite minor differences in spelling. Soundex is the most widely known of all phonetic algorithms and is often used (incorrectly) as a synonym for "phonetic algorithm". Improvements to Soundex are the basis for many modern phonetic algorithms. (Wikipedia, 2007) This module implements the original soundex algorithm developed by Robert Russell and Margaret Odell, patented in 1918 and 1922, as well as a variation called "American Soundex" used for US census data, and current maintained by the National Archives and Records Administration (NARA). The soundex algorithm may be recognized from Donald Knuth's B. The algorithm described by Knuth is the NARA algorithm. The value returned for strings which have no soundex encoding is defined using C<$Text::Soundex::nocode>. The default value is C, however values such as C<'Z000'> are commonly used alternatives. For backward compatibility with older versions of this module the C<$Text::Soundex::nocode> is exported into the caller's namespace as C<$soundex_nocode>. In scalar context, C returns the soundex code of its first argument. In list context, a list is returned in which each element is the soundex code for the corresponding argument passed to C. For example, the following code assigns @codes the value C<('M200', 'S320')>: @codes = soundex qw(Mike Stok); To use C to generate codes that can be used to search one of the publically available US Censuses, a variant of the soundex algorithm must be used: use Text::Soundex; $code = soundex_nara($name); An example of where these algorithm differ follows: use Text::Soundex; print soundex("Ashcraft"), "\n"; # prints: A226 print soundex_nara("Ashcraft"), "\n"; # prints: A261 =head1 EXAMPLES Donald Knuth's examples of names and the soundex codes they map to are listed below: Euler, Ellery -> E460 Gauss, Ghosh -> G200 Hilbert, Heilbronn -> H416 Knuth, Kant -> K530 Lloyd, Ladd -> L300 Lukasiewicz, Lissajous -> L222 so: $code = soundex 'Knuth'; # $code contains 'K530' @list = soundex qw(Lloyd Gauss); # @list contains 'L300', 'G200' =head1 LIMITATIONS As the soundex algorithm was originally used a B time ago in the US it considers only the English alphabet and pronunciation. In particular, non-ASCII characters will be ignored. The recommended method of dealing with characters that have accents, or other unicode characters, is to use the Text::Unidecode module available from CPAN. Either use the module explicitly: use Text::Soundex; use Text::Unidecode; print soundex(unidecode("Fran\xE7ais")), "\n"; # Prints "F652\n" Or use the convenient wrapper routine: use Text::Soundex 'soundex_unicode'; print soundex_unicode("Fran\xE7ais"), "\n"; # Prints "F652\n" Since the soundex algorithm maps a large space (strings of arbitrary length) onto a small space (single letter plus 3 digits) no inference can be made about the similarity of two strings which end up with the same soundex code. For example, both C and C end up with a soundex code of C. =head1 COPYRIGHT AND LICENSE This software is copyright (c) 1998-2003 by Mark Mielke. This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself. =head1 MAINTAINER This module is currently maintain by Mark Mielke (C). =head1 HISTORY Version 3 is a significant update to provide support for versions of Perl later than Perl 5.004. Specifically, the XS version of the soundex() subroutine understands strings that are encoded using UTF-8 (unicode strings). Version 2 of this module was a re-write by Mark Mielke (C) to improve the speed of the subroutines. The XS version of the soundex() subroutine was introduced in 2.00. Version 1 of this module was written by Mike Stok (C) and was included into the Perl core library set. Dave Carlsen (C) made the request for the NARA algorithm to be included. The NARA soundex page can be viewed at: C Ian Phillips (C) and Rich Pinder (C) supplied ideas and spotted mistakes for v1.x. =cut Text-Soundex-3.05/Soundex.xs000644 000767 000024 00000012570 12076144613 016127 0ustar00rjbsstaff000000 000000 /* -*- c -*- */ /* (c) Copyright 1998-2003 by Mark Mielke * * Freedom to use these sources for whatever you want, as long as credit * is given where credit is due, is hereby granted. You may make modifications * where you see fit but leave this copyright somewhere visible. As well try * to initial any changes you make so that if i like the changes i can * incorporate them into any later versions of mine. * * - Mark Mielke */ #include "EXTERN.h" #include "perl.h" #include "XSUB.h" #define SOUNDEX_ACCURACY (4) /* The maximum code length... (should be>=2) */ #if !(PERL_REVISION >= 5 && PERL_VERSION >= 8) # define utf8n_to_uvchr utf8_to_uv #endif static char sv_soundex_table[0x100]; static void sv_soundex_initialize (void) { memset(&sv_soundex_table[0], '\0', sizeof(sv_soundex_table)); sv_soundex_table['A'] = '0'; sv_soundex_table['a'] = '0'; sv_soundex_table['E'] = '0'; sv_soundex_table['e'] = '0'; sv_soundex_table['H'] = '0'; sv_soundex_table['h'] = '0'; sv_soundex_table['I'] = '0'; sv_soundex_table['i'] = '0'; sv_soundex_table['O'] = '0'; sv_soundex_table['o'] = '0'; sv_soundex_table['U'] = '0'; sv_soundex_table['u'] = '0'; sv_soundex_table['W'] = '0'; sv_soundex_table['w'] = '0'; sv_soundex_table['Y'] = '0'; sv_soundex_table['y'] = '0'; sv_soundex_table['B'] = '1'; sv_soundex_table['b'] = '1'; sv_soundex_table['F'] = '1'; sv_soundex_table['f'] = '1'; sv_soundex_table['P'] = '1'; sv_soundex_table['p'] = '1'; sv_soundex_table['V'] = '1'; sv_soundex_table['v'] = '1'; sv_soundex_table['C'] = '2'; sv_soundex_table['c'] = '2'; sv_soundex_table['G'] = '2'; sv_soundex_table['g'] = '2'; sv_soundex_table['J'] = '2'; sv_soundex_table['j'] = '2'; sv_soundex_table['K'] = '2'; sv_soundex_table['k'] = '2'; sv_soundex_table['Q'] = '2'; sv_soundex_table['q'] = '2'; sv_soundex_table['S'] = '2'; sv_soundex_table['s'] = '2'; sv_soundex_table['X'] = '2'; sv_soundex_table['x'] = '2'; sv_soundex_table['Z'] = '2'; sv_soundex_table['z'] = '2'; sv_soundex_table['D'] = '3'; sv_soundex_table['d'] = '3'; sv_soundex_table['T'] = '3'; sv_soundex_table['t'] = '3'; sv_soundex_table['L'] = '4'; sv_soundex_table['l'] = '4'; sv_soundex_table['M'] = '5'; sv_soundex_table['m'] = '5'; sv_soundex_table['N'] = '5'; sv_soundex_table['n'] = '5'; sv_soundex_table['R'] = '6'; sv_soundex_table['r'] = '6'; } static SV *sv_soundex (source) SV *source; { char *source_p; char *source_end; { STRLEN source_len; source_p = SvPV(source, source_len); source_end = &source_p[source_len]; } while (source_p != source_end) { char codepart_last = sv_soundex_table[(unsigned char) *source_p]; if (codepart_last != '\0') { SV *code = newSV(SOUNDEX_ACCURACY); char *code_p = SvPVX(code); char *code_end = &code_p[SOUNDEX_ACCURACY]; SvCUR_set(code, SOUNDEX_ACCURACY); SvPOK_only(code); *code_p++ = toupper(*source_p++); while (source_p != source_end && code_p != code_end) { char c = *source_p++; char codepart = sv_soundex_table[(unsigned char) c]; if (codepart != '\0') if (codepart != codepart_last && (codepart_last = codepart) != '0') *code_p++ = codepart; } while (code_p != code_end) *code_p++ = '0'; *code_end = '\0'; return code; } source_p++; } return SvREFCNT_inc(perl_get_sv("Text::Soundex::nocode", FALSE)); } static SV *sv_soundex_utf8 (source) SV *source; { U8 *source_p; U8 *source_end; { STRLEN source_len; source_p = (U8 *) SvPV(source, source_len); source_end = &source_p[source_len]; } while (source_p < source_end) { STRLEN offset; UV c = utf8n_to_uvchr(source_p, source_end-source_p, &offset, 0); char codepart_last = (c <= 0xFF) ? sv_soundex_table[c] : '\0'; source_p = (offset >= 1) ? &source_p[offset] : source_end; if (codepart_last != '\0') { SV *code = newSV(SOUNDEX_ACCURACY); char *code_p = SvPVX(code); char *code_end = &code_p[SOUNDEX_ACCURACY]; SvCUR_set(code, SOUNDEX_ACCURACY); SvPOK_only(code); *code_p++ = toupper(c); while (source_p != source_end && code_p != code_end) { char codepart; c = utf8n_to_uvchr(source_p, source_end-source_p, &offset, 0); codepart = (c <= 0xFF) ? sv_soundex_table[c] : '\0'; source_p = (offset >= 1) ? &source_p[offset] : source_end; if (codepart != '\0') if (codepart != codepart_last && (codepart_last = codepart) != '0') *code_p++ = codepart; } while (code_p != code_end) *code_p++ = '0'; *code_end = '\0'; return code; } source_p++; } return SvREFCNT_inc(perl_get_sv("Text::Soundex::nocode", FALSE)); } MODULE = Text::Soundex PACKAGE = Text::Soundex PROTOTYPES: DISABLE void soundex_xs (...) INIT: { sv_soundex_initialize(); } PPCODE: { int i; for (i = 0; i < items; i++) { SV *sv = ST(i); if (DO_UTF8(sv)) sv = sv_soundex_utf8(sv); else sv = sv_soundex(sv); PUSHs(sv_2mortal(sv)); } } Text-Soundex-3.05/t/000755 000767 000024 00000000000 12620420313 014351 5ustar00rjbsstaff000000 000000 Text-Soundex-3.05/t/basic.t000644 000767 000024 00000006076 12620420103 015625 0ustar00rjbsstaff000000 000000 use strict; my $test_counter; BEGIN { $test_counter = 0; sub t (&); sub tsoundex; sub test_label; } END { print "1..$test_counter\n"; } t { test_label "use Text::Soundex 'soundex'"; eval "use Text::Soundex 'soundex'"; die if $@; }; t { test_label "use Text::Soundex 'soundex_nara'"; eval "use Text::Soundex 'soundex_nara'"; die if $@; }; t { test_label "use Text::Soundex;"; eval "use Text::Soundex"; die if $@; }; # Knuth's test cases, scalar in, scalar out tsoundex("Euler" => "E460"); tsoundex("Gauss" => "G200"); tsoundex("Hilbert" => "H416"); tsoundex("Knuth" => "K530"); tsoundex("Lloydi" => "L300"); tsoundex("Lukasiewicz" => "L222"); # check default "no code" code on a bad string and undef tsoundex("2 + 2 = 4" => undef); tsoundex(undef() => undef); # check list context with and without "no code" tsoundex([qw/Ellery Ghosh Heilbronn Kant Ladd Lissajous/], [qw/E460 G200 H416 K530 L300 L222 /]); tsoundex(['Mark', 'Mielke'], ['M620', 'M420']); tsoundex(['Mike', undef, 'Stok'], ['M200', undef, 'S320']); # check the deprecated $soundex_nocode and make sure it's reflected in # the $Text::Soundex::nocode variable. { our $soundex_nocode; my $nocodeValue = 'Z000'; $soundex_nocode = $nocodeValue; t { test_label "setting \$soundex_nocode"; die if soundex(undef) ne $nocodeValue; }; t { test_label "\$soundex_nocode eq \$Text::Soundex::nocode"; die if $Text::Soundex::nocode ne $soundex_nocode; }; } # make sure an empty argument list returns an undefined scalar t { test_label "empty list"; die if defined(soundex()); }; # test to detect an error in Mike Stok's original implementation, the # error isn't in Mark Mielke's at all but the test should be kept anyway. # originally spotted by Rich Pinder tsoundex("CZARKOWSKA" => "C622"); exit 0; my $test_label; sub t (&) { my($test_f) = @_; $test_label = undef; eval {&$test_f}; my $ok = $@ ? "not ok" : "ok"; $test_counter++; print "$ok - $test_counter $test_label\n"; } sub tsoundex { my($string, $expected) = @_; if (ref($string) eq 'ARRAY') { t { my $s = scalar2string(@$string); my $e = scalar2string(@$expected); $test_label = "soundex($s) eq ($e)"; my @codes = soundex(@$string); for (my $i = 0; $i < @$string; $i++) { my $success = !(defined($codes[$i])||defined($expected->[$i])); if (defined($codes[$i]) && defined($expected->[$i])) { $success = ($codes[$i] eq $expected->[$i]); } die if !$success; } }; } else { t { my $s = scalar2string($string); my $e = scalar2string($expected); $test_label = "soundex($s) eq $e"; my $code = soundex($string); my $success = !(defined($code) || defined($expected)); if (defined($code) && defined($expected)) { $success = ($code eq $expected); } die if !$success; }; } } sub test_label { $test_label = $_[0]; } sub scalar2string { join(", ", map {defined($_) ? qq{'$_'} : qq{undef}} @_); }