greekocr-1.0.1/ 0000755 0001750 0001750 00000000000 11635564234 012663 5 ustar dalitz dalitz greekocr-1.0.1/TODO 0000644 0001750 0001750 00000001116 11536114132 013336 0 ustar dalitz dalitz Tasks for future versions of the GreekOCR toolkit ------------------------------------------------- - optionally enable grouping for the separatistic approach - add new wholistic recognition which attaches accents to characters with Gamera's grouping algorithm - do thorough tests for the different approaches and measure the performances on various documents - Why does the recognition result contain spurious additional lines? - integrate the improved attachment of diacritical signs to characters into the basic OCR toolkit - disambiguate between quotes and accents greekocr-1.0.1/MANIFEST.in 0000644 0001750 0001750 00000000717 11530742470 014420 0 ustar dalitz dalitz recursive-include src *.cpp *.c *.h makefile.* *.hpp *.hxx *.cxx *.txt ANNOUNCE CHANGES INSTALL KNOWNBUG LICENSE README TODO recursive-include include *.cpp *.c *.h makefile.* *.hpp *.hxx *.cxx *.txt ANNOUNCE CHANGES INSTALL KNOWNBUG LICENSE README TODO recursive-include scripts greekocr include ACKNOWLEDGEMENTS CHANGES TODO INSTALL LICENSE README KNOWN_BUGS MANIFEST.in version recursive-include doc *.txt *.html *.css *.py *.jpg *.jpeg *.png *.gif *.fig greekocr-1.0.1/LICENSE 0000644 0001750 0001750 00000035423 11635556523 013701 0 ustar dalitz dalitz GNU GENERAL PUBLIC LICENSE Version 2, June 1991 Copyright (C) 1989, 1991 Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software. Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations. Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all. The precise terms and conditions for copying, distribution and modification follow. GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you". Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does. 1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program. You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. 2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions: a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License. c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.) These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it. Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program. In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. 3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following: a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.) The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code. 4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it. 6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License. 7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program. If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. 9. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation. 10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. END OF TERMS AND CONDITIONS greekocr-1.0.1/setup.py 0000755 0001750 0001750 00000003041 11530744025 014366 0 ustar dalitz dalitz #!/usr/bin/env python from distutils.core import setup, Extension from gamera import gamera_setup # This constant should be the name of the toolkit TOOLKIT_NAME = "greekocr" VERSION = open("version", 'r').readlines()[0].strip() AUTHOR = "Christian Brandt and Christoph Dalitz" HOMEPAGE = "http://gamera.sourceforge.net/" DESCRIPTION = "An addon Greek OCR toolkit for the Gamera framework for document analysis and recognition." LICENSE = "GNU GPL version 2" # ---------------------------------------------------------------------------- # You should not usually have to edit anything below, but it is # implemented here and not in the Gamera core so that you can edit it # if you need to do something more complicated (for example, building # and linking to a third- party library). # ---------------------------------------------------------------------------- PLUGIN_PATH = 'gamera/toolkits/%s/plugins/' % TOOLKIT_NAME PACKAGE = 'gamera.toolkits.%s' % TOOLKIT_NAME PLUGIN_PACKAGE = PACKAGE + ".plugins" plugins = gamera_setup.get_plugin_filenames(PLUGIN_PATH) plugin_extensions = gamera_setup.generate_plugins(plugins, PLUGIN_PACKAGE) # This is a standard distutils setup initializer. If you need to do # anything more complex here, refer to the Python distutils documentation. setup(name=TOOLKIT_NAME, version=VERSION, license=LICENSE, url=HOMEPAGE, author=AUTHOR, description=DESCRIPTION, ext_modules = plugin_extensions, packages = [PACKAGE, PLUGIN_PACKAGE], scripts = ['scripts/greekocr4gamera.py']) greekocr-1.0.1/gamera/ 0000755 0001750 0001750 00000000000 11635564234 014117 5 ustar dalitz dalitz greekocr-1.0.1/gamera/toolkits/ 0000755 0001750 0001750 00000000000 11635564234 015767 5 ustar dalitz dalitz greekocr-1.0.1/gamera/toolkits/greekocr/ 0000755 0001750 0001750 00000000000 11635564234 017570 5 ustar dalitz dalitz greekocr-1.0.1/gamera/toolkits/greekocr/singlediacritics.py 0000644 0001750 0001750 00000013744 11635564123 023470 0 ustar dalitz dalitz # -*- mode: python; indent-tabs-mode: nil; tab-width: 3 -*- # vim: set tabstop=3 shiftwidth=3 expandtab: # Copyright (C) 2010-2011 Christian Brandt # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. from gamera.core import * from gamera.plugins.pagesegmentation import textline_reading_order from gamera.toolkits.ocr.ocr_toolkit import * from gamera.toolkits.ocr.classes import Textline,Page,ClassifyCCs import gamera.kdtree as kdtree import unicodedata import sys class SinglePage(Page): def lines_to_chars(self): subccs = self.img.sub_cc_analysis(self.ccs_lines) for i,segment in enumerate(self.ccs_lines): self.textlines.append(SingleTextline(segment, subccs[1][i])) class Character(object): def __init__(self, glyph): self.maincharacter = glyph self.unicodename = glyph.get_main_id() self.unicodename = self.unicodename.replace(".", " ").upper() #print self.unicodename #self.unicodename = self.combinedwith = [] #print self.maincharacter def addCombiningDiacritics(self, diacrit): self.combinedwith.append(diacrit) pass def toUnicodeString(self): try: str = u"" mainids = self.maincharacter.get_main_id().split(".and.") for char in mainids: if char == "skip" or char == "unclassified": continue str = str + u"%c" % return_char(char) #str = u"" + return_char(self.unicodename) for char in self.combinedwith: #char = char.get_main_id().replace(".", " ").upper() mainids = char.get_main_id().split(".and.") #print mainids for char in mainids: if char == "skip": continue #print "added %s to output" % char str = str + u"%c" % return_char(char) return unicodedata.normalize('NFD', str) except: #print self.unicodename return u"E" class SingleTextline(Textline): def sort_glyphs(self): self.glyphs.sort(lambda x,y: cmp(x.ul_x, y.ul_x)) #begin calculating threshold for word-spacing glyphs = [] for g in self.glyphs: if self.is_combining_glyph(g): continue glyphs.append(g) spacelist = [] total_space = 0 for i in range(len(glyphs) - 1): spacelist.append(glyphs[i + 1].ul_x - glyphs[i].lr_x) if(len(spacelist) > 0): threshold = median(spacelist) threshold = threshold * 2.0 else: threshold = 0 #end calculatin threshold for word-spacing self.words = chars_make_words(self.glyphs, threshold) def is_combining_glyph(self, glyph): ret = glyph.get_main_id().find("combining") != -1 return ret def to_string(self): k = 3 max_k = 10 output = "" for word in self.words: med_center = median([g.center.y for g in word]) glyphs_combining = [] characters = [] nodes_normal = [] skipids = ["manual.xi.upper", "manual.xi.lower", "manual.theta.outer", "_split.splitx", "skip"] for glyph in word: mainid = glyph.get_main_id() if skipids.count(mainid) > 0: continue elif mainid == "manual.xi.middle": glyph.classify_automatic("greek.capital.letter.xi") elif mainid == "manual.theta.inner": glyph.classify_automatic("greek.capital.letter.theta") elif mainid == "comma" or mainid == "combining.comma.above": #print "%s - center_y: %d - med_center: %d" % (mainid, glyph.center.y, med_center) if glyph.center.y > self.bbox.center.y: glyph.classify_automatic("comma") else: glyph.classify_automatic("combining.comma.above") elif mainid.find("manual") != -1 or mainid.find("split") != -1: continue if self.is_combining_glyph(glyph): glyphs_combining.append(glyph) else: c = Character(glyph) characters.append(c) #print c nodes_normal.append(kdtree.KdNode((glyph.center.x, glyph.center.y), c)) if (nodes_normal == None or len(nodes_normal) == 0): continue tree = kdtree.KdTree(nodes_normal) for g in glyphs_combining: fast = True if fast: knn = tree.k_nearest_neighbors((g.center.x, g.center.y), k) knn[0].data.addCombiningDiacritics(g) else: found = False while (not found) and k < max_k: knn = tree.k_nearest_neighbors((g.center.x, g.center.y), k) for nn in knn: if (nn.data.maincharacter.get_main_id().split(".").count("greek") > 0) and not found: nn.data.addCombiningDiacritics(g) found = True break k = k + 2 for c in characters: output = output + c.toUnicodeString() output = output + " " return output greekocr-1.0.1/gamera/toolkits/greekocr/compare.py 0000644 0001750 0001750 00000006017 11635564123 021571 0 ustar dalitz dalitz # -*- mode: python; indent-tabs-mode: nil; tab-width: 3 -*- # vim: set tabstop=3 shiftwidth=3 expandtab: # Copyright (C) 2010-2011 Christian Brandt # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. import unicodedata import codecs def levenshtein(s1, s2): """Computes the Levenshtein distance (aka edit distance) between the two Unicode strings *s1* and *s2*. Signature: ``levenshtein(s1, s2)`` This implementation differs from the plugin *edit_distance* in the Gamera core in two points: - the Gamera core function is implemented in C++ and currently only works with ASCII strings - this implementation is written in pure Python and therefore somewhat slower, but it works with Unicode strings. For details about the algorithm see http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance """ if len(s1) < len(s2): return levenshtein(s2, s1) if not s1: return len(s2) previous_row = xrange(len(s2) + 1) for i, c1 in enumerate(s1): current_row = [i + 1] for j, c2 in enumerate(s2): insertions = previous_row[j + 1] + 1 # j+1 instead of j since previous_row and current_row are one character longer deletions = current_row[j] + 1 # than s2 substitutions = previous_row[j] + (c1 != c2) current_row.append(min(insertions, deletions, substitutions)) previous_row = current_row return previous_row[-1] def levenshtein_multi_unicode(str1, str2): #remove linebreaks str1 = str1.replace("\n", "") str2 = str2.replace("\n", "") #remove spaces str1 = str1.replace(" ", "") str2 = str2.replace(" ", "") str1_n = unicodedata.normalize("NFD", str1) str2_n = unicodedata.normalize("NFD", str2) return levenshtein(str1_n, str2_n), len(str1_n), len(str2_n) def errorrate(groundtruth, ocr): """For the two given Unicode strings, the edit distance divided by the length of the first string is returned. Signature: ``errorrate(groundtruth, ocr)`` """ errorcount, gtlength, ocrlength = levenshtein_multi_unicode(groundtruth, ocr) rate = float(errorcount) / gtlength print "Errorcount: %d" % errorcount print "Characters in GT: %d" % gtlength print "Characters in OCR: %d" % ocrlength print "Error Rate: %.2f %%" % (rate * 100) #print "=%5d %5d %5d %3.2f" % (errorcount, gtlength, ocrlength, rate*100) return rate greekocr-1.0.1/gamera/toolkits/greekocr/unicode_teubner.py 0000644 0001750 0001750 00000017420 11635564123 023315 0 ustar dalitz dalitz #encoding: utf-8 # -*- mode: python; indent-tabs-mode: nil; tab-width: 3 -*- # vim: set tabstop=3 shiftwidth=3 expandtab: # Copyright (C) 2010-2011 Christian Brandt # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. import sys charactermap = { "GREEK CAPITAL LETTER ALPHA": "A","GREEK CAPITAL LETTER BETA": "B", "GREEK CAPITAL LETTER GAMMA": "G","GREEK CAPITAL LETTER DELTA": "D", "GREEK CAPITAL LETTER EPSILON": "E","GREEK CAPITAL LETTER ZETA": "Z", "GREEK CAPITAL LETTER ETA": "H","GREEK CAPITAL LETTER THETA": "J", "GREEK CAPITAL LETTER IOTA": "I","GREEK CAPITAL LETTER KAPPA": "K", "GREEK CAPITAL LETTER LAMDA": "L","GREEK CAPITAL LETTER MU": "M", "GREEK CAPITAL LETTER NU": "N","GREEK CAPITAL LETTER XI": "X", "GREEK CAPITAL LETTER OMICRON": "O","GREEK CAPITAL LETTER PI": "P", "GREEK CAPITAL LETTER RHO": "R","GREEK CAPITAL LETTER SIGMA": "C", "GREEK CAPITAL LETTER TAU": "T","GREEK CAPITAL LETTER UPSILON": "U", "GREEK CAPITAL LETTER PHI": "F","GREEK CAPITAL LETTER CHI": "Q", "GREEK CAPITAL LETTER PSI": "Y","GREEK CAPITAL LETTER OMEGA": "W", "GREEK SMALL LETTER ALPHA": "a","GREEK SMALL LETTER BETA": "b", "GREEK SMALL LETTER GAMMA": "g","GREEK SMALL LETTER DELTA": "d", "GREEK SMALL LETTER EPSILON": "e","GREEK SMALL LETTER ZETA": "z", "GREEK SMALL LETTER ETA": "h","GREEK SMALL LETTER THETA": "j","GREEK THETA SYMBOL": "j", "GREEK SMALL LETTER IOTA": "i","GREEK SMALL LETTER KAPPA": "k", "GREEK SMALL LETTER LAMDA": "l","GREEK SMALL LETTER MU": "m", "GREEK SMALL LETTER NU": "n","GREEK SMALL LETTER XI": "x", "GREEK SMALL LETTER OMICRON": "o","GREEK SMALL LETTER PI": "p", "GREEK SMALL LETTER RHO": "r","GREEK SMALL LETTER FINAL SIGMA": "c", "GREEK SMALL LETTER SIGMA": "s","GREEK SMALL LETTER TAU": "t", "GREEK SMALL LETTER UPSILON": "u","GREEK SMALL LETTER PHI": "f", "GREEK SMALL LETTER CHI": "q","GREEK SMALL LETTER PSI": "y", "GREEK SMALL LETTER OMEGA": "w", "SPACE": " ", "FULL STOP": ".", "COMMA": ",", "HYPHEN-MINUS": "-" } accentmap = [ ['\\`%c', ['combining.grave.accent']], ['\\\'%c', ['combining.acute.accent']], ['\\~%c', ['combining.greek.perispomeni']], ['\\"%c', ['combining.diaresis']], ['\\u{%c}', ['combining.breve']], ['\\U{%c%c}', ['combining.double.breve']], ['\\=%c', ['combining.overline']], ['\\r{%c}', ['combining.comma.above']], ['\\s{%c}', ['combining.reversed.comma.above']], ['\\Ad{%c}', ['combining.acute.accent', 'combining.diaresis']], ['\\Gd{%c}', ['combining.diaresis', 'combining.grave.accent']], ['\\Cd{%c}', ['combining.diaresis', 'combining.greek.perispomeni']], ['\\Ar{%c}', ['combining.acute.accent', 'combining.reversed.comma.above']], ['\\Gr{%c}', ['combining.grave.accent', 'combining.reversed.comma.above']], ['\\Cr{%c}', ['combining.greek.perispomeni', 'combining.reversed.comma.above']], ['\\As{%c}', ['combining.acute.accent', 'combining.comma.above']], ['\\Gs{%c}', ['combining.comma.above', 'combining.grave.accent']], ['\\Cs{%c}', ['combining.comma.above', 'combining.greek.perispomeni']], ['\\c{%c}', ['combining.inverted.breve.below']], ['\\ut{%cw}', ['combining.double.breve.below']], ['\\Ab{%c}', ['combining.acute.accent', 'combining.breve']], ['\\Gb{%c}', ['combining.breve', 'combining.grave.accent']], ['\\Arb{%c}', ['combining.acute.accent', 'combining.breve', 'combining.reversed.comma.above']], ['\\Grb{%c}', ['combining.breve', 'combining.grave.accent', 'combining.reversed.comma.above']], ['\\Asb{%c}', ['combining.acute.accent', 'combining.breve', 'combining.comma.above']], ['\\Gsb{%c}', ['combining.breve', 'combining.comma.above', 'combining.grave.accent']], ['\\Am{%c}', ['combining.acute.accent', 'combining.overline']], ['\\Gm{%c}', ['combining.grave.accent', 'combining.overline']], ['\\Cm{%c}', ['combining.greek.perispomeni', 'combining.overline']], ['\\Arm{%c}', ['combining.acute.accent', 'combining.overline', 'combining.reversed.comma.above']], ['\\Grm{%c}', ['combining.grave.accent', 'combining.overline', 'combining.reversed.comma.above']], ['\\Crm{%c}', ['combining.greek.perispomeni', 'combining.overline', 'combining.reversed.comma.above']], ['\\Asm{%c}', ['combining.acute.accent', 'combining.comma.above', 'combining.overline']], ['\\Gsm{%c}', ['combining.comma.above', 'combining.grave.accent', 'combining.overline']], ['\\Csm{%c}', ['combining.comma.above', 'combining.greek.perispomeni', 'combining.overline']], ['\\Sm{%c}', ['combining.comma.above', 'combining.overline']], ['\\Rm{%c}', ['combining.overline', 'combining.reversed.comma.above']], ['\\iS{%c}', ['combining.greek.ypogegrammeni']], ['\\d{%c}', ['combining.dot.below']], ['\\bd{%c}', ['combining.breve', 'combining.diaresis']], ['\\ring{%c}', ['combining.ring.below']] ] accentmap.sort(key=lambda s: s[1]) def unicode_to_teubner(unicode_text): """Returns the given unicode string to a LaTeX document body using the Teubner style for representing Greek characters and accents. Signature: ``unicode_to_teubner (unicode_text)`` The returned LaTeX code does not contain the LaTeX header. To create a complete LaTeX document, you can use the following code: .. code:: Python # LaTeX header print "\documentclass[10pt]{article}" print "\usepackage[polutonikogreek]{babel}" print "\usepackage[or]{teubner}" print "\\\\begin{document}" print "\selectlanguage{greek}" # document body print unicode_to_teubner(unicode_string) # LaTex footer print "\end{document}" """ import unicodedata output = u"" combinewith = [] maincharacter = None i = 0 while i < len(unicode_text): character = unicode_text[i] try: name_unicode = unicodedata.name(character) except: if character == "\n": output += " " name = name_unicode.lower() name = name.replace(" ", ".") if name.find("combining") == -1: #non-combining character if maincharacter != None and len(combinewith) > 0: #do lookup combinewith.sort() for format,combination in accentmap: if combination == combinewith: try: output += format % charactermap[maincharacter] except KeyError: sys.stderr.write("Teubner: Unknown character '%s'\n" % maincharacter) break maincharacter = None combinewith = [] elif maincharacter != None: try: output += charactermap[maincharacter] maincharacter = None except KeyError: sys.stderr.write("Teubner: Unknown character '%s'\n" % maincharacter) maincharacter = name_unicode else: #combining character if maincharacter != None: combinewith.append(name) else: output += "e" i += 1 return output if __name__ == "__main__": import unicodedata teststr = u"ἔθαψε, ὡς οἰκὸς ἦν" print unicode_to_teubner(unicodedata.normalize("NFD", teststr)) for a in accentmap: sort = sorted(a[1]) #unicodedata.normalize(u"aäb", "NFD")) greekocr-1.0.1/gamera/toolkits/greekocr/plugins/ 0000755 0001750 0001750 00000000000 11635564234 021251 5 ustar dalitz dalitz greekocr-1.0.1/gamera/toolkits/greekocr/plugins/__init__.py 0000644 0001750 0001750 00000000207 11511157302 023344 0 ustar dalitz dalitz # You need to have some sort of __init__.py file here # in order to import modules in this directory. # It is deliberately empty. greekocr-1.0.1/gamera/toolkits/greekocr/greekocr.py 0000644 0001750 0001750 00000017676 11635564123 021761 0 ustar dalitz dalitz # -*- mode: python; indent-tabs-mode: nil; tab-width: 3 -*- # vim: set tabstop=3 shiftwidth=3 expandtab: # Copyright (C) 2010-2011 Christian Brandt # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. from gamera.core import * init_gamera() from gamera import knn from gamera.plugins import pagesegmentation from gamera.plugins.pagesegmentation import textline_reading_order from gamera.classify import ShapedGroupingFunction from gamera.plugins.image_utilities import union_images from gamera.toolkits.ocr.ocr_toolkit import * from gamera.toolkits.ocr.classes import Textline,Page,ClassifyCCs from gamera.gamera_xml import glyphs_to_xml from gamera.knn_editing import edit_cnn from gamera.toolkits.greekocr.singlediacritics import * from gamera.toolkits.greekocr.wholisticdiacritics import * import unicodedata import codecs def clean_classifier(cknn): glyphs = cknn.get_glyphs() print "old %d" % len(glyphs) sorted_glyphs = sorted(glyphs, key=lambda g: g.to_rle()) new_glyphs = [] last_rle = sorted_glyphs[0].to_rle() new_glyphs.append(sorted_glyphs[0]) for i in range(1, len(sorted_glyphs) -1): if last_rle != sorted_glyphs[i].to_rle(): new_glyphs.append(sorted_glyphs[i]) else: print sorted_glyphs[i].get_main_id() print "new after removing duplicates: %d" % len(new_glyphs) cknn.set_glyphs(new_glyphs) cknn = edit_cnn(cknn) print "new after cnn: %d" % len(cknn.get_glyphs()) return cknn class GreekOCR(object): """Provides the functionality for GreekOCR. The following parameters control the recognition process: **cknn** The kNNInteractive classifier. **mode** The mode for dealing with accents. Can be ``wholistic`` or ``separatistic``. """ def __init__(self, mode="wholistic"): """Signature: ``init (mode="wholistic")`` where *mode* can be "wholistic" or "separatistic". """ self.optimizeknn = False self.debug = False self.cknn = knn.kNNInteractive([], ["aspect_ratio", "volume64regions", "moments", "nholes_extended"], 0) self.autogroup = None self.output = "" self.mode = mode def load_trainingdata(self, trainfile): """Loads the training data. Signature: ``load_trainingdata (trainfile)`` where *trainfile* is an Gamera XML file containing training data. Make sure that the training file matches the *mode* (wholistic or separatistic). """ self.cknn.from_xml_filename(trainfile) if self.optimizeknn: self.cknn = clean_classifier(self.cknn) def segment_page(self): if(self.mode == "separatistic"): self.page = SinglePage(self.img) else: self.page = WholisticPage(self.img) if self.debug: print "start page segmentation..." t = time.time() self.page.segment() if self.debug: t = time.time() - t print "\t segmentation done [",t,"sec]" def get_page_glyphs(self, image): """Returns a list of segmented CCs using the selected segmentation approach on the given image. This list can be used for creating training data. Signature: ``get_page_glyphs (image)`` where *image* is a Gamera image. """ if image.data.pixel_type != ONEBIT: image = image.to_onebit() self.img = image self.segment_page() glyphs = [] for line in self.page.textlines: for g in line.glyphs: glyphs.append(g) return glyphs def save_debug_images(self): """Saves the following images to the current working directory: **debug_lines.png** Has a frame drawn around each detected line. **debug_chars.png** Has a frame drawn around each detected character. **debug_words.png** Has a frame drawn around each detected word. """ rgbfilename = "debug_lines.png" rgb = self.page.show_lines() rgb.save_PNG(rgbfilename) print "file '%s' written" % rgbfilename rgbfilename = "debug_chars.png" rgb = self.page.show_glyphs() rgb.save_PNG(rgbfilename) print "file '%s' written" % rgbfilename rgbfilename = "debug_words.png" rgb = self.page.show_words() rgb.save_PNG(rgbfilename) print "file '%s' written" % rgbfilename def classify_text(self): self.output = "" for line in self.page.textlines: line.glyphs = \ self.cknn.classify_and_update_list_automatic(line.glyphs) line.sort_glyphs() self.output = self.output + line.to_string() + "\n" self.output = self._normalize(self.output) def get_text(self): return self.output def process_image(self, image): """Recognizes the given image and returns the recognized text as Unicode string. Signature: ``process_image (image)`` where *image* is a Gamera image. The recognized text is additionally stored in the ``GreekOCR`` property *output*, which can subsequently be written to a file with save_text_unicode_ or save_text_teubner_. Make sure that you have called load_trainingdata_ before! """ if image.data.pixel_type != ONEBIT: image = image.to_onebit() self.img = image if self.debug: print "Doing page Segmentation" self.segment_page() if self.debug: print "Classifying Text" self.classify_text() if self.debug: print "Returning Output" return self.get_text() def save_text_xetex(self, filename): data = \ '''\documentclass[11pt]{article} \usepackage{xltxtra} \setmainfont[Mapping=tex-text]{GFS Porson} \\begin{document} %s \end{document}''' % self.output.replace("\n", "\n\n") f = codecs.open(filename, "w", encoding='utf-8') f.write(data) f.close() def save_text_unicode(self, filename): """Stores the recognized text to the given *filename* as Unicode string. Signature ``save_text_unicode(filename)`` Make sure that you have called process_image_ before! """ f = codecs.open(filename, "w", encoding='utf-8') f.write(self.output) f.close() def save_text_teubner(self, filename): """Stores the recognized text to the given *filename* as a LaTeX document utilizing the Teubner style for representing Greek characters and accents. Signature ``save_text_teubner(filename)`` Make sure that you have called process_image_ before! """ from unicode_teubner import unicode_to_teubner data = ''' \documentclass[10pt]{article} \usepackage[polutonikogreek]{babel} \usepackage[or]{teubner} \\begin{document} \selectlanguage{greek} %s \end{document} ''' % unicode_to_teubner(self.output).replace("\n", "\n\n") f = codecs.open(filename, "w", encoding='utf-8') f.write(data) f.close() def _normalize(self,str): str = unicodedata.normalize("NFD", str) output = u"" combined = [] for i in str: is_combining = True try: is_combining = unicodedata.combining(i) > 0 or unicodedata.name(i).find("ACCENT") >= 0 except: is_combining = False if not is_combining: for j in sorted(combined): output += j combined = [] output += i else: combined.append(i) if len(combined) > 0: for j in sorted(combined): output += j return unicodedata.normalize("NFD", output) greekocr-1.0.1/gamera/toolkits/greekocr/wholisticdiacritics.py 0000644 0001750 0001750 00000014643 11635564123 024213 0 ustar dalitz dalitz # -*- mode: python; indent-tabs-mode: nil; tab-width: 3 -*- # vim: set tabstop=3 shiftwidth=3 expandtab: # Copyright (C) 2010-2011 Christian Brandt # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. from gamera.core import * from gamera.plugins.pagesegmentation import textline_reading_order from gamera.toolkits.ocr.ocr_toolkit import * from gamera.toolkits.ocr.classes import Textline,Page,ClassifyCCs import gamera.kdtree as kdtree import unicodedata import sys class WholisticPage(Page): def __init__(self, img): self.img = img #cknn = knn.kNNInteractive([], ["aspect_ratio", "volume64regions", "moments", "nholes_extended"], 0) #cknn.from_xml_filename("x01/classifier-all-2/classifier_glyphs.xml") #if(opt.ccsfilter): # the_ccs = ccs #else: the_ccs = img.cc_analysis() self.median_cc = int(median([cc.nrows for cc in the_ccs])) #autogroup = ClassifyCCs(cknn) #autogroup.parts_to_group = 3 #autogroup.grouping_distance = max([2,median_cc / 8]) Page.__init__(self, img)#, classify_ccs=autogroup) #print "autogrouping glyphs activated." #print "maximal autogroup distance:", autogroup.grouping_distance def lines_to_chars(self): self.textlines = self.get_line_glyphs(self.img, self.ccs_lines) def check_glyph_greek_accent(self, item,glyph): remove = [] add = [] result = [] if((glyph.ul_x == item.ul_x and glyph.ul_y == item.ul_y and glyph.lr_x == item.lr_x and glyph.lr_y == item.lr_y) or \ glyph.intersects_x(item) or \ (glyph.distance_bb(item) < 3 and \ (glyph.distance_cx(item) < (self.median_cc / 2) or 2*glyph.nrows < item.nrows or 2*item.nrows < glyph.nrows ) \ )\ ): ##nebeinander? # print "y" remove.append(glyph) remove.append(item) new = union_images([item,glyph]) add.append(new) result.append(add) #result[0] == ADD result.append(remove) #result[1] == REMOVE return result def get_line_glyphs(self,image,textlines): i=0 show = [] lines = [] ret,sub_ccs = image.sub_cc_analysis(textlines) #print "doc has %d lines" % len(sub_ccs) linenumber = 0 for ccs in sub_ccs: linenumber = linenumber + 1 #print "line %d" % linenumber line_bbox = Rect(textlines[i]) i = i + 1 glyphs = ccs[:] newlist = [] remove = [] add = [] result = [] glyphs.sort(lambda x,y: cmp(x.ul_x, y.ul_x)) #print "first run" for position, item in enumerate(glyphs): olditem = item left = max(0,position - 5) right = min(position + 5, len(glyphs)) checklist = glyphs[left:right] for glyph in checklist: if(item == glyph): continue result = self.check_glyph_greek_accent(item,glyph) if(len(result[0]) > 0): #something has been joind... item = result[0][0] #add.append(result[0][0]) #joind glyph remove.append(result[1][0]) #first part of joind one remove.append(result[1][1]) #second part of joind one if olditem != item: add.append(item) for elem in remove: if(elem in glyphs): glyphs.remove(elem) for elem in add: glyphs.append(elem) remove = [] add = [] glyphs = textline_reading_order(glyphs) glyphs = list(set(glyphs)) #print len(glyphs) new_line = WholisticTextline(line_bbox) final = [] if(len(glyphs) > 0): for glyph in glyphs: final.append(glyph) new_line.add_glyphs(final,False) #new_line.sort_glyphs() #reading order -- from left to right lines.append(new_line) for glyph in glyphs: show.append(glyph) return lines class WholisticTextline(Textline): #called after classification def sort_glyphs(self): self.glyphs = textline_reading_order(self.glyphs) #begin calculating threshold for word-spacing spacelist = [] for i in range(len(self.glyphs) - 1): spacelist.append(self.glyphs[i + 1].ul_x - self.glyphs[i].lr_x) if(len(spacelist) > 0): threshold = median(spacelist) threshold = threshold * 2.0 else: threshold = 0 #end calculatin threshold for word-spacing self.words = chars_make_words(self.glyphs, threshold) def to_string(self): k = 3 max_k = 10 output = u"" for word in self.words: characters = [] skipids = ["manual.xi.upper", "manual.xi.lower", "manual.theta.outer"] for glyph in word: mainid = glyph.get_main_id() if mainid == "comma" or mainid == "combining.comma.above": #print "%s - center_y: %d - med_center: %d" % (mainid, glyph.center.y, med_center) if glyph.center.y > self.bbox.center.y: glyph.classify_automatic("comma") else: glyph.classify_automatic("combining.comma.above") mainid = glyph.get_main_id() mainid = mainid.split(".and.") for a in mainid: char = return_char(a) #print "added %s to output" % char output = output + char#unicodedata.normalize('NFD', char) output = output + " " return output greekocr-1.0.1/gamera/toolkits/greekocr/__init__.py 0000644 0001750 0001750 00000002505 11635564123 021700 0 ustar dalitz dalitz # -*- mode: python; indent-tabs-mode: nil; tab-width: 3 -*- # vim: set tabstop=3 shiftwidth=3 expandtab: # Copyright (C) 2010-2011 Christian Brandt # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. """ Toolkit setup This file is run on importing anything within this directory. Its purpose is only to help with the Gamera GUI shell, and may be omitted if you are not concerned with that. """ from gamera import toolkit from gamera.toolkits.greekocr import compare from gamera.toolkits.greekocr.greekocr import * # Let's import all our plugins here so that when this toolkit # is imported using the "Toolkit" menu in the Gamera GUI # everything works. greekocr-1.0.1/PKG-INFO 0000644 0001750 0001750 00000000510 11635564234 013754 0 ustar dalitz dalitz Metadata-Version: 1.0 Name: greekocr Version: 1.0.1 Summary: An addon Greek OCR toolkit for the Gamera framework for document analysis and recognition. Home-page: http://gamera.sourceforge.net/ Author: Christian Brandt and Christoph Dalitz Author-email: UNKNOWN License: GNU GPL version 2 Description: UNKNOWN Platform: UNKNOWN greekocr-1.0.1/doc/ 0000755 0001750 0001750 00000000000 11635564234 013430 5 ustar dalitz dalitz greekocr-1.0.1/doc/html/ 0000755 0001750 0001750 00000000000 11635564234 014374 5 ustar dalitz dalitz greekocr-1.0.1/doc/html/functions.html 0000644 0001750 0001750 00000012562 11535422120 017262 0 ustar dalitz dalitz
Last modified: March 08, 2011
Contents
The toolkit defines a number of free function which are not image methods. These are defined in different modules and can be imported in a python script with
from gamera.toolkits.greekocr.compare import levenshtein, errorrate
from gamera.toolkits.greekocr.unicode_teubner import unicode_to_teubner
Returns the given unicode string to a LaTeX document body using the Teubner style for representing Greek characters and accents. Signature:
unicode_to_teubner (unicode_text)
The returned LaTeX code does not contain the LaTeX header. To create a complete LaTeX document, you can use the following code:
# LaTeX header
print "\documentclass[10pt]{article}"
print "\usepackage[polutonikogreek]{babel}"
print "\usepackage[or]{teubner}"
print "\\begin{document}"
print "\selectlanguage{greek}"
# document body
print unicode_to_teubner(unicode_string)
# LaTex footer
print "\end{document}"
Computes the Levenshtein distance (aka edit distance) between the two Unicode strings s1 and s2. Signature:
levenshtein(s1, s2)
This implementation differs from the plugin edit_distance in the Gamera core in two points:
- the Gamera core function is implemented in C++ and currently only works with ASCII strings
- this implementation is written in pure Python and therefore somewhat slower, but it works with Unicode strings.
For details about the algorithm see http://en.wikibooks.org/wiki/Algorithm_implementation/Strings/Levenshtein_distance
For the two given Unicode strings, the edit distance divided by the length of the first string is returned. Signature:
errorrate(groundtruth, ocr)