ocr-1.0.6/ 0000755 0001750 0001750 00000000000 11716212264 011643 5 ustar dalitz dalitz ocr-1.0.6/MANIFEST.in 0000644 0001750 0001750 00000000725 11716171513 013406 0 ustar dalitz dalitz recursive-include src *.cpp *.c *.h makefile.* *.hpp *.hxx *.cxx *.txt ANNOUNCE CHANGES INSTALL KNOWN_BUGS LICENSE README TODO recursive-include include *.cpp *.c *.h makefile.* *.hpp *.hxx *.cxx *.txt ANNOUNCE CHANGES INSTALL KNOWN_BUGS LICENSE README TODO recursive-include scripts ocr4gamera include ACKNOWLEDGEMENTS CHANGES TODO INSTALL LICENSE README KNOWN_BUGS MANIFEST.in version recursive-include doc *.txt *.html *.css *.py *.jpg *.jpeg *.png *.gif *.fig ocr-1.0.6/LICENSE 0000644 0001750 0001750 00000035423 11716171513 012660 0 ustar dalitz dalitz GNU GENERAL PUBLIC LICENSE Version 2, June 1991 Copyright (C) 1989, 1991 Free Software Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA Everyone is permitted to copy and distribute verbatim copies of this license document, but changing it is not allowed. Preamble The licenses for most software are designed to take away your freedom to share and change it. By contrast, the GNU General Public License is intended to guarantee your freedom to share and change free software--to make sure the software is free for all its users. This General Public License applies to most of the Free Software Foundation's software and to any other program whose authors commit to using it. (Some other Free Software Foundation software is covered by the GNU Library General Public License instead.) You can apply it to your programs, too. When we speak of free software, we are referring to freedom, not price. Our General Public Licenses are designed to make sure that you have the freedom to distribute copies of free software (and charge for this service if you wish), that you receive source code or can get it if you want it, that you can change the software or use pieces of it in new free programs; and that you know you can do these things. To protect your rights, we need to make restrictions that forbid anyone to deny you these rights or to ask you to surrender the rights. These restrictions translate to certain responsibilities for you if you distribute copies of the software, or if you modify it. For example, if you distribute copies of such a program, whether gratis or for a fee, you must give the recipients all the rights that you have. You must make sure that they, too, receive or can get the source code. And you must show them these terms so they know their rights. We protect your rights with two steps: (1) copyright the software, and (2) offer you this license which gives you legal permission to copy, distribute and/or modify the software. Also, for each author's protection and ours, we want to make certain that everyone understands that there is no warranty for this free software. If the software is modified by someone else and passed on, we want its recipients to know that what they have is not the original, so that any problems introduced by others will not reflect on the original authors' reputations. Finally, any free program is threatened constantly by software patents. We wish to avoid the danger that redistributors of a free program will individually obtain patent licenses, in effect making the program proprietary. To prevent this, we have made it clear that any patent must be licensed for everyone's free use or not licensed at all. The precise terms and conditions for copying, distribution and modification follow. GNU GENERAL PUBLIC LICENSE TERMS AND CONDITIONS FOR COPYING, DISTRIBUTION AND MODIFICATION 0. This License applies to any program or other work which contains a notice placed by the copyright holder saying it may be distributed under the terms of this General Public License. The "Program", below, refers to any such program or work, and a "work based on the Program" means either the Program or any derivative work under copyright law: that is to say, a work containing the Program or a portion of it, either verbatim or with modifications and/or translated into another language. (Hereinafter, translation is included without limitation in the term "modification".) Each licensee is addressed as "you". Activities other than copying, distribution and modification are not covered by this License; they are outside its scope. The act of running the Program is not restricted, and the output from the Program is covered only if its contents constitute a work based on the Program (independent of having been made by running the Program). Whether that is true depends on what the Program does. 1. You may copy and distribute verbatim copies of the Program's source code as you receive it, in any medium, provided that you conspicuously and appropriately publish on each copy an appropriate copyright notice and disclaimer of warranty; keep intact all the notices that refer to this License and to the absence of any warranty; and give any other recipients of the Program a copy of this License along with the Program. You may charge a fee for the physical act of transferring a copy, and you may at your option offer warranty protection in exchange for a fee. 2. You may modify your copy or copies of the Program or any portion of it, thus forming a work based on the Program, and copy and distribute such modifications or work under the terms of Section 1 above, provided that you also meet all of these conditions: a) You must cause the modified files to carry prominent notices stating that you changed the files and the date of any change. b) You must cause any work that you distribute or publish, that in whole or in part contains or is derived from the Program or any part thereof, to be licensed as a whole at no charge to all third parties under the terms of this License. c) If the modified program normally reads commands interactively when run, you must cause it, when started running for such interactive use in the most ordinary way, to print or display an announcement including an appropriate copyright notice and a notice that there is no warranty (or else, saying that you provide a warranty) and that users may redistribute the program under these conditions, and telling the user how to view a copy of this License. (Exception: if the Program itself is interactive but does not normally print such an announcement, your work based on the Program is not required to print an announcement.) These requirements apply to the modified work as a whole. If identifiable sections of that work are not derived from the Program, and can be reasonably considered independent and separate works in themselves, then this License, and its terms, do not apply to those sections when you distribute them as separate works. But when you distribute the same sections as part of a whole which is a work based on the Program, the distribution of the whole must be on the terms of this License, whose permissions for other licensees extend to the entire whole, and thus to each and every part regardless of who wrote it. Thus, it is not the intent of this section to claim rights or contest your rights to work written entirely by you; rather, the intent is to exercise the right to control the distribution of derivative or collective works based on the Program. In addition, mere aggregation of another work not based on the Program with the Program (or with a work based on the Program) on a volume of a storage or distribution medium does not bring the other work under the scope of this License. 3. You may copy and distribute the Program (or a work based on it, under Section 2) in object code or executable form under the terms of Sections 1 and 2 above provided that you also do one of the following: a) Accompany it with the complete corresponding machine-readable source code, which must be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, b) Accompany it with a written offer, valid for at least three years, to give any third party, for a charge no more than your cost of physically performing source distribution, a complete machine-readable copy of the corresponding source code, to be distributed under the terms of Sections 1 and 2 above on a medium customarily used for software interchange; or, c) Accompany it with the information you received as to the offer to distribute corresponding source code. (This alternative is allowed only for noncommercial distribution and only if you received the program in object code or executable form with such an offer, in accord with Subsection b above.) The source code for a work means the preferred form of the work for making modifications to it. For an executable work, complete source code means all the source code for all modules it contains, plus any associated interface definition files, plus the scripts used to control compilation and installation of the executable. However, as a special exception, the source code distributed need not include anything that is normally distributed (in either source or binary form) with the major components (compiler, kernel, and so on) of the operating system on which the executable runs, unless that component itself accompanies the executable. If distribution of executable or object code is made by offering access to copy from a designated place, then offering equivalent access to copy the source code from the same place counts as distribution of the source code, even though third parties are not compelled to copy the source along with the object code. 4. You may not copy, modify, sublicense, or distribute the Program except as expressly provided under this License. Any attempt otherwise to copy, modify, sublicense or distribute the Program is void, and will automatically terminate your rights under this License. However, parties who have received copies, or rights, from you under this License will not have their licenses terminated so long as such parties remain in full compliance. 5. You are not required to accept this License, since you have not signed it. However, nothing else grants you permission to modify or distribute the Program or its derivative works. These actions are prohibited by law if you do not accept this License. Therefore, by modifying or distributing the Program (or any work based on the Program), you indicate your acceptance of this License to do so, and all its terms and conditions for copying, distributing or modifying the Program or works based on it. 6. Each time you redistribute the Program (or any work based on the Program), the recipient automatically receives a license from the original licensor to copy, distribute or modify the Program subject to these terms and conditions. You may not impose any further restrictions on the recipients' exercise of the rights granted herein. You are not responsible for enforcing compliance by third parties to this License. 7. If, as a consequence of a court judgment or allegation of patent infringement or for any other reason (not limited to patent issues), conditions are imposed on you (whether by court order, agreement or otherwise) that contradict the conditions of this License, they do not excuse you from the conditions of this License. If you cannot distribute so as to satisfy simultaneously your obligations under this License and any other pertinent obligations, then as a consequence you may not distribute the Program at all. For example, if a patent license would not permit royalty-free redistribution of the Program by all those who receive copies directly or indirectly through you, then the only way you could satisfy both it and this License would be to refrain entirely from distribution of the Program. If any portion of this section is held invalid or unenforceable under any particular circumstance, the balance of the section is intended to apply and the section as a whole is intended to apply in other circumstances. It is not the purpose of this section to induce you to infringe any patents or other property right claims or to contest validity of any such claims; this section has the sole purpose of protecting the integrity of the free software distribution system, which is implemented by public license practices. Many people have made generous contributions to the wide range of software distributed through that system in reliance on consistent application of that system; it is up to the author/donor to decide if he or she is willing to distribute software through any other system and a licensee cannot impose that choice. This section is intended to make thoroughly clear what is believed to be a consequence of the rest of this License. 8. If the distribution and/or use of the Program is restricted in certain countries either by patents or by copyrighted interfaces, the original copyright holder who places the Program under this License may add an explicit geographical distribution limitation excluding those countries, so that distribution is permitted only in or among countries not thus excluded. In such case, this License incorporates the limitation as if written in the body of this License. 9. The Free Software Foundation may publish revised and/or new versions of the General Public License from time to time. Such new versions will be similar in spirit to the present version, but may differ in detail to address new problems or concerns. Each version is given a distinguishing version number. If the Program specifies a version number of this License which applies to it and "any later version", you have the option of following the terms and conditions either of that version or of any later version published by the Free Software Foundation. If the Program does not specify a version number of this License, you may choose any version ever published by the Free Software Foundation. 10. If you wish to incorporate parts of the Program into other free programs whose distribution conditions are different, write to the author to ask for permission. For software which is copyrighted by the Free Software Foundation, write to the Free Software Foundation; we sometimes make exceptions for this. Our decision will be guided by the two goals of preserving the free status of all derivatives of our free software and of promoting the sharing and reuse of software generally. NO WARRANTY 11. BECAUSE THE PROGRAM IS LICENSED FREE OF CHARGE, THERE IS NO WARRANTY FOR THE PROGRAM, TO THE EXTENT PERMITTED BY APPLICABLE LAW. EXCEPT WHEN OTHERWISE STATED IN WRITING THE COPYRIGHT HOLDERS AND/OR OTHER PARTIES PROVIDE THE PROGRAM "AS IS" WITHOUT WARRANTY OF ANY KIND, EITHER EXPRESSED OR IMPLIED, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE. THE ENTIRE RISK AS TO THE QUALITY AND PERFORMANCE OF THE PROGRAM IS WITH YOU. SHOULD THE PROGRAM PROVE DEFECTIVE, YOU ASSUME THE COST OF ALL NECESSARY SERVICING, REPAIR OR CORRECTION. 12. IN NO EVENT UNLESS REQUIRED BY APPLICABLE LAW OR AGREED TO IN WRITING WILL ANY COPYRIGHT HOLDER, OR ANY OTHER PARTY WHO MAY MODIFY AND/OR REDISTRIBUTE THE PROGRAM AS PERMITTED ABOVE, BE LIABLE TO YOU FOR DAMAGES, INCLUDING ANY GENERAL, SPECIAL, INCIDENTAL OR CONSEQUENTIAL DAMAGES ARISING OUT OF THE USE OR INABILITY TO USE THE PROGRAM (INCLUDING BUT NOT LIMITED TO LOSS OF DATA OR DATA BEING RENDERED INACCURATE OR LOSSES SUSTAINED BY YOU OR THIRD PARTIES OR A FAILURE OF THE PROGRAM TO OPERATE WITH ANY OTHER PROGRAMS), EVEN IF SUCH HOLDER OR OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGES. END OF TERMS AND CONDITIONS ocr-1.0.6/setup.py 0000700 0001750 0001750 00000002774 11716171513 013361 0 ustar dalitz dalitz #!/usr/bin/env python from distutils.core import setup, Extension from gamera import gamera_setup # Some meta data of the toolkit TOOLKIT_NAME = "ocr" VERSION = open("version", 'r').readlines()[0].strip() AUTHOR = "Rene Baston and Christoph Dalitz" HOMEPAGE = "http://gamera.sourceforge.net/" DESCRIPTION = "An addon OCR toolkit for the Gamera framework for document analysis and recognition." LICENSE = "GNU GPL version 2" # ---------------------------------------------------------------------------- # You should not usually have to edit anything below, but it is # implemented here and not in the Gamera core so that you can edit it # if you need to do something more complicated (for example, building # and linking to a third- party library). # ---------------------------------------------------------------------------- PLUGIN_PATH = 'gamera/toolkits/%s/plugins/' % TOOLKIT_NAME PACKAGE = 'gamera.toolkits.%s' % TOOLKIT_NAME PLUGIN_PACKAGE = PACKAGE + ".plugins" plugins = gamera_setup.get_plugin_filenames(PLUGIN_PATH) plugin_extensions = gamera_setup.generate_plugins(plugins, PLUGIN_PACKAGE) # This is a standard distutils setup initializer. If you need to do # anything more complex here, refer to the Python distutils documentation. setup(name=TOOLKIT_NAME, version=VERSION, license=LICENSE, url=HOMEPAGE, author=AUTHOR, description=DESCRIPTION, ext_modules = plugin_extensions, packages = [PACKAGE, PLUGIN_PACKAGE], scripts = ['scripts/ocr4gamera.py']) ocr-1.0.6/gamera/ 0000755 0001750 0001750 00000000000 11716212264 013077 5 ustar dalitz dalitz ocr-1.0.6/gamera/toolkits/ 0000755 0001750 0001750 00000000000 11716212264 014747 5 ustar dalitz dalitz ocr-1.0.6/gamera/toolkits/ocr/ 0000755 0001750 0001750 00000000000 11716212264 015532 5 ustar dalitz dalitz ocr-1.0.6/gamera/toolkits/ocr/ocr_toolkit.py 0000644 0001750 0001750 00000036324 11716210324 020437 0 ustar dalitz dalitz # # Copyright (C) 2009-2010 Rene Baston, Christoph Dalitz # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. # from gamera.core import * init_gamera() from gamera import knn from gamera.plugins import pagesegmentation from gamera.classify import ShapedGroupingFunction from gamera.plugins.image_utilities import union_images from gamera.plugins.listutilities import median from gamera.toolkits.ocr.classes import Textline import unicodedata import sys import time def return_char(unicode_str, extra_chars_dict={}): """Converts a unicode character name to a unicode symbol. Signature: ``return_char (classname, extra_chars_dict={})`` with *classname*: A class name derived from a unicode character name. Example: ``latin.small.letter.a`` returns the character ``a``. *extra_chars_dict* A dictionary of additional translations of classnames to character codes. This is necessary when you use class names that are not unicode names. The character 'code' does not need to be an actual code, but can be any string. This can be useful, e.g. for ligatures: .. code:: Python return_char(glyph.get_main_id(), {'latin.small.ligature.st':'st'}) When *classname* is not listed in *extra_chars_dict*, it must correspond to a `standard unicode character name`_, as in the examples of the following table: .. _`standard unicode character names`: http://www.unicode.org/charts/ +-----------+----------------------------+----------------------------+ | Character | Unicode Name | Class Name | +===========+============================+============================+ | ``!`` | ``EXCLAMATION MARK`` | ``exclamation.mark`` | +-----------+----------------------------+----------------------------+ | ``2`` | ``DIGIT TWO`` | ``digit.two`` | +-----------+----------------------------+----------------------------+ | ``A`` | ``LATIN CAPITAL LETTER A`` | ``latin.capital.letter.a`` | +-----------+----------------------------+----------------------------+ | ``a`` | ``LATIN SMALL LETTER A`` | ``latin.small.letter.a`` | +-----------+----------------------------+----------------------------+ """ if len(extra_chars_dict) > 0: try: return extra_chars_dict[unicode_str] except: pass name = unicode_str.upper() # some xml-files might be corrupted due to wrong grouping if name.startswith('_GROUP.'): name = name[len('_GROUP.'):] if name.startswith('_PART.'): name = name[len('_PART.'):] name = name.replace(".", " ") try: return unicodedata.lookup(name) except KeyError: strings = unicode_str.split(".") if(strings[0] == "collated"): return strings[1] if(strings[0] == "cursive"): return return_char(unicode_str[8:]) else: print "ERROR: Name not found:", name return "" def chars_make_words(lines_glyphs,threshold=None): """Groups the given glyphs to words based upon the horizontal distance between adjacent glyphs. Signature: ``chars_make_words (glyphs, threshold=None)`` with *glyphs*: A list of ``Cc`` data types, each of which representing a character. All glyphs must stem from the same single line of text. *threshold*: Horizontal white space greater than *threshold* will be considered a word separating gap. When ``None``, the threshold value is calculated automatically as 2.5 times teh median white space between adjacent glyphs. The result is a nested list of glyphs with each sublist representing a word. This is the same data structure as used in `Textline.words`_ .. _`Textline.words`: gamera.toolkits.ocr.classes.Textline.html """ glyphs = lines_glyphs[:] wordlist = [] if(threshold == None): spacelist = [] total_space = 0 for i in range(len(glyphs) - 1): spacelist.append(glyphs[i + 1].ul_x - glyphs[i].lr_x) if(len(spacelist) > 0): threshold = median(spacelist) threshold = threshold * 2.5 else: threshold = 0 word = [] for i in range(len(glyphs)): if i > 0: if((glyphs[i].ul_x - glyphs[i - 1].lr_x) > threshold): wordlist.append(word) word = [] word.append(glyphs[i]) if(len(word) > 0): wordlist.append(word) return wordlist def __char_touches_top(glyph, line): """Returns true when the top of the character is close to the top of the line.""" #if glyph.ul_y < line.bbox.center_y-(line.bbox.nrows/4): if glyph.ul_y <= line.bbox.ul_y+(line.bbox.nrows/5): return True else: return False def textline_to_string(line, heuristic_rules="roman", extra_chars_dict={}): """Returns a unicode string of the text in the given ``Textline``. Signature: ``textline_to_string (textline, heuristic_rules="roman", extra_chars_dict={})`` with *textline*: A ``Textline`` object containing the glyphs. The glyphs must already be classified. *heuristic_rules*: Depending on the alphabeth, some characters can very similar and need further heuristic rules for disambiguation, like apostroph and comma, which have the same shape and only differ in their position relative to the baseline. When set to \"roman\", several rules specific for latin alphabeths are applied. *extra_chars_dict* A dictionary of additional translations of classnames to character codes. This is necessary when you use class names that are not unicode names. Will be passed to `return_char`_. As this function uses `return_char`_, the class names of the glyphs in *textline* must corerspond to unicode character names, as described in the documentation of `return_char`_. .. _`return_char`: #return-char """ wordlist = line.words s = "" char = "" for i in range(len(wordlist)): if(i): s = s + " " for glyph in wordlist[i]: char = return_char(glyph.get_main_id(), extra_chars_dict) if (heuristic_rules == "roman"): # disambiguation of similar roman characters if (char == "x" or char == "X"): if __char_touches_top(glyph, line): glyph.classify_heuristic("latin.capital.letter.x") else: glyph.classify_heuristic("latin.small.letter.x") char = return_char(glyph.get_main_id()) if (char == "p" or char == "P"): if __char_touches_top(glyph, line): glyph.classify_heuristic("latin.capital.letter.p") else: glyph.classify_heuristic("latin.small.letter.p") char = return_char(glyph.get_main_id()) if (char == "o" or char == "O"): if __char_touches_top(glyph, line): glyph.classify_heuristic("latin.capital.letter.o") else: glyph.classify_heuristic("latin.small.letter.o") char = return_char(glyph.get_main_id()) if (char == "w" or char == "W"): if __char_touches_top(glyph, line): glyph.classify_heuristic("latin.capital.letter.w") else: glyph.classify_heuristic("latin.small.letter.w") char = return_char(glyph.get_main_id()) if (char == "v" or char == "V"): if __char_touches_top(glyph, line): glyph.classify_heuristic("latin.capital.letter.v") else: glyph.classify_heuristic("latin.small.letter.v") char = return_char(glyph.get_main_id()) if (char == "z" or char == "Z"): if __char_touches_top(glyph, line): glyph.classify_heuristic("latin.capital.letter.z") else: glyph.classify_heuristic("latin.small.letter.z") char = return_char(glyph.get_main_id()) if (char == "s" or char == "S"): # not for long s if (glyph.get_main_id().upper() != "LATIN.SMALL.LETTER.LONG.S"): if __char_touches_top(glyph, line): glyph.classify_heuristic("latin.capital.letter.s") else: glyph.classify_heuristic("latin.small.letter.s") char = return_char(glyph.get_main_id()) #if(char == "T" and (float(glyph.nrows)/float(glyph.ncols)) > 1.5): # glyph.classify_heuristic("LATIN SMALL LETTER F") # char = return_char(glyph.get_main_id()) if (char == "'" or char == ","): if (glyph.ul_y < line.bbox.center_y): glyph.classify_heuristic("APOSTROPHE") char = "'" else: glyph.classify_heuristic("COMMA") char = "," s = s + char return s def check_upper_neighbors(item,glyph,line): """Check for small signs grouped beside each other like quotation marks. Signature: ``check_upper_neighbors(item,glyph,line)`` with *item*: Some connected-component. *glyph*: Some connected-component. *line*: The ``Textline`` Object which includes ``item`` and ``glyph`` Returns an array with two elements. The first element keeps a list of characters (images that has been united to a single image) and the second image is a list of characters which has to be removed as these have been united to a single character. """ remove = [] add = [] result = [] minheight = min([item.nrows,glyph.nrows]) # glyphs must be small, of similar size and on the same height if(not(glyph.lr_y >= line.center_y and glyph.lr_y-(glyph.nrows/3) <= line.lr_y)): if (glyph.contains_y(item.center_y) and item.contains_y(glyph.center_y)): minwidth = min([item.ncols,glyph.ncols]) distance = item.lr_x - glyph.lr_x if(distance > 0 and distance <= minwidth*3): remove.append(item) remove.append(glyph) new = union_images([item,glyph]) add.append(new) result.append(add) #result[0] == ADD result.append(remove) #result[1] == REMOVE return result def check_glyph_accent(item,glyph): """Check two glyphs for beeing grouped to one single character. This function is for unit connected-components like i, j or colon. Signature: ``check_glyph_accent(item,glyph)`` with *item*: Some connected-component. *glyph*: Some connected-component. There is returned an array with two elements. The first element keeps a list of characters (images that has been united to a single image) and the second image is a list of characters which has to be removed as these have been united to a single character. """ remove = [] add = [] result = [] if(glyph.contains_x(item.ul_x) or glyph.contains_x(item.lr_x) or glyph.contains_x(item.center_x)): ##nebeinander? if(not(item.contains_y(glyph.ul_y) or item.contains_y(glyph.lr_y) or item.contains_y(glyph.center_y))): ##nicht y-dimensions ueberschneident remove.append(item) remove.append(glyph) new = union_images([item,glyph]) add.append(new) result.append(add) #result[0] == ADD result.append(remove) #result[1] == REMOVE return result def get_line_glyphs(image,textlines): """Splits image regions representing text lines into characters. Signature: ``get_line_glyphs (image, segments)`` with *image*: The document image that is to be further segmentated. It must contin the same underlying image data as the second argument *segments* *segments*: A list ``Cc`` data types, each of which represents a text line region. The image views must correspond to *image*, i.e. each pixels has a value that is the unique label of the text line it belongs to. This is the interface used by the plugins in the \"PageSegmentation\" section of the Gamera core. The result is returned as a list of Textline_ objects. .. _Textline: gamera.toolkits.ocr.classes.Textline.html """ i=0 show = [] lines = [] ret,sub_ccs = image.sub_cc_analysis(textlines) for ccs in sub_ccs: line_bbox = Rect(textlines[i]) i = i + 1 glyphs = ccs[:] newlist = [] remove = [] add = [] result = [] glyphs.sort(lambda x,y: cmp(x.ul_x, y.ul_x)) for position, item in enumerate(glyphs): if(True): #if(not(glyph.lr_y >= line_bbox.center_y and glyph.lr_y-(glyph.nrows/3) <= line_bbox.lr_y)): ## is this part of glyph higher then line.center_y ? left = position - 2 if(left < 0): left = 0 right = position + 2 if(right > len(glyphs)): right = len(glyphs) checklist = glyphs[left:right] for glyph in checklist: if (item == glyph): continue result = check_upper_neighbors(glyph,item,line_bbox) if(len(result[0]) > 0): #something has been joind... joind_upper_connection = result[0][0] #joind glyph add.append(joind_upper_connection) remove.append(result[1][0]) #first part of joind one remove.append(result[1][1]) #second part of joind one for glyph2 in checklist: #maybe the upper joind glyphs fits to a glyph below... if(glyphs == joind_upper_connection): continue if(joind_upper_connection.contains_x(glyph2.center_x)): #fits for example on ae, oe, ue in german alph new = union_images([glyph2,joind_upper_connection]) add.append(new) remove.append(glyph2) add.remove(joind_upper_connection) break for elem in remove: if (elem in checklist): checklist.remove(elem) for glyph in checklist: if(item == glyph): continue result = check_glyph_accent(item,glyph) if(len(result[0]) > 0): #something has been joind... add.append(result[0][0]) #joind glyph remove.append(result[1][0]) #first part of joind one remove.append(result[1][1]) #second part of joind one for elem in remove: if(elem in glyphs): glyphs.remove(elem) for elem in add: glyphs.append(elem) new_line = Textline(line_bbox) final = [] if(len(glyphs) > 0): for glyph in glyphs: final.append(glyph) new_line.add_glyphs(final,False) new_line.sort_glyphs() #reading order -- from left to right lines.append(new_line) for glyph in glyphs: show.append(glyph) return lines def show_bboxes(image,glyphs): """Returns an RGB image with bounding boxes of the given glyphs as hollow rects. Useful for visualization and debugging of a segmentation. Signature: ``show_bboxes (image, glyphs)`` with: *image*: An image of the textdokument which has to be segmentated. *glyphs*: List of rects which will be drawn on ``image`` as hollow rects. As all image types are derived from ``Rect``, any image list can be passed. """ rgb = image.to_rgb() if(len(glyphs) > 0): for glyph in glyphs: rgb.draw_hollow_rect(glyph, RGBPixel(255,0,0), 1.0) return rgb ocr-1.0.6/gamera/toolkits/ocr/classes.py 0000644 0001750 0001750 00000030733 11716171513 017550 0 ustar dalitz dalitz # # Copyright (C) 2009-2010 Rene Baston, Christoph Dalitz # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. # from gamera.core import * init_gamera() from gamera.plugins import pagesegmentation from gamera.plugins.listutilities import median class Textline: ######################################################################### """The ``Textline`` object stores information about a text line in its following properties: **bbox** A ``Rect`` object representing the bounding box of the text line. **glyphs** A list of ``Cc`` objects, each representing a character in the line. **words** A nested list of ``Cc`` objects, where each sublist represents the characters of a single word. """ bbox = [] glyphs = [] words = [] text = "" ####################################################################### # constructor # def __init__(self,bbox,glyphs = None): """Signature: ``init (bbox, glyphs=None)`` with *bbox*: ``Rect`` object representing position and size of the text line *glyphs*: A list of ``Cc`` objects representing the characters in the text line """ self.bbox = Rect(bbox) if(glyphs == None): self.glyphs = [] else: self.glyphs = glyphs self.text = "" def add_glyph(self,glyph,extend=True): """Adds the given *glyph* to the Textline. Signature: ``add_glyph (glyph, extend=True)`` When *extend* is ``True``, the text line bounding box *bbox* is extended by the glyph's bounding box. """ self.glyphs.append(glyph) if (extend): self.bbox.union(glyph) def add_glyphs(self,glyphs,extend=True): """Adds the given *glyphs* to the Textline. Signature: ``add_glyphs (glyphs, extend=True)`` When *extend* is ``True``, the text line bounding box *bbox* is extended by the union of the glyphs' bounding boxes. """ for glyph in glyphs: self.glyphs.append(glyph) if (extend): self.bbox.union(glyph) def sort_glyphs(self): """Sorts the characters in *Textline.glyphs* from left to right. """ self.glyphs.sort(lambda x,y: cmp(x.ul_x, y.ul_x)) class ClassifyCCs: ############################################################################### """This is a callable class that can optionally be passed to the constructor of Page_, so that it will be called during the segmentation process. .. _Page: gamera.toolkits.ocr.classes.Page.html Its standard definition should generally be sufficient for using a kNN classifier. Should you need to write your own classification function (e.g. one that additionally uses heuristic rules for classification), make sure that you overwrite the `__call__`_ method with the same signature. For fine tuning the classification, the follwoing attributes can be used: **knn** The knn classifier; this is passed in the constructor **parts_to_group** Corresponds to *max_parts_per_group* in *kNNInteractive.group_list_automatic*. Default value is 3. **grouping_distance** Corresponds to the *distance* argument of the *grouping_function* in *kNNInteractive.group_list_automatic*. Only CCs closer than this distance are considered for grouping. Default value is -1, which means that it will be calculated automatically as in `__call__`__. .. __: #call """ ############################################################################# # constructor # def __init__(self,knn): """Signature: ``__init__ (knn)`` where *knn* is a kNN classifier which has already loaded training data. """ self.knn = knn self.grouping_distance = -1 self.parts_to_group = 3 def __call__(self,ccs): """This method will be called in `Page.segment`_. Signature: .. _`Page.segment`: gamera.toolkits.ocr.classes.Page.html#segment ``__call__ (ccs)`` where *ccs* is the list of glyphs that is to be classified. See the documentation of Gamera's classifier API how the classification result is stored in the glpyhs. How the classification is done is controled by the following attributes of ``ClassifyCCs``: - When *parts_to_group* > 1, the classification is done with Gamera's grouping algorithm; otherwise no grouping of broken characters is done. - In case of grouping, the property *distance* is passed to the grouping function. When it is -1 (default), it is set to the median height of the *ccs*. """ from gamera.classify import ShapedGroupingFunction, BoundingBoxGroupingFunction distance = self.grouping_distance if (self.parts_to_group > 1 and distance < 0): distance = int(median([c.nrows for c in ccs])) if (self.parts_to_group > 1): ccs = self.knn.group_and_update_list_automatic(ccs,grouping_function=ShapedGroupingFunction(distance),max_parts_per_group=self.parts_to_group) #ccs = self.knn.group_and_update_list_automatic(ccs,grouping_function=BoundingBoxGroupingFunction(distance),max_parts_per_group=self.parts_to_group) else: ccs = self.knn.classify_and_update_list_automatic(ccs) return ccs class Page: ##################################################################################### """The ``Page`` object offers the page segmentation functionality by providing a ``segment`` method. See `its documentation`__ for more information on how to overwrite specific steps of the segmentation process. .. __: #segment After the call of ``segment``, the segmentation results are stored in the following attributes of ``Page``: **textlines** List of Textline_ objects representing all text lines **img** The image to which Ccs in the *textlines* refer. .. _Textline: gamera.toolkits.ocr.classes.Textline.html """ ccs_glyphs = [] ccs_lines = [] textlines = [] img = None classify_ccs = None #################################################################################### # constructor # def __init__(self, image, glyphs=None, classify_ccs=None): """The only required argument in the constructor is the image that is to be segmented. Note that the constructor does *not* do the segmentation; for this, you must call the segment__ method. .. __: #segment Signature: ``init (image, glyphs=None, classify_ccs=None)`` with *image*: The image to be segmented. *glyphs*: An optional list of connected components representing the characters in the image. In general, this is not needed, but it can be useful for bottom up methods starting from already detected characters (e.g. by Gamera's classification based character grouping. *classify_ccs*: A callable class with the same interface as ClassifyCCs_. If given, it will be called during the segmentation process, right after the splitting of lines to characters. .. _ClassifyCCs: gamera.toolkits.ocr.classes.ClassifyCCs.html """ self.img = image self.textlines = [] if (classify_ccs != None): self.classify_ccs = classify_ccs else: self.classify_ccs = None if (glyphs != None): self.ccs_glyphs = glyphs else: self.ccs_glyphs = [] def segment(self): """Segments *Page.img* and stores the result in *Page.textlines*. This method has no arguments. It calls the following methods in the given order: - page_to_lines_ for splitting the page into segments representing text lines - order_lines_ for sorting the lines into reading order - lines_to_chars_ for splitting all lines into characters - *Page.classify_ccs* when it is set, i.e., has been passed to the constructor (default is that it is not set) - chars_to_words_ for grouping the characters to words .. _page_to_lines: #page-to-lines .. _order_lines: #order-lines .. _lines_to_chars: #lines-to-chars .. _chars_to_words: #chars-to-words By overwriting one (or several) of the above functions, you can replace specific steps of the segmentation process with custom algorithms. """ self.page_to_lines() self.order_lines() self.lines_to_chars() if(self.classify_ccs != None): for line in self.textlines: line.glyphs = self.classify_ccs(line.glyphs) # grouping in classification may change glyph order line.sort_glyphs() self.chars_to_words() def page_to_lines(self): """Splits the image into segments representing text lines. This method has no arguments. The current implementation simply calls the *bbox_merging* plugin from the Gamera core with *Ey=0*, such that the page is not split into paragraphs, but into lines. The segmentation result is stored in the variable *Page.ccs_lines*, which is a list of the data type ``Cc``, i.e., with each segment (line) represented by a different label in the image. This is the interface used by all page segmentation plugins in the Gamera core. .. note:: When you overwrite this method, make sure that write the segmentation result to *self.ccs_lines*. This member variable will then be further processed by lines_to_chars_. .. _lines_to_chars: #lines-to-chars """ self.ccs_lines = self.img.bbox_merging(Ey=0) def order_lines(self): """Sorts the segments in *Page.ccs_lines* into reading order. This method has no arguments. The current implementation uses the plugin *textline_reading_order* from the Gamera core. """ from gamera.plugins.pagesegmentation import textline_reading_order self.ccs_lines = textline_reading_order(self.ccs_lines) def lines_to_chars(self, lines=None): """Splits text lines into characters. Signature: ``lines_to_chars (lines=None)`` *lines* must be a list of ``Cc`` data types, each of them representing a text line. When not given (default), *Page.ccs_lines* is used instead. The current implementation calls *get_line_glyphs* as defined in the module ocr_toolkit_. .. _ocr_toolkit: functions.html The result is stored in *Page.textlines*; the characters are stored for each textline in *Textline.glyphs*. """ from gamera.toolkits.ocr.ocr_toolkit import get_line_glyphs if(lines != None): seg_lines = lines else: seg_lines = self.ccs_lines self.textlines = get_line_glyphs(self.img, seg_lines) def chars_to_words(self, lines=None): """Groups the characters in each ``Textline`` from *Page.textlines* to words and stores the result for each ``Textline`` in the property *Textline.words*. This method has an optional but generally useless argument for the list of textlines. It is therefore usually called without arguments. The current implementation calls *chars_make_words* as defined in the module ocr_toolkit_. .. _ocr_toolkit: functions.html """ from gamera.toolkits.ocr.ocr_toolkit import chars_make_words if(lines != None): lines = lines else: lines = self.textlines for line in lines: line.words = chars_make_words(line.glyphs) def show_lines(self): """Returns an RGB image with all segmented text lines marked by hollow rects. Makes only sense after *page_to_lines* (or *segment*) has been called. """ from gamera.toolkits.ocr.ocr_toolkit import show_bboxes return show_bboxes(self.img, self.ccs_lines) def show_glyphs(self): """Returns an RGB image with all segmented/grouped characters marked by hollow rects. Makes only sense after *lines_to_chars* (or *segment*) has been called. """ glyphs = [] for line in self.textlines: if(len(line.glyphs) > 0): for glyph in line.glyphs: glyphs.append(glyph) from gamera.toolkits.ocr.ocr_toolkit import show_bboxes return show_bboxes(self.img, glyphs) def show_words(self): """Returns an RGB image with all grouped words marked by hollow rects. Makes only sense after *chars_to_words* (or *segment*) has been called.. """ words = [] for line in self.textlines: for word in line.words: words.append(word) final_bboxes = [] if(len(words) > 0): for word in words: cc = word[:1] word_bbox = Rect(cc[0]) for glyph in word[1:]: word_bbox.union(glyph) final_bboxes.append(word_bbox) from gamera.toolkits.ocr.ocr_toolkit import show_bboxes return show_bboxes(self.img, final_bboxes) ocr-1.0.6/gamera/toolkits/ocr/plugins/ 0000755 0001750 0001750 00000000000 11716212264 017213 5 ustar dalitz dalitz ocr-1.0.6/gamera/toolkits/ocr/plugins/bbox_merging_mcmillan.py 0000644 0001750 0001750 00000023767 11716171513 024123 0 ustar dalitz dalitz # # Copyright (C) 2010 Robert Butz # # This program is free software; you can redistribute it and/or # modify it under the terms of the GNU General Public License # as published by the Free Software Foundation; either version 2 # of the License, or (at your option) any later version. # # This program is distributed in the hope that it will be useful, # but WITHOUT ANY WARRANTY; without even the implied warranty of # MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the # GNU General Public License for more details. # # You should have received a copy of the GNU General Public License # along with this program; if not, write to the Free Software # Foundation, Inc., 51 Franklin Street, Fifth Floor, Boston, MA 02110-1301 USA. # from gamera.plugin import * from gamera.args import NoneDefault class bbox_mcmillan(PluginFunction): # overwrite find_tall_glyphs to adjust deviation """Returns the textlines from image as connected components. The segmentation method is adapted from McMillan's segmentation method in roman_text.py. It allows a more individual segmentation through parameterization. Options: *glyphs*: This list can be build out of a ``cc_analysis``. On default, this parameter is blank, which will cause the function to call ``cc_analysis`` itself. *section_search_size* This optional parameter adjusts the calculated avg_glyph_size by multipling its value (default=1). *noise_mltplk* With this optional parameter one can adjust the noise_recognition rate independently from the calculated avg_glyph_size (default = 1). Values greater than 1 let the noise_removal detect bigger noise (but maybe even glyphs). Chose smaller values to avoid assigning small glyphs to noise. *large_mltplk* Analog to noise_mltplk one can set this parameter to manipulate the recognition of very large ccs according to the avg_glyph_size (default=20). Higher values lead to a better acceptance of above-average ccs. Beneficial, for example for big capital initials at the beginning of paragraphs such as seen in bibles. *stdev_mltplk* This parameter affects the line finding algorithm by excluding abnormally tall glyphs (default=5). The standard deviation will be calculated and multiplied by this parameter. """ pure_python = True category="PageSegmentation" self_type = ImageType([ONEBIT]) args = Args([ImageList("glyphs", default=NoneDefault), Float("section_search_size", default=1.0), Float("noise_mltplk", default=1.0), Float("large_mltplk", default=20.0), Float("stdev_mltplk", default=5.0)]) return_type = ImageList("line_cc_list") author = "Robert Butz, Karl MacMillan" def __call__(self, glyphs=None, section_search_size=1, noise_mltplk=1, large_mltplk=20, stdev_mltplk=5): from gamera import core from gamera.roman_text import Section as Roman_Section #from gamera.plugins.image_utilities import union_images def find_sections(image, glyphs, section_search_size=1, noise_mltplk=1, large_mltplk=20, stdev_mltplk=5): """Find the sections within an image - this finds large blocks of text making it possible to find the lines within complex text layouts.""" FUDGE = __avg_glyph_size(glyphs) * section_search_size # remove noise and large objects noise_size = FUDGE * noise_mltplk large_size = FUDGE * large_mltplk new_glyphs = [] for g in glyphs: if __section_size_test(image, g, noise_size, large_size): new_glyphs.append(g) # Sort the glyphs left-to-right and top-to-bottom new_glyphs.sort(lambda x, y: cmp(x.ul_x, y.ul_x)) new_glyphs.sort(lambda x, y: cmp(x.ul_y, y.ul_y)) # Create rectangles for each glyph that are bigger by FUDGE big_rects = [] for g in new_glyphs: ul_y = max(0, g.ul_y - FUDGE) ul_x = max(0, g.ul_x - FUDGE) lr_y = min(image.lr_y, g.lr_y + FUDGE) lr_x = min(image.lr_x, g.lr_x + FUDGE) ul_x = int(ul_x); ul_y = int(ul_y) nrows = int(lr_y - ul_y + 1) ncols = int(lr_x - ul_x + 1) big_rects.append(core.Rect(core.Point(ul_x, ul_y), core.Dim(ncols, nrows))) # Search for intersecting glyphs and merge them. This is # harder than it seems at first because we want everything # to merge together that intersects regardless of the order # in the list. It ends up being similar to connected-component # labeling. This is prone to be kind-of slow. current = 0 rects = big_rects while(1): # Find the indexes of any rects that interesect with current inter = __find_intersecting_rects(rects, current) # If we found intersecting rectangles merge them with them current # rect, remove them from the list, and start the whole process # over. We start over to make certain that everything that should # be merged is. if len(inter): g = rects[current] new_rects = [g] for i in range(len(rects)): if i == current: continue if i in inter: g.union(rects[i]) else: new_rects.append(rects[i]) rects = new_rects current = 0 # If we didn't find anything that intersected move on to the next # rectangle. else: current += 1 # Bail when we are done. if current >= len(rects): break # Create the sections sections = [] for rect in rects: sections.append(Section(rect, stdev_mltplk)) # Place the original (small) glyphs into the sections for g in glyphs: if __section_size_test(image, g, noise_size, large_size): for s in sections: if s.bbox.intersects(g): s.add_glyph(g) break # Fix up the bounding boxes for s in sections: s.calculate_bbox() return sections def __avg_glyph_size(glyphs): """Compute the average glyph size for the page""" total = 0.0 for g in glyphs: total += g.nrows total += g.ncols return total / (2 * len(glyphs)) def __section_size_test(image, glyph, noise_size, large_size): """Filter for section finding - removes very small and very large glyphs""" black_area = glyph.black_area()[0] if black_area > noise_size and \ glyph.nrows < large_size and \ glyph.ncols < large_size: return 1 else: return 0 def __find_intersecting_rects(glyphs, index): """For section finding - return the index of glyphs intersecting the glyph and the index passed in.""" g = glyphs[index] inter = [] for i in range(len(glyphs)): if i == index: continue if g.intersects(glyphs[i]): inter.append(i) return inter # overwrite find_tall_glyphs to adjust deviation class Section(Roman_Section): def __init__(self, bbox, stdev_mltplk=5): self.bbox = core.Rect(bbox) self.lines = [] self.glyphs = [] # stats self.avg_glyph_area = 0 self.avg_glyph_height = 0 self.avg_glyph_width = 0 self.avg_line_height = 0 self.agv_line_width = 0 self.stdev = 0 self.stdev_mltplk = stdev_mltplk def find_tall_glyphs(self): from gamera import stats if self.stdev == 0: self.stdev = stats.samplestdev([g.nrows for g in self.glyphs]) tall = [] for i in range(len(self.glyphs)): g = self.glyphs[i] if (g.nrows - self.avg_glyph_height) > self.stdev*self.stdev_mltplk: tall.append(i) return tall # this is the actual beginning of the __call__-method if glyphs == None: glyphs = self.cc_analysis() sections = find_sections(self, glyphs, section_search_size, noise_mltplk, large_mltplk, stdev_mltplk) for s in sections: s.find_lines() # create a Cc for each line lines = [] label = 1 for s in sections: for l in s.lines: if len(l.glyphs) == 0: continue # label the lines in input image label += 1 for g in l.glyphs: self.highlight(g, label) line_rect = l.glyphs[0].union_rects(l.glyphs) lines.append(core.Cc(self, label, line_rect)) return lines __call__ = staticmethod(__call__) class BboxModule(PluginModule): category = "OCR" functions = [bbox_mcmillan] author = "Robert Butz, Karl MacMillan" url = "http://gamera.sourceforge.net/" module = BboxModule() ocr-1.0.6/gamera/toolkits/ocr/plugins/__init__.py 0000700 0001750 0001750 00000000036 11716171513 021315 0 ustar dalitz dalitz import bbox_merging_mcmillan ocr-1.0.6/gamera/toolkits/ocr/__init__.py 0000700 0001750 0001750 00000002010 11716171513 017626 0 ustar dalitz dalitz """ Toolkit setup This file is run on importing anything within this directory. Its purpose is only to help with the Gamera GUI shell, and may be omitted if you are not concerned with that. """ from gamera import toolkit import plugins import wx # You can inherit from toolkit.CustomMenu to create a menu # for your toolkit. Create a list of menu option in the # member _items, and a series of callback functions that # correspond to them. The name of the callback function # should be the same as the menu item, prefixed by '_On' # and with all spaces converted to underscores. # class OcrMenu(toolkit.CustomMenu): # _items = ["Ocr Toolkit", # "Ocr Toolkit 2"] # def _OnOcr_Toolkit(self, event): # wx.MessageDialog(None, "You clicked on Ocr Toolkit!").ShowModal() # main.main() # def _OnOcr_Toolkit_2(self, event): # wx.MessageDialog(None, "You clicked on Ocr Toolkit 2!").ShowModal() # main.main() # ocr_menu = OcrMenu() ocr-1.0.6/PKG-INFO 0000644 0001750 0001750 00000000470 11716212264 012741 0 ustar dalitz dalitz Metadata-Version: 1.0 Name: ocr Version: 1.0.6 Summary: An addon OCR toolkit for the Gamera framework for document analysis and recognition. Home-page: http://gamera.sourceforge.net/ Author: Rene Baston and Christoph Dalitz Author-email: UNKNOWN License: GNU GPL version 2 Description: UNKNOWN Platform: UNKNOWN ocr-1.0.6/doc/ 0000755 0001750 0001750 00000000000 11716212264 012410 5 ustar dalitz dalitz ocr-1.0.6/doc/html/ 0000755 0001750 0001750 00000000000 11716212264 013354 5 ustar dalitz dalitz ocr-1.0.6/doc/html/functions.html 0000644 0001750 0001750 00000026332 11716171513 016261 0 ustar dalitz dalitz
Last modified: June 08, 2010
The toolkit defines a number of free function which are not image methods. These are defined in ocr_toolkit.py and can be imported in a python script with
from gamera.toolkits.ocr.ocr_toolkit import *
While the class Page splits the image into Textline objects and possibly classifies the characters, it does not generate an output string. For this purpose, you can use the function textline_to_string.
Returns a unicode string of the text in the given Textline.
Signature:
textline_to_string (textline, heuristic_rules="roman", extra_chars_dict={})
with
- textline:
- A Textline object containing the glyphs. The glyphs must already be classified.
- heuristic_rules:
Depending on the alphabeth, some characters can very similar and need further heuristic rules for disambiguation, like apostroph and comma, which have the same shape and only differ in their position relative to the baseline.
When set to "roman", several rules specific for latin alphabeths are applied.
- extra_chars_dict
- A dictionary of additional translations of classnames to character codes. This is necessary when you use class names that are not unicode names. Will be passed to return_char.
As this function uses return_char, the class names of the glyphs in textline must corerspond to unicode character names, as described in the documentation of return_char.
Converts a unicode character name to a unicode symbol.
Signature:
return_char (classname, extra_chars_dict={})
with
- classname:
- A class name derived from a unicode character name. Example: latin.small.letter.a returns the character a.
- extra_chars_dict
- A dictionary of additional translations of classnames to character codes. This is necessary when you use class names that are not unicode names. The character 'code' does not need to be an actual code, but can be any string. This can be useful, e.g. for ligatures:
return_char(glyph.get_main_id(), {'latin.small.ligature.st':'st'})
classname must correspond to the standard unicode character names, as in the examples of the following table:
Character | Unicode Name | Class Name |
---|---|---|
! | EXCLAMATION MARK | exclamation.mark |
2 | DIGIT TWO | digit.two |
A | LATIN CAPITAL LETTER A | latin.capital.letter.a |
a | LATIN SMALL LETTER A | latin.small.letter.a |
Groups the given glyphs to words based upon the horizontal distance between adjacent glyphs.
with
- glyphs:
- A list of Cc data types, each of which representing a character. All glyphs must stem from the same single line of text.
- threshold:
- Horizontal white space greater than threshold will be considered a word separating gap. When None, the threshold value is calculated automatically as 2.5 times teh median white space between adjacent glyphs.
The result is a nested list of glyphs with each sublist representing a word. This is the same data structure as used in Textline.words
These functions are used in the segmentation methods of class Page. You will generally not need to call them, unless you are implementing a custom segmentation method.
Splits image regions representing text lines into characters.
Signature:
get_line_glyphs (image, segments)
with
- image:
- The document image that is to be further segmentated. It must contin the same underlying image data as the second argument segments
- segments:
- A list Cc data types, each of which represents a text line region. The image views must correspond to image, i.e. each pixels has a value that is the unique label of the text line it belongs to. This is the interface used by the plugins in the "PageSegmentation" section of the Gamera core.
The result is returned as a list of Textline objects.
Returns an RGB image with bounding boxes of the given glyphs as hollow rects. Useful for visualization and debugging of a segmentation.
Signature:
show_bboxes (image, glyphs)
with:
- image:
- An image of the textdokument which has to be segmentated.
- glyphs:
- List of rects which will be drawn on image as hollow rects. As all image types are derived from Rect, any image list can be passed.
Last modified: February 13, 2012
Contents
Last modified: February 13, 2012
Contents
In module gamera.toolkits.ocr.classes
The Textline object stores information about a text line in its following properties:
- bbox
- A Rect object representing the bounding box of the text line.
- glyphs
- A list of Cc objects, each representing a character in the line.
- words
- A nested list of Cc objects, where each sublist represents the characters of a single word.
Signature:
init (bbox, glyphs=None)
with
- bbox:
- Rect object representing position and size of the text line
- glyphs:
- A list of Cc objects representing the characters in the text line
Adds the given glyph to the Textline. Signature:
add_glyph (glyph, extend=True)
When extend is True, the text line bounding box bbox is extended by the glyph's bounding box.
Adds the given glyphs to the Textline. Signature:
add_glyphs (glyphs, extend=True)
When extend is True, the text line bounding box bbox is extended by the union of the glyphs' bounding boxes.
Sorts the characters in Textline.glyphs from left to right.
Last modified: May 28, 2010
[object] bbox_mcmill ([object glyphs], float section_search_size = 1.00, float noise_mltplk = 1.00, float large_mltplk = 20.00, float stdev_mltplk = 5.00)
Operates on: | Image [OneBit] |
---|---|
Returns: | [object] |
Category: | OCR/segmentation |
Defined in: | bbox_merging_mcmillan.py |
Author: | Robert Butz, Karl MacMillan |
Returns the textlines from image as connected components. The segmentation method is adapted from McMillan's segmentation method in roman_text.py. It allows a more individual segmentation through parameterization.
Options:
- glyphs:
- This list can be build out of a cc_analysis. On default, this parameter is blank, which will cause the function to call cc_analysis itself.
- section_search_size
- This optional parameter adjusts the calculated avg_glyph_size by multipling its value (default=1).
- noise_mltplk
- With this optional parameter one can adjust the noise_recognition rate independently from the calculated avg_glyph_size (default = 1). Values greater than 1 let the noise_removal detect bigger noise (but maybe even glyphs). Chose smaller values to avoid assigning small glyphs to noise.
- large_mltplk
- Analog to noise_mltplk one can set this parameter to manipulate the recognition of very large ccs according to the avg_glyph_size (default=20). Higher values lead to a better acceptance of above-average ccs. Beneficial, for example for big capital initials at the beginning of paragraphs such as seen in bibles.
- stdev_mltplk
- This parameter affects the line finding algorithm by excluding abnormally tall glyphs (default=5). The standard deviation will be calculated and multiplied by this parameter.
Last modified: May 21, 2010
This documentation is for those who want to extend the functionality of the OCR toolkit, or who want to customize specific steps of the recognition process. For a comprehensive overview over the architecture of this toolkit, see section 3 of
C. Dalitz, R. Baston: Optical Character Recognition with the Gamera Framework. In C. Dalitz (Ed.): "Document Image Analysis with the Gamera Framework." Schriftenreihe des Fachbereichs Elektrotechnik und Informatik, Hochschule Niederrhein, vol. 8, pp. 53-65, Shaker Verlag (2009)
The core functionality of this toolkit is implemented in the Page class. This class provides a method segment, which segments the page into lines, and the lines into characters and words. The segmentation result is stored in the property textlines, which is a list of objects from type Textline.
To customize the page segmentation process, you can derive a custom class from Page, and overwrite some methods. While it is theoretically possible to directly overwrite the segment method, it is in most cases more desirable to only overwrite one of the methods called in segment, so that only a specific part of the segmentation process is replaced. See the documentation of Page.segment for information which other methods are called in this method.
In the subsequent sections, we describe two typical use cases:
Let us assume you want to use the Gamera core plugin projection_cutting for segmenting the page into text lines. To do so, simply derive a custom class MyPage from Page and overwrite the page_to_lines method:
class MyPage(Page):
def page_to_lines(self):
self.ccs_lines = self.img.projection_cutting()
This example is obviously very basic; in practice you might want to experiment with the input arguments of projection_cutting. You can the use MyPge just like Page, and the following code does the same segmentation as Page.segment, but with only page_to_lines replaced:
result = MyPage(image)
result.segment()
Now let us assume that you want to let Gamera's classification based grouping algorithm join connected components to characters, rather than the rule based method built into Page.lines_to_chars. To do so, derive a custom class MyPage from Page, that segments the line into characters only by a connected component analysis, without any joining of CCs to characters (this will be done at a later point):
# segment lines into chars only by CC analysis
class MyPage(Page):
def lines_to_chars(self):
dummy, subbccs = self.img.sub_cc_analysis(self.ccs_lines)
self.textlines = []
for i,segment in enumerate(self.ccs_lines):
self.textlines.append(Textline(segment, subccs[i]))
Then you must make sure that a classification with grouping is done during Page.segment. This is done by passing a callable class derived from ClassifyCCs to the contructor of MyPage. As the default definition of ClassifyCCs already does what we need, we simply need to create an instance thereform:
# create an instance of ClassifyCCs ...
cknn = knn.kNNInteractive([], \
["aspect_ratio", "moments", "volume64regions"], 0)
cknn.from_xml_filename("trainingdata.xml")
classify = ClassifyCCs(cknn)
# ... and set its property parts_to_group such that the
# grouping algorithm will be used during classification
classify.parts_to_group = 4
# pass the ClassifyCCs instance to the constructor of MyPage
page = MyPage(image, classify_ccs=classify)
page.segment() # will call classify
Last modified: February 13, 2012
Contents
[object] bbox_mcmillan ([object glyphs] = None, float section_search_size = 1.00, float noise_mltplk = 1.00, float large_mltplk = 20.00, float stdev_mltplk = 5.00)
Operates on: | Image [OneBit] |
---|---|
Returns: | [object] |
Category: | PageSegmentation |
Defined in: | bbox_merging_mcmillan.py |
Author: | Robert Butz, Karl MacMillan |
Returns the textlines from image as connected components. The segmentation method is adapted from McMillan's segmentation method in roman_text.py. It allows a more individual segmentation through parameterization.
Options:
- glyphs:
- This list can be build out of a cc_analysis. On default, this parameter is blank, which will cause the function to call cc_analysis itself.
- section_search_size
- This optional parameter adjusts the calculated avg_glyph_size by multipling its value (default=1).
- noise_mltplk
- With this optional parameter one can adjust the noise_recognition rate independently from the calculated avg_glyph_size (default = 1). Values greater than 1 let the noise_removal detect bigger noise (but maybe even glyphs). Chose smaller values to avoid assigning small glyphs to noise.
- large_mltplk
- Analog to noise_mltplk one can set this parameter to manipulate the recognition of very large ccs according to the avg_glyph_size (default=20). Higher values lead to a better acceptance of above-average ccs. Beneficial, for example for big capital initials at the beginning of paragraphs such as seen in bibles.
- stdev_mltplk
- This parameter affects the line finding algorithm by excluding abnormally tall glyphs (default=5). The standard deviation will be calculated and multiplied by this parameter.