ocrodjvu-0.7.9/ 0000755 0000000 0000000 00000000000 11726754065 013335 5 ustar root root 0000000 0000000 ocrodjvu-0.7.9/MANIFEST.in 0000644 0000000 0000000 00000000652 11626015227 015063 0 ustar root root 0000000 0000000 include MANIFEST.in
include COPYING
include doc/*.1
include doc/*.txt
include doc/*.xml
include doc/changelog
include ocrodjvu.py
include tools/manpage-fixup
recursive-include tests *.bmp *.pbm *.ppm *.tif
recursive-include tests *.py
recursive-include tests *.html
recursive-include tests *.djvused
recursive-include tests *.test[0-9]
recursive-include tests *.djvu
recursive-include tests fake-executable
prune tests/local/
ocrodjvu-0.7.9/doc/ 0000755 0000000 0000000 00000000000 11726754065 014102 5 ustar root root 0000000 0000000 ocrodjvu-0.7.9/doc/ocrodjvu.xml 0000644 0000000 0000000 00000045760 11712276111 016455 0 ustar root root 0000000 0000000
]>
&p; manual&p;JakubWilkjwilk@jwilk.net&p;1&version;&p;OCR for DjVu files&p;output-djvu-fileoptiondjvu-file&p;index-djvu-fileoptiondjvu-file&p;script-fileoptiondjvu-file&p;optiondjvu-file&p;optiondjvu-file&p;Description
&p; is a wrapper for OCR systems that allows you to perform OCR on DjVu files.
The following OCR engines are supported:
OCRopus (internally,
&p; calls ocroscript's recognize (or rec-tess)
command, so that ultimately Tesseract acts as the OCR backend);
Cuneiform for Linux.Ocrad.GOCR.Stand-alone Tesseract.OptionsOCR engine options
Use this OCR engine. The default is ‘ocropus’ (OCRopus).
Print list of available OCR engines.
Options controlling output
It is mandatory to use exactly one of the following options:
Save OCR results as a bundled multi-page document into
output-djvu-file.
Save OCR results as an indirect multi-page document. Use
index-djvu-file as the index file name; put the
component files into the same directory. The directory must exist and be writable.
Save a djvused script with OCR results into
script-file.
Save OCR results in place.
(Use this option to retain compatibility with &p; < 0.2.)
Don't change any files, throw OCR results away.
Text segmentation options
Record location of every line. Don't record locations of particular words or characters.
This is the default for OCRopus 0.2.
The option is ineffective with stand-alone Tesseract 2.0.
Record location of every line and every word. Don't record locations of particular characters.
This is the default for most OCR engines.
This option is ineffective with OCRopus 0.2 and stand-alone Tesseract 2.0.
Record location of every line, every word and every character.
This option is ineffective with OCRopus 0.2 and stand-alone Tesseract 2.0.
Consider each non-empty sequence of non-whitespace characters a single word.
This is the default, despite being linguistically incorrect.
Use the Unicode Text Segmentation algorithm
to break lines into words.
This option breaks assumptions of some DjVu tools that words are separated by spaces,
and therefore it is not recommended.
Other options
Remove existing hidden text if present in the pages not selected for OCR.
(Use this option to retain compatibility with &p; < 0.2.)
Don't save pages that were not processed.
Set recognition language. language-id is typically an ISO 639-2/T
three-letter code.
For OCRopus, the default is ‘eng’ (English), unless the
tesslanguage environment variable is set.
For other OCR engines, the default is always ‘eng’.
Print list of available languages for the currently selected OCR engine.
Render only masks of page images.
This is the default.
Render only foreground layers of page images.
Render all layers of page images.
This option is necessary to OCR DjVu files with invalid foreground/background separation.
Specifies pages to process. page-range is a comma-separated list of
sub-ranges. Each sub-range is either a single page (e.g. 17) or a contiguous
range of pages (e.g. 37-42). Pages are numbered from 1.
The default is to process all pages.
Start up to n OCR processes.
Output version information and exit.Display help and exit.Advanced optionsTo ease debugging, don't delete intermediate files.
This option allow to control some details of how &p; operates.
Stop program execution when exception situation (e.g., malformed output from the OCR engine,
internal &p; error, etc.) occurs.
This is the default.
Attempt to recover from exceptional situations.
This option is strongly discouraged.
Use a HTML5
parser, which is more robust but slower than the default parser.
Environment
The following environment variables affects &p;:
tesslanguage
Recognition language for Tesseract.
(Use this variable is deprecated in favor of the option.)
TMPDIR
&p; makes heavy use of temporary files. It will store them in a directory
specified by this variable. The default is /tmp.
Bugs
Tesseract 3.00 is affected by a bug making it produce invalid
hOCR output in certain circumstances. &p; does not try recover form this fault (which couldn't be done reliably
anyway) unless you pass the option.
When using Tesseract ≥ 3.00, extracting bounding boxes of particular characters (which happens when either
or ) is inefficient. This due to
limitations of Tesseract command line interface.
See alsodjvu1,
ocroscript1,
tesseract1,
cuneiform1,
ocrad1,
gocr1
ocrodjvu-0.7.9/doc/hocr2djvused.1 0000644 0000000 0000000 00000007751 11726754065 016600 0 ustar root root 0000000 0000000 '\" t
.\" Title: hocr2djvused
.\" Author: Jakub Wilk
.\" Generator: DocBook XSL Stylesheets v1.76.1
.\" Date: 03/10/2012
.\" Manual: hocr2djvused manual
.\" Source: hocr2djvused 0.7.9
.\" Language: English
.\"
.TH "HOCR2DJVUSED" "1" "03/10/2012" "hocr2djvused 0\&.7\&.9" "hocr2djvused manual"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
hocr2djvused \- hOCR to \fBdjvused\fR script converter
.SH "SYNOPSIS"
.HP \w'\fBhocr2djvused\fR\ 'u
\fBhocr2djvused\fR [\fIoption\fR...]
.SH "DESCRIPTION"
.PP
hocr2djvused reads a
\m[blue]\fIhOCR\fR\m[]\&\s-2\u[1]\d\s+2
file (as produced by
\m[blue]\fIOCRopus\fR\m[]\&\s-2\u[2]\d\s+2
or
\m[blue]\fICuneiform\fR\m[]\&\s-2\u[3]\d\s+2
or
\m[blue]\fITesseract\fR\m[]\&\s-2\u[4]\d\s+2) from the standard input and converts it to a
\fBdjvused\fR
script\&.
.SH "OPTIONS"
.SS "Text segmentation options"
.PP
\fB\-t lines\fR, \fB\-\-details lines\fR
.RS 4
Record location of every line\&. Don\*(Aqt record locations of particular words or characters\&.
.RE
.PP
\fB\-t words\fR, \fB\-\-details=words\fR
.RS 4
Record location of every line and every word\&. Don\*(Aqt record locations of particular characters\&.
.sp
This is the default\&.
.RE
.PP
\fB\-t chars\fR, \fB\-\-details=chars\fR
.RS 4
Record location of every line, every word and every character\&.
.RE
.PP
\fB\-\-word\-segmentation=simple\fR
.RS 4
Consider each non\-empty sequence of non\-whitespace characters a single word\&.
.sp
This is the default, despite being linguistically incorrect\&.
.RE
.PP
\fB\-\-word\-segmentation=uax29\fR
.RS 4
Use the
\m[blue]\fIUnicode Text Segmentation\fR\m[]\&\s-2\u[5]\d\s+2
algorithm to break lines into words\&.
.sp
This options break assumptions of some DjVu tools that words are separated by spaces, and therefore is it not recommended\&.
.RE
.SS "Other options"
.PP
\fB\-\-rotation=\fR\fB\fIn\fR\fR
.RS 4
Assume that DjVu pages are rotated by
\fIn\fR
degrees\&.
.RE
.PP
\fB\-\-page\-size=\fR\fB\fIwidth\fR\fR\fBx\fR\fB\fIheight\fR\fR
.RS 4
Specifies that page size is
\fIwidth\fR
pixels \(mu
\fIheight\fR
pixels\&.
.sp
This option is required for hOCR generated by Cuneiform (< 0\&.8) and superfluous otherwise\&.
.RE
.PP
\fB\-\-html5\fR
.RS 4
Use a
\m[blue]\fIHTML5 parser\fR\m[]\&\s-2\u[6]\d\s+2, which is more robust but slower than the default parser\&.
.RE
.PP
\fB\-\-version\fR
.RS 4
Output version information and exit\&.
.RE
.PP
\fB\-h\fR, \fB\-\-help\fR
.RS 4
Display help and exit\&.
.RE
.SH "SEE ALSO"
.PP
\fBocrodjvu\fR(1),
\fBdjvused\fR(1)
.SH "AUTHOR"
.PP
\fBJakub Wilk\fR <\&jwilk@jwilk\&.net\&>
.RS 4
Author.
.RE
.SH "NOTES"
.IP " 1." 4
hOCR
.RS 4
\m[blue]\fI\%http://docs.google.com/View?docid=dfxcv4vc_67g844kf\fR\m[]
.RE
.IP " 2." 4
OCRopus
.RS 4
\m[blue]\fI\%http://ocropus.googlecode.com/\fR\m[]
.RE
.IP " 3." 4
Cuneiform
.RS 4
\m[blue]\fI\%http://launchpad.net/cuneiform-linux\fR\m[]
.RE
.IP " 4." 4
Tesseract
.RS 4
\m[blue]\fI\%http://tesseract-ocr.googlecode.com/\fR\m[]
.RE
.IP " 5." 4
Unicode Text Segmentation
.RS 4
\m[blue]\fI\%http://unicode.org/reports/tr29/\fR\m[]
.RE
.IP " 6." 4
HTML5 parser
.RS 4
\m[blue]\fI\%http://www.whatwg.org/specs/web-apps/current-work/#html-parser\fR\m[]
.RE
ocrodjvu-0.7.9/doc/djvu2hocr.1 0000644 0000000 0000000 00000005301 11726754065 016071 0 ustar root root 0000000 0000000 '\" t
.\" Title: djvu2hocr
.\" Author: Jakub Wilk
.\" Generator: DocBook XSL Stylesheets v1.76.1
.\" Date: 03/10/2012
.\" Manual: djvu2hocr manual
.\" Source: djvu2hocr 0.7.9
.\" Language: English
.\"
.TH "DJVU2HOCR" "1" "03/10/2012" "djvu2hocr 0\&.7\&.9" "djvu2hocr manual"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
djvu2hocr \- DjVu to hOCR converter
.SH "SYNOPSIS"
.HP \w'\fBdjvu2hocr\fR\ 'u
\fBdjvu2hocr\fR [\fIoption\fR...] \fIdjvu\-file\fR
.HP \w'\fBdjvu2hocr\fR\ 'u
\fBdjvu2hocr\fR {\fB\-\-version\fR | \fB\-\-help\fR | \fB\-h\fR}
.SH "DESCRIPTION"
.PP
djvu2hocr converts hidden text from a DjVu file to the
\m[blue]\fIhOCR\fR\m[]\&\s-2\u[1]\d\s+2
format\&.
.SH "OPTIONS"
.SS "Text segmentation options"
.PP
\fB\-\-word\-segmentation=simple\fR
.RS 4
Use the same word segmentation as found in the DjVu file\&.
.sp
This is the default\&.
.RE
.PP
\fB\-\-word\-segmentation=uax29\fR
.RS 4
Use the
\m[blue]\fIUnicode Text Segmentation\fR\m[]\&\s-2\u[2]\d\s+2
algorithm to break lines into words, possibly fixing word segmentation found in the DjVu file\&.
.RE
.SS "Other options"
.PP
\fB\-\-version\fR
.RS 4
Output version information and exit\&.
.RE
.PP
\fB\-h\fR, \fB\-\-help\fR
.RS 4
Display help and exit\&.
.RE
.SH "PORTABILITY"
.PP
djvu2hocr uses a custom extension to hOCR to retain characters which cannot be directly represented in an HTML/XML document\&. For example, control character BEL (^G, U+0007), is converted into the following HTML chunk:
.SH "SEE ALSO"
.PP
\fBdjvu\fR(1)
.SH "AUTHOR"
.PP
\fBJakub Wilk\fR <\&jwilk@jwilk\&.net\&>
.RS 4
Author.
.RE
.SH "NOTES"
.IP " 1." 4
hOCR
.RS 4
\m[blue]\fI\%http://docs.google.com/View?docid=dfxcv4vc_67g844kf\fR\m[]
.RE
.IP " 2." 4
Unicode Text Segmentation
.RS 4
\m[blue]\fI\%http://unicode.org/reports/tr29/\fR\m[]
.RE
ocrodjvu-0.7.9/doc/credits.txt 0000644 0000000 0000000 00000000271 11717774114 016275 0 ustar root root 0000000 0000000 Since May 2009 ocrodjvu development has been supported by the Polish Ministry
of Science and Higher Education's grant no. N N519 384036 (2009 - 2012,
https://bitbucket.org/jsbien/ndt).
ocrodjvu-0.7.9/doc/todo.txt 0000644 0000000 0000000 00000001221 11625512033 015565 0 ustar root root 0000000 0000000 Missing tests
=============
* For non-ASCII filenames.
* For Cuneiform hOCR with inline formatting.
* For Cuneiform hOCR with bounding boxes for whitespace characters.
* For Cuneiform hOCR with empty pages.
* For OCRad, in particular:
- for non-ASCII characters;
- for text close to a page boundary;
- for empty pages;
- for characters with no interpretations.
* For GOCR, in particular:
- for non-ASCII characters;
- for empty pages.
* For http://bugs.debian.org/575484#35
Documentation
=============
* Write better documentation for -X.
Nice-to-have things
===================
http://bugs.debian.org/575490
.. vim:ft=rst ts=3 sw=3 et tw=72
ocrodjvu-0.7.9/doc/dependencies.txt 0000644 0000000 0000000 00000003132 11625547654 017272 0 ustar root root 0000000 0000000 Dependencies
============
* Python ≥ 2.5
* An OCR engine:
+ OCRopus_ ≥ 0.2 (tested with 0.2 and 0.3.1) —
document analysis and OCR system
+ Cuneiform_ ≥ 0.7 (tested with 0.7, 0.8, 0.9, 1.0) —
document analysis and OCR system
+ Ocrad_ (tested with 0.17 and 0.21) —
document analysis and OCR system
+ GOCR_ ≥ 0.40 (tested with 0.48) —
document analysis and OCR system
+ Tesseract_ ≥ 2.00 (tested with 2.04 and 3.00) —
an OCR system
* DjVuLibre_ ≥ 3.5.21 —
library for the DjVu_ file format
* python-djvulibre_ ≥ 0.1.14 —
Python bindings for DjVuLibre_
* PyICU_ -
Python bindings for PyICU IBM's ICU_ C++ API
* lxml_ —
Python bindings for libxml2_
* html5lib_ -
HTML parser based on the HTML5_ specification
* argparse_ —
Python command line parser
.. _OCRopus:
http://code.google.com/p/ocropus/
.. _Cuneiform:
http://launchpad.net/cuneiform-linux
.. _Ocrad:
http://www.gnu.org/software/ocrad/
.. _GOCR:
http://jocr.sourceforge.net/
.. _Tesseract:
http://code.google.com/p/tesseract-ocr/
.. _DjVuLibre:
http://djvu.sourceforge.net/
.. _DjVu:
http://djvu.org/
.. _python-djvulibre:
http://jwilk.net/software/python-djvulibre.html
.. _PyICU:
http://pyicu.osafoundation.org/
.. _ICU:
http://www-306.ibm.com/software/globalization/icu/
.. _lxml:
http://codespeak.net/lxml/
.. _libxml2:
http://xmlsoft.org/
.. _html5lib:
http://code.google.com/p/html5lib/
.. _HTML5:
http://www.whatwg.org/specs/web-apps/current-work/
.. _argparse:
http://code.google.com/p/argparse/
.. vim:ft=rst ts=3 sw=3 et tw=72
ocrodjvu-0.7.9/doc/djvu2hocr.xml 0000644 0000000 0000000 00000007214 11712276111 016520 0 ustar root root 0000000 0000000
]>
&p; manual&p;JakubWilkjwilk@jwilk.net&p;1&version;&p;DjVu to hOCR converter&p;optiondjvu-file&p;Description
&p; converts hidden text from a DjVu file to the
hOCR format.
OptionsText segmentation options
Use the same word segmentation as found in the DjVu file.
This is the default.
Use the Unicode Text Segmentation algorithm
to break lines into words, possibly fixing word segmentation found in the DjVu file.
Other optionsOutput version information and exit.Display help and exit.Portability
&p; uses a custom extension to hOCR to retain characters which cannot be directly represented in an HTML/XML
document. For example, control character BEL (^G, U+0007), is converted into the following HTML chunk:
]]>See alsodjvu1
ocrodjvu-0.7.9/doc/hocr2djvused.xml 0000644 0000000 0000000 00000013733 11712276111 017217 0 ustar root root 0000000 0000000
]>
&p; manual&p;JakubWilkjwilk@jwilk.net&p;1&version;&p;hOCR to djvused script converter&p;optionDescription
&p; reads a hOCR file (as produced by
OCRopus or
Cuneiform or
Tesseract)
from the standard input and converts it to a djvused script.
OptionsText segmentation options
Record location of every line. Don't record locations of particular words or characters.
Record location of every line and every word. Don't record locations of particular characters.
This is the default.
Record location of every line, every word and every character.
Consider each non-empty sequence of non-whitespace characters a single word.
This is the default, despite being linguistically incorrect.
Use the Unicode Text Segmentation algorithm
to break lines into words.
This options break assumptions of some DjVu tools that words are separated by spaces,
and therefore is it not recommended.
Other options
Assume that DjVu pages are rotated by n degrees.
Specifies that page size is width pixels ×
height pixels.
This option is required for hOCR generated by Cuneiform (< 0.8) and superfluous otherwise.
Use a HTML5
parser, which is more robust but slower than the default parser.
Output version information and exit.Display help and exit.See alsoocrodjvu1,
djvused1
ocrodjvu-0.7.9/doc/ocrodjvu.1 0000644 0000000 0000000 00000022736 11726754065 016031 0 ustar root root 0000000 0000000 '\" t
.\" Title: ocrodjvu
.\" Author: Jakub Wilk
.\" Generator: DocBook XSL Stylesheets v1.76.1
.\" Date: 03/10/2012
.\" Manual: ocrodjvu manual
.\" Source: ocrodjvu 0.7.9
.\" Language: English
.\"
.TH "OCRODJVU" "1" "03/10/2012" "ocrodjvu 0\&.7\&.9" "ocrodjvu manual"
.\" -----------------------------------------------------------------
.\" * Define some portability stuff
.\" -----------------------------------------------------------------
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.\" http://bugs.debian.org/507673
.\" http://lists.gnu.org/archive/html/groff/2009-02/msg00013.html
.\" ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
.ie \n(.g .ds Aq \(aq
.el .ds Aq '
.\" -----------------------------------------------------------------
.\" * set default formatting
.\" -----------------------------------------------------------------
.\" disable hyphenation
.nh
.\" disable justification (adjust text to left margin only)
.ad l
.\" -----------------------------------------------------------------
.\" * MAIN CONTENT STARTS HERE *
.\" -----------------------------------------------------------------
.SH "NAME"
ocrodjvu \- OCR for DjVu files
.SH "SYNOPSIS"
.HP \w'\fBocrodjvu\fR\ 'u
\fBocrodjvu\fR {\fB\-o\fR | \fB\-\-save\-bundled\fR} \fIoutput\-djvu\-file\fR [\fIoption\fR...] \fIdjvu\-file\fR
.HP \w'\fBocrodjvu\fR\ 'u
\fBocrodjvu\fR {\fB\-i\fR | \fB\-\-save\-indirect\fR} \fIindex\-djvu\-file\fR [\fIoption\fR...] \fIdjvu\-file\fR
.HP \w'\fBocrodjvu\fR\ 'u
\fBocrodjvu\fR \fB\-\-save\-script\fR \fIscript\-file\fR [\fIoption\fR...] \fIdjvu\-file\fR
.HP \w'\fBocrodjvu\fR\ 'u
\fBocrodjvu\fR \fB\-\-in\-place\fR [\fIoption\fR...] \fIdjvu\-file\fR
.HP \w'\fBocrodjvu\fR\ 'u
\fBocrodjvu\fR \fB\-\-dry\-run\fR [\fIoption\fR...] \fIdjvu\-file\fR
.HP \w'\fBocrodjvu\fR\ 'u
\fBocrodjvu\fR {\fB\-\-version\fR | \fB\-\-help\fR | \fB\-h\fR | \fB\-\-list\-engines\fR | \fB\-\-list\-languages\fR}
.SH "DESCRIPTION"
.PP
ocrodjvu is a wrapper for OCR systems that allows you to perform OCR on DjVu files\&.
.PP
The following OCR engines are supported:
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
\m[blue]\fIOCRopus\fR\m[]\&\s-2\u[1]\d\s+2
(internally, ocrodjvu calls
\fBocroscript\fR\*(Aqs
\fBrecognize\fR
(or
\fBrec\-tess\fR) command, so that ultimately
Tesseract
acts as the OCR backend);
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
\m[blue]\fICuneiform for Linux\fR\m[]\&\s-2\u[2]\d\s+2\&.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
\m[blue]\fIOcrad\fR\m[]\&\s-2\u[3]\d\s+2\&.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
\m[blue]\fIGOCR\fR\m[]\&\s-2\u[4]\d\s+2\&.
.RE
.sp
.RS 4
.ie n \{\
\h'-04'\(bu\h'+03'\c
.\}
.el \{\
.sp -1
.IP \(bu 2.3
.\}
Stand\-alone
\m[blue]\fITesseract\fR\m[]\&\s-2\u[5]\d\s+2\&.
.RE
.sp
.SH "OPTIONS"
.SS "OCR engine options"
.PP
\fB\-e\fR, \fB\-\-engine=\fR\fB\fIengine\-id\fR\fR
.RS 4
Use this OCR engine\&. The default is \(oqocropus\(cq (OCRopus)\&.
.RE
.PP
\fB\-\-list\-engines\fR
.RS 4
Print list of available OCR engines\&.
.RE
.SS "Options controlling output"
.PP
It is mandatory to use exactly one of the following options:
.PP
\fB\-o\fR, \fB\-\-save\-bundled=\fR\fB\fIoutput\-djvu\-file\fR\fR
.RS 4
Save OCR results as a bundled multi\-page document into
\fIoutput\-djvu\-file\fR\&.
.RE
.PP
\fB\-i\fR, \fB\-\-save\-indirect=\fR\fB\fIindex\-djvu\-file\fR\fR
.RS 4
Save OCR results as an indirect multi\-page document\&. Use
\fIindex\-djvu\-file\fR
as the index file name; put the component files into the same directory\&. The directory must exist and be writable\&.
.RE
.PP
\fB\-\-save\-script=\fR\fB\fIscript\-file\fR\fR
.RS 4
Save a
\fBdjvused\fR
script with OCR results into
\fIscript\-file\fR\&.
.RE
.PP
\fB\-\-in\-place\fR
.RS 4
Save OCR results in place\&.
.sp
(Use this option to retain compatibility with ocrodjvu < 0\&.2\&.)
.RE
.PP
\fB\-\-dry\-run\fR
.RS 4
Don\*(Aqt change any files, throw OCR results away\&.
.RE
.SS "Text segmentation options"
.PP
\fB\-t lines\fR, \fB\-\-details lines\fR
.RS 4
Record location of every line\&. Don\*(Aqt record locations of particular words or characters\&.
.sp
This is the default for OCRopus 0\&.2\&. The option is ineffective with stand\-alone Tesseract 2\&.0\&.
.RE
.PP
\fB\-t words\fR, \fB\-\-details=words\fR
.RS 4
Record location of every line and every word\&. Don\*(Aqt record locations of particular characters\&.
.sp
This is the default for most OCR engines\&.
.sp
This option is ineffective with OCRopus 0\&.2 and stand\-alone Tesseract 2\&.0\&.
.RE
.PP
\fB\-t chars\fR, \fB\-\-details=chars\fR
.RS 4
Record location of every line, every word and every character\&.
.sp
This option is ineffective with OCRopus 0\&.2 and stand\-alone Tesseract 2\&.0\&.
.RE
.PP
\fB\-\-word\-segmentation=simple\fR
.RS 4
Consider each non\-empty sequence of non\-whitespace characters a single word\&.
.sp
This is the default, despite being linguistically incorrect\&.
.RE
.PP
\fB\-\-word\-segmentation=uax29\fR
.RS 4
Use the
\m[blue]\fIUnicode Text Segmentation\fR\m[]\&\s-2\u[6]\d\s+2
algorithm to break lines into words\&.
.sp
This option breaks assumptions of some DjVu tools that words are separated by spaces, and therefore it is not recommended\&.
.RE
.SS "Other options"
.PP
\fB\-\-clear\-text\fR
.RS 4
Remove existing hidden text if present in the pages not selected for OCR\&.
.sp
(Use this option to retain compatibility with ocrodjvu < 0\&.2\&.)
.RE
.PP
\fB\-\-ocr\-only\fR
.RS 4
Don\*(Aqt save pages that were not processed\&.
.RE
.PP
\fB\-l\fR, \fB\-\-language=\fR\fB\fIlanguage\-id\fR\fR
.RS 4
Set recognition language\&.
\fIlanguage\-id\fR
is typically an ISO 639\-2/T three\-letter code\&.
.sp
For OCRopus, the default is \(oqeng\(cq (English), unless the
\fItesslanguage\fR
environment variable is set\&. For other OCR engines, the default is always \(oqeng\(cq\&.
.RE
.PP
\fB\-\-list\-languages\fR
.RS 4
Print list of available languages for the currently selected OCR engine\&.
.RE
.PP
\fB\-\-render=mask\fR
.RS 4
Render only masks of page images\&.
.sp
This is the default\&.
.RE
.PP
\fB\-\-render=foreground\fR
.RS 4
Render only foreground layers of page images\&.
.RE
.PP
\fB\-\-render=all\fR
.RS 4
Render all layers of page images\&.
.sp
This option is necessary to OCR DjVu files with invalid foreground/background separation\&.
.RE
.PP
\fB\-p\fR, \fB\-\-pages=\fR\fB\fIpage\-range\fR\fR
.RS 4
Specifies pages to process\&.
\fIpage\-range\fR
is a comma\-separated list of sub\-ranges\&. Each sub\-range is either a single page (e\&.g\&.\ \&17) or a contiguous range of pages (e\&.g\&.\ \&37\-42)\&. Pages are numbered from 1\&.
.sp
The default is to process all pages\&.
.RE
.PP
\fB\-j\fR, \fB\-\-jobs=\fR\fB\fIn\fR\fR
.RS 4
Start up to
\fIn\fR
OCR processes\&.
.RE
.PP
\fB\-\-version\fR
.RS 4
Output version information and exit\&.
.RE
.PP
\fB\-h\fR, \fB\-\-help\fR
.RS 4
Display help and exit\&.
.RE
.SS "Advanced options"
.PP
\fB\-D\fR, \fB\-\-debug\fR
.RS 4
To ease debugging, don\*(Aqt delete intermediate files\&.
.RE
.PP
\fB\-X \fR\fB\fIkey\fR\fR\fB=\fR\fB\fIvalue\fR\fR
.RS 4
This option allow to control some details of how ocrodjvu operates\&.
.RE
.PP
\fB\-\-on\-error=abort\fR
.RS 4
Stop program execution when exception situation (e\&.g\&., malformed output from the OCR engine, internal ocrodjvu error, etc\&.) occurs\&.
.sp
This is the default\&.
.RE
.PP
\fB\-\-on\-error=resume\fR
.RS 4
Attempt to recover from exceptional situations\&.
.sp
This option is strongly discouraged\&.
.RE
.PP
\fB\-\-html5\fR
.RS 4
Use a
\m[blue]\fIHTML5 parser\fR\m[]\&\s-2\u[7]\d\s+2, which is more robust but slower than the default parser\&.
.RE
.SH "ENVIRONMENT"
.PP
The following environment variables affects ocrodjvu:
.PP
\fItesslanguage\fR
.RS 4
Recognition language for Tesseract\&.
.sp
(Use this variable is deprecated in favor of the
\fB\-\-language\fR
option\&.)
.RE
.PP
\fITMPDIR\fR
.RS 4
ocrodjvu makes heavy use of temporary files\&. It will store them in a directory specified by this variable\&. The default is
/tmp\&.
.RE
.SH "BUGS"
.PP
Tesseract 3\&.00 is affected by a bug
\&\s-2\u[8]\d\s+2
making it produce invalid hOCR output in certain circumstances\&. ocrodjvu does not try recover form this fault (which couldn\*(Aqt be done reliably anyway) unless you pass the
\fB\-X fix\-html=1\fR
option\&.
.PP
When using Tesseract \(>= 3\&.00, extracting bounding boxes of particular characters (which happens when either
\fB\-\-details=chars\fR
or
\fB\-\-word\-segmentation=uax29\fR) is inefficient\&. This due to limitations of Tesseract command line interface\&.
.SH "SEE ALSO"
.PP
\fBdjvu\fR(1),
\fBocroscript\fR(1),
\fBtesseract\fR(1),
\fBcuneiform\fR(1),
\fBocrad\fR(1),
\fBgocr\fR(1)
.SH "AUTHOR"
.PP
\fBJakub Wilk\fR <\&jwilk@jwilk\&.net\&>
.RS 4
Author.
.RE
.SH "NOTES"
.IP " 1." 4
OCRopus
.RS 4
\m[blue]\fI\%http://ocropus.googlecode.com/\fR\m[]
.RE
.IP " 2." 4
Cuneiform for Linux
.RS 4
\m[blue]\fI\%http://launchpad.net/cuneiform-linux\fR\m[]
.RE
.IP " 3." 4
Ocrad
.RS 4
\m[blue]\fI\%http://www.gnu.org/software/ocrad/\fR\m[]
.RE
.IP " 4." 4
GOCR
.RS 4
\m[blue]\fI\%http://jocr.sourceforge.net/\fR\m[]
.RE
.IP " 5." 4
Tesseract
.RS 4
\m[blue]\fI\%http://code.google.com/p/tesseract-ocr/\fR\m[]
.RE
.IP " 6." 4
Unicode Text Segmentation
.RS 4
\m[blue]\fI\%http://unicode.org/reports/tr29/\fR\m[]
.RE
.IP " 7." 4
HTML5 parser
.RS 4
\m[blue]\fI\%http://www.whatwg.org/specs/web-apps/current-work/#html-parser\fR\m[]
.RE
.IP " 8." 4
\m[blue]\fI\%http://code.google.com/p/tesseract-ocr/issues/detail?id=376\fR\m[]
ocrodjvu-0.7.9/doc/changelog 0000644 0000000 0000000 00000024313 11726753525 015757 0 ustar root root 0000000 0000000 ocrodjvu (0.7.9) unstable; urgency=low
* Improve error handling.
* Fix compatibility with Tesseract > 3.01.
-- Jakub Wilk Sat, 10 Mar 2012 23:36:03 +0100
ocrodjvu (0.7.8) unstable; urgency=low
* Improve test suite.
-- Jakub Wilk Sun, 22 Jan 2012 00:04:16 +0100
ocrodjvu (0.7.7) unstable; urgency=low
* Raise proper import error if html5lib is not installed. Thanks to Kyrill
Detinov for the bug report.
-- Jakub Wilk Sun, 11 Dec 2011 23:08:05 +0100
ocrodjvu (0.7.6) unstable; urgency=low
* Improve error handling.
* ocrodjvu:
+ Fix a regression in gocr, ocrad and tesseract engines, which made them
unusable.
-- Jakub Wilk Thu, 27 Oct 2011 18:06:38 +0200
ocrodjvu (0.7.5) unstable; urgency=low
* Check Python version in setup.py.
* Accept slightly malformed hOCR documents (with a text zone not completely
within the page area).
http://bugs.debian.org/575484#35
* Fix compatibility with Tesseract > 3.00.
Thanks to Janusz S. Bień for the bug report.
* ocrodjvu, hocr2djvused:
+ Add the --html5 option.
-- Jakub Wilk Sat, 27 Aug 2011 01:25:33 +0200
ocrodjvu (0.7.4) unstable; urgency=low
* Use a better method to detect Debian-based systems.
* hocr2djvused:
+ Ignore comments and
'''
def _wait_for_worker(worker):
stderr = worker.stderr.readlines()
try:
worker.wait()
except Exception:
for line in stderr:
sys.stderr.write(line)
raise
if len(stderr) == 1:
[line] = stderr
if line.startswith(('Tesseract Open Source OCR Engine', 'Page')):
# Annoyingly, Tesseract prints its own name on standard error even
# if nothing went wrong. Filter out such an unhelpful message.
return
for line in stderr:
sys.stderr.write(line)
def fix_html(s):
'''
Work around buggy hOCR output:
http://code.google.com/p/tesseract-ocr/issues/detail?id=376
'''
regex = re.compile(
r'''
( <[!/]?[a-z]+(?:\s+[^<>]*)?>
| &[a-z]+;
| &[#][0-9]+;
| &[#]x[0-9a-f]+;
| [^<>&]+
)
''', re.IGNORECASE | re.VERBOSE
)
return ''.join(
chunk if n & 1 else cgi.escape(chunk)
for n, chunk in enumerate(regex.split(s))
)
class ExtractSettings(object):
def __init__(self, rotation=0, page_size=None, **kwargs):
self.rotation = rotation
self.page_size = page_size
class Engine(common.Engine):
name = 'tesseract'
image_format = image_io.TIFF
executable = utils.property('tesseract')
extra_args = utils.property([], shlex.split)
use_hocr = utils.property(None, int)
fix_html = utils.property(0, int)
def __init__(self, *args, **kwargs):
assert args == ()
common.Engine.__init__(self, **kwargs)
try:
self._directory, self._extension = self.get_filesystem_info()
except errors.UnknownLanguageList:
raise errors.EngineNotFound(self.name)
if self.use_hocr is None:
self.use_hocr = self._extension == 'traineddata'
if self.use_hocr:
# Import hocr late, so that importing lxml is not triggered if hOCR
# output is not used.
from .. import hocr
self._hocr = hocr
self.output_format = 'html'
else:
self._hocr = None
self.output_format = 'txt'
def get_filesystem_info(self):
try:
tesseract = ipc.Subprocess([self.executable, '', '', '-l', 'nonexistent'],
stdout=ipc.PIPE,
stderr=ipc.PIPE,
)
except OSError:
raise errors.UnknownLanguageList
try:
line = tesseract.stderr.read()
match = _error_pattern.match(line)
if match is None:
raise errors.UnknownLanguageList
directory = match.group('dir')
extension = match.group('ext')
if not os.path.isdir(directory):
raise errors.UnknownLanguageList
finally:
try:
tesseract.wait()
except ipc.CalledProcessError:
pass
else:
raise errors.UnknownLanguageList
return directory, extension
def list_languages(self):
for filename in glob.glob(os.path.join(self._directory, '*.%s' % self._extension)):
filename = os.path.basename(filename)
language = os.path.splitext(filename)[0]
if _language_pattern.match(language):
yield language
def has_language(self, language):
if not _language_pattern.match(language):
raise errors.InvalidLanguageId(language)
return os.path.exists(os.path.join(self._directory, '%s.%s' % (language, self._extension)))
@classmethod
def get_default_language(cls):
return os.getenv('tesslanguage') or 'eng'
@contextlib.contextmanager
def recognize_plain_text(self, image, language, details=None, uax29=None):
with temporary.directory() as output_dir:
worker = ipc.Subprocess(
[self.executable, image.name, os.path.join(output_dir, 'tmp'), '-l', language] + self.extra_args,
stdout=ipc.PIPE,
stderr=ipc.PIPE,
)
_wait_for_worker(worker)
with open(os.path.join(output_dir, 'tmp.txt'), 'rt') as file:
yield file
@contextlib.contextmanager
def recognize_hocr(self, image, language, details=text_zones.TEXT_DETAILS_WORD, uax29=None):
character_details = details < text_zones.TEXT_DETAILS_WORD or (uax29 and details <= text_zones.TEXT_DETAILS_WORD)
with temporary.directory() as output_dir:
tessconf_path = os.path.join(output_dir, 'tessconf')
with open(tessconf_path, 'wt') as tessconf:
# Tesseract 3.00 doesn't come with any config file to enable hOCR
# output. Let's create our own one.
print >>tessconf, 'tessedit_create_hocr T'
worker = ipc.Subprocess(
[self.executable, image.name, os.path.join(output_dir, 'tmp'), '-l', language, tessconf_path] + self.extra_args,
stdout=ipc.PIPE,
stderr=ipc.PIPE,
)
_wait_for_worker(worker)
with open(os.path.join(output_dir, 'tmp.html'), 'r') as hocr_file:
if self.fix_html or character_details:
contents = hocr_file.read()
else:
yield hocr_file
return
if character_details:
worker = ipc.Subprocess(
[self.executable, image.name, os.path.join(output_dir, 'tmp'), '-l', language, 'makebox'] + self.extra_args,
stderr=ipc.PIPE,
)
_wait_for_worker(worker)
with open(os.path.join(output_dir, 'tmp.box'), 'r') as box_file:
contents = contents.replace('