debian/0000755000000000000000000000000011625154357007177 5ustar debian/source/0000755000000000000000000000000011625154357010477 5ustar debian/source/format0000644000000000000000000000001411402434101011662 0ustar 3.0 (quilt) debian/changelog0000644000000000000000000000644111625154345011053 0ustar pdfminer (20110515+dfsg-1) unstable; urgency=low * New upstream release * Upload to unstable * debian/control - Removed Jakub and added Debian Python Modules Team to Maintainer - Added myself to Uploaders (Closes: #629178) - Bumped Standards-Version to 3.9.2 (no changes needed) * debian/{control,rules} - Switched to dh_python2 -- Daniele Tricoli Wed, 24 Aug 2011 12:56:37 +0200 pdfminer (20110227+dfsg-1) experimental; urgency=low * New upstream release. + Document the -V option in pdf2txt manual page. * Correct a few grammatical errors in the manual pages and in the package description. Thanks to Stefano Rivera for help. * Remove byte-compiled files from (repackaged) upstream tarball. * Use $() constructs rather than backticks in shell scripts. * Rename some private variables in debian/rules to make them lowercase. -- Jakub Wilk Sat, 05 Mar 2011 18:39:32 +0100 pdfminer (20101226+dfsg-1) experimental; urgency=low * New upstream release. + Drop fix-test-psparser.diff, applied upstream. + Prevent upstream Makefile from using ‘python2’ binary. [python2.diff] -- Jakub Wilk Tue, 28 Dec 2010 11:18:13 +0100 pdfminer (20101017+dfsg-1) experimental; urgency=low * New upstream release. * Fix a typo in the pdf2txt manual page. * Backport an upstream patch to fix test failures. [fix-test-psparser.diff] * To fix FTBFS when built twice in a row: + force dh_auto_clean to use distutils build system; + add samples/{*.txt,*.html,*.xml} to debian/clean. -- Jakub Wilk Thu, 02 Dec 2010 17:56:37 +0100 pdfminer (20100829+dfsg-1) experimental; urgency=low * New upstream release. * Add mutual Breaks to ensure that if python-pdfminer and pdfminer-data are installed together, they have the same version. * Use pickle protocol 2 for serializing data. [pickle-protocol-2.diff] -- Jakub Wilk Sun, 29 Aug 2010 11:43:51 +0200 pdfminer (20100619p1+dfsg-1) experimental; urgency=low * New upstream release. + Drop all patches: either applied upstream or not needed anymore. + Recreate non-empty cmap/__init__.py in the build target and remove it in the clean target. + Update debian/get-orig-source and debian/rules to take into account new location of non-free samples. + Relax debian/watch and debian/rules to allow versions with pN suffix. + Explain copyright status of samples/jo.* in debian/copyright. * Bump standards version to 3.9.1 (no changes needed). -- Jakub Wilk Thu, 26 Aug 2010 18:17:31 +0200 pdfminer (20100424+dfsg-1) experimental; urgency=low * Initial release (closes: #584555). * Strip non-DFSG-free test documents from the .orig.tar.gz. + Run tests only on those files that are actually available. [dfsg-testsuite.diff] * Disable test suite for psparser.py, as it is currently broken. [psparser-testsuite.diff] * Store encoding data in gzipped pickles rather than in Python modules. This way we can save lots of disk space. [encoding-data.diff] * Backport upstream patches: + to fix a bug in layout analysis [layout.diff]; + to allow extraction of nested tags [nested-tags.diff]. -- Jakub Wilk Sun, 13 Jun 2010 12:27:50 +0200 debian/control0000644000000000000000000000352511613640510010573 0ustar Source: pdfminer Section: python Priority: optional Maintainer: Debian Python Modules Team Uploaders: Daniele Tricoli Build-Depends: debhelper (>= 7.0.50~), docbook-xml, docbook-xsl, elinks-lite | elinks, libxml2-utils, python-all (>= 2.6.6-3~), python-nose, xsltproc, X-Python-Version: >= 2.4 Standards-Version: 3.9.2 Homepage: http://www.unixuser.org/~euske/python/pdfminer/ Vcs-Svn: svn://svn.debian.org/python-modules/packages/pdfminer/trunk/ Vcs-Browser: http://svn.debian.org/viewsvn/python-modules/packages/pdfminer/trunk/ Package: python-pdfminer Architecture: all Depends: ${misc:Depends}, ${python:Depends} Suggests: pdfminer-data Breaks: pdfminer-data (<< ${source:Version}) Description: PDF parser and analyser PDFMiner is a tool for extracting information from PDF documents, which focuses entirely on getting and analyzing text data. It allows one to obtain the exact location of text portions in a page, as well as other information such as fonts or lines. It includes a PDF converter that can transform PDF files into other text formats (such as HTML). It has an extensible PDF parser that can be used for other purposes than text analysis. . This package provides the Python module and the command-line tools: pdf2txt and dumppdf. Package: pdfminer-data Architecture: all Depends: ${misc:Depends} Recommends: python-pdfminer Breaks: python-pdfminer (<< ${source:Version}) Description: PDF parser and analyser (encoding data) PDFMiner is a tool for extracting information from PDF documents, which focuses entirely on getting and analyzing text data. . This package contains the encoding data needed to read some PDF documents in CJK (Chinese, Japanese, Korean) languages. debian/manpages/0000755000000000000000000000000011625154357010772 5ustar debian/manpages/Makefile0000644000000000000000000000054211403537200012415 0ustar XML_FILES = $(wildcard *.xml) MAN_FILES = $(XML_FILES:.xml=) XSL = http://docbook.sourceforge.net/release/xsl/current/manpages/docbook.xsl XSL_PARAMS = --param man.charmap.use.subset 0 .PHONY: all all: $(MAN_FILES) %: %.xml xmllint --valid --noout $(<) xsltproc $(XSL_PARAMS) $(XSL) $(<) .PHONY: clean clean: rm $(MAN_FILES) # vim:ts=4 sw=4 noet debian/manpages/dumppdf.1.xml0000644000000000000000000001155411532453270013311 0ustar PDFMiner Manual dumppdf Jakub Wilk Wrote this manual page for the Debian system.
jwilk@debian.org
Yusuke Shinyama Author of PDFMiner and its original HTML documentation.
yusuke@cs.nyu.edu
dumppdf 1 dumppdf dumps internal contents of a PDF files dumppdf option file Description dumppdf dumps the internal contents of a PDF file in pseudo-XML format. This program is primarily for debugging purposes, but it's also possible to extract some meaningful contents Options Dump all the objects. By default only the document trailer is printed. Specifies PDF object IDs to display. Comma-separated IDs, or multiple options are accepted. Specifies the comma-separated list of the page numbers to be extracted. Page numbers start at one. By default, it extracts text from all the pages. Specifies the output format of stream contents. Because the contents of stream objects can be very large, they are omitted when none of the options above is specified. With option, the “raw” stream contents are dumped without decompression. With option, the decompressed contents are dumped as a binary blob. With option, the decompressed contents are dumped in a text format, similar to repr() manner. When or option is given, no stream header is displayed for the ease of saving it to a file. Show the table of contents. Provides the user password to access PDF contents. Increase the debug level. Examples Dump all the headers and contents, except stream objects: $ dumppdf -a test.pdf Dump the table of contents: $ dumppdf -T test.pdf Extract a JPEG image: $ dumppdf -r -i6 test.pdf > image.jpeg See also pdf2txt1
debian/manpages/pdf2txt.1.xml0000644000000000000000000002250211532453270013240 0ustar PDFMiner Manual pdf2txt Jakub Wilk Wrote this manual page for the Debian system.
jwilk@debian.org
Yusuke Shinyama Author of PDFMiner and its original HTML documentation.
yusuke@cs.nyu.edu
pdf2txt 1 pdf2txt extracts text contents of PDF files pdf2txt option file Description pdf2txt extracts text contents from a PDF file. It extracts all the text that is to be rendered programmatically, i.e. text represented as ASCII or Unicode strings. It cannot recognize text drawn as images that would require optical character recognition. It also extracts the corresponding locations, font names, font sizes, writing direction (horizontal or vertical) for each text portion. You need to provide a password for protected PDF documents when its access is restricted. You cannot extract any text from a PDF document which does not have extraction permission. Options Specifies the output file name. The default is to print the extracted contents to standand output in text format. Specifies the comma-separated list of the page numbers to be extracted. Page numbers start at one. By default, it extracts text from all the pages. Specifies the output codec. Specifies the output format. The following formats are currently supported: text Text format. This is the default. html HTML format. It is not recommended. xml XML format. It provides the most information. tag “Tagged PDF” format. A tagged PDF has its own contents annotated with HTML-like tags. pdf2txt tries to extract its content streams rather than inferring its text locations. Tags used here are defined in the PDF Reference, Sixth Edition (§10.7 “Tagged PDF”). Specifies the writing mode of text outputs: lr-tb Left-to-right, top-to-bottom. tb-rl Top-to-bottom, right-to-left. auto Determine writing mode automatically These are the parameters used for layout analysis. In an actual PDF file, text portions might be split into several chunks in the middle of its running, depending on the authoring software. Therefore, text extraction needs to splice text chunks. In the figure below, two text chunks whose distance is closer than the char-margin is considered continuous and get grouped into one. Also, two lines whose distance is closer than the line-margin is grouped as a text box, which is a rectangular area that contains a “cluster” of text portions. Furthermore, it may be required to insert blank characters (spaces) as necessary if the distance between two words is greater than the word-margin, as a blank between words might not be represented as a space, but indicated by the positioning of each word. Each value is specified not as an actual length, but as a proportion of the length to the size of each character in question. The default values are char-margin = 1.0, line-margin = 0.3, and W = 0.2, respectively. Suppress layout analysis. Force layout analysis for all the text strings, including text contained in figures. Enable detection of vertical writing. Specifies the output scale. This option can be used in HTML format only. Specifies the maximum number of pages to extract. By default, all the pages in a document are extracted. Provides the user password to access PDF contents. Increase the debug level. Examples Extract text as an HTML file whose filename is output.html: $ pdf2txt -o output.html samples/naacl06-shinyama.pdf Extract a Japanese HTML file in vertical writing: $ pdf2txt -c euc-jp -D tb-rl -o output.html samples/jo.pdf Extract text from an encrypted PDF file: $ pdf2txt -P mypassword -o output.txt secret.pdf See also dumppdf1
debian/python-pdfminer.doc-base0000644000000000000000000000032511402543270013706 0ustar Document: pdfminer-documentation Title: PDFMiner documentation Author: Yusuke Shinyama Section: Text Format: HTML Index: /usr/share/doc/python-pdfminer/index.html Files: /usr/share/doc/python-pdfminer/index.html debian/python-pdfminer.examples0000644000000000000000000000002311402466356014053 0ustar tools/pdf2html.cgi debian/python-pdfminer.docs0000644000000000000000000000002711402433652013162 0ustar docs/*.html README.txt debian/copyright0000644000000000000000000001361111625073376011135 0ustar Format: http://anonscm.debian.org/viewvc/dep/web/deps/dep5.mdwn?revision=174 Upstream-Name: PDFminer Upstream-Contact: Yusuke Shinyama Source: http://pypi.python.org/pypi/pdfminer/ Comment: All but trivial test PDF documents were removed from the sources/ directory because of lack of source code for them. License: Expat Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: . The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. . THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE. Files: * Copyright: 2004-2011, Yusuke Shinyama License: Expat Files: cmapsrc/*.txt Copyright: 1990-2010, Adobe Systems Incorporated License: BSD-Adobe Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met: . Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer. . Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution. . Neither the name of Adobe Systems Incorporated nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission. . THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS “AS IS” AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE. Files: pdfminer/glyphlist.py Copyright: 1997-2007, Adobe Systems Incorporated License: other-1 Permission is hereby granted, free of charge, to any person obtaining a copy of this documentation file to use, copy, publish, distribute, sublicense, and/or sell copies of the documentation, and to permit others to do the same, provided that: . - No modification, editing or other alteration of this document is allowed; and . - The above copyright notice and this permission notice shall be included in all copies of the documentation. . Permission is hereby granted, free of charge, to any person obtaining a copy of this documentation file, to create their own derivative works from the content of this document to use, copy, publish, distribute, sublicense, and/or sell the derivative works, and to permit others to do the same, provided that the derived work is not represented as being a copy or version of this document. . Adobe shall not be liable to any party for any loss of revenue or profit or for indirect, incidental, special, consequential, or other similar damages, whether based on tort (including without limitation negligence or strict liability), contract or other legal or equitable grounds even if Adobe has been advised or had reason to know of the possibility of such damages. The Adobe materials are provided on an "AS IS" basis. Adobe specifically disclaims all express, statutory, or implied warranties relating to the Adobe materials, including but not limited to those concerning merchantability or fitness for a particular purpose or non-infringement of any third party rights regarding the Adobe materials. Files: pdfminer/fontmetrics.py Copyright: 1985-1999, Adobe Systems Incorporated. License: other-2 This file and the 35 PostScript® AFM files it accompanies may be used, copied, and distributed for any purpose and without charge, with or without modification, provided that all copyright notices are retained; that the AFM files are not distributed without this file; that all modifications to this file or any of the AFM files are prominently noted in the modified file(s); and that this paragraph is not modified. Adobe Systems has no responsibility or obligation to support the use of the AFM files. Files: pdfminer/rijndael.py pdfminer/runlength.py Copyright: Public Domain License: public-domain This code is in the public domain. . rijndael.py is based on a public domain C implementation by Philip J. Erdelsky: http://www.efgh.com/software/rijndael.htm . runlength.py: RunLength decoder (Adobe version) implementation based on PDF Reference version 1.4 section 3.3.4. Files: samples/jo.* Copyright: expired Comment: Kenji Miyazawa, preface of "Haru to Shura" License: public-domain This files are in the public domain because copyright expired: Kenji Miyazawa died on 21 September 1933. Files: debian/* Copyright: 2010, Jakub Wilk 2011, Daniele Tricoli License: Expat debian/rules0000755000000000000000000000276711614146257010271 0ustar #!/usr/bin/make -f here = $(dir $(firstword $(MAKEFILE_LIST)))/.. upstream_version = $(shell cd $(here) && dpkg-parsechangelog | sed -n -r -e '/^Version: (.+)([+]dfsg).*/ { s//\1/; p; q; }') .PHONY: override_dh_auto_build override_dh_auto_build: $(MAKE) cmap dh_auto_build echo '#' > pdfminer/cmap/__init__.py .PHONY: override_dh_install override_dh_install: rename.ul .py '' debian/tmp/usr/bin/*.py dh_install .PHONY: override_dh_installman override_dh_installman: $(MAKE) -C debian/manpages/ dh_installman .PHONY: override_dh_auto_test override_dh_auto_test: ifeq ($(filter nocheck,$(DEB_BUILD_OPTIONS)),) set -e -x; \ for python in $(shell pyversions -r); do \ $$python /usr/bin/nosetests --with-doctest --verbose pdfminer/*.py; \ $(MAKE) -C samples clean; \ $(MAKE) -C samples PYTHON=$$python CMP="diff -u" HTMLS_NONFREE= TEXTS_NONFREE= XMLS_NONFREE= test; \ done endif .PHONY: override_dh_installchangelogs override_dh_installchangelogs: elinks -config-file /dev/null -dump -no-numbering -no-references docs/index.html \ | sed -n -e '/^Changes/,/^ ---/ { /^ / s/// p }' \ > docs/changelog dh_installchangelogs docs/changelog .PHONY: get-orig-source: sh -x $(here)/debian/get-orig-source.sh $(upstream_version) .PHONY: build build-arch build-indep binary binary-arch binary-indep clean build build-arch build-indep binary binary-indep clean: dh $(@) --with python2 -Spython_distutils # In order not to confuse lintian, binary-arch is a separate target: binary-arch: # vim:ts=4 sw=4 noet debian/watch0000644000000000000000000000021111435322552010213 0ustar version=3 opts=dversionmangle=s/\+dfsg// \ http://pypi.python.org/packages/source/p/pdfminer/pdfminer-([0-9.]+(?:p[0-9.]+)?)[.]tar[.]gz debian/get-orig-source.sh0000644000000000000000000000127411534471513012545 0ustar set -e export TAR_OPTIONS="--owner root --group root --mode a+rX" export GZIP="-9n" pwd=$(pwd) version="$1" if [ -z "$version" ] then printf 'Usage: %s \n' "$0" exit 1 fi cd "$(dirname "$0")/../" tmpdir=$(mktemp -d get-orig-source.XXXXXX) uscan --noconf --force-download --rename --download-version="$version" --destdir="$tmpdir" cd "$tmpdir" tar -xzf pdfminer_*.orig.tar.gz rm *.tar.gz # Remove test documents without source: rm -Rf pdfminer-*/samples/nonfree/ # Remove byte-compiled files find pdfminer-* -name '*.py[co]' -delete mv pdfminer-*/ "pdfminer-$version.orig" tar -czf "$pwd/pdfminer_$version+dfsg.orig.tar.gz" pdfminer-*.orig/ cd .. rm -Rf "$tmpdir" # vim:ts=4 sw=4 et debian/patches/0000755000000000000000000000000011625154357010626 5ustar debian/patches/series0000644000000000000000000000004411506210370012023 0ustar python2.diff pickle-protocol-2.diff debian/patches/python2.diff0000644000000000000000000000034711506210370013051 0ustar Description: Use ‘python’ rather than ‘python2’ binary. Forwarded: no Last-Update: 2010-12-27 --- a/Makefile +++ b/Makefile @@ -3,7 +3,7 @@ PACKAGE=pdfminer -PYTHON=python2 +PYTHON=python GIT=git RM=rm -f CP=cp -f debian/patches/pickle-protocol-2.diff0000644000000000000000000000054511436425720014724 0ustar Description: Use pickle protocol 2 for serializing data. Author: Jakub Wilk --- a/tools/conv_cmap.py +++ b/tools/conv_cmap.py @@ -148,7 +148,7 @@ CID2UNICHR_H=cid2unichr_h, CID2UNICHR_V=cid2unichr_v, ) - fp.write(pickle.dumps(data)) + fp.write(pickle.dumps(data, protocol=2)) fp.close() return 0 debian/compat0000644000000000000000000000000211402433652010364 0ustar 7 debian/pdfminer-data.install0000644000000000000000000000011211404743671013272 0ustar usr/lib/python*/*-packages/pdfminer/cmap/*.pickle.gz /usr/share/pdfminer/ debian/python-pdfminer.manpages0000644000000000000000000000003011403537200014012 0ustar debian/manpages/*.[0-9] debian/python-pdfminer.install0000644000000000000000000000025311404743671013710 0ustar /usr/bin/pdf2txt /usr/bin/dumppdf /usr/lib/python*/*-packages/pdfminer-*.egg-info /usr/lib/python*/*-packages/pdfminer/*.py /usr/lib/python*/*-packages/pdfminer/cmap/*.py debian/clean0000644000000000000000000000014211475747631010207 0ustar debian/manpages/*.[0-9] docs/changelog pdfminer/cmap/* samples/*.xml samples/*.html samples/*.txt